Geneset Enrichement Analysis

Geneset Enrichment Analysis (GSEA) is a set of different computational methods that have been developed to interpret the gene expression data and gain insight into the biology of the transcriptional changes in an experiment.

Multiple methods have been developed in different level of sophistication, to determine if a predefined set of genes that share common biological function, shows significant differences between two, or multiple, biological conditions, or phenotypes.

Herein, we go through the main methods that have been developed for GSEA, and explain how we can apply them to a simple experimental setting.

First Generation GSEA

Early methods for associating genesets with phenotype changes was developed by:

First, identifying the differential expressed genes between two conditions.
Use a priori defined genesets (e.g., Gene Ontology collection).
Run a simple Fisher exact test to determine whether a significant number of these genes belong to a geneset.
Ref: ¹

The most accessible tool to run this analysis on an experimental setting is the “Investigate Gene Sets”" page from:

https://www.gsea-msigdb.org/gsea/msigdb/annotate.jsp
It takes a list of genes (differential expressed genes) as input.
The user can use from the Molecular Signatures Database (MSigDB) as many geneset collections, and the tool runs the enrichment analysis between the user-provided geneset and the selected MSigDB.

Second Generation GSEA

An alternative approach to the simple fisher test was developed by ranking all genes in the transcriptomic data according to the level of differential expression between two conditions (or phenotypes). The second generation of GSEA methods determines if a predefined geneset is significantly overrepresented toward the top or bottom of the ranked list.

These methods work as the following:

Rank all genes by the magnitudes of their differential expression
Select a window in the ranked list
Define an enrichment score based on a weighted Kolmogorov Smirnov (WKS) test ²
Simulate a background distribution of the enrichment score by shuffling samples and estimate the statistical significance of the geneset
Repeat the first 3 steps for all predefined genesets and various window sizes
Correct for multiple hypotheses testing
Ref: ³

Noteworthy to mention that the second generation of GSEA is methodologically similar to 1st generation; because they put a cut-off for the window of differentially expressed genes. In other words, even though the score that they develop is through running down the list of differential expressed genes and penalize when the gene is out of geneset, they are still not wholly free of cut-off.

One of the available tools to run this method is GSEA software developed by Broad Institute, accessible via:

https://www.gsea-msigdb.org/gsea/downloads.jsp

Third Generation GSEA (Geneset Signature Score)

One of the most popular set of statistical methods that have been established as another alternative for GSEA work by developing a score for each sample (condition, or phenotype) based on a predefined geneset, and then compare these scores to evaluate the enrichment of the geneset in each condition. These methods are also called geneset signature score extraction. After defining a score for each sample, we can compare the signature scores in different conditions using conventional statistical methods, such as a t-test or ANOVA test.

One of the most used method in publications to develop a score based on transcriptomics data is Gene Set Variation Analysis (GSVA) ⁴:

First, it calculates the expression level statistics for unifying the expression level of the genes between samples.
Then the expression level statistics are ranked for each sample.
For each predefined geneset, the Kolmogorov-Smirnov-like rank statistic is calculated
And it constructs the score based on Kolmogorov-Smirnov-like rank statistic.

In the following you can see an example of running GSVA in R:

# Install GSVA R package from   Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GSVA")
BiocManager::install("GSVAdata")

# Run GSVA on sample data
library(GSVA)
library(GSVAdata)

# 1) load a gene expression data:
# in here we use a gene expression dataset  from  Verhaak et al., 2010, for Glioblastoma (GBM) patiant samples, including 4 subtypes: roneural, classical, neuraland  mesenchymal.

data(gbm_VerhaakEtAl) # load the gene expression dataset from GSVAdata
expr <- gbm_eset@assayData$exprs # expression matrix: rows -> genes , columns -> samples
subtypes <- gbm_eset$subtype # leukemia samples sybtypes 


# 2) load pre-defined genesets to be used for signature score extraction
# in here we are manually define 3 gene-signatures for oligodendrocytic, astroglia, and astrocytic cell types. (the signatures were extracted from brainTxDbSets)

oligodendrocytic_up <- c("DCT","ZNF536","GNG8","ELOVL6","NR2C1","RCBTB1")
astroglia_up <- c("BST2","SERPING1","ACTA2","C9orf167","C1orf31","ANXA4")
astrocytic_up <- c("GRHL1","GPAM","PAPSS2","MERTK","BTG1","SLC46A1")


signature.list <- list(oligodendrocytic = oligodendrocytic_up,
                       astroglia = astroglia_up,
                       astrocytic = astrocytic_up) # defining the final signature list

# And now run gsva analysis on the GBM expression dataset

score <- gsva(expr,
           gset.idx.list = signature.list, 
           method = 'ssgsea', # different methods can be used for score extration
           ssgsea.norm = TRUE # to normalize the scores across the samples
      )

The result is a table with the signatures in the rows, and samples in the columns:

##                  TCGA.02.0003.01A.01 TCGA.02.0010.01A.01 TCGA.02.0011.01B.01
## oligodendrocytic           0.2518755           0.3154164          0.03539221
## astroglia                 -0.1792207          -0.3525300         -0.17440859
## astrocytic                -0.1480899          -0.2814492          0.04311682

For representation, in here, we choose astroglia signature, which should be highly enriched in Mesenchymal subtype, and plot the signature scores for each subtype, using a boxplot:

astroglia.signature <- score ['astroglia',]
plot.df <- data.frame(astroglia.signature,
                      subtypes) # defining a dataframe including the signature score and the sybtypes

# here we show the result using Boxplot
boxplot(astroglia.signature ~ subtypes, data = plot.df)
stripchart(astroglia.signature ~ subtypes, data = plot.df,
           vertical = TRUE,
           add = TRUE,
           method = "jitter",
           pch = 20)

DOI: https://doi.org/10.1093/bioinformatics/bti565 ↩
WKS test: measures the difference between the number of genes in a predefined gene set that is observed in the window, and the number of occurrences if the genes in the set were uniformly distributed in the list.↩
DOI: https://doi.org/10.1073/pnas.0506580102 ↩
DOI: https://doi.org/10.1186/1471-2105-14-7 ↩