Geneset Enrichment Analysis (GSEA) is a set of different computational methods that have been developed to interpret the gene expression data and gain insight into the biology of the transcriptional changes in an experiment.

Multiple methods have been developed in different level of sophistication, to determine if a predefined set of genes that share common biological function, shows significant differences between two, or multiple, biological conditions, or phenotypes.

Herein, we go through the main methods that have been developed for GSEA, and explain how we can apply them to a simple experimental setting.

## First Generation GSEA

Early methods for associating genesets with phenotype changes was developed by:

• First, identifying the differential expressed genes between two conditions.
• Use a priori defined genesets (e.g., Gene Ontology collection).
• Run a simple Fisher exact test to determine whether a significant number of these genes belong to a geneset.
• Ref: 1

The most accessible tool to run this analysis on an experimental setting is the “Investigate Gene Sets”" page from:

• https://www.gsea-msigdb.org/gsea/msigdb/annotate.jsp
• It takes a list of genes (differential expressed genes) as input.
• The user can use from the Molecular Signatures Database (MSigDB) as many geneset collections, and the tool runs the enrichment analysis between the user-provided geneset and the selected MSigDB.

## Second Generation GSEA

An alternative approach to the simple fisher test was developed by ranking all genes in the transcriptomic data according to the level of differential expression between two conditions (or phenotypes). The second generation of GSEA methods determines if a predefined geneset is significantly overrepresented toward the top or bottom of the ranked list.

These methods work as the following:

• Rank all genes by the magnitudes of their differential expression
• Select a window in the ranked list
• Define an enrichment score based on a weighted Kolmogorov Smirnov (WKS) test 2
• Simulate a background distribution of the enrichment score by shuffling samples and estimate the statistical significance of the geneset
• Repeat the first 3 steps for all predefined genesets and various window sizes
• Correct for multiple hypotheses testing
• Ref: 3

Noteworthy to mention that the second generation of GSEA is methodologically similar to 1st generation; because they put a cut-off for the window of differentially expressed genes. In other words, even though the score that they develop is through running down the list of differential expressed genes and penalize when the gene is out of geneset, they are still not wholly free of cut-off.

One of the available tools to run this method is GSEA software developed by Broad Institute, accessible via:

## Third Generation GSEA (Geneset Signature Score)

One of the most popular set of statistical methods that have been established as another alternative for GSEA work by developing a score for each sample (condition, or phenotype) based on a predefined geneset, and then compare these scores to evaluate the enrichment of the geneset in each condition. These methods are also called geneset signature score extraction. After defining a score for each sample, we can compare the signature scores in different conditions using conventional statistical methods, such as a t-test or ANOVA test.

One of the most used method in publications to develop a score based on transcriptomics data is Gene Set Variation Analysis (GSVA) 4:

• First, it calculates the expression level statistics for unifying the expression level of the genes between samples.
• Then the expression level statistics are ranked for each sample.
• For each predefined geneset, the Kolmogorov-Smirnov-like rank statistic is calculated
• And it constructs the score based on Kolmogorov-Smirnov-like rank statistic.

In the following you can see an example of running GSVA in R:

# Install GSVA R package from   Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GSVA")

# Run GSVA on sample data
library(GSVA)

# 1) load a gene expression data:
# in here we use a gene expression dataset  from  Verhaak et al., 2010, for Glioblastoma (GBM) patiant samples, including 4 subtypes: roneural, classical, neuraland  mesenchymal.

expr <- gbm_eset@assayData$exprs # expression matrix: rows -> genes , columns -> samples subtypes <- gbm_eset$subtype # leukemia samples sybtypes

# 2) load pre-defined genesets to be used for signature score extraction
# in here we are manually define 3 gene-signatures for oligodendrocytic, astroglia, and astrocytic cell types. (the signatures were extracted from brainTxDbSets)

oligodendrocytic_up <- c("DCT","ZNF536","GNG8","ELOVL6","NR2C1","RCBTB1")
astroglia_up <- c("BST2","SERPING1","ACTA2","C9orf167","C1orf31","ANXA4")
astrocytic_up <- c("GRHL1","GPAM","PAPSS2","MERTK","BTG1","SLC46A1")

signature.list <- list(oligodendrocytic = oligodendrocytic_up,
astroglia = astroglia_up,
astrocytic = astrocytic_up) # defining the final signature list

# And now run gsva analysis on the GBM expression dataset

score <- gsva(expr,
gset.idx.list = signature.list,
method = 'ssgsea', # different methods can be used for score extration
ssgsea.norm = TRUE # to normalize the scores across the samples
)

The result is a table with the signatures in the rows, and samples in the columns:

##                  TCGA.02.0003.01A.01 TCGA.02.0010.01A.01 TCGA.02.0011.01B.01
## oligodendrocytic           0.2518755           0.3154164          0.03539221
## astroglia                 -0.1792207          -0.3525300         -0.17440859
## astrocytic                -0.1480899          -0.2814492          0.04311682

For representation, in here, we choose astroglia signature, which should be highly enriched in Mesenchymal subtype, and plot the signature scores for each subtype, using a boxplot:

astroglia.signature <- score ['astroglia',]
plot.df <- data.frame(astroglia.signature,
subtypes) # defining a dataframe including the signature score and the sybtypes
# here we show the result using Boxplot
boxplot(astroglia.signature ~ subtypes, data = plot.df)
stripchart(astroglia.signature ~ subtypes, data = plot.df,
vertical = TRUE,
pch = 20)