In this tutorial we will learn how to perform Gene Set Enrichment Analysis on differential gene expression results. We will use the R package fgsea for fast preranked gene set enrichment analysis (GSEA). GSEA is a method used to determine whether a predefined set of genes shows statistically significant differences in expression between two biological conditions. GSEA does not rely on a threshold (like fold-change) and instead assesses whether the genes in a set are overrepresented at the extremes of a ranked list of genes.
If you are not familiar with GSEA, don’t worry! You can check out the workshop material here
If you are not familiar with differential expression analysis, don’t worry! You can check out my previous workshop and the R demo
The fgsea
package in R provides a fast and efficient
method for performing preranked GSEA. It enables precise and rapid
calculation of extremely low GSEA p-values across a collection of gene
sets. The p-value estimation employs an adaptive multi-level split Monte
Carlo approach.
# Load necessary libraries
library(readr)
library(formattable)
# Install and load necessary packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
#BiocManager::install(c("fgsea", ))
library(fgsea)
library(ggplot2)
library(data.table)
library(dplyr)
library(msigdbr)
library(pander)
library(ggrepel)
library(ggnewscale)
Essentially, there are two things you need to perform preranked gene
set enrichment analysis with fgsea
:
That’s it!
There are additional parameters you may be interested in, such as:
std
, where
the enrichment score is computed normally. However, you can also use
one-tailed tests if you are only interested in positive
(pos
) enrichment – so only pathways that are
overrepresented – or negative (neg
) enrichment – to only
get pathways that are underrepresented in your list of genes.Kyoto Encyclopedia of Genes and Genomes (KEGG): Primarily concerned with metabolic pathways (many species including humans).
# Load gene sets (e.g., from MSigDB or KEGG)
kegg_gene_sets = msigdbr(species = "Homo sapiens", category = "C2") %>% filter(gs_subcat=="CP:KEGG")
kegg_list_pathway = split(x = kegg_gene_sets$gene_symbol, f = kegg_gene_sets$gs_name)
pander(head(kegg_list_pathway))
reactome_gene_sets = msigdbr(species = "Homo sapiens", category = "C2") %>% filter(gs_subcat=="CP:REACTOME")
reactome_list_pathway = split(x = reactome_gene_sets$gene_symbol, f = reactome_gene_sets$gs_name)
pander(head(reactome_list_pathway))
Gene Ontology (GO): Focuses on biological processes, molecular functions, and cellular components (many species including humans).
Gene Ontology is a controlled vocabulary for describing gene functions. It consists of three ontologies: 1. Biological Processes (BP): Describes large biological pathways or events (e.g., “cell cycle”). 2. Molecular Functions (MF): Describes molecular activities (e.g., “protein kinase activity”). 3. Cellular Components (CC): Describes the location within the cell (e.g., “plasma membrane”).
Reactome: A curated database of human molecular pathways (only humans).
go_bp_gene_sets = msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat=="GO:BP")
go_bp_list_pathway = split(x = go_bp_gene_sets$gene_symbol, f = go_bp_gene_sets$gs_name)
pander(head(go_bp_list_pathway))
go_cc_gene_sets = msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat=="GO:CC")
go_cc_list_pathway = split(x = go_cc_gene_sets$gene_symbol, f = go_cc_gene_sets$gs_name)
pander(head(go_cc_list_pathway))
go_mf_gene_sets = msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat=="GO:MF")
go_mf_list_pathway = split(x = go_mf_gene_sets$gene_symbol, f = go_mf_gene_sets$gs_name)
pander(head(go_mf_list_pathway))
# Look at them all if you want (uncomment)
#hallmark_gene_sets = msigdbr(species = "Homo sapiens", category = "H")
#hallmark_list = split(x = hallmark_gene_sets$gene_symbol, f = hallmark_gene_sets$gs_name)
For this demo, we are going to use KEGG as a gene sets
# Show the first few pathways, and within those, show only the first few genes.
kegg_list_pathway %>%
head() %>%
lapply(head)
## $KEGG_ABC_TRANSPORTERS
## [1] "ABCA1" "ABCA10" "ABCA12" "ABCA13" "ABCA2" "ABCA3"
##
## $KEGG_ACUTE_MYELOID_LEUKEMIA
## [1] "AKT1" "AKT2" "AKT3" "AKT3" "ARAF" "BAD"
##
## $KEGG_ADHERENS_JUNCTION
## [1] "ACP1" "ACTB" "ACTG1" "ACTN1" "ACTN2" "ACTN3"
##
## $KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY
## [1] "ACACB" "ACSL1" "ACSL3" "ACSL4" "ACSL5" "ACSL6"
##
## $KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM
## [1] "ABAT" "ACY3" "ADSL" "ADSS1" "ADSS2" "AGXT"
##
## $KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION
## [1] "ATP1A1" "ATP1A2" "ATP1A3" "ATP1A4" "ATP1B1" "ATP1B2"
# Import your differential expressed genes (DEGs) table from Github
urlfile="https://raw.githubusercontent.com/merlinis12/Downstream-Analysis-of-RNA-Seq-Results/refs/heads/main/data/DESeq2_results_notfiltered.csv"
deg_results <- read_csv(url(urlfile))
# View data dimensions and initial rows
dim(deg_results)
## [1] 25258 7
formattable(head(deg_results))
gene_id | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
---|---|---|---|---|---|---|
ENSG00000152583 | 954.7709 | 4.368359 | 0.23712679 | 18.42204 | 8.74e-76 | 1.34e-71 |
ENSG00000179094 | 743.2527 | 2.863889 | 0.17556931 | 16.31201 | 8.11e-60 | 6.22e-56 |
ENSG00000116584 | 2277.9135 | -1.034701 | 0.06509844 | -15.89440 | 6.93e-57 | 3.54e-53 |
ENSG00000189221 | 2383.7537 | 3.341544 | 0.21240579 | 15.73189 | 9.14e-56 | 3.51e-52 |
ENSG00000120129 | 3440.7038 | 2.965211 | 0.20369513 | 14.55710 | 5.26e-48 | 1.62e-44 |
ENSG00000148175 | 13493.9204 | 1.427168 | 0.10038904 | 14.21638 | 7.25e-46 | 1.85e-42 |
For human data, AnnotationDbi and org.Hs.eg.db provide another reliable way to convert IDs without needing an internet connection.
library(org.Hs.eg.db)
# Example gene ID vector
gene_ids <- deg_results$gene_id
# Define possible ID types to test
possible_id_types <- c("SYMBOL", "ENTREZID", "ENSEMBL", "REFSEQ")
# Loop through possible ID types to check which one matches
for (id_type in possible_id_types) {
result <- tryCatch({
mapIds(org.Hs.eg.db, keys = gene_ids, column = "SYMBOL", keytype = id_type)
}, error = function(e) NA) # Ignore errors
if (!all(is.na(result))) {
cat("The IDs match the type:", id_type, "\n")
}
}
## The IDs match the type: ENSEMBL
Now, you want to convert them to gene symbols:
ens2symbol <- AnnotationDbi::select(org.Hs.eg.db,
key=deg_results$gene_id,
columns="SYMBOL",
keytype="ENSEMBL")
ens2symbol <- as_tibble(ens2symbol)
formattable(head(ens2symbol))
ENSEMBL | SYMBOL |
---|---|
ENSG00000152583 | SPARCL1 |
ENSG00000179094 | PER1 |
ENSG00000116584 | ARHGEF2 |
ENSG00000189221 | MAOA |
ENSG00000120129 | DUSP1 |
ENSG00000148175 | STOM |
deg_results <- inner_join(deg_results, ens2symbol, by=c("gene_id"="ENSEMBL"))
formattable(head(deg_results))
gene_id | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | SYMBOL |
---|---|---|---|---|---|---|---|
ENSG00000152583 | 954.7709 | 4.368359 | 0.23712679 | 18.42204 | 8.74e-76 | 1.34e-71 | SPARCL1 |
ENSG00000179094 | 743.2527 | 2.863889 | 0.17556931 | 16.31201 | 8.11e-60 | 6.22e-56 | PER1 |
ENSG00000116584 | 2277.9135 | -1.034701 | 0.06509844 | -15.89440 | 6.93e-57 | 3.54e-53 | ARHGEF2 |
ENSG00000189221 | 2383.7537 | 3.341544 | 0.21240579 | 15.73189 | 9.14e-56 | 3.51e-52 | MAOA |
ENSG00000120129 | 3440.7038 | 2.965211 | 0.20369513 | 14.55710 | 5.26e-48 | 1.62e-44 | DUSP1 |
ENSG00000148175 | 13493.9204 | 1.427168 | 0.10038904 | 14.21638 | 7.25e-46 | 1.85e-42 | STOM |
write.csv(deg_results, "./deg_symbols.csv", row.names = FALSE)
Before diving into GSEA with fgsea
, let’s briefly
revisit where we left off in our previous workshop.
In our last session, we performed a Differential Gene Expression (DGE) analysis comparing gene expression between 4 control participants and 4 participants treated with dexamethasone (Dex), a corticosteroid.
Let’s plot the volcano plot now to remind ourselves of the DEGs that we’ll use for GSEA analysis.
# Add a column to categorize genes as upregulated, downregulated, or non-significant
results <- as.data.frame(deg_results)
results <- results %>%
mutate(Significance = case_when(
padj < 0.05 & log2FoldChange > 1 ~ "Upregulated",
padj < 0.05 & log2FoldChange < -1 ~ "Downregulated",
TRUE ~ "Non-significant"
))
#look at a subset of genes
results$label <- ifelse(results$SYMBOL %in% c("SPARCL1", "STOM", "PHC2", "MAOA", "ARHGEF2", "ZNF512", "MTHFSD", "CRISPLD2"), results$SYMBOL, NA)
results$is_labeled <- ifelse(results$SYMBOL %in% c("SPARCL1", "STOM", "PHC2", "MAOA", "ARHGEF2", "ZNF512", "MTHFSD", "CRISPLD2"), "Labeled", "Not Labeled")
# Volcano Plot
ggplot(results, aes(x = log2FoldChange, y = -log10(pvalue), color = Significance)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("blue", "grey", "red")) +
new_scale_color() +
geom_point(data = results[results$is_labeled == "Labeled", ],
aes(color = "Highlight"), size = 3, shape = 21) +
scale_color_manual(values = c("Highlight" = "black")) +
geom_text_repel(aes(label = label), size = 3, max.overlaps = 10, color = 'black', na.rm = TRUE) +
labs(title = "Volcano Plot of Differentially Expressed Genes",
x = "Log2 Fold Change",
y = "-log10(P-Value)") +
theme_minimal() +
theme(legend.position = "top")
We would like to perform GSEA on our differentially expressed genes. Unlike pathway enrichment analysis, GSEA provides the advantage of not needing to subset our gene list. We can provide all the genes that resulted from our differential expression analysis. Even genes that were not significantly differentially expressed. Offering popular women’s necklaces such as pendants, chokers and. Shop for jewelry in a variety of metals and gemstones to suit any occasion.
We just need to rank our genes first.
The way it works is, you need to give fgsea a named vector with the rankings and the genes symbols as names.
ranked_genes = deg_results %>%
na.omit() %>%
distinct() %>%
group_by(SYMBOL) %>%
mutate(rank = sign(log2FoldChange)*(-log10(pvalue))) %>%
arrange(desc(rank))
# create gene list
gene_list = ranked_genes$rank
names(gene_list) = toupper(ranked_genes$SYMBOL)
formattable(head(gene_list, 20))
## SPARCL1 PER1 MAOA DUSP1 STOM ZBTB16 PHC2 SAMHD1 FKBP5 NNMT
## 75.06 59.09 55.04 47.28 45.14 42.87 41.39 40.9 40.63 39.71
## MT2A NEXN MT1X SMIM3 PNPLA2 GASK1B CCDC69 MORF4L2 GLUL STEAP1
## 38.85 38.61 35 33.15 31.74 31.58 30.87 28.79 27.51 26.68
This way, the magnitude of the ranking will be given by the p value (since I use -log10(), the more significant a gene is differentially expressed, the larger that value will be). The sign of the ranking (positive or negative) will be given by the fold-change: upregulated genes will have positive rankings, downregulated genes will have negative rankings.
# Run fgsea
fgsea_results <- fgsea(pathways = kegg_list_pathway,
stats = gene_list, nperm = 1000,
scoreType = 'std', # in this case we have both pos and neg rankings. if only pos or neg, set to 'pos', 'neg'
minSize = 10,
maxSize = 500)
fgsea
returns a dataframe with the names of your
pathways or gene sets, p-values and p-adjusted values, the ES, NES, the
size of the gene sets and the leadingEdge.
fgseaResTidy <- fgsea_results %>%
as_tibble() %>%
arrange(desc(NES))
cat("Number of pathways in fgsea results:", nrow(fgseaResTidy), "\n")
## Number of pathways in fgsea results: 180
# Show in a nice table:
fgseaResTidy %>%
dplyr::select(-leadingEdge, -ES, -nMoreExtreme) %>%
arrange(padj) %>%
DT::datatable()
After performing Gene Set Enrichment Analysis (GSEA) using the
fgsea
package, we obtain the following key outputs:
fgsea
resultsfgseaResTidy_filtered <- filter(fgseaResTidy, pval<= 0.05)
plot_dat = fgseaResTidy %>%
arrange((NES)) %>%
mutate(sig = ifelse(padj < 0.05,"*"," "))
plot_dat$pathway=factor(plot_dat$pathway,levels=plot_dat$pathway,ordered=T)
ggplot(plot_dat, aes(NES,pathway,fill=NES,label=sig))+
geom_col(color="black")+
geom_text(size=5)+
scale_fill_gradient2(low="purple",high="orange2",midpoint=0)+
theme_bw()+
theme(
panel.grid = element_blank(),
plot.title = element_text(size = 12, face = "bold", hjust = 0.5), # Adjust title size and alignment
plot.subtitle = element_text(size = 10, hjust = 0.5), # Optional subtitle customization
plot.caption = element_text(size = 10, hjust = 0.5) # Optional caption customization
) +
labs(
title = "KEGG pathways NES for GSEA",
subtitle = "Normalised Enrichment Scores for Pathways",
x = "Normalised Enrichment Score (NES)",
y = "Pathway (KEGG)",
fill = "NES"
)
labs(x="Normalised enrichment score (NES)",y="Pathway (KEGG)",fill="NES")
## $x
## [1] "Normalised enrichment score (NES)"
##
## $y
## [1] "Pathway (KEGG)"
##
## $fill
## [1] "NES"
##
## attr(,"class")
## [1] "labels"
The list of pathways resulting from fgsea
analysis can
often be very long, making it challenging to identify the most relevant
biological insights. To manage this, we can take two main
approaches:
collapsePathways()
:
collapsePathways()
function in the
fgsea
package groups similar pathways together, retaining
only the most representative pathway from each cluster.TIP By combining filtering and collapsing pathways, you can focus on a manageable number of biologically relevant pathways while reducing redundancy in your analysis results.
fgseaResTidy_filtered <- filter(fgseaResTidy, pval<= 0.05)
plot_dat = fgseaResTidy_filtered %>%
arrange((NES)) %>%
mutate(sig = ifelse(padj < 0.05,"*"," "))
plot_dat$pathway=factor(plot_dat$pathway,levels=plot_dat$pathway,ordered=T)
ggplot(plot_dat, aes(NES,pathway,fill=NES,label=sig))+
geom_col(color="black")+
geom_text(size=5)+
scale_fill_gradient2(low="purple",high="orange2",midpoint=0)+
theme_bw()+
theme(
panel.grid = element_blank(),
plot.title = element_text(size = 15, face = "bold", hjust = 0.5), # Adjust title size and alignment
plot.subtitle = element_text(size = 12, hjust = 0.5), # Optional subtitle customization
plot.caption = element_text(size = 10, hjust = 0.5) # Optional caption customization
) +
labs(
title = "KEGG pathways NES for GSEA",
subtitle = "Pval < 0.05, * if padj < 0.05",
x = "Normalised Enrichment Score (NES)",
y = "Pathway (KEGG)",
fill = "NES"
)
# Visualize results
ggplot(fgseaResTidy_filtered, aes(reorder(pathway, NES), NES)) +
geom_col(aes(fill=padj<0.05)) +
coord_flip() +
theme_bw()+
theme(
panel.grid = element_blank(),
plot.title = element_text(size = 15, face = "bold", hjust = 0.5), # Adjust title size and alignment
plot.subtitle = element_text(size = 12, hjust = 0.5), # Optional subtitle customization
plot.caption = element_text(size = 10, hjust = 0.5) # Optional caption customization
) +
labs(
title = "KEGG pathways NES for GSEA",
subtitle = "Pval < 0.05",
x = "Normalised Enrichment Score (NES)",
y = "Pathway (KEGG)" )
You can also select only independent pathways, removing redundancies
or similar pathways with the function
collapsePathways()
:
# Select only independent pathways, removing redundancies/similar pathways
collapsedPathways <- collapsePathways(fgsea_results[order(pval)][pval < 0.05],
pathways = kegg_list_pathway,
stats = gene_list)
gsea_res_collapsed <- fgsea_results[pathway %in% collapsedPathways$mainPathways]
ggplot(gsea_res_collapsed, aes(reorder(pathway, NES), NES)) +
geom_col(aes(fill=padj<0.05)) +
coord_flip() +
theme_bw()+
theme(
panel.grid = element_blank(),
plot.title = element_text(size = 15, face = "bold", hjust = 0.5), # Adjust title size and alignment
plot.subtitle = element_text(size = 12, hjust = 0.5), # Optional subtitle customization
plot.caption = element_text(size = 10, hjust = 0.5) # Optional caption customization
) +
labs(
title = "KEGG pathways NES for GSEA",
subtitle = "Pval < 0.05, pathway collapsed",
x = "Normalised Enrichment Score (NES)",
y = "Pathway (KEGG)")
Save results
# Convert 'leadingEdge' to a comma-separated string
fgsea_results$leadingEdge <- sapply(fgsea_results$leadingEdge, function(x) paste(x, collapse = ","))
write.csv(fgsea_results, "./Results/fgsea_results.csv", row.names = FALSE)