Overview

In this tutorial we will learn how to perform Gene Set Enrichment Analysis on differential gene expression results. We will use the R package fgsea for fast preranked gene set enrichment analysis (GSEA). GSEA is a method used to determine whether a predefined set of genes shows statistically significant differences in expression between two biological conditions. GSEA does not rely on a threshold (like fold-change) and instead assesses whether the genes in a set are overrepresented at the extremes of a ranked list of genes.

If you are not familiar with GSEA, don’t worry! You can check out the workshop material here

If you are not familiar with differential expression analysis, don’t worry! You can check out my previous workshop and the R demo

How does GSEA work?

The fgsea package in R provides a fast and efficient method for performing preranked GSEA. It enables precise and rapid calculation of extremely low GSEA p-values across a collection of gene sets. The p-value estimation employs an adaptive multi-level split Monte Carlo approach.

Loading Libraries

# Load necessary libraries
library(readr)
library(formattable)
# Install and load necessary packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
#BiocManager::install(c("fgsea", ))

library(fgsea)
library(ggplot2)
library(data.table)
library(dplyr)
library(msigdbr)
library(pander)
library(ggrepel)
library(ggnewscale)

What do we need?

Essentially, there are two things you need to perform preranked gene set enrichment analysis with fgsea:

  1. pathways: A list of gene sets or pathways to check.
  2. stats: A named vector of your genes of interest for GSEA. The gene names must match the ones in the pathways list. Make sure you are using the same nomenclature (e.g., gene IDs or Ensembl IDs).

That’s it!

There are additional parameters you may be interested in, such as:

Prepare your pathways/gene sets

KEGG

Kyoto Encyclopedia of Genes and Genomes (KEGG): Primarily concerned with metabolic pathways (many species including humans).

# Load gene sets (e.g., from MSigDB or KEGG)
kegg_gene_sets = msigdbr(species = "Homo sapiens", category = "C2") %>% filter(gs_subcat=="CP:KEGG")
kegg_list_pathway = split(x = kegg_gene_sets$gene_symbol, f = kegg_gene_sets$gs_name)

pander(head(kegg_list_pathway))
  • KEGG_ABC_TRANSPORTERS: ABCA1, ABCA10, ABCA12, ABCA13, ABCA2, ABCA3, ABCA4, ABCA5, ABCA6, ABCA7, ABCA8, ABCA9, ABCB1, ABCB10, ABCB11, ABCB11, ABCB4, ABCB5, ABCB6, ABCB7, ABCB8, ABCB9, ABCC1, ABCC1, ABCC10, ABCC11, ABCC12, ABCC2, ABCC3, ABCC4, ABCC5, ABCC6, ABCC6, ABCC8, ABCC9, ABCD1, ABCD2, ABCD3, ABCD4, ABCG1, ABCG2, ABCG4, ABCG5, ABCG8, CFTR, TAP1, TAP1, TAP1, TAP1, TAP1, TAP1, TAP1, TAP1, TAP2, TAP2, TAP2, TAP2, TAP2, TAP2, TAP2 and TAP2
  • KEGG_ACUTE_MYELOID_LEUKEMIA: AKT1, AKT2, AKT3, AKT3, ARAF, BAD, BRAF, CCNA1, CCND1, CEBPA, CHUK, EIF4EBP1, FLT3, GRB2, HRAS, HRAS, IKBKB, IKBKG, JUP, KIT, KRAS, LEF1, MAP2K1, MAP2K2, MAPK1, MAPK3, MTOR, MYC, NFKB1, NRAS, PIK3CA, PIK3CB, PIK3CD, PIK3CG, PIK3R1, PIK3R2, PIK3R3, PIK3R5, PIM1, PIM2, PML, PPARD, RAF1, RARA, RELA, RPS6KB1, RPS6KB2, RUNX1, RUNX1T1, SOS1, SOS2, SPI1, STAT3, STAT5A, STAT5B, TCF7, TCF7L1, TCF7L2 and ZBTB16
  • KEGG_ADHERENS_JUNCTION: ACP1, ACTB, ACTG1, ACTN1, ACTN2, ACTN3, ACTN4, ACTN4, AFDN, BAIAP2, CDC42, CDH1, CREBBP, CSNK2A1, CSNK2A2, CSNK2B, CSNK2B, CSNK2B, CSNK2B, CSNK2B, CSNK2B, CSNK2B, CSNK2B, CTNNA1, CTNNA2, CTNNA3, CTNNB1, CTNND1, EGFR, EP300, ERBB2, FARP2, FER, FGFR1, FYN, IGF1R, INSR, IQGAP1, LEF1, LMO7, MAP3K7, MAPK1, MAPK3, MET, NECTIN1, NECTIN2, NECTIN3, NECTIN4, NLK, PARD3, PTPN1, PTPN6, PTPRB, PTPRF, PTPRJ, PTPRM, RAC1, RAC2, RAC3, RHOA, SMAD2, SMAD3, SMAD4, SNAI1, SNAI2, SORBS1, SRC, SSX2IP, TCF7, TCF7L1, TCF7L2, TGFBR1, TGFBR2, TJP1, TJP1, VCL, WAS, WASF1, WASF2, WASF3, WASL and YES1
  • KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY: ACACB, ACSL1, ACSL3, ACSL4, ACSL5, ACSL6, ADIPOQ, ADIPOR1, ADIPOR2, ADIPOR2, AGRP, AKT1, AKT2, AKT3, AKT3, CAMKK1, CAMKK2, CD36, CHUK, CPT1A, CPT1B, CPT1C, G6PC1, G6PC2, G6PC2, IKBKB, IKBKG, IRS1, IRS2, IRS4, JAK2, LEP, LEPR, MAPK10, MAPK8, MAPK9, MTOR, NFKB1, NFKBIA, NFKBIB, NFKBIB, NFKBIE, NPY, PCK1, PCK2, PCK2, POMC, PPARA, PPARGC1A, PRKAA1, PRKAA2, PRKAB1, PRKAB2, PRKAG1, PRKAG2, PRKAG3, PRKCQ, PTPN11, RELA, RXRA, RXRB, RXRB, RXRB, RXRB, RXRB, RXRB, RXRG, SLC2A1, SLC2A4, SLC2A4, SOCS3, STAT3, STK11, TNF, TNF, TNF, TNF, TNF, TNF, TNF, TNF, TNFRSF1A, TNFRSF1B, TRADD and TRAF2
  • KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM: ABAT, ACY3, ADSL, ADSS1, ADSS2, AGXT, AGXT2, ALDH4A1, ALDH5A1, ASL, ASNS, ASPA, ASS1, CAD, CPS1, DDO, GAD1, GAD2, GFPT1, GFPT2, GLS, GLS2, GLUD1, GLUD2, GLUD2, GLUL, GOT1, GOT2, GPT, GPT2, IL4I1, NIT2 and PPAT
  • KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION: ATP1A1, ATP1A2, ATP1A3, ATP1A4, ATP1B1, ATP1B2, ATP1B3, ATP1B4, FXYD2, FXYD4, HSD11B1, HSD11B2, IGF1, INS, INSR, IRS1, IRS2, IRS4, KCNJ1, KRAS, MAPK1, MAPK3, NEDD4L, NR3C2, PDPK1, PIK3CA, PIK3CB, PIK3CD, PIK3CG, PIK3R1, PIK3R2, PIK3R3, PIK3R5, PRKCA, PRKCB, PRKCG, SCNN1A, SCNN1B, SCNN1G, SFN, SGK1 and SLC9A3R2

REACTOME

reactome_gene_sets = msigdbr(species = "Homo sapiens", category = "C2") %>% filter(gs_subcat=="CP:REACTOME")
reactome_list_pathway = split(x = reactome_gene_sets$gene_symbol, f = reactome_gene_sets$gs_name)
pander(head(reactome_list_pathway))
  • REACTOME_2_LTR_CIRCLE_FORMATION: BANF1, HMGA1, LIG4, PSIP1, XRCC4, XRCC5 and XRCC6
  • REACTOME_A_TETRASACCHARIDE_LINKER_SEQUENCE_IS_REQUIRED_FOR_GAG_SYNTHESIS: AGRN, B3GALT6, B3GAT1, B3GAT2, B3GAT3, B4GALT7, BCAN, BGN, CSPG4, CSPG5, DCN, GPC1, GPC2, GPC3, GPC4, GPC5, GPC6, HSPG2, NCAN, SDC1, SDC2, SDC3, SDC4, VCAN, XYLT1 and XYLT2
  • REACTOME_ABACAVIR_METABOLISM: ADAL, ADH1A, GUK1, NT5C2 and PCK1
  • REACTOME_ABACAVIR_TRANSMEMBRANE_TRANSPORT: ABCB1, ABCG2, SLC22A1, SLC22A2 and SLC22A3
  • REACTOME_ABACAVIR_TRANSPORT_AND_METABOLISM: ABCB1, ABCG2, ADAL, ADH1A, GUK1, NT5C2, PCK1, SLC22A1, SLC22A2 and SLC22A3
  • REACTOME_ABC_FAMILY_PROTEINS_MEDIATED_TRANSPORT: ABCA10, ABCA12, ABCA2, ABCA3, ABCA4, ABCA5, ABCA6, ABCA7, ABCA8, ABCA9, ABCB1, ABCB10, ABCB4, ABCB5, ABCB6, ABCB7, ABCB8, ABCB9, ABCC1, ABCC1, ABCC10, ABCC11, ABCC2, ABCC3, ABCC4, ABCC5, ABCC6, ABCC6, ABCC9, ABCD1, ABCD2, ABCD3, ABCF1, ABCF1, ABCF1, ABCF1, ABCF1, ABCF1, ABCF1, ABCG1, ABCG4, ABCG5, ABCG8, APOA1, CFTR, DERL1, DERL2, DERL3, DERL3, EIF2S1, EIF2S2, EIF2S3, ERLEC1, ERLIN1, ERLIN2, KCNJ11, OS9, PEX19, PEX3, PSMA1, PSMA2, PSMA3, PSMA4, PSMA5, PSMA6, PSMA7, PSMA8, PSMB1, PSMB10, PSMB11, PSMB2, PSMB3, PSMB3, PSMB4, PSMB5, PSMB6, PSMB7, PSMB8, PSMB8, PSMB8, PSMB8, PSMB8, PSMB8, PSMB8, PSMB8, PSMB9, PSMB9, PSMB9, PSMB9, PSMB9, PSMB9, PSMB9, PSMB9, PSMC1, PSMC2, PSMC3, PSMC4, PSMC4, PSMC5, PSMC6, PSMD1, PSMD10, PSMD11, PSMD12, PSMD13, PSMD14, PSMD2, PSMD3, PSMD4, PSMD5, PSMD6, PSMD7, PSMD8, PSMD9, PSME1, PSME1, PSME2, PSME2, PSME3, PSME4, PSMF1, RNF185, RNF5, RNF5, RNF5, RNF5, RNF5, RNF5, RNF5, RPS27A, SEL1L, SEM1, UBA52, UBB, UBC and VCP

GO

Gene Ontology (GO): Focuses on biological processes, molecular functions, and cellular components (many species including humans).

Gene Ontology is a controlled vocabulary for describing gene functions. It consists of three ontologies: 1. Biological Processes (BP): Describes large biological pathways or events (e.g., “cell cycle”). 2. Molecular Functions (MF): Describes molecular activities (e.g., “protein kinase activity”). 3. Cellular Components (CC): Describes the location within the cell (e.g., “plasma membrane”).

GO BP

Reactome: A curated database of human molecular pathways (only humans).

go_bp_gene_sets = msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat=="GO:BP")
go_bp_list_pathway = split(x = go_bp_gene_sets$gene_symbol, f = go_bp_gene_sets$gs_name)
pander(head(go_bp_list_pathway))
  • GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS: AASDHPPT, ALDH1L1, ALDH1L2, MTHFD1, MTHFD1L and MTHFD2L
  • GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS: AADAT, ADHFE1, D2HGDH, DLST, GOT1, GOT2, GPT2, IDH1, IDH2, IDH3B, KYAT3, L2HGDH, MRPS36, MRPS36, OGDH, OGDHL, PHYH and TAT
  • GOBP_2FE_2S_CLUSTER_ASSEMBLY: BOLA2, BOLA2B, GLRX3, GLRX5, HSCB and NFS1
  • GOBP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_BIOSYNTHETIC_PROCESS: PAPSS1, PAPSS2, SLC26A1, SLC26A2, SLC35B2 and SLC35B3
  • GOBP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS: ABHD14B, BPNT1, ENPP1, PAPSS1, PAPSS2, SLC26A1, SLC26A2, SLC35B2, SLC35B3, SULT1A1, SULT1A2, SULT1A3, SULT1A4, SULT1B1, SULT1C3, SULT1C4, SULT1E1, SULT2A1, SULT2B1, TPST1 and TPST2
  • GOBP_3_UTR_MEDIATED_MRNA_DESTABILIZATION: CPEB3, DHX36, DHX36, DND1, DND1, HNRNPD, KHSRP, MOV10, PLEKHN1, RBM24, RC3H1, TARDBP, TRIM71, UPF1, ZC3H12A, ZC3H12D, ZFP36, ZFP36L1 and ZFP36L2

GO CC

go_cc_gene_sets = msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat=="GO:CC")
go_cc_list_pathway = split(x = go_cc_gene_sets$gene_symbol, f = go_cc_gene_sets$gs_name)
pander(head(go_cc_list_pathway))
  • GOCC_90S_PRERIBOSOME: BMS1, BOP1, BOP1, ENSG00000275464, HEATR1, IMP3, IMP4, KRI1, MPHOSPH10, NOC2L, NOC4L, NOC4L, NOL6, NOP14, NOP9, NOP9, PES1, PWP2, RRP36, RRP7A, RRP7BP, RSL1D1, SLX9, SRFBP1, TBL3, UTP18, UTP20, UTP4, UTP4, UTP6, WDR12, WDR3 and WDR36
  • GOCC_9PLUS0_NON_MOTILE_CILIUM: ABCA4, ARL3, ARR3, ATP1A4, ATP8A2, BBS4, BBS7, BSG, C1orf115, C4orf47, CACNA1F, CCDC66, CDHR1, CEP250, CEP290, CERKL, CETN1, CETN2, CETN3, CFAP410, CIB2, CNGA1, CNGB1, CNGB3, CRB1, DHRS3, DYNLL2, EYS, FAM161A, GNA11, GNAQ, GNAT1, GNAT2, GNAT3, GNB1, GNGT1, GRK1, GRK1, GRK1, GRK4, GRK7, GUCA1A, GUCA1B, GUCA1C, GUCY2D, GUCY2F, IFT122, IFT140, IFT20, IFT52, IFT57, IFT80, IFTAP, IMPG1, INHA, IQCB1, KIAA1549, KIF17, KIFAP3, LCA5, LYAR, MAK, MAP1B, MERTK, MYO7A, MYRIP, NAPEPLD, NAPEPLD, NPHP1, NPHP4, NXNL1, OCRL, OPN1LW, OPN1MW, OPN1MW2, OPN1MW3, OPN1SW, OPN3, OPN4, PCARE, PCDH15, PCDHB13, PCDHB15, PCDHB15, PDC, PDE6A, PDE6B, PDE6G, PDE6H, PEX6, PHLPP2, PHYH, PIP4K2A, PKHD1, PPEF2, PRCD, PROM1, PRPH2, PTGS1, RAB27A, RCVRN, RD3, RGR, RGS9BP, RHO, ROM1, RP1, RP1L1, RPGR, RPGRIP1, RPGRIP1L, RRH, SAG, SAG, SDCCAG8, SDCCAG8, SEPTIN2, SHANK2, SLC24A4, SMO, SPATA7, SPTBN5, TBCC, TMEM237, TOPORS, TSGA10IP, TTC8, TTLL4, TTLL6, TULP1, TULP3, USH1C, USH1G, USH2A, VCAN, WDR19 and WHRN
  • GOCC_9PLUS2_MOTILE_CILIUM: ABHD2, ACE, AK1, AK2, AK8, AKAP3, AKAP4, ATP1A1, ATP1A4, ATP1B1, ATP1B3, ATP2B4, CABCOCO1, CABS1, CABYR, CAPZB, CATSPER4, CATSPERD, CATSPERE, CATSPERG, CATSPERZ, CCDC172, CCDC181, CCDC189, CCDC39, CCDC39, CCR6, CD52, CETN2, CFAP221, CFAP251, CFAP43, CFAP45, CFAP47, CFAP52, CFAP58, CFAP65, CFAP69, CFAP70, CROCC, CST11, CUL3, DCDC2C, DEFB1, DEFB1, DNAH1, DNAH11, DNAH17, DNAH2, DNAH5, DNAH8, DNAH9, DNAI1, DNAI2, DNAJB13, DNALI1, DRC3, DRD2, DYNLT4, EFCAB2, EFCAB9, ENKUR, ENO4, FLACC1, FSCB, FSIP2, GABARAP, GAS8, GK2, GSTM3, HSPD1, HYAL3, IFT172, IFT27, IFT81, IFT88, IL4I1, IQCG, KIF9, LYZL4, LYZL6, LYZL6, MAFIP, MNS1, MROH2B, NME5, NME8, ODAD4, ODF1, ODF2, ODF3, PACRG, PFKM, PGAM4, PGK2, PMFBP1, PRKACA, PRKACA, PRKAR1A, QRICH2, RAN, RAP1A, RHO, RNF38, RPGR, RSPH1, RSPH6A, RSPH9, SAXO1, SAXO2, SCNN1A, SEPTIN12, SEPTIN2, SEPTIN4, SEPTIN6, SEPTIN7, SLC22A14, SLC26A3, SLC26A6, SLC9A3R1, SLC9B1, SLC9B2, SLIRP, SPA17, SPACA3, SPACA9, SPAG16, SPAG6, SPATA6, SPATA6L, SPATC1L, SPATC1L, SPEF1, SPEF2, SQSTM1, SQSTM1, SUN5, TACR1, TACR2, TACR3, TBC1D21, TCP11, TCP11X1, TCTE1, TEKT3, TEKT4, TEKT4, TEKT5, TOMM20, TPPP2, TSSK4, TSSK4, TTC29, TTLL3, TTLL8, VPS13A and WBP2NL
  • GOCC_A_BAND: ALDOA, ANK1, ANK2, CMYA5, CRYAB, DST, ENO1, KAT2B, KCTD6, KLHL40, KLHL41, LMOD2, LMOD3, LRRC39, MYBPC3, MYH1, MYL2, MYL3, MYL4, MYL7, MYOM1, MYOM2, MYOM2, MYOM3, NBR1, OBSCN, OBSL1, PPP1R12A, PPP1R12B, PPP2R5A, SMPX, SMTNL1, SMTNL2, SPTBN1 and TRIM63
  • GOCC_ACETYLCHOLINE_GATED_CHANNEL_COMPLEX: CHRFAM7A, CHRFAM7A, CHRNA1, CHRNA2, CHRNA3, CHRNA4, CHRNA5, CHRNA6, CHRNA7, CHRNA7, CHRNA7, CHRNA9, CHRNB1, CHRNB1, CHRNB2, CHRNB3, CHRNB4, CHRNE and STXBP5
  • GOCC_ACROSOMAL_MEMBRANE: ACRBP, ATP8B3, BSG, CAV1, CAV2, CCDC136, CD46, CFAP65, EQTN, FAM170B, FLOT2, HYAL3, IZUMO1, MFGE8, PCSK4, PLA1A, RND2, SERPINA5, SPACA1, SPACA3, SPACA6, TEKT3, TEX101, TMEM190, TMEM225, TMEM95, TMEM95, TRIP11 and ZPBP

GO MF

go_mf_gene_sets = msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat=="GO:MF")
go_mf_list_pathway = split(x = go_mf_gene_sets$gene_symbol, f = go_mf_gene_sets$gs_name)
pander(head(go_mf_list_pathway))
  • GOMF_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACTIVITY: GDPGP1, MTAP, PYGB, PYGL, PYGM and TYMP
  • GOMF_1_ACYLGLYCEROPHOSPHOCHOLINE_O_ACYLTRANSFERASE_ACTIVITY: LPCAT1, LPCAT1, LPCAT2, LPCAT3, LPCAT4, MBOAT1, MBOAT2, PRDX6 and TAFAZZIN
  • GOMF_1_ALKYL_2_ACETYLGLYCEROPHOSPHOCHOLINE_ESTERASE_ACTIVITY: ASPG, LCAT, PAFAH1B2, PAFAH1B3, PAFAH2, PLA2G10, PLA2G10, PLA2G6 and PLA2G7
  • GOMF_1_PHOSPHATIDYLINOSITOL_3_KINASE_ACTIVITY: ATM, PIK3C2A, PIK3C2B, PIK3C2G, PIK3C3, PIK3CA, PIK3CB, PIK3CD, PIK3CG and PIK3R3
  • GOMF_1_PHOSPHATIDYLINOSITOL_3_KINASE_REGULATOR_ACTIVITY: CCKBR, CISH, P3R3URF, PIK3R1, PIK3R2, PIK3R3, PIK3R5, PIK3R6, SLA2, SOCS1, SOCS2, SOCS3, SOCS4, SOCS5, SOCS6, SOCS7 and SOCS7
  • GOMF_1_PHOSPHATIDYLINOSITOL_4_KINASE_ACTIVITY: PI4K2A, PI4K2B, PI4KA, PI4KAP1, PI4KAP2 and PI4KB

ALL OF THEM

# Look at them all if you want (uncomment)
#hallmark_gene_sets = msigdbr(species = "Homo sapiens", category = "H") 
#hallmark_list = split(x = hallmark_gene_sets$gene_symbol, f = hallmark_gene_sets$gs_name)

For this demo, we are going to use KEGG as a gene sets

# Show the first few pathways, and within those, show only the first few genes. 
kegg_list_pathway %>% 
  head() %>% 
  lapply(head)
## $KEGG_ABC_TRANSPORTERS
## [1] "ABCA1"  "ABCA10" "ABCA12" "ABCA13" "ABCA2"  "ABCA3" 
## 
## $KEGG_ACUTE_MYELOID_LEUKEMIA
## [1] "AKT1" "AKT2" "AKT3" "AKT3" "ARAF" "BAD" 
## 
## $KEGG_ADHERENS_JUNCTION
## [1] "ACP1"  "ACTB"  "ACTG1" "ACTN1" "ACTN2" "ACTN3"
## 
## $KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY
## [1] "ACACB" "ACSL1" "ACSL3" "ACSL4" "ACSL5" "ACSL6"
## 
## $KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM
## [1] "ABAT"  "ACY3"  "ADSL"  "ADSS1" "ADSS2" "AGXT" 
## 
## $KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION
## [1] "ATP1A1" "ATP1A2" "ATP1A3" "ATP1A4" "ATP1B1" "ATP1B2"

Import Differential Expression Analysis Results (DEGs)

# Import your differential expressed genes (DEGs) table from Github
urlfile="https://raw.githubusercontent.com/merlinis12/Downstream-Analysis-of-RNA-Seq-Results/refs/heads/main/data/DESeq2_results_notfiltered.csv"
deg_results <- read_csv(url(urlfile))

# View data dimensions and initial rows
dim(deg_results)
## [1] 25258     7
formattable(head(deg_results))
gene_id baseMean log2FoldChange lfcSE stat pvalue padj
ENSG00000152583 954.7709 4.368359 0.23712679 18.42204 8.74e-76 1.34e-71
ENSG00000179094 743.2527 2.863889 0.17556931 16.31201 8.11e-60 6.22e-56
ENSG00000116584 2277.9135 -1.034701 0.06509844 -15.89440 6.93e-57 3.54e-53
ENSG00000189221 2383.7537 3.341544 0.21240579 15.73189 9.14e-56 3.51e-52
ENSG00000120129 3440.7038 2.965211 0.20369513 14.55710 5.26e-48 1.62e-44
ENSG00000148175 13493.9204 1.427168 0.10038904 14.21638 7.25e-46 1.85e-42

Convert Gene ID in the SYMBOL format

For human data, AnnotationDbi and org.Hs.eg.db provide another reliable way to convert IDs without needing an internet connection.

  1. Load Packages
library(org.Hs.eg.db)
  1. How to know your Gene IDs
# Example gene ID vector
gene_ids <- deg_results$gene_id

# Define possible ID types to test
possible_id_types <- c("SYMBOL", "ENTREZID", "ENSEMBL", "REFSEQ")

# Loop through possible ID types to check which one matches
for (id_type in possible_id_types) {
  result <- tryCatch({
    mapIds(org.Hs.eg.db, keys = gene_ids, column = "SYMBOL", keytype = id_type)
  }, error = function(e) NA)  # Ignore errors

  if (!all(is.na(result))) {
    cat("The IDs match the type:", id_type, "\n")
  }
}
## The IDs match the type: ENSEMBL

Now, you want to convert them to gene symbols:

ens2symbol <- AnnotationDbi::select(org.Hs.eg.db,
                                    key=deg_results$gene_id, 
                                    columns="SYMBOL",
                                    keytype="ENSEMBL")
ens2symbol <- as_tibble(ens2symbol)
formattable(head(ens2symbol))
ENSEMBL SYMBOL
ENSG00000152583 SPARCL1
ENSG00000179094 PER1
ENSG00000116584 ARHGEF2
ENSG00000189221 MAOA
ENSG00000120129 DUSP1
ENSG00000148175 STOM
deg_results <- inner_join(deg_results, ens2symbol, by=c("gene_id"="ENSEMBL"))
formattable(head(deg_results))
gene_id baseMean log2FoldChange lfcSE stat pvalue padj SYMBOL
ENSG00000152583 954.7709 4.368359 0.23712679 18.42204 8.74e-76 1.34e-71 SPARCL1
ENSG00000179094 743.2527 2.863889 0.17556931 16.31201 8.11e-60 6.22e-56 PER1
ENSG00000116584 2277.9135 -1.034701 0.06509844 -15.89440 6.93e-57 3.54e-53 ARHGEF2
ENSG00000189221 2383.7537 3.341544 0.21240579 15.73189 9.14e-56 3.51e-52 MAOA
ENSG00000120129 3440.7038 2.965211 0.20369513 14.55710 5.26e-48 1.62e-44 DUSP1
ENSG00000148175 13493.9204 1.427168 0.10038904 14.21638 7.25e-46 1.85e-42 STOM
write.csv(deg_results, "./deg_symbols.csv", row.names = FALSE)

Recap: Differential Gene Expression Analysis

Before diving into GSEA with fgsea, let’s briefly revisit where we left off in our previous workshop.

In our last session, we performed a Differential Gene Expression (DGE) analysis comparing gene expression between 4 control participants and 4 participants treated with dexamethasone (Dex), a corticosteroid.

Let’s plot the volcano plot now to remind ourselves of the DEGs that we’ll use for GSEA analysis.

# Add a column to categorize genes as upregulated, downregulated, or non-significant
results <- as.data.frame(deg_results)
results <- results %>%
  mutate(Significance = case_when(
    padj < 0.05 & log2FoldChange > 1 ~ "Upregulated",
    padj < 0.05 & log2FoldChange < -1 ~ "Downregulated",
    TRUE ~ "Non-significant"
  ))



#look at a subset of genes
results$label <- ifelse(results$SYMBOL %in% c("SPARCL1", "STOM", "PHC2", "MAOA", "ARHGEF2", "ZNF512", "MTHFSD", "CRISPLD2"), results$SYMBOL, NA)
results$is_labeled <- ifelse(results$SYMBOL %in% c("SPARCL1", "STOM", "PHC2", "MAOA", "ARHGEF2", "ZNF512", "MTHFSD", "CRISPLD2"), "Labeled", "Not Labeled")



# Volcano Plot
ggplot(results, aes(x = log2FoldChange, y = -log10(pvalue), color = Significance)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("blue", "grey", "red")) +
  new_scale_color() +
  geom_point(data = results[results$is_labeled == "Labeled", ], 
             aes(color = "Highlight"), size = 3, shape = 21) +
  scale_color_manual(values = c("Highlight" = "black")) +
  geom_text_repel(aes(label = label), size = 3, max.overlaps = 10, color = 'black', na.rm = TRUE) +
  labs(title = "Volcano Plot of Differentially Expressed Genes",
       x = "Log2 Fold Change",
       y = "-log10(P-Value)") +
  theme_minimal() +
  theme(legend.position = "top")

Prepare the ranked gene list

We would like to perform GSEA on our differentially expressed genes. Unlike pathway enrichment analysis, GSEA provides the advantage of not needing to subset our gene list. We can provide all the genes that resulted from our differential expression analysis. Even genes that were not significantly differentially expressed. Offering popular women’s necklaces such as pendants, chokers and. Shop for jewelry in a variety of metals and gemstones to suit any occasion.

We just need to rank our genes first.

The way it works is, you need to give fgsea a named vector with the rankings and the genes symbols as names.

ranked_genes = deg_results %>%
    na.omit() %>% 
    distinct() %>% 
    group_by(SYMBOL) %>% 
    mutate(rank = sign(log2FoldChange)*(-log10(pvalue))) %>%
    arrange(desc(rank))


# create gene list 
gene_list = ranked_genes$rank
names(gene_list) = toupper(ranked_genes$SYMBOL)

formattable(head(gene_list, 20))
## SPARCL1    PER1    MAOA   DUSP1    STOM  ZBTB16    PHC2  SAMHD1   FKBP5    NNMT 
##   75.06   59.09   55.04   47.28   45.14   42.87   41.39    40.9   40.63   39.71 
##    MT2A    NEXN    MT1X   SMIM3  PNPLA2  GASK1B  CCDC69 MORF4L2    GLUL  STEAP1 
##   38.85   38.61      35   33.15   31.74   31.58   30.87   28.79   27.51   26.68

This way, the magnitude of the ranking will be given by the p value (since I use -log10(), the more significant a gene is differentially expressed, the larger that value will be). The sign of the ranking (positive or negative) will be given by the fold-change: upregulated genes will have positive rankings, downregulated genes will have negative rankings.

Run fgsea()

# Run fgsea
fgsea_results <- fgsea(pathways = kegg_list_pathway,
                       stats = gene_list, nperm = 1000,
                       scoreType = 'std', # in this case we have both pos and neg rankings. if only pos or neg, set to 'pos', 'neg'
                       minSize = 10,
                       maxSize = 500)

fgsea returns a dataframe with the names of your pathways or gene sets, p-values and p-adjusted values, the ES, NES, the size of the gene sets and the leadingEdge.

fgseaResTidy <- fgsea_results %>%
  as_tibble() %>%
  arrange(desc(NES))


cat("Number of pathways in fgsea results:", nrow(fgseaResTidy), "\n")
## Number of pathways in fgsea results: 180
# Show in a nice table:
fgseaResTidy %>% 
  dplyr::select(-leadingEdge, -ES, -nMoreExtreme) %>% 
  arrange(padj) %>% 
  DT::datatable()

After performing Gene Set Enrichment Analysis (GSEA) using the fgsea package, we obtain the following key outputs:

  1. Pathway Name: The name of each gene set or pathway tested.
  2. Enrichment Score (ES): Measures the degree to which a pathway is overrepresented at the top or bottom of the ranked gene list.
  3. Normalized Enrichment Score (NES): The ES normalized across all gene sets, accounting for differences in pathway sizes, making it comparable across analyses.
  4. p-value: The probability that the observed enrichment score is due to random chance.
  5. Adjusted p-value (q-value): Corrected p-value to account for multiple testing using methods like Benjamini-Hochberg.
  6. Leading Edge Genes: Subset of genes that contribute most to the enrichment signal.

Interpreting the Results

  • Pathways with a low q-value (e.g., < 0.05) are considered significantly enriched in your ranked gene list.
  • Positive NES indicates pathways enriched with upregulated genes, while negative NES indicates pathways enriched with downregulated genes.
  • The leading edge genes provide insights into the core contributors driving the enrichment of each pathway.

Visualize fgsea results

fgseaResTidy_filtered <- filter(fgseaResTidy, pval<= 0.05)

plot_dat = fgseaResTidy %>%
    arrange((NES)) %>%
    mutate(sig = ifelse(padj < 0.05,"*"," "))

plot_dat$pathway=factor(plot_dat$pathway,levels=plot_dat$pathway,ordered=T)

ggplot(plot_dat, aes(NES,pathway,fill=NES,label=sig))+
    geom_col(color="black")+
    geom_text(size=5)+
    scale_fill_gradient2(low="purple",high="orange2",midpoint=0)+
    theme_bw()+
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 12, face = "bold", hjust = 0.5), # Adjust title size and alignment
    plot.subtitle = element_text(size = 10, hjust = 0.5),             # Optional subtitle customization
    plot.caption = element_text(size = 10, hjust = 0.5)              # Optional caption customization
  ) +
  labs(
    title = "KEGG pathways NES for GSEA",
    subtitle = "Normalised Enrichment Scores for Pathways",
    x = "Normalised Enrichment Score (NES)",
    y = "Pathway (KEGG)",
    fill = "NES"
  )

    labs(x="Normalised enrichment score (NES)",y="Pathway (KEGG)",fill="NES")
## $x
## [1] "Normalised enrichment score (NES)"
## 
## $y
## [1] "Pathway (KEGG)"
## 
## $fill
## [1] "NES"
## 
## attr(,"class")
## [1] "labels"

The list of pathways resulting from fgsea analysis can often be very long, making it challenging to identify the most relevant biological insights. To manage this, we can take two main approaches:

  1. Filtering by Adjusted p-value (q-value):
    • Pathways with a low q-value (e.g., < 0.05) are considered statistically significant.
    • Filtering results by q-value allows us to focus only on the most enriched pathways, reducing noise from less significant results.
  2. Grouping Similar Pathways with collapsePathways():
    • Often, many enriched pathways are functionally related, leading to redundancy in the results.
    • The collapsePathways() function in the fgsea package groups similar pathways together, retaining only the most representative pathway from each cluster.
    • This approach simplifies interpretation by providing a more concise and biologically meaningful summary.

TIP By combining filtering and collapsing pathways, you can focus on a manageable number of biologically relevant pathways while reducing redundancy in your analysis results.

fgseaResTidy_filtered <- filter(fgseaResTidy, pval<= 0.05)

plot_dat = fgseaResTidy_filtered %>%
    arrange((NES)) %>%
    mutate(sig = ifelse(padj < 0.05,"*"," "))

plot_dat$pathway=factor(plot_dat$pathway,levels=plot_dat$pathway,ordered=T)

ggplot(plot_dat, aes(NES,pathway,fill=NES,label=sig))+
    geom_col(color="black")+
    geom_text(size=5)+
    scale_fill_gradient2(low="purple",high="orange2",midpoint=0)+
    theme_bw()+
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 15, face = "bold", hjust = 0.5), # Adjust title size and alignment
    plot.subtitle = element_text(size = 12, hjust = 0.5),             # Optional subtitle customization
    plot.caption = element_text(size = 10, hjust = 0.5)              # Optional caption customization
  ) +
  labs(
    title = "KEGG pathways NES for GSEA",
    subtitle = "Pval < 0.05, * if padj < 0.05",
    x = "Normalised Enrichment Score (NES)",
    y = "Pathway (KEGG)",
    fill = "NES"
  )

# Visualize results
ggplot(fgseaResTidy_filtered, aes(reorder(pathway, NES), NES)) +
  geom_col(aes(fill=padj<0.05)) +
  coord_flip() +
  theme_bw()+
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 15, face = "bold", hjust = 0.5), # Adjust title size and alignment
    plot.subtitle = element_text(size = 12, hjust = 0.5),             # Optional subtitle customization
    plot.caption = element_text(size = 10, hjust = 0.5)              # Optional caption customization
  ) +
  labs(
    title = "KEGG pathways NES for GSEA",
    subtitle = "Pval < 0.05",
    x = "Normalised Enrichment Score (NES)",
    y = "Pathway (KEGG)"  )

You can also select only independent pathways, removing redundancies or similar pathways with the function collapsePathways():

# Select only independent pathways, removing redundancies/similar pathways
collapsedPathways <- collapsePathways(fgsea_results[order(pval)][pval < 0.05],
                                        pathways = kegg_list_pathway,
                                        stats = gene_list)
gsea_res_collapsed <- fgsea_results[pathway %in% collapsedPathways$mainPathways]
  
ggplot(gsea_res_collapsed, aes(reorder(pathway, NES), NES)) +
  geom_col(aes(fill=padj<0.05)) +
  coord_flip() +
  theme_bw()+
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 15, face = "bold", hjust = 0.5), # Adjust title size and alignment
    plot.subtitle = element_text(size = 12, hjust = 0.5),             # Optional subtitle customization
    plot.caption = element_text(size = 10, hjust = 0.5)              # Optional caption customization
  ) +
  labs(
    title = "KEGG pathways NES for GSEA",
    subtitle = "Pval < 0.05, pathway collapsed",
    x = "Normalised Enrichment Score (NES)",
    y = "Pathway (KEGG)")

Save results

# Convert 'leadingEdge' to a comma-separated string
fgsea_results$leadingEdge <- sapply(fgsea_results$leadingEdge, function(x) paste(x, collapse = ","))

write.csv(fgsea_results, "./Results/fgsea_results.csv", row.names = FALSE)