Downstream Analysis of RNA-Seq Results in R: Gene Set Enrichment Analysis

Overview

In this tutorial we will learn how to perform Gene Set Enrichment Analysis on differential gene expression results. We will use the R package fgsea for fast preranked gene set enrichment analysis (GSEA). GSEA is a method used to determine whether a predefined set of genes shows statistically significant differences in expression between two biological conditions. GSEA does not rely on a threshold (like fold-change) and instead assesses whether the genes in a set are overrepresented at the extremes of a ranked list of genes.

If you are not familiar with GSEA, don’t worry! You can check out the workshop material here

If you are not familiar with differential expression analysis, don’t worry! You can check out my previous workshop and the R demo

How does GSEA work?

The fgsea package in R provides a fast and efficient method for performing preranked GSEA. It enables precise and rapid calculation of extremely low GSEA p-values across a collection of gene sets. The p-value estimation employs an adaptive multi-level split Monte Carlo approach.

Loading Libraries

# Load necessary libraries
library(readr)
library(formattable)
# Install and load necessary packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
#BiocManager::install(c("fgsea", ))

library(fgsea)
library(ggplot2)
library(data.table)
library(dplyr)
library(msigdbr)
library(pander)
library(ggrepel)
library(ggnewscale)

What do we need?

Essentially, there are two things you need to perform preranked gene set enrichment analysis with fgsea:

pathways: A list of gene sets or pathways to check.
stats: A named vector of your genes of interest for GSEA. The gene names must match the ones in the pathways list. Make sure you are using the same nomenclature (e.g., gene IDs or Ensembl IDs).

That’s it!

There are additional parameters you may be interested in, such as:

minSize: The minimal size of a gene set to test. Pathways below this threshold are excluded.
maxSize: The maximum size of a gene set to test. Pathways above this threshold are excluded.
scoreType: The default is std, where the enrichment score is computed normally. However, you can also use one-tailed tests if you are only interested in positive (pos) enrichment – so only pathways that are overrepresented – or negative (neg) enrichment – to only get pathways that are underrepresented in your list of genes.

Prepare your pathways/gene sets

KEGG

Kyoto Encyclopedia of Genes and Genomes (KEGG): Primarily concerned with metabolic pathways (many species including humans).

# Load gene sets (e.g., from MSigDB or KEGG)
kegg_gene_sets = msigdbr(species = "Homo sapiens", category = "C2") %>% filter(gs_subcat=="CP:KEGG")
kegg_list_pathway = split(x = kegg_gene_sets$gene_symbol, f = kegg_gene_sets$gs_name)

pander(head(kegg_list_pathway))

KEGG_ABC_TRANSPORTERS: ABCA1, ABCA10, ABCA12, ABCA13, ABCA2, ABCA3, ABCA4, ABCA5, ABCA6, ABCA7, ABCA8, ABCA9, ABCB1, ABCB10, ABCB11, ABCB11, ABCB4, ABCB5, ABCB6, ABCB7, ABCB8, ABCB9, ABCC1, ABCC1, ABCC10, ABCC11, ABCC12, ABCC2, ABCC3, ABCC4, ABCC5, ABCC6, ABCC6, ABCC8, ABCC9, ABCD1, ABCD2, ABCD3, ABCD4, ABCG1, ABCG2, ABCG4, ABCG5, ABCG8, CFTR, TAP1, TAP1, TAP1, TAP1, TAP1, TAP1, TAP1, TAP1, TAP2, TAP2, TAP2, TAP2, TAP2, TAP2, TAP2 and TAP2
KEGG_ACUTE_MYELOID_LEUKEMIA: AKT1, AKT2, AKT3, AKT3, ARAF, BAD, BRAF, CCNA1, CCND1, CEBPA, CHUK, EIF4EBP1, FLT3, GRB2, HRAS, HRAS, IKBKB, IKBKG, JUP, KIT, KRAS, LEF1, MAP2K1, MAP2K2, MAPK1, MAPK3, MTOR, MYC, NFKB1, NRAS, PIK3CA, PIK3CB, PIK3CD, PIK3CG, PIK3R1, PIK3R2, PIK3R3, PIK3R5, PIM1, PIM2, PML, PPARD, RAF1, RARA, RELA, RPS6KB1, RPS6KB2, RUNX1, RUNX1T1, SOS1, SOS2, SPI1, STAT3, STAT5A, STAT5B, TCF7, TCF7L1, TCF7L2 and ZBTB16
KEGG_ADHERENS_JUNCTION: ACP1, ACTB, ACTG1, ACTN1, ACTN2, ACTN3, ACTN4, ACTN4, AFDN, BAIAP2, CDC42, CDH1, CREBBP, CSNK2A1, CSNK2A2, CSNK2B, CSNK2B, CSNK2B, CSNK2B, CSNK2B, CSNK2B, CSNK2B, CSNK2B, CTNNA1, CTNNA2, CTNNA3, CTNNB1, CTNND1, EGFR, EP300, ERBB2, FARP2, FER, FGFR1, FYN, IGF1R, INSR, IQGAP1, LEF1, LMO7, MAP3K7, MAPK1, MAPK3, MET, NECTIN1, NECTIN2, NECTIN3, NECTIN4, NLK, PARD3, PTPN1, PTPN6, PTPRB, PTPRF, PTPRJ, PTPRM, RAC1, RAC2, RAC3, RHOA, SMAD2, SMAD3, SMAD4, SNAI1, SNAI2, SORBS1, SRC, SSX2IP, TCF7, TCF7L1, TCF7L2, TGFBR1, TGFBR2, TJP1, TJP1, VCL, WAS, WASF1, WASF2, WASF3, WASL and YES1
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY: ACACB, ACSL1, ACSL3, ACSL4, ACSL5, ACSL6, ADIPOQ, ADIPOR1, ADIPOR2, ADIPOR2, AGRP, AKT1, AKT2, AKT3, AKT3, CAMKK1, CAMKK2, CD36, CHUK, CPT1A, CPT1B, CPT1C, G6PC1, G6PC2, G6PC2, IKBKB, IKBKG, IRS1, IRS2, IRS4, JAK2, LEP, LEPR, MAPK10, MAPK8, MAPK9, MTOR, NFKB1, NFKBIA, NFKBIB, NFKBIB, NFKBIE, NPY, PCK1, PCK2, PCK2, POMC, PPARA, PPARGC1A, PRKAA1, PRKAA2, PRKAB1, PRKAB2, PRKAG1, PRKAG2, PRKAG3, PRKCQ, PTPN11, RELA, RXRA, RXRB, RXRB, RXRB, RXRB, RXRB, RXRB, RXRG, SLC2A1, SLC2A4, SLC2A4, SOCS3, STAT3, STK11, TNF, TNF, TNF, TNF, TNF, TNF, TNF, TNF, TNFRSF1A, TNFRSF1B, TRADD and TRAF2
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM: ABAT, ACY3, ADSL, ADSS1, ADSS2, AGXT, AGXT2, ALDH4A1, ALDH5A1, ASL, ASNS, ASPA, ASS1, CAD, CPS1, DDO, GAD1, GAD2, GFPT1, GFPT2, GLS, GLS2, GLUD1, GLUD2, GLUD2, GLUL, GOT1, GOT2, GPT, GPT2, IL4I1, NIT2 and PPAT
KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION: ATP1A1, ATP1A2, ATP1A3, ATP1A4, ATP1B1, ATP1B2, ATP1B3, ATP1B4, FXYD2, FXYD4, HSD11B1, HSD11B2, IGF1, INS, INSR, IRS1, IRS2, IRS4, KCNJ1, KRAS, MAPK1, MAPK3, NEDD4L, NR3C2, PDPK1, PIK3CA, PIK3CB, PIK3CD, PIK3CG, PIK3R1, PIK3R2, PIK3R3, PIK3R5, PRKCA, PRKCB, PRKCG, SCNN1A, SCNN1B, SCNN1G, SFN, SGK1 and SLC9A3R2

REACTOME

reactome_gene_sets = msigdbr(species = "Homo sapiens", category = "C2") %>% filter(gs_subcat=="CP:REACTOME")
reactome_list_pathway = split(x = reactome_gene_sets$gene_symbol, f = reactome_gene_sets$gs_name)
pander(head(reactome_list_pathway))

REACTOME_2_LTR_CIRCLE_FORMATION: BANF1, HMGA1, LIG4, PSIP1, XRCC4, XRCC5 and XRCC6
REACTOME_A_TETRASACCHARIDE_LINKER_SEQUENCE_IS_REQUIRED_FOR_GAG_SYNTHESIS: AGRN, B3GALT6, B3GAT1, B3GAT2, B3GAT3, B4GALT7, BCAN, BGN, CSPG4, CSPG5, DCN, GPC1, GPC2, GPC3, GPC4, GPC5, GPC6, HSPG2, NCAN, SDC1, SDC2, SDC3, SDC4, VCAN, XYLT1 and XYLT2
REACTOME_ABACAVIR_METABOLISM: ADAL, ADH1A, GUK1, NT5C2 and PCK1
REACTOME_ABACAVIR_TRANSMEMBRANE_TRANSPORT: ABCB1, ABCG2, SLC22A1, SLC22A2 and SLC22A3
REACTOME_ABACAVIR_TRANSPORT_AND_METABOLISM: ABCB1, ABCG2, ADAL, ADH1A, GUK1, NT5C2, PCK1, SLC22A1, SLC22A2 and SLC22A3
REACTOME_ABC_FAMILY_PROTEINS_MEDIATED_TRANSPORT: ABCA10, ABCA12, ABCA2, ABCA3, ABCA4, ABCA5, ABCA6, ABCA7, ABCA8, ABCA9, ABCB1, ABCB10, ABCB4, ABCB5, ABCB6, ABCB7, ABCB8, ABCB9, ABCC1, ABCC1, ABCC10, ABCC11, ABCC2, ABCC3, ABCC4, ABCC5, ABCC6, ABCC6, ABCC9, ABCD1, ABCD2, ABCD3, ABCF1, ABCF1, ABCF1, ABCF1, ABCF1, ABCF1, ABCF1, ABCG1, ABCG4, ABCG5, ABCG8, APOA1, CFTR, DERL1, DERL2, DERL3, DERL3, EIF2S1, EIF2S2, EIF2S3, ERLEC1, ERLIN1, ERLIN2, KCNJ11, OS9, PEX19, PEX3, PSMA1, PSMA2, PSMA3, PSMA4, PSMA5, PSMA6, PSMA7, PSMA8, PSMB1, PSMB10, PSMB11, PSMB2, PSMB3, PSMB3, PSMB4, PSMB5, PSMB6, PSMB7, PSMB8, PSMB8, PSMB8, PSMB8, PSMB8, PSMB8, PSMB8, PSMB8, PSMB9, PSMB9, PSMB9, PSMB9, PSMB9, PSMB9, PSMB9, PSMB9, PSMC1, PSMC2, PSMC3, PSMC4, PSMC4, PSMC5, PSMC6, PSMD1, PSMD10, PSMD11, PSMD12, PSMD13, PSMD14, PSMD2, PSMD3, PSMD4, PSMD5, PSMD6, PSMD7, PSMD8, PSMD9, PSME1, PSME1, PSME2, PSME2, PSME3, PSME4, PSMF1, RNF185, RNF5, RNF5, RNF5, RNF5, RNF5, RNF5, RNF5, RPS27A, SEL1L, SEM1, UBA52, UBB, UBC and VCP

GO

Gene Ontology (GO): Focuses on biological processes, molecular functions, and cellular components (many species including humans).

Gene Ontology is a controlled vocabulary for describing gene functions. It consists of three ontologies: 1. Biological Processes (BP): Describes large biological pathways or events (e.g., “cell cycle”). 2. Molecular Functions (MF): Describes molecular activities (e.g., “protein kinase activity”). 3. Cellular Components (CC): Describes the location within the cell (e.g., “plasma membrane”).

GO BP

Reactome: A curated database of human molecular pathways (only humans).

go_bp_gene_sets = msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat=="GO:BP")
go_bp_list_pathway = split(x = go_bp_gene_sets$gene_symbol, f = go_bp_gene_sets$gs_name)
pander(head(go_bp_list_pathway))

GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS: AASDHPPT, ALDH1L1, ALDH1L2, MTHFD1, MTHFD1L and MTHFD2L
GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS: AADAT, ADHFE1, D2HGDH, DLST, GOT1, GOT2, GPT2, IDH1, IDH2, IDH3B, KYAT3, L2HGDH, MRPS36, MRPS36, OGDH, OGDHL, PHYH and TAT
GOBP_2FE_2S_CLUSTER_ASSEMBLY: BOLA2, BOLA2B, GLRX3, GLRX5, HSCB and NFS1
GOBP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_BIOSYNTHETIC_PROCESS: PAPSS1, PAPSS2, SLC26A1, SLC26A2, SLC35B2 and SLC35B3
GOBP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS: ABHD14B, BPNT1, ENPP1, PAPSS1, PAPSS2, SLC26A1, SLC26A2, SLC35B2, SLC35B3, SULT1A1, SULT1A2, SULT1A3, SULT1A4, SULT1B1, SULT1C3, SULT1C4, SULT1E1, SULT2A1, SULT2B1, TPST1 and TPST2
GOBP_3_UTR_MEDIATED_MRNA_DESTABILIZATION: CPEB3, DHX36, DHX36, DND1, DND1, HNRNPD, KHSRP, MOV10, PLEKHN1, RBM24, RC3H1, TARDBP, TRIM71, UPF1, ZC3H12A, ZC3H12D, ZFP36, ZFP36L1 and ZFP36L2

GO CC

go_cc_gene_sets = msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat=="GO:CC")
go_cc_list_pathway = split(x = go_cc_gene_sets$gene_symbol, f = go_cc_gene_sets$gs_name)
pander(head(go_cc_list_pathway))

GOCC_90S_PRERIBOSOME: BMS1, BOP1, BOP1, ENSG00000275464, HEATR1, IMP3, IMP4, KRI1, MPHOSPH10, NOC2L, NOC4L, NOC4L, NOL6, NOP14, NOP9, NOP9, PES1, PWP2, RRP36, RRP7A, RRP7BP, RSL1D1, SLX9, SRFBP1, TBL3, UTP18, UTP20, UTP4, UTP4, UTP6, WDR12, WDR3 and WDR36
GOCC_9PLUS0_NON_MOTILE_CILIUM: ABCA4, ARL3, ARR3, ATP1A4, ATP8A2, BBS4, BBS7, BSG, C1orf115, C4orf47, CACNA1F, CCDC66, CDHR1, CEP250, CEP290, CERKL, CETN1, CETN2, CETN3, CFAP410, CIB2, CNGA1, CNGB1, CNGB3, CRB1, DHRS3, DYNLL2, EYS, FAM161A, GNA11, GNAQ, GNAT1, GNAT2, GNAT3, GNB1, GNGT1, GRK1, GRK1, GRK1, GRK4, GRK7, GUCA1A, GUCA1B, GUCA1C, GUCY2D, GUCY2F, IFT122, IFT140, IFT20, IFT52, IFT57, IFT80, IFTAP, IMPG1, INHA, IQCB1, KIAA1549, KIF17, KIFAP3, LCA5, LYAR, MAK, MAP1B, MERTK, MYO7A, MYRIP, NAPEPLD, NAPEPLD, NPHP1, NPHP4, NXNL1, OCRL, OPN1LW, OPN1MW, OPN1MW2, OPN1MW3, OPN1SW, OPN3, OPN4, PCARE, PCDH15, PCDHB13, PCDHB15, PCDHB15, PDC, PDE6A, PDE6B, PDE6G, PDE6H, PEX6, PHLPP2, PHYH, PIP4K2A, PKHD1, PPEF2, PRCD, PROM1, PRPH2, PTGS1, RAB27A, RCVRN, RD3, RGR, RGS9BP, RHO, ROM1, RP1, RP1L1, RPGR, RPGRIP1, RPGRIP1L, RRH, SAG, SAG, SDCCAG8, SDCCAG8, SEPTIN2, SHANK2, SLC24A4, SMO, SPATA7, SPTBN5, TBCC, TMEM237, TOPORS, TSGA10IP, TTC8, TTLL4, TTLL6, TULP1, TULP3, USH1C, USH1G, USH2A, VCAN, WDR19 and WHRN
GOCC_9PLUS2_MOTILE_CILIUM: ABHD2, ACE, AK1, AK2, AK8, AKAP3, AKAP4, ATP1A1, ATP1A4, ATP1B1, ATP1B3, ATP2B4, CABCOCO1, CABS1, CABYR, CAPZB, CATSPER4, CATSPERD, CATSPERE, CATSPERG, CATSPERZ, CCDC172, CCDC181, CCDC189, CCDC39, CCDC39, CCR6, CD52, CETN2, CFAP221, CFAP251, CFAP43, CFAP45, CFAP47, CFAP52, CFAP58, CFAP65, CFAP69, CFAP70, CROCC, CST11, CUL3, DCDC2C, DEFB1, DEFB1, DNAH1, DNAH11, DNAH17, DNAH2, DNAH5, DNAH8, DNAH9, DNAI1, DNAI2, DNAJB13, DNALI1, DRC3, DRD2, DYNLT4, EFCAB2, EFCAB9, ENKUR, ENO4, FLACC1, FSCB, FSIP2, GABARAP, GAS8, GK2, GSTM3, HSPD1, HYAL3, IFT172, IFT27, IFT81, IFT88, IL4I1, IQCG, KIF9, LYZL4, LYZL6, LYZL6, MAFIP, MNS1, MROH2B, NME5, NME8, ODAD4, ODF1, ODF2, ODF3, PACRG, PFKM, PGAM4, PGK2, PMFBP1, PRKACA, PRKACA, PRKAR1A, QRICH2, RAN, RAP1A, RHO, RNF38, RPGR, RSPH1, RSPH6A, RSPH9, SAXO1, SAXO2, SCNN1A, SEPTIN12, SEPTIN2, SEPTIN4, SEPTIN6, SEPTIN7, SLC22A14, SLC26A3, SLC26A6, SLC9A3R1, SLC9B1, SLC9B2, SLIRP, SPA17, SPACA3, SPACA9, SPAG16, SPAG6, SPATA6, SPATA6L, SPATC1L, SPATC1L, SPEF1, SPEF2, SQSTM1, SQSTM1, SUN5, TACR1, TACR2, TACR3, TBC1D21, TCP11, TCP11X1, TCTE1, TEKT3, TEKT4, TEKT4, TEKT5, TOMM20, TPPP2, TSSK4, TSSK4, TTC29, TTLL3, TTLL8, VPS13A and WBP2NL
GOCC_A_BAND: ALDOA, ANK1, ANK2, CMYA5, CRYAB, DST, ENO1, KAT2B, KCTD6, KLHL40, KLHL41, LMOD2, LMOD3, LRRC39, MYBPC3, MYH1, MYL2, MYL3, MYL4, MYL7, MYOM1, MYOM2, MYOM2, MYOM3, NBR1, OBSCN, OBSL1, PPP1R12A, PPP1R12B, PPP2R5A, SMPX, SMTNL1, SMTNL2, SPTBN1 and TRIM63
GOCC_ACETYLCHOLINE_GATED_CHANNEL_COMPLEX: CHRFAM7A, CHRFAM7A, CHRNA1, CHRNA2, CHRNA3, CHRNA4, CHRNA5, CHRNA6, CHRNA7, CHRNA7, CHRNA7, CHRNA9, CHRNB1, CHRNB1, CHRNB2, CHRNB3, CHRNB4, CHRNE and STXBP5
GOCC_ACROSOMAL_MEMBRANE: ACRBP, ATP8B3, BSG, CAV1, CAV2, CCDC136, CD46, CFAP65, EQTN, FAM170B, FLOT2, HYAL3, IZUMO1, MFGE8, PCSK4, PLA1A, RND2, SERPINA5, SPACA1, SPACA3, SPACA6, TEKT3, TEX101, TMEM190, TMEM225, TMEM95, TMEM95, TRIP11 and ZPBP

GO MF

go_mf_gene_sets = msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat=="GO:MF")
go_mf_list_pathway = split(x = go_mf_gene_sets$gene_symbol, f = go_mf_gene_sets$gs_name)
pander(head(go_mf_list_pathway))

GOMF_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACTIVITY: GDPGP1, MTAP, PYGB, PYGL, PYGM and TYMP
GOMF_1_ACYLGLYCEROPHOSPHOCHOLINE_O_ACYLTRANSFERASE_ACTIVITY: LPCAT1, LPCAT1, LPCAT2, LPCAT3, LPCAT4, MBOAT1, MBOAT2, PRDX6 and TAFAZZIN
GOMF_1_ALKYL_2_ACETYLGLYCEROPHOSPHOCHOLINE_ESTERASE_ACTIVITY: ASPG, LCAT, PAFAH1B2, PAFAH1B3, PAFAH2, PLA2G10, PLA2G10, PLA2G6 and PLA2G7
GOMF_1_PHOSPHATIDYLINOSITOL_3_KINASE_ACTIVITY: ATM, PIK3C2A, PIK3C2B, PIK3C2G, PIK3C3, PIK3CA, PIK3CB, PIK3CD, PIK3CG and PIK3R3
GOMF_1_PHOSPHATIDYLINOSITOL_3_KINASE_REGULATOR_ACTIVITY: CCKBR, CISH, P3R3URF, PIK3R1, PIK3R2, PIK3R3, PIK3R5, PIK3R6, SLA2, SOCS1, SOCS2, SOCS3, SOCS4, SOCS5, SOCS6, SOCS7 and SOCS7
GOMF_1_PHOSPHATIDYLINOSITOL_4_KINASE_ACTIVITY: PI4K2A, PI4K2B, PI4KA, PI4KAP1, PI4KAP2 and PI4KB

ALL OF THEM

# Look at them all if you want (uncomment)
#hallmark_gene_sets = msigdbr(species = "Homo sapiens", category = "H") 
#hallmark_list = split(x = hallmark_gene_sets$gene_symbol, f = hallmark_gene_sets$gs_name)

For this demo, we are going to use KEGG as a gene sets

# Show the first few pathways, and within those, show only the first few genes. 
kegg_list_pathway %>% 
  head() %>% 
  lapply(head)

## $KEGG_ABC_TRANSPORTERS
## [1] "ABCA1"  "ABCA10" "ABCA12" "ABCA13" "ABCA2"  "ABCA3" 
## 
## $KEGG_ACUTE_MYELOID_LEUKEMIA
## [1] "AKT1" "AKT2" "AKT3" "AKT3" "ARAF" "BAD" 
## 
## $KEGG_ADHERENS_JUNCTION
## [1] "ACP1"  "ACTB"  "ACTG1" "ACTN1" "ACTN2" "ACTN3"
## 
## $KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY
## [1] "ACACB" "ACSL1" "ACSL3" "ACSL4" "ACSL5" "ACSL6"
## 
## $KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM
## [1] "ABAT"  "ACY3"  "ADSL"  "ADSS1" "ADSS2" "AGXT" 
## 
## $KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION
## [1] "ATP1A1" "ATP1A2" "ATP1A3" "ATP1A4" "ATP1B1" "ATP1B2"

Import Differential Expression Analysis Results (DEGs)

# Import your differential expressed genes (DEGs) table from Github
urlfile="https://raw.githubusercontent.com/merlinis12/Downstream-Analysis-of-RNA-Seq-Results/refs/heads/main/data/DESeq2_results_notfiltered.csv"
deg_results <- read_csv(url(urlfile))

# View data dimensions and initial rows
dim(deg_results)

## [1] 25258     7

formattable(head(deg_results))

gene_id	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
ENSG00000152583	954.7709	4.368359	0.23712679	18.42204	8.74e-76	1.34e-71
ENSG00000179094	743.2527	2.863889	0.17556931	16.31201	8.11e-60	6.22e-56
ENSG00000116584	2277.9135	-1.034701	0.06509844	-15.89440	6.93e-57	3.54e-53
ENSG00000189221	2383.7537	3.341544	0.21240579	15.73189	9.14e-56	3.51e-52
ENSG00000120129	3440.7038	2.965211	0.20369513	14.55710	5.26e-48	1.62e-44
ENSG00000148175	13493.9204	1.427168	0.10038904	14.21638	7.25e-46	1.85e-42

Convert Gene ID in the SYMBOL format

For human data, AnnotationDbi and org.Hs.eg.db provide another reliable way to convert IDs without needing an internet connection.

Load Packages

library(org.Hs.eg.db)

How to know your Gene IDs

# Example gene ID vector
gene_ids <- deg_results$gene_id

# Define possible ID types to test
possible_id_types <- c("SYMBOL", "ENTREZID", "ENSEMBL", "REFSEQ")

# Loop through possible ID types to check which one matches
for (id_type in possible_id_types) {
  result <- tryCatch({
    mapIds(org.Hs.eg.db, keys = gene_ids, column = "SYMBOL", keytype = id_type)
  }, error = function(e) NA)  # Ignore errors

  if (!all(is.na(result))) {
    cat("The IDs match the type:", id_type, "\n")
  }
}

## The IDs match the type: ENSEMBL

Now, you want to convert them to gene symbols:

ens2symbol <- AnnotationDbi::select(org.Hs.eg.db,
                                    key=deg_results$gene_id, 
                                    columns="SYMBOL",
                                    keytype="ENSEMBL")
ens2symbol <- as_tibble(ens2symbol)
formattable(head(ens2symbol))

ENSEMBL	SYMBOL
ENSG00000152583	SPARCL1
ENSG00000179094	PER1
ENSG00000116584	ARHGEF2
ENSG00000189221	MAOA
ENSG00000120129	DUSP1
ENSG00000148175	STOM

deg_results <- inner_join(deg_results, ens2symbol, by=c("gene_id"="ENSEMBL"))
formattable(head(deg_results))

gene_id	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	SYMBOL
ENSG00000152583	954.7709	4.368359	0.23712679	18.42204	8.74e-76	1.34e-71	SPARCL1
ENSG00000179094	743.2527	2.863889	0.17556931	16.31201	8.11e-60	6.22e-56	PER1
ENSG00000116584	2277.9135	-1.034701	0.06509844	-15.89440	6.93e-57	3.54e-53	ARHGEF2
ENSG00000189221	2383.7537	3.341544	0.21240579	15.73189	9.14e-56	3.51e-52	MAOA
ENSG00000120129	3440.7038	2.965211	0.20369513	14.55710	5.26e-48	1.62e-44	DUSP1
ENSG00000148175	13493.9204	1.427168	0.10038904	14.21638	7.25e-46	1.85e-42	STOM

write.csv(deg_results, "./deg_symbols.csv", row.names = FALSE)

Recap: Differential Gene Expression Analysis

Before diving into GSEA with fgsea, let’s briefly revisit where we left off in our previous workshop.

In our last session, we performed a Differential Gene Expression (DGE) analysis comparing gene expression between 4 control participants and 4 participants treated with dexamethasone (Dex), a corticosteroid.

Let’s plot the volcano plot now to remind ourselves of the DEGs that we’ll use for GSEA analysis.

# Add a column to categorize genes as upregulated, downregulated, or non-significant
results <- as.data.frame(deg_results)
results <- results %>%
  mutate(Significance = case_when(
    padj < 0.05 & log2FoldChange > 1 ~ "Upregulated",
    padj < 0.05 & log2FoldChange < -1 ~ "Downregulated",
    TRUE ~ "Non-significant"
  ))



#look at a subset of genes
results$label <- ifelse(results$SYMBOL %in% c("SPARCL1", "STOM", "PHC2", "MAOA", "ARHGEF2", "ZNF512", "MTHFSD", "CRISPLD2"), results$SYMBOL, NA)
results$is_labeled <- ifelse(results$SYMBOL %in% c("SPARCL1", "STOM", "PHC2", "MAOA", "ARHGEF2", "ZNF512", "MTHFSD", "CRISPLD2"), "Labeled", "Not Labeled")



# Volcano Plot
ggplot(results, aes(x = log2FoldChange, y = -log10(pvalue), color = Significance)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("blue", "grey", "red")) +
  new_scale_color() +
  geom_point(data = results[results$is_labeled == "Labeled", ], 
             aes(color = "Highlight"), size = 3, shape = 21) +
  scale_color_manual(values = c("Highlight" = "black")) +
  geom_text_repel(aes(label = label), size = 3, max.overlaps = 10, color = 'black', na.rm = TRUE) +
  labs(title = "Volcano Plot of Differentially Expressed Genes",
       x = "Log2 Fold Change",
       y = "-log10(P-Value)") +
  theme_minimal() +
  theme(legend.position = "top")

Prepare the ranked gene list

We would like to perform GSEA on our differentially expressed genes. Unlike pathway enrichment analysis, GSEA provides the advantage of not needing to subset our gene list. We can provide all the genes that resulted from our differential expression analysis. Even genes that were not significantly differentially expressed. Offering popular women’s necklaces such as pendants, chokers and. Shop for jewelry in a variety of metals and gemstones to suit any occasion.

We just need to rank our genes first.

The way it works is, you need to give fgsea a named vector with the rankings and the genes symbols as names.

ranked_genes = deg_results %>%
    na.omit() %>% 
    distinct() %>% 
    group_by(SYMBOL) %>% 
    mutate(rank = sign(log2FoldChange)*(-log10(pvalue))) %>%
    arrange(desc(rank))


# create gene list 
gene_list = ranked_genes$rank
names(gene_list) = toupper(ranked_genes$SYMBOL)

formattable(head(gene_list, 20))

## SPARCL1    PER1    MAOA   DUSP1    STOM  ZBTB16    PHC2  SAMHD1   FKBP5    NNMT 
##   75.06   59.09   55.04   47.28   45.14   42.87   41.39    40.9   40.63   39.71 
##    MT2A    NEXN    MT1X   SMIM3  PNPLA2  GASK1B  CCDC69 MORF4L2    GLUL  STEAP1 
##   38.85   38.61      35   33.15   31.74   31.58   30.87   28.79   27.51   26.68

This way, the magnitude of the ranking will be given by the p value (since I use -log10(), the more significant a gene is differentially expressed, the larger that value will be). The sign of the ranking (positive or negative) will be given by the fold-change: upregulated genes will have positive rankings, downregulated genes will have negative rankings.

Run fgsea()

# Run fgsea
fgsea_results <- fgsea(pathways = kegg_list_pathway,
                       stats = gene_list, nperm = 1000,
                       scoreType = 'std', # in this case we have both pos and neg rankings. if only pos or neg, set to 'pos', 'neg'
                       minSize = 10,
                       maxSize = 500)

fgsea returns a dataframe with the names of your pathways or gene sets, p-values and p-adjusted values, the ES, NES, the size of the gene sets and the leadingEdge.

fgseaResTidy <- fgsea_results %>%
  as_tibble() %>%
  arrange(desc(NES))


cat("Number of pathways in fgsea results:", nrow(fgseaResTidy), "\n")

## Number of pathways in fgsea results: 180

# Show in a nice table:
fgseaResTidy %>% 
  dplyr::select(-leadingEdge, -ES, -nMoreExtreme) %>% 
  arrange(padj) %>% 
  DT::datatable()

After performing Gene Set Enrichment Analysis (GSEA) using the fgsea package, we obtain the following key outputs:

Pathway Name: The name of each gene set or pathway tested.
Enrichment Score (ES): Measures the degree to which a pathway is overrepresented at the top or bottom of the ranked gene list.
Normalized Enrichment Score (NES): The ES normalized across all gene sets, accounting for differences in pathway sizes, making it comparable across analyses.
p-value: The probability that the observed enrichment score is due to random chance.
Adjusted p-value (q-value): Corrected p-value to account for multiple testing using methods like Benjamini-Hochberg.
Leading Edge Genes: Subset of genes that contribute most to the enrichment signal.

Interpreting the Results

Pathways with a low q-value (e.g., < 0.05) are considered significantly enriched in your ranked gene list.
Positive NES indicates pathways enriched with upregulated genes, while negative NES indicates pathways enriched with downregulated genes.
The leading edge genes provide insights into the core contributors driving the enrichment of each pathway.

Visualize `fgsea` results

fgseaResTidy_filtered <- filter(fgseaResTidy, pval<= 0.05)

plot_dat = fgseaResTidy %>%
    arrange((NES)) %>%
    mutate(sig = ifelse(padj < 0.05,"*"," "))

plot_dat$pathway=factor(plot_dat$pathway,levels=plot_dat$pathway,ordered=T)

ggplot(plot_dat, aes(NES,pathway,fill=NES,label=sig))+
    geom_col(color="black")+
    geom_text(size=5)+
    scale_fill_gradient2(low="purple",high="orange2",midpoint=0)+
    theme_bw()+
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 12, face = "bold", hjust = 0.5), # Adjust title size and alignment
    plot.subtitle = element_text(size = 10, hjust = 0.5),             # Optional subtitle customization
    plot.caption = element_text(size = 10, hjust = 0.5)              # Optional caption customization
  ) +
  labs(
    title = "KEGG pathways NES for GSEA",
    subtitle = "Normalised Enrichment Scores for Pathways",
    x = "Normalised Enrichment Score (NES)",
    y = "Pathway (KEGG)",
    fill = "NES"
  )

    labs(x="Normalised enrichment score (NES)",y="Pathway (KEGG)",fill="NES")

## $x
## [1] "Normalised enrichment score (NES)"
## 
## $y
## [1] "Pathway (KEGG)"
## 
## $fill
## [1] "NES"
## 
## attr(,"class")
## [1] "labels"

The list of pathways resulting from fgsea analysis can often be very long, making it challenging to identify the most relevant biological insights. To manage this, we can take two main approaches:

Filtering by Adjusted p-value (q-value):
- Pathways with a low q-value (e.g., < 0.05) are considered statistically significant.
- Filtering results by q-value allows us to focus only on the most enriched pathways, reducing noise from less significant results.
Grouping Similar Pathways with collapsePathways():
- Often, many enriched pathways are functionally related, leading to redundancy in the results.
- The collapsePathways() function in the fgsea package groups similar pathways together, retaining only the most representative pathway from each cluster.
- This approach simplifies interpretation by providing a more concise and biologically meaningful summary.

TIP By combining filtering and collapsing pathways, you can focus on a manageable number of biologically relevant pathways while reducing redundancy in your analysis results.

fgseaResTidy_filtered <- filter(fgseaResTidy, pval<= 0.05)

plot_dat = fgseaResTidy_filtered %>%
    arrange((NES)) %>%
    mutate(sig = ifelse(padj < 0.05,"*"," "))

plot_dat$pathway=factor(plot_dat$pathway,levels=plot_dat$pathway,ordered=T)

ggplot(plot_dat, aes(NES,pathway,fill=NES,label=sig))+
    geom_col(color="black")+
    geom_text(size=5)+
    scale_fill_gradient2(low="purple",high="orange2",midpoint=0)+
    theme_bw()+
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 15, face = "bold", hjust = 0.5), # Adjust title size and alignment
    plot.subtitle = element_text(size = 12, hjust = 0.5),             # Optional subtitle customization
    plot.caption = element_text(size = 10, hjust = 0.5)              # Optional caption customization
  ) +
  labs(
    title = "KEGG pathways NES for GSEA",
    subtitle = "Pval < 0.05, * if padj < 0.05",
    x = "Normalised Enrichment Score (NES)",
    y = "Pathway (KEGG)",
    fill = "NES"
  )

# Visualize results
ggplot(fgseaResTidy_filtered, aes(reorder(pathway, NES), NES)) +
  geom_col(aes(fill=padj<0.05)) +
  coord_flip() +
  theme_bw()+
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 15, face = "bold", hjust = 0.5), # Adjust title size and alignment
    plot.subtitle = element_text(size = 12, hjust = 0.5),             # Optional subtitle customization
    plot.caption = element_text(size = 10, hjust = 0.5)              # Optional caption customization
  ) +
  labs(
    title = "KEGG pathways NES for GSEA",
    subtitle = "Pval < 0.05",
    x = "Normalised Enrichment Score (NES)",
    y = "Pathway (KEGG)"  )

You can also select only independent pathways, removing redundancies or similar pathways with the function collapsePathways():

# Select only independent pathways, removing redundancies/similar pathways
collapsedPathways <- collapsePathways(fgsea_results[order(pval)][pval < 0.05],
                                        pathways = kegg_list_pathway,
                                        stats = gene_list)
gsea_res_collapsed <- fgsea_results[pathway %in% collapsedPathways$mainPathways]
  
ggplot(gsea_res_collapsed, aes(reorder(pathway, NES), NES)) +
  geom_col(aes(fill=padj<0.05)) +
  coord_flip() +
  theme_bw()+
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 15, face = "bold", hjust = 0.5), # Adjust title size and alignment
    plot.subtitle = element_text(size = 12, hjust = 0.5),             # Optional subtitle customization
    plot.caption = element_text(size = 10, hjust = 0.5)              # Optional caption customization
  ) +
  labs(
    title = "KEGG pathways NES for GSEA",
    subtitle = "Pval < 0.05, pathway collapsed",
    x = "Normalised Enrichment Score (NES)",
    y = "Pathway (KEGG)")

Save results

# Convert 'leadingEdge' to a comma-separated string
fgsea_results$leadingEdge <- sapply(fgsea_results$leadingEdge, function(x) paste(x, collapse = ","))

write.csv(fgsea_results, "./Results/fgsea_results.csv", row.names = FALSE)

Downstream Analysis of RNA-Seq Results in R: Gene Set Enrichment Analysis

Simona Merlini

2024-11-19

Overview

How does GSEA work?

Loading Libraries

What do we need?

Prepare your pathways/gene sets

KEGG

REACTOME

GO

GO BP

GO CC

GO MF

ALL OF THEM

Import Differential Expression Analysis Results (DEGs)

Convert Gene ID in the SYMBOL format

Recap: Differential Gene Expression Analysis

Prepare the ranked gene list

Run fgsea()

Interpreting the Results

Visualize `fgsea` results

Downstream Analysis of RNA-Seq Results in R: Gene Set Enrichment Analysis

Simona Merlini

2024-11-19

Overview

How does GSEA work?

Loading Libraries

What do we need?

Prepare your pathways/gene sets

KEGG

REACTOME

GO

GO BP

GO CC

GO MF

ALL OF THEM

Import Differential Expression Analysis Results (DEGs)

Convert Gene ID in the SYMBOL format

Recap: Differential Gene Expression Analysis

Prepare the ranked gene list

Run fgsea()

Interpreting the Results

Visualize fgsea results

Visualize `fgsea` results