Introduction to Differential Expression Analysis

Overview

This tutorial provides a comprehensive summary of differential gene expression (DGE) analysis techniques that are broadly applicable across various research scenarios. Specifically, it focuses on workflows using the popular tools limma, DESeq2, and edgeR—three of the most widely adopted packages in RNA-seq analysis.For more information visit RNA-Seq and Differential Expression Analysis in R.

Biological Background

The data comes from:

Himes et al. “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS ONE. 2014 Jun 13;9(6):e99625. PMID: 24926665.

Asthma is a long-term inflammatory lung disease affecting over 300 million people worldwide. Glucocorticoids are commonly used to reduce inflammation in asthma, especially in airway smooth muscle (ASM), but how they work at the cellular level in ASM is still unclear. In this study, they used RNA-Seq to analyze changes in gene activity in ASM cells treated with dexamethasone, a strong synthetic glucocorticoid. They found 316 genes with altered activity, including well-known glucocorticoid-responsive genes and others less studied, like CRISPLD2. Interestingly, variations in the CRISPLD2 gene were linked to resistance to inhaled steroids and bronchodilator response in asthma patients. Tests showed that dexamethasone and an inflammatory molecule (IL1β) both increased CRISPLD2 levels. When CRISPLD2 was reduced, inflammation markers IL6 and IL8 increased. This research highlights CRISPLD2 as a gene that could influence asthma treatment responses by regulating glucocorticoid anti-inflammatory effects in ASM.

They did their analysis using Tophat and Cufflinks. We’re using three different R packages called DESeq2, limma. Click here to read more on DESeq2 previous tutorial.

Data Import
DGE Analysis in limma
1. Converting Counts to CPM and Using voom
2. Fit Model and Test for Differential Expression
DGE Analysis in DESeq2
Visualizing Intersection of Significantly Expressed Genes
1. Venn Diagram
2. UpSet Plot
Results Interpretation and Visualization
Handling and Converting Gene IDs

Data Import

A read counts matrix is the essential starting point for DGE analysis. This matrix contains counts of sequencing reads mapped to each gene across multiple samples. Each row represents a gene, and each column corresponds to a sample. The values in the matrix indicate the number of reads aligning to each gene, providing a measure of expression.

Loading Libraries

# Load necessary libraries
library(DESeq2)
library(edgeR)
library(limma)
library(readr)
library(vsn)
library(ggplot2)
library(VennDiagram)
library(UpSetR)
library(dplyr)
library(ggplot2)
library(factoextra)
library(formattable)
library(apeglm)
library(ggnewscale)
library(ggrepel)

Importing Read Counts Table

Ensure that the rows represent genes and columns represent samples.

# Import your read count table from Github

urlfile="https://raw.githubusercontent.com/merlinis12/RNA-Seq-Data-Analysis-in-R/refs/heads/main/data/airway_scaledcounts.csv"
count_data_original <- read_csv(url(urlfile))

# View data dimensions and initial rows
dim(count_data_original)

## [1] 38694     9

formattable(head(count_data_original))

ensgene	SRR1039508	SRR1039509	SRR1039512	SRR1039513	SRR1039516	SRR1039517	SRR1039520	SRR1039521
ENSG00000000003	723	486	904	445	1170	1097	806	604
ENSG00000000005	0	0	0	0	0	0	0	0
ENSG00000000419	467	523	616	371	582	781	417	509
ENSG00000000457	347	258	364	237	318	447	330	324
ENSG00000000460	96	81	73	66	118	94	102	74
ENSG00000000938	0	0	1	0	2	0	0	0

Importing Metadata

#read metadata from github
urlfile="https://raw.githubusercontent.com/merlinis12/RNA-Seq-Data-Analysis-in-R/refs/heads/main/data/airway_metadata.csv"
meta_data <- read_csv(url(urlfile))
meta_data$dex <- as.factor(meta_data$dex)

# View data dimensions and initial rows
dim(meta_data)

## [1] 8 4

formattable(head(meta_data))

id	dex	celltype	geo_id
SRR1039508	control	N61311	GSM1275862
SRR1039509	treated	N61311	GSM1275863
SRR1039512	control	N052611	GSM1275866
SRR1039513	treated	N052611	GSM1275867
SRR1039516	control	N080611	GSM1275870
SRR1039517	treated	N080611	GSM1275871

Missing Data

# Check for missing values
missing_counts <- sum(is.na(count_data_original))
cat("Number of missing values in the read count table:", missing_counts)

## Number of missing values in the read count table: 0

If you decide to replace missing values with zeros (assuming missing values represent no expression), use the following code:

# Replace NA with zero
count_data <- count_data_original
count_data[is.na(count_data)] <- 0

Alternatively, you can filter out genes with too many missing values. For instance, remove genes missing in more than 20% of samples:

#Define a threshold for missing data (20%)
threshold <- 0.2

# Calculate the proportion of missing values per gene
missing_proportion <- rowMeans(is.na(count_data))

# Filter out genes with missing values above the threshold
filtered_counts <- count_data[missing_proportion <= threshold, ]

What if a gene has always expression = 0?

all_zero_rows <- apply(count_data[,-1], 1, function(row) all(row == 0))
row_indices <- which(all_zero_rows)

# This will print the indices of genes where all values are zero
cat("Number of genes with all zero values in the read count table:",length(row_indices))

## Number of genes with all zero values in the read count table: 13436

#remove gene with exp = 0
gene_list <- count_data[-row_indices, ][[1]]
count_data <- count_data[-row_indices,-1]
#gene names as rownames

rownames(count_data) <- gene_list

# View data dimensions and initial rows
dim(count_data)

## [1] 25258     8

formattable(head(count_data))

SRR1039508	SRR1039509	SRR1039512	SRR1039513	SRR1039516	SRR1039517	SRR1039520	SRR1039521
723	486	904	445	1170	1097	806	604
467	523	616	371	582	781	417	509
347	258	364	237	318	447	330	324
96	81	73	66	118	94	102	74
0	0	1	0	2	0	0	0
3413	3916	6000	4308	6424	10723	5039	7803

Visualization of Expression Data and Transformation

Most DGE tools operate on raw or normalized count values. However, transformations are often applied to facilitate downstream analyses. Common transformations include:

Log transformation: This helps reduce the skew in data, especially for high-count genes.
Variance-stabilizing transformations (VST): Methods like regularized log transformation (rlog()) in DESeq2 help address heteroskedasticity by shrinking variance for low-count genes.

Log2 Transformation

Log transformations are commonly applied using base 2 (log2), which is easier to interpret in biological terms (doubling, halving, etc.). When using log transformation on count data, a pseudocount (e.g., +1) is added to avoid taking the log of zero.

# Log2 transformation with pseudocount
log.norm.counts <- log2(count_data + 1)

# Visualize transformed data
par(mfrow=c(2,1))
boxplot(count_data, notch=TRUE, main="Untransformed read counts", ylab="Read counts")
boxplot(log.norm.counts, notch=TRUE, main="Log2-transformed read counts", ylab="Log2(read counts)")

This transformation is particularly useful for creating visualizations, as it compresses the range of read counts, making patterns more discernible.

Visualizing Transformed Counts and Checking for Variance Homogeneity

Heteroskedasticity occurs when variance changes with the mean, often observed in RNA-Seq data. Visual checks for heteroskedasticity are helpful before choosing a transformation method.

# Pairwise scatter plot to inspect count similarity across samples
plot(log.norm.counts[,c(1,2)], cex=.1, main="Normalized log2(read counts)")

# Mean-SD plot to check for variance dependency
msd_plot <- meanSdPlot(as.matrix(log.norm.counts), ranks=FALSE, plot=FALSE)
msd_plot$gg + ggtitle("Sequencing depth normalized log2(read counts)") + ylab("Standard deviation")

An upward trend on the left side of the mean-SD plot indicates that low-count genes show higher variability, suggesting the need for variance stabilization. Variance stabilizing transformations. Plots show the standard deviation of normalized counts (normalizedCounts) using log2() (A), rlog() (B), and variance stabilizing (vst()) (C) transformations by rank(mean). The red line shows the running median estimator. As (B) and (C) show, the transformations greatly reduce the standard deviation, and the rlog() transformation is most effective at stabilizing the variance across the mean.

Pattern Exploration

Before conducting DGE analysis, it’s important to explore patterns across your samples. This step helps you understand sample-to-sample variability, identify any outliers, and assess the similarity between replicates. Three common methods for examining global patterns in RNA-Seq data include:

Pairwise Correlation
Hierarchical Clustering
Principal Component Analysis (PCA)

Pairwise Correlation Analysis

Pairwise correlation is a straightforward method to assess similarity between samples. High correlations between replicates indicate consistency, while low correlations might signal technical issues or biological differences.

# Install required packages if necessary
if (!requireNamespace("pheatmap", quietly = TRUE)) install.packages("pheatmap")

# Load library
library(pheatmap)

# Assume `norm_counts` is a matrix of normalized counts (e.g., log2 transformed)
# Calculate pairwise correlation
cor_matrix <- cor(log.norm.counts, method="pearson")

# Plot heatmap of the correlation matrix
pheatmap(cor_matrix, main="Pairwise Sample Correlation", display_numbers=TRUE)

This heatmap provides a quick overview of which samples are similar or dissimilar based on their gene expression profiles. Values close to 1 indicate high similarity, while lower values suggest potential differences.

Hierarchical Clustering

Hierarchical clustering groups samples based on their similarity, providing a tree-like representation (dendrogram) that visually illustrates relationships among samples. This is particularly useful for detecting clusters of similar samples, identifying outliers, and understanding sample grouping.

# Convert correlation matrix to distance matrix
dist_matrix <- as.dist(1 - cor_matrix) # 1 - correlation to get distances

# Perform hierarchical clustering
hclust_res <- hclust(dist_matrix, method="average")

# Plot dendrogram
plot(hclust_res, main="Hierarchical Clustering of Samples", sub="", xlab="", ylab="Distance")

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique that summarizes the variation across samples in a few components. Each principal component represents a portion of the total variance in the data, and plotting samples based on the first two components often reveals sample groupings and potential outliers.

# Perform PCA
pca_res <- prcomp(as.data.frame(t(log.norm.counts)), center=TRUE, scale.=TRUE) # Transpose to analyze samples

# Plot PCA
pca_data <- as.data.frame(pca_res$x)
pca_data$Sample <- rownames(pca_data)
pca_data[['dex']] <- meta_data[which(meta_data$id == pca_data$Sample),][["dex"]]
pca_data[['celltype']] <- meta_data[which(meta_data$id == pca_data$Sample),][["celltype"]]

#plot eigenvalues
fviz_eig(pca_res)

ggplot(pca_data, aes(x=PC1, y=PC2, label=Sample, colour = dex)) +
  geom_point(size=3) +
  geom_text(vjust=1.5) +
  labs(title="PCA of Samples by Treatment", x="Principal Component 1", y="Principal Component 2") +
  theme_minimal()

ggplot(pca_data, aes(x=PC1, y=PC2, label=Sample,color = dex,  shape = celltype)) +
  geom_point(size=3) +
  geom_text(vjust=1.5) +
  labs(title="PCA of Samples by Treatment and Cell Type", x="Principal Component 1", y="Principal Component 2") +
  theme_minimal()

In the PCA plot, samples with similar expression profiles will cluster together. The variance explained by each principal component helps determine which components capture the most variation in the data.

DGE Analysis in limma

limma is typically used for microarray analysis but can also handle RNA-Seq data by first applying voom, which transforms count data into log2 counts-per-million (CPM) with associated precision weights based on mean-variance.

Converting Counts to CPM and Using voom

# Define experimental groups
ids <- colnames(count_data)
group <- meta_data[which(meta_data$id == ids),][['dex']]# Modify as per your setup
design <- model.matrix(~group)

# Apply edgeR’s DGEList to organize data
dge <- DGEList(counts = as.matrix(count_data))

# Filter out lowly expressed genes (optional but recommended)
keep <- filterByExpr(dge, design)
dge <- dge[keep, ]

# Use voom transformation to convert to log2 CPM and estimate mean-variance relationship
v <- voom(dge, design)

# Inspect the transformed data
formattable(head(v$E))

##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 5.097      4.649      5.117      4.828      5.542     
## ENSG00000000419 4.467      4.755      4.564      4.566      4.536     
## ENSG00000000457 4.039      3.737      3.806      3.921      3.665     
## ENSG00000000460 2.19       2.072      1.496      2.084      2.238     
## ENSG00000000971 7.335      7.658      7.847      8.102      7.999     
## ENSG00000001036 6.783      6.466      6.663      6.461      6.43      
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 5.108      5.356      4.787     
## ENSG00000000419 4.618      4.406      4.541     
## ENSG00000000457 3.814      4.069      3.89      
## ENSG00000000460 1.57       2.38       1.767     
## ENSG00000000971 8.396      7.999      8.478     
## ENSG00000001036 6.152      6.787      6.351

Ready Data for Differential Expression in limma The object v now contains log2-transformed counts with associated weights, suitable for DE analysis in limma.

Fit Model and Test for Differential Expression

# Fit the model
fit <- lmFit(v, design = model.matrix(~ group))
fit <- eBayes(fit)

# Extract results and order by adj p-value
formattable(topTable(fit, adjust = "fdr", sort.by = "P"))

	logFC	AveExpr	t	P.Value	adj.P.Val	B
ENSG00000120129	2.885463	6.606309	16.32247	6.966005e-08	0.0004233824	8.835158
ENSG00000152583	4.286436	4.151669	15.83640	9.017721e-08	0.0004233824	8.091505
ENSG00000189221	3.287994	5.922711	15.82970	9.050290e-08	0.0004233824	8.564216
ENSG00000179094	2.778021	4.445666	15.04101	1.398648e-07	0.0004233824	8.041050
ENSG00000162614	1.897122	7.676952	14.76806	1.634158e-07	0.0004233824	8.055725
ENSG00000116584	-1.116939	6.600136	-14.71454	1.685328e-07	0.0004233824	8.043382
ENSG00000178695	-2.589366	6.430372	-14.25608	2.204244e-07	0.0004528795	7.773064
ENSG00000096060	3.928518	5.773563	13.97779	2.604513e-07	0.0004528795	7.568161
ENSG00000101347	3.689697	8.245659	13.91589	2.704117e-07	0.0004528795	7.564826
ENSG00000134686	1.361308	6.880045	13.43925	3.629235e-07	0.0005470346	7.309653

Understanding limma Differential Expression Analysis Results

logFC (Log Fold Change):
The log2-transformed fold change in expression between the compared groups. A positive value indicates that the gene is more highly expressed in the test group, while a negative value suggests higher expression in the control group. This measure reflects the magnitude of change in gene expression.
AveExpr (Average Expression):
The average expression level of the gene across all samples in the analysis. This value provides context on the overall abundance of the gene in the dataset, regardless of any differences between groups.
t (t-statistic):
The t-statistic for each gene, which reflects the ratio of the log fold change to its standard error. Higher absolute values of t indicate greater evidence against the null hypothesis (i.e., no differential expression). A higher t-value generally corresponds to a more statistically significant result.
P.Value:
The raw p-value derived from the statistical test. This value assesses the likelihood of observing the given log fold change by chance, assuming there’s no actual difference in expression between groups. Lower p-values indicate stronger evidence that the gene is differentially expressed.
adj.P.Val (Adjusted P-Value):
The adjusted p-value, often calculated using the Benjamini-Hochberg method to control for false discovery rate (FDR). Adjusted p-values help account for multiple testing and reduce the likelihood of identifying false positives as significant. Genes with adjusted p-values below a chosen threshold (e.g., 0.05) are typically considered significantly differentially expressed.
B (Log-Odds Score):
The log-odds of differential expression, also known as the B-statistic. It estimates the likelihood that a gene is truly differentially expressed based on the model used. Higher values indicate higher confidence in the gene’s differential expression.

Save Results

# Save significant results
limma_genes <- topTable(fit, adjust = "fdr", number = Inf)
limma_genes <- filter(limma_genes, adj.P.Val < 0.05)
write.csv(limma_genes, "limma_voom_results.csv")

DGE Analysis in DESeq2

DESeq2 directly models raw counts without prior transformation, applying its own size factor-based normalization.

Creating a DESeq2 Dataset

# Define the sample information as a data frame
col_data <- data.frame(
  row.names = colnames(count_data),
  condition = group # Modify as per your setup
)

# Create DESeq2 dataset object
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~condition)

Normalizing Counts

# Apply DESeq’s normalization
dds <- estimateSizeFactors(dds)

# Inspect normalized counts (optional)
norm_counts <- counts(dds, normalized = TRUE)
formattable(head(norm_counts))

##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 709.3      539.7      767.1      663.2      997.3     
## ENSG00000000419 458.1      580.7      522.7      552.9      496.1     
## ENSG00000000457 340.4      286.5      308.9      353.2      271.1     
## ENSG00000000460 94.17      89.94      61.95      98.36      100.6     
## ENSG00000000938 0          0          0.8486     0          1.705     
## ENSG00000000971 3348       4348       5092       6420       5476      
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 787.5      878        635.5     
## ENSG00000000419 560.7      454.2      535.5     
## ENSG00000000457 320.9      359.5      340.9     
## ENSG00000000460 67.48      111.1      77.86     
## ENSG00000000938 0          0          0         
## ENSG00000000971 7698       5489       8210

Ready Data for Differential Expression in DESeq2 The dds object is now prepared for DE analysis. DESeq2 will automatically handle normalization during DE testing, but the size factors are already estimated in this step.

Run DESeq2 Analysis

# Run the DESeq2 pipeline
dds <- DESeq(dds)

# Extract results for all genes with default settings
res <- results(dds)

# Order by adjusted p-value
res <- res[order(res$padj),]

#get meaning of the columns
mcols(res, use.names = TRUE)

## DataFrame with 6 rows and 2 columns
##                        type            description
##                 <character>            <character>
## baseMean       intermediate mean of normalized c..
## log2FoldChange      results log2 fold change (ML..
## lfcSE               results standard error: cond..
## stat                results Wald statistic: cond..
## pvalue              results Wald test p-value: c..
## padj                results   BH adjusted p-values

#other summary
summary(res)

## 
## out of 25258 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1557, 6.2%
## LFC < 0 (down)     : 1179, 4.7%
## outliers [1]       : 142, 0.56%
## low counts [2]     : 9775, 39%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

The first column, baseMean, is a just the average of the normalized count values, divided by the size factors, taken over all samples in the DESeqDataSet. The remaining four columns refer to a specific contrast, namely the comparison of the treatment level over the control level for the factor variable dex.

The column log2FoldChange is the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples. This value is reported on a logarithmic scale to base 2: for example, a log2 fold change of 1.5 means that the gene’s expression is increased by a multiplicative factor of 2^1.5≈2.82.

This estimate has an uncertainty associated with it, which is available in the column lfcSE, the standard error estimate for the log2 fold change estimate. The purpose of a test for differential expression is to test whether the data provides sufficient evidence to conclude that this value is really different from zero. DESeq2 performs for each gene a hypothesis test to see whether evidence is sufficient to decide against the null hypothesis that there is zero effect of the treatment on the gene and that the observed difference between treatment and control was merely caused by experimental variability (i.e., the type of variability that you can expect between different samples in the same treatment group). As usual in statistics, the result of this test is reported as a p value, and it is found in the column pvalue.

Examine and Export Results

Filter significant genes and save the results:

# Filter for significant results (adjusted p-value < 0.05)
deseq2_genes <- subset(res, padj < 0.05)

# Save to file
write.csv(as.data.frame(deseq2_genes), file="DESeq2_results.csv")

Visualizing Intersection of Significantly Expressed Genes

Venn Diagram

# Combine the lists of significant genes into a named list
significant_genes <- list(
  DESeq2 = rownames(deseq2_genes),
  limma = rownames(limma_genes)
)

## 1. Venn Diagram

# Create the Venn diagram
grid.newpage()

venn.plot <- venn.diagram(
  x = significant_genes,
  category.names = c("DESeq2", "limma"),
  filename = NULL,  # Set to NULL to plot directly
  output = TRUE,
  fill = c("cornflowerblue", "green"),
  alpha = 0.5,
  cex = 1.5,
  fontface = "bold",
  fontfamily = "sans"
)

# Display the Venn plot
grid.draw(venn.plot)

UpSet Plot

# Create binary matrix from gene lists for UpSetR
upset_data <- fromList(significant_genes)

# Generate the UpSet plot
upset(
  upset_data,
  sets = c("DESeq2", "limma"),
  order.by = "freq",
  main.bar.color = "steelblue",
  sets.bar.color = c("cornflowerblue", "green"),
  number.angles = 30,
  point.size = 3,
  line.size = 1
)

Results Interpretation and Visualization

Visualizing the results of differential gene expression (DGE) analysis is essential for interpreting patterns in the data, particularly the extent of upregulation and downregulation across genes. Here, we cover common visualization techniques using the results from DESeq2. We will generate several plots: volcano plots, MA plots, and heatmaps, each highlighting upregulated and downregulated genes effectively.

Volcano Plot

A volcano plot is a scatter plot used to visualize the significance (p-values) and magnitude (log-fold change) of gene expression changes. Genes with higher statistical significance and larger fold changes are easier to identify.

# Required Libraries
library(ggplot2)
library(dplyr)

# Example Data: Assume 'results' is a data frame containing columns logFC, P.Value (or PValue in edgeR), and adj.P.Val (or FDR)
# Add a column to categorize genes as upregulated, downregulated, or non-significant
results <- as.data.frame(res)
results <- results %>%
  mutate(Significance = case_when(
    padj < 0.05 & log2FoldChange > 1 ~ "Upregulated",
    padj < 0.05 & log2FoldChange < -1 ~ "Downregulated",
    TRUE ~ "Non-significant"
  ))

# Volcano Plot
ggplot(results, aes(x = log2FoldChange, y = -log10(pvalue), color = Significance)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("blue", "grey", "red")) +
  labs(title = "Volcano Plot of Differentially Expressed Genes",
       x = "Log2 Fold Change",
       y = "-log10(P-Value)") +
  theme_minimal() +
  theme(legend.position = "top")

Interpretation: Genes with higher logFC and lower p-values (higher -log10(p-value)) are either significantly upregulated (red) or downregulated (blue). Genes not meeting these thresholds are shown in grey. This plot quickly highlights the most significant and strongly differentially expressed genes.

MA Plot

An MA plot visualizes the relationship between the mean expression (average log counts per million, logCPM) and log-fold changes (logFC). This plot helps identify patterns in expression changes across the range of gene abundances.

# MA Plot
ggplot(results, aes(x = baseMean, y = log2FoldChange, color = Significance)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("blue", "grey", "red")) +
  xlim(0, 25000)+
  labs(title = "MA Plot of Differentially Expressed Genes",
       x = "Average Expression (Log CPM)",
       y = "Log2 Fold Change") +
  theme_minimal() +
  theme(legend.position = "top")

Interpretation: In an MA plot, points represent genes; the average expression (x-axis) is plotted against fold change (y-axis). Genes with strong positive or negative fold changes and low average expression values can indicate biologically relevant outliers or potential noise.

Heatmap of Differentially Expressed Genes

Heatmaps are powerful for visualizing expression patterns across multiple samples. In this example, we’ll plot the expression values of the top significantly differentially expressed genes.

# Required Libraries
library(pheatmap)
library(dplyr)

# Check Results Object
# Ensure results is a data.frame or tibble and contains necessary columns
head(results)

##                   baseMean log2FoldChange      lfcSE      stat       pvalue
## ENSG00000152583   954.7709       4.368359 0.23712679  18.42204 8.744898e-76
## ENSG00000179094   743.2527       2.863889 0.17556931  16.31201 8.107836e-60
## ENSG00000116584  2277.9135      -1.034701 0.06509844 -15.89440 6.928546e-57
## ENSG00000189221  2383.7537       3.341544 0.21240579  15.73189 9.144326e-56
## ENSG00000120129  3440.7038       2.965211 0.20369513  14.55710 5.264243e-48
## ENSG00000148175 13493.9204       1.427168 0.10038904  14.21638 7.251278e-46
##                         padj  Significance
## ENSG00000152583 1.341555e-71   Upregulated
## ENSG00000179094 6.219115e-56   Upregulated
## ENSG00000116584 3.543027e-53 Downregulated
## ENSG00000189221 3.507078e-52   Upregulated
## ENSG00000120129 1.615175e-44   Upregulated
## ENSG00000148175 1.854031e-42   Upregulated

# Prepare Data: Select top 50 differentially expressed genes
results <- results %>%
  mutate(gene = rownames(results)) # Add gene names if rownames exist

top_genes <- results %>%
  filter(padj < 0.05) %>%                     # Filter significant genes
  arrange(desc(abs(log2FoldChange))) %>%      # Sort by absolute log2 fold change
  slice_head(n = 40) %>%                      # Select top 50 genes
  pull(gene)                                  # Extract gene names

# Expression Matrix Preparation
# Assuming 'dds' is the DESeq2 object and you have a normalized counts matrix
norm_counts <- counts(dds, normalized = TRUE) # Get normalized counts
heatmap_data <- norm_counts[top_genes, ]      # Subset normalized counts by top genes

# Log Transform for Better Visualization
heatmap_data <- log2(heatmap_data + 1)        # Add pseudocount and log-transform

# Group Metadata
sample_groups <- as.data.frame(colData(dds)[, "condition", drop = FALSE])
colnames(sample_groups) <- "Group" # Rename for clarity
rownames(sample_groups) <- colnames(heatmap_data)

# Custom Colors for Groups
group_colors <- list(Group = c("control" = "green", "treated" = "purple"))

# Heatmap Visualization
pheatmap(heatmap_data,
         color = colorRampPalette(c("blue", "white", "red"))(40),
         cluster_rows = FALSE,
         cluster_cols = FALSE,
         scale = "row",                      # Normalize rows (genes)
         show_rownames = TRUE,               # Show gene names
         show_colnames = TRUE,               # Show sample names
         annotation_col = sample_groups,     # Add group annotations
         annotation_colors = group_colors,   # Define group colors
         main = "Heatmap of Top Differentially Expressed Genes",
         fontsize_row = 7,                   # Adjust row font size
         fontsize_col = 7,                   # Adjust column font size
         angle_col = 45,
         fontsize = 8,
         #cellwidth = 20,                     # Width of each cell
         cellheight = 6)

Interpretation: This heatmap shows expression values across samples, with red representing upregulated genes and blue representing downregulated genes. Genes are clustered by similarity in expression patterns, revealing groups of co-expressed genes and allowing for the identification of up- and down-regulated clusters.

What if you have more than 8 samples?

To simplify the heatmap and represent one column for treated and one column for control, we aggregate the normalized counts for each condition by calculating the mean expression for each gene within each group.

metadata <- as.data.frame(colData(dds)[, "condition", drop = FALSE])
colnames(metadata) <- "Group" # Rename for clarity
metadata$Sample <- rownames(metadata)

# Aggregate Data by Condition
aggregated_data <- t(apply(heatmap_data, 1, function(x) {
  tapply(x, metadata$Group, mean) # Calculate mean expression per group
}))

# Check Aggregated Data
head(aggregated_data)

##                   control   treated
## ENSG00000179593 0.2657586 7.0069977
## ENSG00000277196 0.3916402 5.9022279
## ENSG00000109906 3.7684509 9.6560434
## ENSG00000171819 2.7840135 7.2426185
## ENSG00000127954 4.8958533 9.5460035
## ENSG00000118729 3.6497668 0.6836456

# Plot Heatmap
pheatmap(aggregated_data,
         color = colorRampPalette(c("blue", "white", "red"))(40),
         cluster_rows = FALSE,
         cluster_cols = FALSE,               # Disable clustering of conditions
         scale = "row",                      # Normalize rows (genes)
         show_rownames = TRUE,               # Show gene names
         show_colnames = TRUE,               # Show condition names
         main = "Heatmap of Top DE Genes (Aggregated by Group)",
         fontsize_row = 7,                   # Adjust row font size
         fontsize_col = 7,                   # Adjust column font size
         angle_col = 45,
         fontsize = 8,
         #cellwidth = 20,                     # Width of each cell
         cellheight = 6)                    # Adjust cell height

Other plots here and here

Handling and Converting Gene IDs

Common Gene ID Types and Formats

Typical Gene ID Formats 1. Ensembl Gene IDs: e.g., ENSG00000141510 for the TP53 gene. 2. NCBI Gene IDs: e.g., 7157 for the TP53 gene. 3. Gene Symbols: e.g., TP53, a more recognizable name, especially for human-readable purposes. 4. RefSeq IDs: e.g., NM_001126114 for specific transcript isoforms of TP53. 5. Custom IDs: Some datasets might use project-specific IDs or require converting IDs for cross-species analysis.

**Practical Steps for Inspecting and Converting Gene IDs*

First, load your read count table and inspect the gene ID column to understand the format in use.

# Preview the first few rows to inspect gene IDs
head(rownames(count_data))

## [1] "ENSG00000000003" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460"
## [5] "ENSG00000000938" "ENSG00000000971"

If the IDs are in a format that doesn’t match your annotation database, they will need to be converted.

Using the AnnotationDbi and org.Hs.eg.db Packages

For human data, AnnotationDbi and org.Hs.eg.db provide another reliable way to convert IDs without needing an internet connection.

Load Packages

library(AnnotationDbi)
library(org.Hs.eg.db)

How to know your Gene IDs

# Example gene ID vector
gene_ids <- rownames(count_data)

# Define possible ID types to test
possible_id_types <- c("SYMBOL", "ENTREZID", "ENSEMBL", "REFSEQ")

# Loop through possible ID types to check which one matches
for (id_type in possible_id_types) {
  result <- tryCatch({
    mapIds(org.Hs.eg.db, keys = gene_ids, column = "SYMBOL", keytype = id_type)
  }, error = function(e) NA)  # Ignore errors

  if (!all(is.na(result))) {
    cat("The IDs match the type:", id_type, "\n")
  }
}

## The IDs match the type: ENSEMBL

Now, you want to convert them to gene symbols:

ens2symbol <- AnnotationDbi::select(org.Hs.eg.db,
                                    key=results$gene, 
                                    columns="SYMBOL",
                                    keytype="ENSEMBL")
ens2symbol <- as_tibble(ens2symbol)
formattable(head(ens2symbol))

ENSEMBL	SYMBOL
ENSG00000152583	SPARCL1
ENSG00000179094	PER1
ENSG00000116584	ARHGEF2
ENSG00000189221	MAOA
ENSG00000120129	DUSP1
ENSG00000148175	STOM

deg_results <- inner_join(results, ens2symbol, by=c("gene"="ENSEMBL"))
formattable(head(deg_results))

baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	Significance	gene	SYMBOL
954.7709	4.368359	0.23712679	18.42204	8.744898e-76	1.341555e-71	Upregulated	ENSG00000152583	SPARCL1
743.2527	2.863889	0.17556931	16.31201	8.107836e-60	6.219115e-56	Upregulated	ENSG00000179094	PER1
2277.9135	-1.034701	0.06509844	-15.89440	6.928546e-57	3.543027e-53	Downregulated	ENSG00000116584	ARHGEF2
2383.7537	3.341544	0.21240579	15.73189	9.144326e-56	3.507078e-52	Upregulated	ENSG00000189221	MAOA
3440.7038	2.965211	0.20369513	14.55710	5.264243e-48	1.615175e-44	Upregulated	ENSG00000120129	DUSP1
13493.9204	1.427168	0.10038904	14.21638	7.251278e-46	1.854031e-42	Upregulated	ENSG00000148175	STOM

#write.csv(deg_results, "./deg_symbols.csv", row.names = FALSE)

#look at a subset of genes
deg_results$label <- ifelse(deg_results$SYMBOL %in% c("SPARCL1", "STOM", "PHC2", "MAOA", "ARHGEF2", "ZNF512", "MTHFSD", "CRISPLD2"), deg_results$SYMBOL, NA)
deg_results$is_labeled <- ifelse(deg_results$SYMBOL %in% c("SPARCL1", "STOM", "PHC2", "MAOA", "ARHGEF2", "ZNF512", "MTHFSD", "CRISPLD2"), "Labeled", "Not Labeled")


# Volcano Plot
ggplot(deg_results, aes(x = log2FoldChange, y = -log10(pvalue), color = Significance)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("blue", "grey", "red")) +
  new_scale_color() +
  geom_point(data = deg_results[deg_results$is_labeled == "Labeled", ], 
             aes(color = "Highlight"), size = 3, shape = 21) +
  scale_color_manual(values = c("Highlight" = "black")) +
  geom_text_repel(aes(label = label), size = 3, max.overlaps = 10, color = 'black', na.rm = TRUE) +
  labs(title = "Volcano Plot of Differentially Expressed Genes",
       x = "Log2 Fold Change",
       y = "-log10(P-Value)") +
  theme_minimal() +
  theme(legend.position = "top")