This tutorial provides a comprehensive summary of differential gene expression (DGE) analysis techniques that are broadly applicable across various research scenarios. Specifically, it focuses on workflows using the popular tools limma, DESeq2, and edgeR—three of the most widely adopted packages in RNA-seq analysis.For more inofrmation visit RNA-Seq and Differential Expression Analysis in R.
The data comes from:
Himes et al. “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS ONE. 2014 Jun 13;9(6):e99625. PMID: 24926665.
Asthma is a long-term inflammatory lung disease affecting over 300 million people worldwide. Glucocorticoids are commonly used to reduce inflammation in asthma, especially in airway smooth muscle (ASM), but how they work at the cellular level in ASM is still unclear. In this study, they used RNA-Seq to analyze changes in gene activity in ASM cells treated with dexamethasone, a strong synthetic glucocorticoid. They found 316 genes with altered activity, including well-known glucocorticoid-responsive genes and others less studied, like CRISPLD2. Interestingly, variations in the CRISPLD2 gene were linked to resistance to inhaled steroids and bronchodilator response in asthma patients. Tests showed that dexamethasone and an inflammatory molecule (IL1β) both increased CRISPLD2 levels. When CRISPLD2 was reduced, inflammation markers IL6 and IL8 increased. This research highlights CRISPLD2 as a gene that could influence asthma treatment responses by regulating glucocorticoid anti-inflammatory effects in ASM.
They did their analysis using Tophat and Cufflinks. We’re using three different R packages called DESeq2, limma. Click here to read more on DESeq2 previous tutorial.
A read counts matrix is the essential starting point for DGE analysis. This matrix contains counts of sequencing reads mapped to each gene across multiple samples. Each row represents a gene, and each column corresponds to a sample. The values in the matrix indicate the number of reads aligning to each gene, providing a measure of expression.
# Load necessary libraries
library(DESeq2)
library(edgeR)
library(limma)
library(readr)
library(vsn)
library(ggplot2)
library(VennDiagram)
library(UpSetR)
library(dplyr)
library(ggplot2)
library(factoextra)
library(formattable)
library(apeglm)
Ensure that the rows represent genes and columns represent samples.
# Import your read count table from Github
urlfile="https://raw.githubusercontent.com/merlinis12/RNA-Seq-Data-Analysis-in-R/refs/heads/main/data/airway_scaledcounts.csv"
count_data_original <- read_csv(url(urlfile))
# View data dimensions and initial rows
dim(count_data_original)
## [1] 38694 9
formattable(head(count_data_original))
ensgene | SRR1039508 | SRR1039509 | SRR1039512 | SRR1039513 | SRR1039516 | SRR1039517 | SRR1039520 | SRR1039521 |
---|---|---|---|---|---|---|---|---|
ENSG00000000003 | 723 | 486 | 904 | 445 | 1170 | 1097 | 806 | 604 |
ENSG00000000005 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSG00000000419 | 467 | 523 | 616 | 371 | 582 | 781 | 417 | 509 |
ENSG00000000457 | 347 | 258 | 364 | 237 | 318 | 447 | 330 | 324 |
ENSG00000000460 | 96 | 81 | 73 | 66 | 118 | 94 | 102 | 74 |
ENSG00000000938 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 0 |
#read metadata from github
urlfile="https://raw.githubusercontent.com/merlinis12/RNA-Seq-Data-Analysis-in-R/refs/heads/main/data/airway_metadata.csv"
meta_data <- read_csv(url(urlfile))
meta_data$dex <- as.factor(meta_data$dex)
# View data dimensions and initial rows
dim(meta_data)
## [1] 8 4
formattable(head(meta_data))
id | dex | celltype | geo_id |
---|---|---|---|
SRR1039508 | control | N61311 | GSM1275862 |
SRR1039509 | treated | N61311 | GSM1275863 |
SRR1039512 | control | N052611 | GSM1275866 |
SRR1039513 | treated | N052611 | GSM1275867 |
SRR1039516 | control | N080611 | GSM1275870 |
SRR1039517 | treated | N080611 | GSM1275871 |
# Check for missing values
missing_counts <- sum(is.na(count_data_original))
cat("Number of missing values in the read count table:", missing_counts)
## Number of missing values in the read count table: 0
If you decide to replace missing values with zeros (assuming missing values represent no expression), use the following code:
# Replace NA with zero
count_data <- count_data_original
count_data[is.na(count_data)] <- 0
Alternatively, you can filter out genes with too many missing values. For instance, remove genes missing in more than 20% of samples:
#Define a threshold for missing data (20%)
threshold <- 0.2
# Calculate the proportion of missing values per gene
missing_proportion <- rowMeans(is.na(count_data))
# Filter out genes with missing values above the threshold
filtered_counts <- count_data[missing_proportion <= threshold, ]
What if a gene has always expression = 0?
all_zero_rows <- apply(count_data[,-1], 1, function(row) all(row == 0))
row_indices <- which(all_zero_rows)
# This will print the indices of genes where all values are zero
cat("Number of genes with all zero values in the read count table:",length(row_indices))
## Number of genes with all zero values in the read count table: 13436
#remove gene with exp = 0
gene_list <- count_data[-row_indices, ][[1]]
count_data <- count_data[-row_indices,-1]
#gene names as rownames
rownames(count_data) <- gene_list
# View data dimensions and initial rows
dim(count_data)
## [1] 25258 8
formattable(head(count_data))
SRR1039508 | SRR1039509 | SRR1039512 | SRR1039513 | SRR1039516 | SRR1039517 | SRR1039520 | SRR1039521 |
---|---|---|---|---|---|---|---|
723 | 486 | 904 | 445 | 1170 | 1097 | 806 | 604 |
467 | 523 | 616 | 371 | 582 | 781 | 417 | 509 |
347 | 258 | 364 | 237 | 318 | 447 | 330 | 324 |
96 | 81 | 73 | 66 | 118 | 94 | 102 | 74 |
0 | 0 | 1 | 0 | 2 | 0 | 0 | 0 |
3413 | 3916 | 6000 | 4308 | 6424 | 10723 | 5039 | 7803 |
Most DGE tools operate on raw or normalized count values. However, transformations are often applied to facilitate downstream analyses. Common transformations include:
rlog()
) in DESeq2 help address heteroskedasticity by shrinking variance for low-count genes.Log transformations are commonly applied using base 2 (log2), which is easier to interpret in biological terms (doubling, halving, etc.). When using log transformation on count data, a pseudocount (e.g., +1) is added to avoid taking the log of zero.
# Log2 transformation with pseudocount
log.norm.counts <- log2(count_data + 1)
# Visualize transformed data
par(mfrow=c(2,1))
boxplot(count_data, notch=TRUE, main="Untransformed read counts", ylab="Read counts")
boxplot(log.norm.counts, notch=TRUE, main="Log2-transformed read counts", ylab="Log2(read counts)")
This transformation is particularly useful for creating visualizations, as it compresses the range of read counts, making patterns more discernible.
Heteroskedasticity occurs when variance changes with the mean, often observed in RNA-Seq data. Visual checks for heteroskedasticity are helpful before choosing a transformation method.
# Pairwise scatter plot to inspect count similarity across samples
plot(log.norm.counts[,c(1,2)], cex=.1, main="Normalized log2(read counts)")
# Mean-SD plot to check for variance dependency
msd_plot <- meanSdPlot(as.matrix(log.norm.counts), ranks=FALSE, plot=FALSE)
msd_plot$gg + ggtitle("Sequencing depth normalized log2(read counts)") + ylab("Standard deviation")
An upward trend on the left side of the mean-SD plot indicates that low-count genes show higher variability, suggesting the need for variance stabilization. Variance stabilizing transformations. Plots show the standard deviation of normalized counts (normalizedCounts) using log2() (A), rlog() (B), and variance stabilizing (vst()) (C) transformations by rank(mean). The red line shows the running median estimator. As (B) and (C) show, the transformations greatly reduce the standard deviation, and the rlog() transformation is most effective at stabilizing the variance across the mean.
Before conducting DGE analysis, it’s important to explore patterns across your samples. This step helps you understand sample-to-sample variability, identify any outliers, and assess the similarity between replicates. Three common methods for examining global patterns in RNA-Seq data include:
Pairwise correlation is a straightforward method to assess similarity between samples. High correlations between replicates indicate consistency, while low correlations might signal technical issues or biological differences.
# Install required packages if necessary
if (!requireNamespace("pheatmap", quietly = TRUE)) install.packages("pheatmap")
# Load library
library(pheatmap)
# Assume `norm_counts` is a matrix of normalized counts (e.g., log2 transformed)
# Calculate pairwise correlation
cor_matrix <- cor(log.norm.counts, method="pearson")
# Plot heatmap of the correlation matrix
pheatmap(cor_matrix, main="Pairwise Sample Correlation", display_numbers=TRUE)
This heatmap provides a quick overview of which samples are similar or dissimilar based on their gene expression profiles. Values close to 1 indicate high similarity, while lower values suggest potential differences.
Hierarchical clustering groups samples based on their similarity, providing a tree-like representation (dendrogram) that visually illustrates relationships among samples. This is particularly useful for detecting clusters of similar samples, identifying outliers, and understanding sample grouping.
# Convert correlation matrix to distance matrix
dist_matrix <- as.dist(1 - cor_matrix) # 1 - correlation to get distances
# Perform hierarchical clustering
hclust_res <- hclust(dist_matrix, method="average")
# Plot dendrogram
plot(hclust_res, main="Hierarchical Clustering of Samples", sub="", xlab="", ylab="Distance")
Principal Component Analysis (PCA) is a dimensionality reduction technique that summarizes the variation across samples in a few components. Each principal component represents a portion of the total variance in the data, and plotting samples based on the first two components often reveals sample groupings and potential outliers.
# Perform PCA
pca_res <- prcomp(as.data.frame(t(log.norm.counts)), center=TRUE, scale.=TRUE) # Transpose to analyze samples
# Plot PCA
pca_data <- as.data.frame(pca_res$x)
pca_data$Sample <- rownames(pca_data)
pca_data[['dex']] <- meta_data[which(meta_data$id == pca_data$Sample),][["dex"]]
pca_data[['celltype']] <- meta_data[which(meta_data$id == pca_data$Sample),][["celltype"]]
#plot eigenvalues
fviz_eig(pca_res)
ggplot(pca_data, aes(x=PC1, y=PC2, label=Sample, colour = dex)) +
geom_point(size=3) +
geom_text(vjust=1.5) +
labs(title="PCA of Samples by Treatment", x="Principal Component 1", y="Principal Component 2") +
theme_minimal()
ggplot(pca_data, aes(x=PC1, y=PC2, label=Sample,color = dex, shape = celltype)) +
geom_point(size=3) +
geom_text(vjust=1.5) +
labs(title="PCA of Samples by Treatment and Cell Type", x="Principal Component 1", y="Principal Component 2") +
theme_minimal()
In the PCA plot, samples with similar expression profiles will cluster together. The variance explained by each principal component helps determine which components capture the most variation in the data.
limma is typically used for microarray analysis but can also handle RNA-Seq data by first applying voom
, which transforms count data into log2 counts-per-million (CPM) with associated precision weights based on mean-variance.
# Define experimental groups
ids <- colnames(count_data)
group <- meta_data[which(meta_data$id == ids),][['dex']]# Modify as per your setup
design <- model.matrix(~group)
# Apply edgeR’s DGEList to organize data
dge <- DGEList(counts = as.matrix(count_data))
# Filter out lowly expressed genes (optional but recommended)
keep <- filterByExpr(dge, design)
dge <- dge[keep, ]
# Use voom transformation to convert to log2 CPM and estimate mean-variance relationship
v <- voom(dge, design)
# Inspect the transformed data
formattable(head(v$E))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 5.097 4.649 5.117 4.828 5.542
## ENSG00000000419 4.467 4.755 4.564 4.566 4.536
## ENSG00000000457 4.039 3.737 3.806 3.921 3.665
## ENSG00000000460 2.19 2.072 1.496 2.084 2.238
## ENSG00000000971 7.335 7.658 7.847 8.102 7.999
## ENSG00000001036 6.783 6.466 6.663 6.461 6.43
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 5.108 5.356 4.787
## ENSG00000000419 4.618 4.406 4.541
## ENSG00000000457 3.814 4.069 3.89
## ENSG00000000460 1.57 2.38 1.767
## ENSG00000000971 8.396 7.999 8.478
## ENSG00000001036 6.152 6.787 6.351
Ready Data for Differential Expression in limma The object v
now contains log2-transformed counts with associated weights, suitable for DE analysis in limma
.
# Fit the model
fit <- lmFit(v, design = model.matrix(~ group))
fit <- eBayes(fit)
# Extract results and order by adj p-value
formattable(topTable(fit, adjust = "fdr", sort.by = "P"))
logFC | AveExpr | t | P.Value | adj.P.Val | B | |
---|---|---|---|---|---|---|
ENSG00000120129 | 2.885463 | 6.606309 | 16.32247 | 6.966005e-08 | 0.0004233824 | 8.835158 |
ENSG00000152583 | 4.286436 | 4.151669 | 15.83640 | 9.017721e-08 | 0.0004233824 | 8.091505 |
ENSG00000189221 | 3.287994 | 5.922711 | 15.82970 | 9.050290e-08 | 0.0004233824 | 8.564216 |
ENSG00000179094 | 2.778021 | 4.445666 | 15.04101 | 1.398648e-07 | 0.0004233824 | 8.041050 |
ENSG00000162614 | 1.897122 | 7.676952 | 14.76806 | 1.634158e-07 | 0.0004233824 | 8.055725 |
ENSG00000116584 | -1.116939 | 6.600136 | -14.71454 | 1.685328e-07 | 0.0004233824 | 8.043382 |
ENSG00000178695 | -2.589366 | 6.430372 | -14.25608 | 2.204244e-07 | 0.0004528795 | 7.773064 |
ENSG00000096060 | 3.928518 | 5.773563 | 13.97779 | 2.604513e-07 | 0.0004528795 | 7.568161 |
ENSG00000101347 | 3.689697 | 8.245659 | 13.91589 | 2.704117e-07 | 0.0004528795 | 7.564826 |
ENSG00000134686 | 1.361308 | 6.880045 | 13.43925 | 3.629235e-07 | 0.0005470346 | 7.309653 |
Understanding limma
Differential Expression Analysis Results
logFC (Log Fold Change):
The log2-transformed fold change in expression between the compared groups. A positive value indicates that the gene is more highly expressed in the test group, while a negative value suggests higher expression in the control group. This measure reflects the magnitude of change in gene expression.
AveExpr (Average Expression):
The average expression level of the gene across all samples in the analysis. This value provides context on the overall abundance of the gene in the dataset, regardless of any differences between groups.
t (t-statistic):
The t-statistic for each gene, which reflects the ratio of the log fold change to its standard error. Higher absolute values of t indicate greater evidence against the null hypothesis (i.e., no differential expression). A higher t-value generally corresponds to a more statistically significant result.
P.Value:
The raw p-value derived from the statistical test. This value assesses the likelihood of observing the given log fold change by chance, assuming there’s no actual difference in expression between groups. Lower p-values indicate stronger evidence that the gene is differentially expressed.
adj.P.Val (Adjusted P-Value):
The adjusted p-value, often calculated using the Benjamini-Hochberg method to control for false discovery rate (FDR). Adjusted p-values help account for multiple testing and reduce the likelihood of identifying false positives as significant. Genes with adjusted p-values below a chosen threshold (e.g., 0.05) are typically considered significantly differentially expressed.
B (Log-Odds Score):
The log-odds of differential expression, also known as the B-statistic. It estimates the likelihood that a gene is truly differentially expressed based on the model used. Higher values indicate higher confidence in the gene’s differential expression.
Save Results
# Save significant results
limma_genes <- topTable(fit, adjust = "fdr", number = Inf)
limma_genes <- filter(limma_genes, adj.P.Val < 0.05)
write.csv(limma_genes, "limma_voom_results.csv")
DESeq2
directly models raw counts without prior transformation, applying its own size factor-based normalization.
# Define the sample information as a data frame
col_data <- data.frame(
row.names = colnames(count_data),
condition = group # Modify as per your setup
)
# Create DESeq2 dataset object
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~condition)
# Apply DESeq’s normalization
dds <- estimateSizeFactors(dds)
# Inspect normalized counts (optional)
norm_counts <- counts(dds, normalized = TRUE)
formattable(head(norm_counts))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 709.3 539.7 767.1 663.2 997.3
## ENSG00000000419 458.1 580.7 522.7 552.9 496.1
## ENSG00000000457 340.4 286.5 308.9 353.2 271.1
## ENSG00000000460 94.17 89.94 61.95 98.36 100.6
## ENSG00000000938 0 0 0.8486 0 1.705
## ENSG00000000971 3348 4348 5092 6420 5476
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 787.5 878 635.5
## ENSG00000000419 560.7 454.2 535.5
## ENSG00000000457 320.9 359.5 340.9
## ENSG00000000460 67.48 111.1 77.86
## ENSG00000000938 0 0 0
## ENSG00000000971 7698 5489 8210
Ready Data for Differential Expression in DESeq2 The dds
object is now prepared for DE analysis. DESeq2
will automatically handle normalization during DE testing, but the size factors are already estimated in this step.
# Run the DESeq2 pipeline
dds <- DESeq(dds)
# Extract results for all genes with default settings
res <- results(dds)
# Order by adjusted p-value
res <- res[order(res$padj),]
#get meaning of the columns
mcols(res, use.names = TRUE)
## DataFrame with 6 rows and 2 columns
## type description
## <character> <character>
## baseMean intermediate mean of normalized c..
## log2FoldChange results log2 fold change (ML..
## lfcSE results standard error: cond..
## stat results Wald statistic: cond..
## pvalue results Wald test p-value: c..
## padj results BH adjusted p-values
#other summary
summary(res)
##
## out of 25258 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 1557, 6.2%
## LFC < 0 (down) : 1179, 4.7%
## outliers [1] : 142, 0.56%
## low counts [2] : 9775, 39%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
The first column, baseMean
, is a just the average of the normalized count values, divided by the size factors, taken over all samples in the DESeqDataSet. The remaining four columns refer to a specific contrast, namely the comparison of the treatment
level over the control
level for the factor variable dex.
The column log2FoldChange
is the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples. This value is reported on a logarithmic scale to base 2: for example, a log2 fold change of 1.5 means that the gene’s expression is increased by a multiplicative factor of 21.5≈2.82.
This estimate has an uncertainty associated with it, which is available in the column lfcSE
, the standard error estimate for the log2 fold change estimate. The purpose of a test for differential expression is to test whether the data provides sufficient evidence to conclude that this value is really different from zero. DESeq2 performs for each gene a hypothesis test to see whether evidence is sufficient to decide against the null hypothesis that there is zero effect of the treatment on the gene and that the observed difference between treatment and control was merely caused by experimental variability (i.e., the type of variability that you can expect between different samples in the same treatment group). As usual in statistics, the result of this test is reported as a p value, and it is found in the column pvalue.
Filter significant genes and save the results:
# Filter for significant results (adjusted p-value < 0.05)
deseq2_genes <- subset(res, padj < 0.05)
# Save to file
write.csv(as.data.frame(deseq2_genes), file="DESeq2_results.csv")
# Combine the lists of significant genes into a named list
significant_genes <- list(
DESeq2 = rownames(deseq2_genes),
limma = rownames(limma_genes)
)
## 1. Venn Diagram
# Create the Venn diagram
grid.newpage()
venn.plot <- venn.diagram(
x = significant_genes,
category.names = c("DESeq2", "limma"),
filename = NULL, # Set to NULL to plot directly
output = TRUE,
fill = c("cornflowerblue", "green"),
alpha = 0.5,
cex = 1.5,
fontface = "bold",
fontfamily = "sans"
)
# Display the Venn plot
grid.draw(venn.plot)
# Create binary matrix from gene lists for UpSetR
upset_data <- fromList(significant_genes)
# Generate the UpSet plot
upset(
upset_data,
sets = c("DESeq2", "limma"),
order.by = "freq",
main.bar.color = "steelblue",
sets.bar.color = c("cornflowerblue", "green"),
number.angles = 30,
point.size = 3,
line.size = 1
)
Visualizing the results of differential gene expression (DGE) analysis is essential for interpreting patterns in the data, particularly the extent of upregulation and downregulation across genes. Here, we cover common visualization techniques using the results from DESeq2
. We will generate several plots: volcano plots, MA plots, and heatmaps, each highlighting upregulated and downregulated genes effectively.
A volcano plot is a scatter plot used to visualize the significance (p-values) and magnitude (log-fold change) of gene expression changes. Genes with higher statistical significance and larger fold changes are easier to identify.
# Required Libraries
library(ggplot2)
library(dplyr)
# Example Data: Assume 'results' is a data frame containing columns logFC, P.Value (or PValue in edgeR), and adj.P.Val (or FDR)
# Add a column to categorize genes as upregulated, downregulated, or non-significant
results <- as.data.frame(res)
results <- results %>%
mutate(Significance = case_when(
padj < 0.05 & log2FoldChange > 1 ~ "Upregulated",
padj < 0.05 & log2FoldChange < -1 ~ "Downregulated",
TRUE ~ "Non-significant"
))
# Volcano Plot
ggplot(results, aes(x = log2FoldChange, y = -log10(pvalue), color = Significance)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("blue", "grey", "red")) +
labs(title = "Volcano Plot of Differentially Expressed Genes",
x = "Log2 Fold Change",
y = "-log10(P-Value)") +
theme_minimal() +
theme(legend.position = "top")
An MA plot visualizes the relationship between the mean expression (average log counts per million, logCPM) and log-fold changes (logFC). This plot helps identify patterns in expression changes across the range of gene abundances.
# MA Plot
ggplot(results, aes(x = baseMean, y = log2FoldChange, color = Significance)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("blue", "grey", "red")) +
xlim(0, 25000)+
labs(title = "MA Plot of Differentially Expressed Genes",
x = "Average Expression (Log CPM)",
y = "Log2 Fold Change") +
theme_minimal() +
theme(legend.position = "top")
Heatmaps are powerful for visualizing expression patterns across multiple samples. In this example, we’ll plot the expression values of the top significantly differentially expressed genes.
# Required Libraries
library(pheatmap)
library(dplyr)
# Check Results Object
# Ensure results is a data.frame or tibble and contains necessary columns
head(results)
## baseMean log2FoldChange lfcSE stat pvalue
## ENSG00000152583 954.7709 4.368359 0.23712679 18.42204 8.744898e-76
## ENSG00000179094 743.2527 2.863889 0.17556931 16.31201 8.107836e-60
## ENSG00000116584 2277.9135 -1.034701 0.06509844 -15.89440 6.928546e-57
## ENSG00000189221 2383.7537 3.341544 0.21240579 15.73189 9.144326e-56
## ENSG00000120129 3440.7038 2.965211 0.20369513 14.55710 5.264243e-48
## ENSG00000148175 13493.9204 1.427168 0.10038904 14.21638 7.251278e-46
## padj Significance
## ENSG00000152583 1.341555e-71 Upregulated
## ENSG00000179094 6.219115e-56 Upregulated
## ENSG00000116584 3.543027e-53 Downregulated
## ENSG00000189221 3.507078e-52 Upregulated
## ENSG00000120129 1.615175e-44 Upregulated
## ENSG00000148175 1.854031e-42 Upregulated
# Prepare Data: Select top 50 differentially expressed genes
results <- results %>%
mutate(gene = rownames(results)) # Add gene names if rownames exist
top_genes <- results %>%
filter(padj < 0.05) %>% # Filter significant genes
arrange(desc(abs(log2FoldChange))) %>% # Sort by absolute log2 fold change
slice_head(n = 40) %>% # Select top 50 genes
pull(gene) # Extract gene names
# Expression Matrix Preparation
# Assuming 'dds' is the DESeq2 object and you have a normalized counts matrix
norm_counts <- counts(dds, normalized = TRUE) # Get normalized counts
heatmap_data <- norm_counts[top_genes, ] # Subset normalized counts by top genes
# Log Transform for Better Visualization
heatmap_data <- log2(heatmap_data + 1) # Add pseudocount and log-transform
# Group Metadata
sample_groups <- as.data.frame(colData(dds)[, "condition", drop = FALSE])
colnames(sample_groups) <- "Group" # Rename for clarity
rownames(sample_groups) <- colnames(heatmap_data)
# Custom Colors for Groups
group_colors <- list(Group = c("control" = "green", "treated" = "purple"))
# Heatmap Visualization
pheatmap(heatmap_data,
color = colorRampPalette(c("blue", "white", "red"))(40),
cluster_rows = FALSE,
cluster_cols = FALSE,
scale = "row", # Normalize rows (genes)
show_rownames = TRUE, # Show gene names
show_colnames = TRUE, # Show sample names
annotation_col = sample_groups, # Add group annotations
annotation_colors = group_colors, # Define group colors
main = "Heatmap of Top Differentially Expressed Genes",
fontsize_row = 7, # Adjust row font size
fontsize_col = 7, # Adjust column font size
angle_col = 45,
fontsize = 8,
#cellwidth = 20, # Width of each cell
cellheight = 6)
To simplify the heatmap and represent one column for treated and one column for control, we aggregate the normalized counts for each condition by calculating the mean expression for each gene within each group.
metadata <- as.data.frame(colData(dds)[, "condition", drop = FALSE])
colnames(metadata) <- "Group" # Rename for clarity
metadata$Sample <- rownames(metadata)
# Aggregate Data by Condition
aggregated_data <- t(apply(heatmap_data, 1, function(x) {
tapply(x, metadata$Group, mean) # Calculate mean expression per group
}))
# Check Aggregated Data
head(aggregated_data)
## control treated
## ENSG00000179593 0.2657586 7.0069977
## ENSG00000277196 0.3916402 5.9022279
## ENSG00000109906 3.7684509 9.6560434
## ENSG00000171819 2.7840135 7.2426185
## ENSG00000127954 4.8958533 9.5460035
## ENSG00000118729 3.6497668 0.6836456
# Plot Heatmap
pheatmap(aggregated_data,
color = colorRampPalette(c("blue", "white", "red"))(40),
cluster_rows = FALSE,
cluster_cols = FALSE, # Disable clustering of conditions
scale = "row", # Normalize rows (genes)
show_rownames = TRUE, # Show gene names
show_colnames = TRUE, # Show condition names
main = "Heatmap of Top DE Genes (Aggregated by Group)",
fontsize_row = 7, # Adjust row font size
fontsize_col = 7, # Adjust column font size
angle_col = 45,
fontsize = 8,
#cellwidth = 20, # Width of each cell
cellheight = 6) # Adjust cell height
Common Gene ID Types and Formats
Typical Gene ID Formats 1. Ensembl Gene IDs: e.g., ENSG00000141510 for the TP53 gene. 2. NCBI Gene IDs: e.g., 7157 for the TP53 gene. 3. Gene Symbols: e.g., TP53, a more recognizable name, especially for human-readable purposes. 4. RefSeq IDs: e.g., NM_001126114 for specific transcript isoforms of TP53. 5. Custom IDs: Some datasets might use project-specific IDs or require converting IDs for cross-species analysis.
**Practical Steps for Inspecting and Converting Gene IDs*
First, load your read count table and inspect the gene ID column to understand the format in use.
# Preview the first few rows to inspect gene IDs
head(rownames(count_data))
## [1] "ENSG00000000003" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460"
## [5] "ENSG00000000938" "ENSG00000000971"
If the IDs are in a format that doesn’t match your annotation database, they will need to be converted.
For human data, AnnotationDbi and org.Hs.eg.db provide another reliable way to convert IDs without needing an internet connection.
library(AnnotationDbi)
library(org.Hs.eg.db)
# Example gene ID vector
gene_ids <- rownames(count_data)
# Define possible ID types to test
possible_id_types <- c("SYMBOL", "ENTREZID", "ENSEMBL", "REFSEQ")
# Loop through possible ID types to check which one matches
for (id_type in possible_id_types) {
result <- tryCatch({
mapIds(org.Hs.eg.db, keys = gene_ids, column = "SYMBOL", keytype = id_type)
}, error = function(e) NA) # Ignore errors
if (!all(is.na(result))) {
cat("The IDs match the type:", id_type, "\n")
}
}
## The IDs match the type: ENSEMBL
Now, you want to convert them to gene symbols:
# Convert Ensembl IDs to Gene Symbols
gene_symbols <- mapIds(org.Hs.eg.db,
keys = rownames(count_data),
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first")
# If you want, update row names in read count data
#rownames(count_data) <- gene_symbols