Pre-requisites:

Linux based software/tools:

# We expect following tools available/installed with Linux-based system
# We refer to each tool as module and loaded below with command 'moule load <Tool_Name>'

module load fastp
module load fastqc
module load star
module load samtools
module load subread
module load rseqc
module load csvtk
module load tpmcalculator

R-packages:

library(DESeq2)
library(ggplot2)
library(gplots)
library(tidyverse)
library(RColorBrewer)
library(edgeR)
library(ggrepel)
library(ComplexHeatmap)
library(dplyr)
library(EnhancedVolcano)
library(circlize)
library(msigdbr)
library(clusterProfiler)
library(org.Hs.eg.db)
library(ggvenn)
library(openxlsx)

Step 1: Preparation: day 1

Timing: 2 hours

Create working directory:

# All analysis should be carried out in a specific directory.
# Here we define a base directory as "RNAseq_Analysis" and 
# all paths are in reference to this base directory.

cd ~
mkdir RNAseq_Analysis

Download reference data:

Raw FASTQ files were obtained from GSE255741 (Supinoxin) and GSE142024 (DDX5). Reference human genome sequence (hg38) in FASTA format and annotation in GTF format was downloaded from Ensembl genome browser (https://www.ensembl.org/). BED format gene annotations for Human (hg38_GENCODE_V47.bed.gz) were downloaded from (https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/).

cd ~/RNAseq_Analysis/
  
mkdir input

# Copy or download the paired-end FASTQ data.
# For this instruction, we assume data is labelled as: 
# control1_1.fastq.gz  and  control1_2.fastq.gz

##########

mkdir reference
cd reference

wget https://ftp.ensembl.org/pub/release-114/fasta/homo_sapiens\
/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

gunzip Homo_sapiens.GRCh38.dna.toplevel.fa.gz
mv Homo_sapiens.GRCh38.dna.toplevel.fa genome_ref.fasta

##########

cd ~/RNAseq_Analysis/
  
mkdir annotations
cd annotations

wget https://ftp.ensembl.org/pub/release-114/gtf/\
homo_sapiens/Homo_sapiens.GRCh38.114.chr.gtf.gz

gunzip Homo_sapiens.GRCh38.114.chr.gtf.gz
mv Homo_sapiens.GRCh38.114.chr.gtf genome_ref.gtf

##########

wget https://sourceforge.net/projects/rseqc/files/BED/\
Human_Homo_sapiens/hg38_GENCODE_V47.bed.gz

gunzip hg38_GENCODE_V47.bed.gz
sed -i 's/^chr//g'  hg38_GENCODE_V47.bed

Step 2: Quality control of FASTQ data: day 1

Timing: up to 4 hours per sample

a. Quality assessment

Quality assessment for each data was performed using fastqc tool.

cd ~/RNAseq_Analysis/
  
mkdir quality_control
cd quality_control

mkdir fastqc_before
cd fastqc_before

mkdir control1

fastqc ~/RNAseq_Analysis/input/control1_1.fastq.gz -t 30  -o control1
fastqc ~/RNAseq_Analysis/input/control1_2.fastq.gz -t 30  -o control1

# Repeat above steps for every FASTQ file for each sample

b. Quality-based trimming

Quality-based trimming (removal of adapters, low quality bases and short sequences) fastp. After trimming, re-assessment of trimmed data was performed using fastqc to ensure optimal data quality.

cd ~/RNAseq_Analysis/

mkdir fastp 
cd fastp 

# Processing for sample control1

fastp -w 30 \
-i ~/RNAseq_Analysis/input/control1_1.fastq.gz \
-I ~/RNAseq_Analysis/input/control1_2.fastq.gz \
-o control1_1.trimmed.fastq.gz \
-O control1_2.trimmed.fastq.gz \
--length_required  50 \
-q 30 \
--detect_adapter_for_pe  \
-h control1.html  \
-j control1.fastp.json 


cd ~/RNAseq_Analysis/quality_control/

mkdir fastqc_after
cd fastqc_after

mkdir control1.trimmed

fastqc ~/RNAseq_Analysis/quality_control/fastp/control1_1.trimmed.fastq.gz \
-t 30 \
-o control1.trimmed 

fastqc ~/RNAseq_Analysis/quality_control/fastp/control1_2.trimmed.fastq.gz \
-t 30 \
-o control1.trimmed 

# Repeat above steps for each sample

Step 3: Indexing reference genome: day 1

Timing: 4 hours

a. Indexing:

Indexing of reference genome (structured representation of a genome to enable faster and more efficient searching and alignment of DNA sequences) was performed through STAR aligner.

cd ~/RNAseq_Analysis/

cd reference

STAR --runThreadN 30 \
--runMode genomeGenerate \
--genomeDir . \
--genomeFastaFiles genome_ref.fasta

Step 4: Alignment with reference genome: day 2

Timing: up to 6 hours per sample

a. Mapping:

Quality trimmed reads were mapped to reference genome using the STAR aligner. Alignment summary file was inspected to ensure a suitable percentage of reads are mapped to reference genome. The aligned data (BAM format) was generated for each sample.

cd ~/RNAseq_Analysis/

mkdir mapping
cd mapping


STAR --runThreadN 30 \
--runMode alignReads \
 --outSAMunmapped Within \
--outSAMattrIHstart 0 \
--outFilterIntronMotifs RemoveNoncanonical  \
--genomeDir ~/RNAseq_Analysis/reference/ \
--readFilesIn ~/RNAseq_Analysis/quality_control/fastp/control1_1.trimmed.fastq.gz  
~/RNAseq_Analysis/quality_control/fastp/control1_2.trimmed.fastq.gz  \
--readFilesCommand zcat \
--twopassMode Basic \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix control1.

samtools sort -@ 30 -n  control1.Aligned.sortedByCoord.out.bam  -o control1_sortedbyread.bam 
samtools index -@ 30   control1.Aligned.sortedByCoord.out.bam 
samtools flagstat -@ 30 control1.Aligned.sortedByCoord.out.bam > control1_flagstat.txt  

# Repeat above steps for each sample

Step 5: Infer data strandedness: day 3

Timing: up to 30 minutes per sample

running infer_experiment.py script:

Determining if data is stranded or un-stranded (dependent of library preparation and sequencing protocols) is crucial for proper analysis and interpretation of RNAseq data. Stranded protocol preserves the directionality of the transcripts, while non-stranded protocol does not.

cd ~/RNAseq_Analysis/
cd mapping

infer_experiment.py \
-i control1.Aligned.sortedByCoord.out.bam \
-r  ~/RNAseq_Analysis/annotations/hg38_GENCODE_V47.bed > control1.infer_exp.txt

# Repeat above steps for each sample

infer strandedness:

Data strandedness for each sample was determined using the ‘infer_experiment.py’ script from the RSeQC package. The output denotes the fraction of reads assigned to each biological sequence orientation (5’-3’ – forward or 3’-5’ reverse). Generally, for un-stranded libraries, fractions of reads assigned to each orientation are roughly equal (50:50) while for stranded libraries a definitive bias is observed towards one orientation (80:20, 20:80 or similar).

Step 6: Quantification to generate counts matrix: day 3

Timing: up to 30 minutes per sample

Counts matrix in RNAseq:

Counts matrix in RNAseq summarizes the expression level by genes in each sample. It is generated by counting the number of reads aligned to each gene.
A ‘featureCounts’ program from subread package was employed to count the reads assigned genes in each sample. This program takes genome annotation (GTF format), aligned reads (BAM files from step 2) and inferred strand (strand information from step 3) as inputs, reads are counted and aggregated by each gene, for each sample.

############### QUANTIFICATION_PARAMETER S#########
# -T -> number of processors
# -a -> provide annotation file here
# -o -> provide out prefix
# -t -> provide feature type to quantify for
# -g -> attribute to assign the quantification for
# -s -> specify strandedness here, 0 = unstranded, 1 = stranded and 2 = reverse
# -p -> specify is data is paired end
# -B -> quantify if both ends of the pairs are mapped
###############################################

cd ~/RNAseq_Analysis/

mkdir counts
cd counts

featureCounts -T 30 \
-a ~/RNAseq_Analysis/annotations/genome_ref.gtf \
-o control1.ct \
-s 0 \
-p -B  \
-t gene -g gene_id \
~/RNAseq_Analysis/mapping/control1.Aligned.sortedByCoord.out.bam

# output file "control1.ct" has 7 columns and 
# we are interested in column 1 (geneID) and column 7 (counts in sample control1)
# We use `cut` command in Linux to extract only the column 1 and 7 
# and save it to another file

grep -v "#" control1.ct | cut -f1,7- > control1.ct.counts

# Repeat above steps for each sample

combine counts:

Lastly, a custom script was employed to generate a combined counts matrix where each row represents a gene, and each column represents a sample. The values in the matrix denote the number of reads mapped to each gene in each sample.

cd ~/RNAseq_Analysis/
cd counts

csvtk join -t $(ls *.counts) | \
sed 's/.Aligned.sortedByCoord.out.bam//g' | \
sed 's/Geneid/Gene_ID/g' > combined_counts.tsv

# Above commands should generate the counts matrix for all the samples

Transcripts Per Million (TPM) counts:

Transcripts Per Million (TPM) are normalized counts in RNA-seq data. TPM represents the relative abundance of transcripts, essentially indicating the number of reads detected for a gene if sequenced to one million reads. TPM normalizes for both sequencing depth and transcript length, making it useful for comparing gene expression across different samples.

cd ~/RNAseq_Analysis/
mkdir TPM
cd TPM

TPMCalculator -p -e -a -b \
~/RNAseq_Analysis/mapping/control1.Aligned.sortedByCoord.out.bam \
-g ~/RNAseq_Analysis/annotations/genome_ref.gtf 

# output file "control1.Aligned.sortedByCoord.out_genes.out" 
# will be created containing TPM counts 

# extract columns of interest (geneID and TPMcounts)

cut -f 1,7 control1.Aligned.sortedByCoord.out_genes.out > control1.TPM.counts

# Repeat above steps for each sample 
# combine *.counts files with csvtk

csvtk join -t $(ls *.counts)  > TPM_counts.tsv

TPM counts matrix generation:

TPMCalculator program add suffix as “#1”, “#2”, …, “#N” when the same gene ID is denoted at different locations in annotation (GTF) file. For the sake of simplicity, for such cases, we keep the gene location with maximum assigned counts so that we retain maximum assigned counts for a specific gene.

TPM = read.table(file = "TPM_counts.tsv", header = T, sep = "\t", quote = "")

TPM_nohash = TPM %>%
    dplyr::filter(!str_detect(Gene_ID, "#"))

TPM_withhash = TPM %>%
    dplyr::filter(str_detect(Gene_ID, "#"))

TPM_withhash$Gene_ID = str_replace(TPM_withhash$Gene_ID, "#.*", "")


TPM_withhash_unique = as.data.frame(TPM_withhash %>%
    rowwise() %>%
    mutate(Total_counts = sum(c_across(2:length(TPM_withhash)))) %>%
    dplyr::arrange(Gene_ID, desc(Total_counts)) %>%
    dplyr::distinct(Gene_ID, .keep_all = T) %>%
    dplyr::select(-c("Total_counts")))

TPM_final = rbind(TPM_nohash, TPM_withhash_unique)

saveRDS(TPM_final, "TPM_final.rds")

Step 7: Differential Expression (DE) analysis: day 4

Timing: up to 4 hours

DE_edger.R script for differential gene expression:

DE analysis identifies genes with significant changes in expression levels between two or more conditions. The analysis involves statistical tests to determine if observed differences in gene expression are likely due to biological factors rather than random noise.

# A complete script is available on GitHub as DE_edger.R

library(DESeq2)
library(ggplot2)
library(gplots)
library(tidyverse)
library(RColorBrewer)
library(edgeR)
library(ggrepel)
library(ComplexHeatmap)
library(dplyr)

args <- commandArgs(trailingOnly = TRUE)

countfile = args[1]  #path for the counts file
control = args[2]  #name for control (as in counts file)
treatment = args[3]  #name for treatment (as in counts file)
control_rep = args[4]  #number of replicates for control
treatment_rep = args[5]  #number of replicates for treatment
path = args[6]  #path for annotation file

outprefix <- paste(treatment, control, sep = "_vs_")
dir.create(outprefix)
setwd(outprefix)

header = paste(rep("#", 50), collapse = "")

sink(file = paste0(outprefix, ".sessioninfo.txt"))

cat(paste(header, "#Version Information", header, sep = "\n"))
cat("\n")
version
cat("\n")

cat(paste(header, "#Session Information", header, sep = "\n"))
cat("\n")
sessionInfo()
sink()

# Save log to file

sink(file = paste0(outprefix, ".log.txt"))

Anno <- read.table(path, sep = "\t", stringsAsFactors = F, header = TRUE,
    quote = "")

raw.data = read.table(countfile, row.names = "Gene_ID", header = T,
    sep = "\t", as.is = T, check.names = F, quote = "")

control
treatment

group <- factor(rep(c(control, treatment), times = c(control_rep,
    treatment_rep)), levels = c(control, treatment))

group
control_mat <- select(raw.data, starts_with(control))
treatment_mat <- select(raw.data, starts_with(treatment))
raw.counts <- merge(control_mat, treatment_mat, by = "row.names",
    all.x = TRUE)

raw.counts <- column_to_rownames(raw.counts, var = "Row.names")

#### edgeR Analysis

y <- DGEList(counts = as.matrix(raw.counts), group = group)
x <- calcNormFactors(y)

keep <- filterByExpr(y)
filtered.data <- y[keep, keep.lib.sizes = FALSE]
y <- calcNormFactors(filtered.data)

design <- model.matrix(~0 + group)

colnames(design) <- levels(y$samples$group)
y <- estimateDisp(filtered.data, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, contrast = c(-1, 1))

edgeR_DE = as.data.frame(topTags(qlf, sort.by = "PValue", n = Inf))
edgeR_DE = rownames_to_column(edgeR_DE, "Gene_ID")
edgeR_DE = left_join(edgeR_DE, Anno, by = "Gene_ID")

write.table(edgeR_DE, file = paste0(outprefix, ".DE_edgeR_All.txt"),
    sep = "\t", quote = F, row.names = F)

saveRDS(edgeR_DE, file = paste0(outprefix, "edgeR_DE.rds"))

#### Filtering EdgeR - FDR 5% filtered
edgeR_DE_FDR5P = dplyr::filter(edgeR_DE, FDR < 0.05)
dim(edgeR_DE_FDR5P)

write.table(edgeR_DE_FDR5P, file = paste0(outprefix, ".DE_edgeR_5PCT.txt"),
    sep = "\t", quote = F, row.names = F)

sink()

DE analysis between (DDX5-knockdown and control) and (Supinoxin treated and control) was performed using edgeR R-package.

# DE analysis require (1) DE_edger.R (2) count.matrix (3)
# Annotation file <TAB delimited>

# The script was run on windows computer using rstudio as
# below: Please update complete paths for each file

# DE analysis for Supinoxin data:
system("RScript   DE_edger.R   counts_Supinoxin.TXT   H69ARWT   H69ARSUP  3   3   Annotation.TXT")

# DE analysis for DDX5KD data:
system("RScript   DE_edger.R   counts_DDX5KD.TXT   WT   DDX5KD  3   3   Annotation.TXT")

Step 8: Determine shared up- and down-regulated genes between DDX5KD and Supinoxin data: day 4

Timing: up to 1 hour

Up- and down-regulated genes:

In each data, up- and down-regulated genes were denoted as following:
1. Up-regulated genes - FDR < 0.05 and log2fold-change > 1.
2. Down-regulated genes - FDR < 0.05 and log2fold-change < -1.

library(dplyr)
library(openxlsx)
library(ggvenn)

setwd("Z:/PCCR/Tran_Elizabeth/Methods_paper/GitHub/data")

SUP.DE = openxlsx::read.xlsx(xlsxFile = "Supinoxin.DE.xlsx",
    sheet = "EdgeR_All")

DDX5KD.DE = openxlsx::read.xlsx(xlsxFile = "DDX5KD.DE.xlsx",
    sheet = "EdgeR_All")

SUP.up = SUP.DE %>%
    dplyr::filter(padj < 0.05) %>%
    dplyr::filter(log2FoldChange > 1)

SUP.down = SUP.DE %>%
    dplyr::filter(padj < 0.05) %>%
    dplyr::filter(log2FoldChange < -1)


DDX5KD.up = DDX5KD.DE %>%
    dplyr::filter(FDR < 0.05) %>%
    dplyr::filter(logFC > 1)


DDX5KD.down = DDX5KD.DE %>%
    dplyr::filter(FDR < 0.05) %>%
    dplyr::filter(logFC < -1)

Venn diagram (Figure 1A) for up- and down-regulated genes:

A Venn diagram was created using the up- and down-regulated genes from each data using the ggvenn R-package.

my_list = list(DDX5KD.up = DDX5KD.up$Gene_ID, DDX5KD.down = DDX5KD.down$Gene_ID,
    SUP.down = SUP.down$Gene_ID, SUP.up = SUP.up$Gene_ID)

ggvenn(my_list, show_percentage = F, text_size = 8)

Custom figures: day 4