v1.0 by Jackie T.
Computing cluster with modules for FastQC, MultiQC, STAR aligner, and featureCounts.
Package | Version |
---|---|
fastqc | 0.11.7 |
star | 2.7.1a |
subread | 1.6.2 |
python | 3.7.9 |
multiqc | 1.10.1 |
Cloning this repo will create a directory holding the pipeline script and assorted files. Move to the directory where you'd like to save the pipeline.
cd pipelines/
git clone https://github.com/neidl-connor-lab/transcriptomics-bulk.git
cd transcriptomics-bulk
Create the alignment index from a reference sequence FASTA file and annotation GTF file. For most model organisms, FASTA and GTF references can be found at Ensembl. Run ./index.sh
with the -h
flag to view the full list of options.
An important note about
exon
vstranscript
. Use theexon
feature when aligning a library prepared with polyA selection. If only rRNA depletion was used, use thetranscript
feature.
Flag | Argument |
---|---|
-P |
SCC project |
-N |
job name |
-g |
genome annotation GTF file |
-f |
genome reference FASTA file |
-o |
output directory |
-x |
alignment feature; exon or transcript |
usage: qsub -P PROJECT -N JOBNAME index.sh -g GTF -f FASTA -o OUTPUT -x FEATURE
arguments (options):
-g genome GTF file
-f genome FASTA file
-o output directory
-x feature for alignments (transcript|exon)
-h show this message and exit
Here is an example, where we download the Homo sapiens reference files from Ensembl and use them to make an index for libraries prepped with polyA selection (exon
not transcript
).
# download the human reference
wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
wget http://ftp.ensembl.org/pub/release-107/gtf/homo_sapiens/Homo_sapiens.GRCh38.107.gtf.gz
# decompress the files
gunzip Homo_sapiens.GRCh38.dna_rm.toplevel.fa.gz
gunzip Homo_sapiens.GRCh38.107.gtf.gz
# move them to the indices/ folder
mkdir indices
mv Homo_sapiens.GRCh38.107.gtf indices/
mv Homo_sapiens.GRCh38.dna_rm.toplevel.fa indices/
# submit the indexing job
qsub -P test-project \
-N test-index \
index.sh \
-g indices/Homo_sapiens.GRCh38.107.gtf \
-f indices/Homo_sapiens.GRCh38.dna_rm.toplevel.fa \
-o indices/hsapiens \
-x exon
View pipeline options and required arguments by running pipeline.sh
with the -h
flag.
./pipeline.sh -h
The help message indicates the required arguments and how to pass them:
Flag | Argument |
---|---|
-P |
SCC project |
-N |
job name |
-i |
path to index created in step 2 |
-g |
path to GTF file used in step 2 |
-f |
directory with FASTQ files |
-o |
output directory |
-x |
alignment feature; exon or transcript |
-m |
feature label; gene_id or gene_name |
usage: qsub -P PROJECT -N JOBNAME pipeline.sh -i INDEX -g GTF -f FASTQ -o OUTPUT -x FEATURE -m LABEL
arguments (options):
-i path to STAR index
-g path to GTF annotation
-f input FASTQ directory
-o output directory
-x feature for alignments (transcript | exon)
-m label for count matrix (gene_id | gene_name)
-h show this message and exit
Here is an example, where the index materials are in indices/
, FASTQ input files are in input-files/
, and output files should go to output-files/
. The example job is named test-job
, and the project allocation used is test-project
. The job output will be written to a file named log-test-job.qlog
.
qsub -P test-project \
-N test-job \
pipeline.sh \
-i indices/hsapiens/ \
-g indices/Homo_sapiens.GRCh38.107.gtf \
-f input-files/ \
-o output-files/
Raw paired-end FASTQ files should be gzipped and renamed such that each sample (e.g., sample01
) has the following files in the input directory. I recommend putting all sample descriptors as columns in your metadata file instead of the filename!
sample01-r1.fq.gz
sample02-r2.fq.gz
Raw FASTQ files are aligned to the previously-constructed reference using STAR aligner. All output files have the sample ID as a prefix. The alignment itself is saved to output/sample01-Aligned.out.bam
.
Flag | Meaning |
---|---|
--runThreadN |
parallelize this job |
--runMode |
align reads instead of making an index |
--genomeDir |
give the path to the genome index |
--readFilesCommand |
the input files are gzipped |
--outSAM* |
compress the output and don't bother sorting the reads |
--readFilesIn |
give the path to the R1 and R2 files |
--outFileNamePrefix |
where should the output files go, and what should they be named? |
STAR --runThreadN 16 \
--runMode alignReads \
--genomeDir 'indices/genomedir' \
--readFilesCommand zcat \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--readFilesIn 'input/sample01-r1.fq.gz' 'input/sample01-r2.fq.gz' \
--outFileNamePrefix 'output/sample01-'
Aligned reads in the *-Aligned.out.bam
file are quantified using featureCounts, which is a tool in the subread module. This will output a counts matrix, which is the input for differential expression calculations.
Flag | Meaning |
---|---|
-T |
parallelize this job |
-M |
count reads that align to more than one locus |
-p |
these alignments come from paired-end sequencing |
-t |
the feature type to use for quantification |
-g |
the annotation feature to use for labeling |
-O |
collapse all counts to their respective -g feature |
-a |
the annotation GTF file |
-o |
path and name of the output count matrix |
... |
BAM alignment files |
featureCounts -T 16 \
-M \
-p \
-t FEATURE \
-g LABEL \
-O \
-a 'annotation.gtf' \
-o 'output/counts.tsv' \
'output/sample01-Aligned.out.bam' ...
The FastQC tool evaluates raw read quality. Running this on each sample allow us to flag potential problematic samples before running differential expression analysis.
Flag | Meaning |
---|---|
--threads |
parallelize this job |
--quiet |
minimize output |
--outdir |
output directory |
... |
one FASTQ file per sample |
fastqc --threads 16 \
--quiet \
--outdir 'data' \
'data/sample01-r1.fq.gz' ...
The MultiQC tool will compile output logs from FastQC, STAR, and featureCounts to generate a report on our entire pipeline. This will be our first stop when pipeline.sh
finishes running -- you can see here what any problems may be before moving forward.
Flag | Meaning |
---|---|
--quiet |
minimize output |
--outdir |
output directory |
--filename |
output HTML file |
--module |
tool output to QC |
... |
input directory |
multiqc --quiet \
--outdir 'data' \
--filename 'multiqc.html' \
--module fastqc \
--module star \
--module featureCounts \
'data'