Reference data retrieval, variant calling, custom database setup, annotation & missense variant filtering
This repository provides a comprehensive genomic analysis pipeline for haploid organisms, integrating reference data retrieval, variant calling, annotation, & missense variant filtering.
It also supports custom SnpEff database setup for genome annotation.
Step # | Task Description |
---|---|
✔️ 1 | Download sequencing data from SRA |
✔️ 2 | Human DNA sequence removal using hostile |
✔️ 3 | Retrieve reference genome (FASTA format) |
✔️ 4 | Perform haploid variant calling using paired-end data |
✔️ 5 | Download and process GenBank files |
✔️ 6 | Convert GenBank to GFF3 format |
✔️ 7 | Set up a custom SnpEff annotation database |
✔️ 8 | Annotate variants and extract missense variants |
Ensure you have the following installed:
fastq-dump
(SRA Toolkit)Biopython
(pip install biopython
)BWA
,Samtools
,BCFtools
SnpEff
(for variant annotation)genbank_to_gff3.py
(for GenBank to GFF3 conversion)extract_missense_variants
(for missense variant filtering)
fastq-dump --gzip --split-3 SRR1735032
*Illumina*
python3 ~~/hostile_clean_human.py --fasta GCF_009914755.4 -i ~/fastq_samples_list.txt -o Clean
**Illumina_fastq_samples_list.txt
Forward Read (R1) | Reverse Read (R2) |
---|---|
~/39907_1_83_R1.fastq.gz |
~/39907_1_83_R2.fastq.gz |
~/39907_1_84_R1.fastq.gz |
~/39907_1_84_R2.fastq.gz |
~/39907_1_85_R1.fastq.gz |
~/39907_1_85_R2.fastq.gz |
*Nanopore*
python3 ~~/hostile_clean_ont_human_minimap2.py --fasta GCF_009914755.4 -i ~/ONT_fastq_samples_list.txt -o Clean
**ONT_fastq_samples_list.txt
Read (R) |
---|
~/39907_1_83.fastq.gz |
~/39907_1_84.fastq.gz |
~/39907_1_85.fastq.gz |
python download_fasta.py -i "KR781608.1"
Single sample
bash ~/snpEff/variant_calling_pipeline --ref ~/snpEff/EVD/Makona-SLE.fa -1 ~/SRR1735032_1.fastq.gz -2 ~/SRR1735032_2.fastq.gz -o SRR1735032
** For a batch run, create a text file (samples.txt) containing one sample ID per line, e.g.:**
SRR1735032 SRR1735033 SRR1735034
a for a batch run, execute the variant_calling_batch.sh script with your reference genome and sample list:
bash variant_calling_batch.sh --ref reference.fasta -s samples.txt
python download_genbank.py -i "Ebola virus zaire"
python download_genbank.py -i "KR781608.1"
python ~/genbank_to_gff3.py -i ~/KR781608.1.gb --output ~/KR781608.1.gff
Edit the snpEff.config
file and add:
Ebola_virus_zaire.genome : Ebola virus zaire
Create the necessary directories inside the data/
directory:
...Ebola_virus_zaire
Each directory should contain:
genes.gbk
sequence.gff
sequence.fa
java -Xmx8g -jar ~/snpEff/snpEff.jar build -genbank -v Ebola_virus_zaire
java -Xmx8g -jar ~/snpEff/snpEff.jar Ebola_virus_zaire ~/SRR1735032.filtered.vcf > ~/SRR1735032.output.ann.vcf
extract_missense_variants -i ~/SRR1735032.output.ann.vcf -o ~/SRR1735032.missense.vcf
After running the pipeline, the key output files include:
- Variant calling output:
SRR1735032.filtered.vcf
- Annotated variants:
SRR1735032.output.ann.vcf
- Filtered missense variants:
SRR1735032.missense.vcf
If you are working with long-read assemblies (PacBio/Nanopore), use Minimap2 for variant detection.
conda install -c bioconda minimap2
minimap2 -ax asm5 reference.fasta draft_genome.fasta > alignment.sam
Convert to BAM & Sort
samtools view -bS alignment.sam | samtools sort -o alignment_sorted.bam
samtools index alignment_sorted.bam
bcftools mpileup -Ou -f reference.fasta alignment_sorted.bam | bcftools call -mv -Oz -o variants.vcf.gz
Output: variants.vcf.gz (VCF file containing SNPs and structural variants).
--ref <reference.fasta> → Reference genome.
-s sample_list.txt → A text file containing paths to multiple draft genome assemblies.
Example (sample_list.txt):
/path/to/draft_genome1.fasta
/path/to/draft_genome2.fasta
/path/to/draft_genome3.fasta
bash variant_calling_minimap2.sh --ref reference.fasta -s sample_list.txt
Output:
Sorted BAM files: variant_results/*.sorted.bam
Final Decompressed VCFs: variant_results/*.vcf (for each draft genome)
Approach best for tool output
Whole-genome alignment (WGA) comparing multiple draft assemblies nucmer (MUMmer4) variants.snps
Read-Mapping Approach If you have sequencing reads BWA + BCFtools variants.vcf.gz
Long-Read Draft Assemblies Nanopore/PacBio assemblies Minimap2 + BCFtools variants.vcf.gz
You could also perform annotation (SnpEff) or visualization, this can follow earlier steps described
-
Ensure all paths are correct before running the scripts.
-
If SnpEff fails to build a custom database, check that
genes.gbk
,sequence.gff
, andsequence.fa
are correctly formatted. -
If variant calling fails, verify that BWA, Samtools, and BCFtools are installed and correctly linked.
If you use this workflow in your research, please cite:
- SnpEff: Cingolani et al., 2012.
- BCFtools: Li et al., 2009.
- Biopython: Cock et al., 2009.