A compact pipeline that allows for searching selected domains (based on Pfam HMM models) in proteins encoded in genomic sequences.
To set up the environment properly you need to have Miniconda installed on linux-64 platform. The pipeline was tested on Ubuntu 22.04 using conda 23.7.4. Once you are ready with that, you need to create a conda environment pfam-gen described in envs/pfam-gen.txt:
conda create --name pfam-gen --file envs/pfam-gen.txtNext, activate the environment:
conda activate pfam-genUsing test_data.sh script, you may download 20 staphylococcal genomic sequences from NCBI GenBank to genomes directory, as well as Pfam-A.hmm file to test the pipeline on a set of 10 CAT (catalytic) and 5 CWT (cell-wall-targeting) domains (input/domains.tsv) that occurr in bacterial peptidoglycan hydrolases. Simply run the script in the pipeline directory:
./test_data.shThe pipeline utilises the following directory structure:
your_pipeline_location/
├── Snakefile
├── config.yaml
├── input
│ ├── colors.tsv
│ ├── domains.tsv
│ ├── IUPACDNA.txt
│ └── TABLE11.txt
├── src
│ ├── build.sh
│ └── extractorfs.cpp
├── templates
│ ├── style.css
│ └── tmpl.html
├── scripts
│ ├── annot.py
│ ├── archchart.py
│ ├── extractfasta.py
│ ├── extractorfs
│ ├── extractpfm.sh
│ ├── filter.py
│ ├── lib
│ │ └── fasta.py
│ ├── preprocess.py
│ └── unique.py
├── log
└── output
In the working directory you can find the Snakefile describing the pipeline and config.yaml, in which paths to Pfam-A.hmm and the directory with genomic sequences as well as domain architectures of interes are provided. In input/ there are files describing nucleotide sequence alphabet and translation table to be used (IUPACDNA.txt, TABLE11.txt) as well as groups of domains and domains being sarched for next to colors these groups and domains are to be painted with in a visualisation of all possible architectures that are found in searched genomes (domains.tsv, colors.tsv). In src/ a source file extractorfs.cpp for scripts/extractorfs is placed. If build.sh is run from that directory, the application will be recompiled and saved as scripts/extractorfs. In templates/ CSS and HTML templates are located. These are used to generate a HTML visualisation of all possible architectures that are found in searched genomes, which next is converted to a PNG file. Necessary scripts and one compiled application are located in scripts/. Directories output/ and log/ will be created automatically once the pipeline is run. All diagnostic and error messages from tools and scripts used by the pipeline will be redirected to files in the log/ directory.
The pipline described in the Snakefile encompasses the following stages:
- extractorfs -- using
scripts/extractorfs, provided DNA alphabet (input/IUPACDNA.txt) and translation table (input/TABLE11.txt), from each genome extract all posisble open reading frames (ORFs) of lenght >= 300 nt. - uniquetrans -- using
scripts/unique.py, cluster quickly extracted protein sequences based on their 100% identity to prepare a non-redundant set for HMM searches. - hmmfetch -- using
hmmfetchtool, fetch domains, fromPfam-A.hmmfile, that are listed ininput/domains.tsvand their Pfam accession version numbers are provided inpfam_acccolumn of that file. - hmmsearch -- using
hmmsearchtool, search for retrived domains in the non-redundant protein sequence set. - preprocess -- using
scripts/preprocess.py, preprocess the rawhmmsearchresults and save relevant columns. - archchart -- using
scripts/archchart.py, branch to generate charts in HTML format that describe all domain architectures found, also in regard to groups of domains (columngroupininput/domains.tsv). - convertchart -- using
wkhtmltoimagetool, continue the branch to convert charts in HTML format to PNG. - filter -- using
scripts/filter.py, continue the main branch and from preprocessed hits select those of independent E-value (i-Evalue) <= 0.001 and domain coverage >= 80% (0.8). Next from target protein sequences select those with domain architecure of interest (config.yaml). - extracttrans -- using
scripts/extractfasta, extract finally selected sequences from the non-redundant set. - signalp -- using independently installed SignalP 5.0 tool available as
signalp, detect N-terminal signal sequences in the final set of protein sequences for each domain architecture of interest. If the tool is not available, the step generates an empty result file and is skipped. - annotdom -- using
scripts/annot.py, prepare a GFF3 file with annotations for the finally selected protein sequences. Annotations are prepared based onhmmsearchfiltered results andsignalpresults.
More detailed description on how the pipeline works you will find in comments in the Snakefile, config.yaml file and the script/source files.
Providing that you have pfam-gen environment properly set up and activated as well as at least the test data dowloaded, you can first look up the list of tasks to be done by running the following command from the pipeline directory, where Snakefile is located:
snakemake --dryrun --quietand then run the pipeline using as many cores as you wish:
snakemake --cores number_of_coresFinal data, non-redundant protein sequences and corresponding annotations for each domain architecture of interest, as well as HTML and PNG visualisatons of all domain architectures found, you will find in output/final directory.