GitHub - SergioAlias/fusariumid-train: ⚙️ 🍄 QIIME 2 TEF1 classifiers for the FUSARIUM-ID database

⚙️ 🍄 FUSARIUM-ID Naive Bayes classifiers for QIIME 2

🎉 Pre-trained FUSARIUM-ID classifiers available here!

A Snakemake workflow to train QIIME 2 taxonomic Naive Bayes classifiers for the FUSARIUM-ID database. This database contains sequences of the Translation Elongation Factor 1 alpha (TEF1, also known as EF1α), which serves as a considerably better marker for species identification in the filamentous fungal genus Fusarium than ITS, the standard marker for all Fungi.

If you don't want to run the workflow, you can pick one of the pre-computed classfiers here!.

🐍 This workflow uses Snakemake 7.32.4. Newer versions (8+) contain backwards incompatible changes that may result in this pipeline not working in a Slurm HPC queue system.

This pipeline:

Parses the FUSARIUM-ID multi-FASTA headers searching metadata and saves it as a TSV file (rules fid_correct_format, fid_extract_metadata and fid_reduce_metadata). You can read about how FUSARIUM-ID stores metadata in this manual (Spanish version here) and in the FUSARIUM-ID publication.
Formats metadata to match SILVA and UNITE taxonomy style (rule fid_build_taxonomy).
Imports taxonomy and sequences into QIIME 2 (rule fid_import_q2).
Downloads TEF1 sequences from GenBank for non-Fusarium fungi and other eukaryotes using a modified version of a query used in Boutigny et al. (2019) (rule download_ncbi).
Filters and dereplicates NCBI GenBank sequences (rule filter_ncbi).
Merges FUSARIUM-ID and NCBI GenBank sequences (rule merge_fid_ncbi).
Optionally, extracts the amplicon region using PCR primers and dereplicates again (rule extract_primers).
Trains a Naive Bayes classfier that can be used in qiime feature-classifier classify-sklearn (rule train).

Requisites

The only prerequisite is having Conda installed. In this regard, we highly recommend installing Miniconda and then installing Mamba (used by default by Snakemake) for a lightweight and fast experience.

Usage

Clone the repository
Create a Screen (see section Immediate submit and Screen)
Run the following command to download (if needed) and activate the FUSARIUM-ID-train environment, and to set aliases for the main functions:

source init_fusariumid_train.sh

Download FUSARIUM-ID v3.0 FASTA file from https://github.com/fusariumid/fusariumid (FUSARIUMID_v.3.0_TEF1.fas).
Edit config/config.yml with your specific requirements. Variables annotated with #cluster# must also be updated in config/cluster_config.yml.
If needed, modify time, ncpus and memory variables in config/cluster_config.yml.
Run fidt_run to start the workflow. You can also run it until some key steps (using --until rule_name) to check the results before continuing and to change parameters if necessary (recommended). For example, a possible workflow split could be (see Drawing DAGs and rule graphs for a visual workflow including all rule names):

fidt_run --until download_ncbi     # download sequences from NCBI GenBank (read the warning below)
fidt_run --until filter_ncbi       # quality filtering and dereplication of NCBI sequences
fidt_run                           # rest of workflow


# Tip: add the flag -n to perform a dry-run. You will see how many jobs 
# will be executed without actually running the workflow.

# Example:

# fidt_run --until download_ncbi -n

⚠️ Before downloading sequences from NCBI GenBank, please be aware of the NCBI Disclaimer and Copyright notice (Policies and Disclaimers - NCBI), particularly "run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests". As a rule of thumb, if you are downloading more than 125,000 sequences, only run this method at those times.

Immediate submit and Screen

FUSARIUM-ID-train inlcudes a command, fidt_immediate, that automatically sends all jobs to Slurm, correctly queued according to their dependencies. This is desirable e.g. when the runtime in the cluster login machine is very short, because it may kill Snakemake in the middle of the workflow. If your HPC queue system only allows a limited number of jobs submitted at once, change that number in init_fusariumid_train.sh and source it again (that also applies for fidt_run).

Please note that if the number of simultaneous jobs accepted by the queue system is less than the total number of jobs you need to submit, the workflow will fail. For such cases, we highly recommend not using fidt_immediate. Instead, use fidt_run inside a Screen. Screen is a multiplexer that lets you create multiple virtual terminal sessions. It is installed by default in most Linux HPC systems.

To create a screen, use screen -S fusariumid_train. Then, follow usage section there. You can dettach the screen with Ctrl+a and then d. You can attach the screen again with screen -r fusariumid_train. For more details about Screen usage, please check this Gist.

Drawing DAGs and rule graphs

Since FUSARIUM-ID-train is built over Snakemake, you can generate DAGs, rule graphs and file graphs of the workflow. We provide three commands for this: fidt_draw_dag, fidt_draw_rulegraph and fidt_draw_filegraph. These commands create dag.pdf, rulegraph.pdf and filegraph.pdf in the code directory.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.img		.img
config		config
resources		resources
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
init_fusariumid_train.sh		init_fusariumid_train.sh
release-template.md		release-template.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎉 Pre-trained FUSARIUM-ID classifiers available here!

Requisites

Usage

Immediate submit and Screen

Drawing DAGs and rule graphs

About

Uh oh!

Releases 1

Packages

Languages

License

SergioAlias/fusariumid-train

Folders and files

Latest commit

History

Repository files navigation

🎉 Pre-trained FUSARIUM-ID classifiers available here!

Requisites

Usage

Immediate submit and Screen

Drawing DAGs and rule graphs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages