The pipeline is designed to harmonize summary statistics based on GWASLab.
- Singularity
see also environment.yml and Makefile
git clone --recurse-submodules https://github.com/ht-diva/harmonization_pipeline.git
cd harmonization_pipeline
The config/config.yaml file is a configuration file that sets various parameters and paths for setting up the pipeline. Here's a breakdown of its components that you must accord to your needs:
Rule Execution Flags
These flags control which parts of the pipeline should be executed.
run:
harmonization: True
liftoverbcf_harmonization: False
summarize: True
delivery: False
- harmonization: If set to True, the harmonization rules will be executed.
- liftoverbcf_harmonization: If set to False, the harmonization rules will not be executed after BCFtools liftover.
- summarize: If set to True, the summarization rules will be executed.
- delivery: If set to False, the delivery rules will not be executed.
Paths
These paths define the locations of input files and directories where intermediate and final results will be stored. Change the paths as per your requirements.
sumstats_path: config/sumstats_paths.tsv
sumstats_suffix: ".gwas.regenie.gz"
sumstats_sep: "\t"
dest_path: "../test/destination"
workspace_path: "../test/results"
hg37_fasta_file_path: '/public_data/liftOver/human_g1k_v37.fasta'
hg38_fasta_file_path: '/public_data/liftOver/hg38.fa'
chain_file_path: '/public_data/liftOver/hg19ToHg38.over.chain.gz'
- sumstats_path: Path to the file containing a path to each summary statistics to process.
- sumstats_suffix: Common suffix of the input summary statistics files.
- sumstats_sep: Separator of the input summary statistics files.
- dest_path: Destination path for the final results.
- workspace_path: Path where intermediate results will be stored.
- hg37_fasta_file_path: Path to the file with the reference genome used in the input summary statistics, hg19 (GRCh37).
- hg38_fasta_file_path: Path to the file with the target genome used in the liftover, hg38 (GRCh38).
- chain_file_path: Path to the chain file to map the genomic coordinates from hg19 (GRCh37) to hg38 (GRCh38) during the liftover.
Common parameters
These parameters are used across different steps of the pipeline.
input_format: &iformat "regenie"
- input_format: Defines the input format for summary statistics and aliased as &iformat. Check the Input formats section for more details.
Parameters for Specific Steps
params:
harmonize_sumstats:
input_format: *iformat
config_file: "config/config_harmonize_sumstats.yml"
snp_mapping:
input_format: *iformat
config_file: "config/config_snp_mapping.yml"
summarize_sumstats:
input_format: "gwaslab"
config_file: "config/config_summarize_sumstats.yml"
-
harmonize_sumstats: Parameters for the harmonize_sumstats rule.
- input_format: Uses the aliased *iformat.
- config_file: Path to the configuration file for harmonization.
-
snp_mapping: Parameters for the snp_mapping rule.
- input_format: Uses the aliased *iformat.
- config_file: Path to the configuration file for SNP mapping.
-
summarize_sumstats: Parameters for the summarize_sumstats rule.
- input_format: Set to "gwaslab".
- config_file: Path to the configuration file for summarization.
At the bottom of each rule-based configuration files described above, there is a filename transformation section that specifies how to extract some information from the input filenames.
For example:
# Filename transformation, e.g.: seq.3007.7.gwas.regenie.gz
filename_mask: [ True, True, True, False, False, False]
filename_sep: '.'
This transformation section tells the pipeline how to split the filename into parts. The filename_mask
list specifies which parts should be retained (True) and which should be discarded (False).
The filename_sep
specifies the separator used to split the filename into parts.
These configuration files are essential for setting up and running the analysis pipeline, ensuring that each step is correctly configured and that paths and formats are consistent across the workflow.
Examples of configuration files for BELIEVE, Decode, FinnGen, INTERVAL, UKB, and UKB-PPP input data are given in the folder examples.
To submit the workflow to the HT HPC cluster, you can use the submit.sbatch script with the command sbatch submit.sbatch
. Check the script to adapt it to your specific requirements.
The job name can now be displayed as rule name (rule_***
) followed by the summary statistics identifier (wildcards_***
) in the "COMMENT" field of squeue
. Use the command:
squeue --me --format="%.10i %.9P %.8j %.50k %.8u %.2t %.6M %.6D %.6R"
with output (example):
JOBID | PARTITION | NAME | COMMENT | USER | ST | TIME | NODES | NODELIST |
---|---|---|---|---|---|---|---|---|
199xxxx | cpuq | 72a9f3ce-8929-... | rule_harmonize_sumstats_wildcards_seq123 | username | R | mm:ss | 1 | cnodexx |
199xxxx | cpuq | harmonization_pipeline | (null) | username | R | mm:ss | 1 | cnodexx |
Possible input formats for summary statistics (see formatbook.json for more options to add):
- finngen
- vcf
- decode
- gwaslab
- regenie
- regenie_gene
- fastgwa
- ldsc
- fuma
- pickle
- metal_het
This pipeline requires 6 configuration files in the folder config: the main configuration file config/config.yaml, and 5 rule-based configuration files where to specify the parameters of each step of the rule.
Examples of configuration files for BELIEVE, Decode, FinnGen, Genes & Health, INTERVAL, Meta-CHRIS-INTERVAL, UK Biobank and UKB-PPP input data are given in the folder examples.
-
harmonize_sumstats (
harmonization: True
):
Purpose: Performs GWASLab harmonization on input data without filtering.
Output: {sumstat_id}.gwaslab.tsv.gz: Standardized and aligned GWAS summary statistics. -
bgzip_tabix (
harmonization: True
):
Purpose: Creates a region-based index (CHR and POS columns) of GWAS harmonized data for fast queries.
Output: {sumstat_id}.gwaslab.tsv.gz.tbi: Index of GWAS harmonized data. -
create_snp_mapping_table (
harmonization: True
):
Purpose: Creates mapping table to match input data and harmonized summary statistics.
Output: table.snp_mapping.tsv.gz: Table that links input SNPID (and rsID when available) to harmonized SNPID. -
convert_to_vcf (
liftoverbcf_harmonization: True
):
Purpose: Convert input data to VCF, i.e. the format required for BCFtools liftover (liftover_bcftools ).
Output: {sumstat_id}.vcf: Summary statistics in VCF format. -
liftover_bcftools (
liftoverbcf_harmonization: True
):
Purpose: Performs BCFtools liftover on the VCF-converted input before performing harmonization (harmonize_sumstats), indexing (bgzip_tabix), and SNPID mapping (create_snp_mapping_table).
Output: {sumstat_id}.liftover.vcf.gz: Liftover summary statistics in VCF format. -
summarize_sumstats, quality_check, create_if_table, create_min_pvalue_table and create_quality_check_table (
summarize: True
):
Purpose: Creates summary reports and plots of harmonized data.
Outputs:
{sumstat_id}.png: Includes a Manhattan plot of -log10(p-values) by chromosome/position, and a QQ plot of observed -log10(p-values) vs. expected, with thresholds for genome-wide significance.
min_pvalue_table.tsv: Table with top association hits (SNPs with the smallest p-value in the GWAS summary statistics).
inflation_factors_table.tsv: Table with genomic inflation factors (lambda GC, Median and Maximum chi-squared statistics).
quality_check_table.tsv: Table with quality check information: nr. of log errors, difference of line nr. between input and harmonized summary statistics, log messages for nr. of removed variants due to bad statistics and/or during liftover, log message of (in)consistent variants. -
sync_outputs_folder, sync_plots and sync_tables (
delivery: True
):
Purpose: Copies GWAS indexes, and summary reports and plots to destination folderdest_path
.
Outputs: Copies of {sumstat_id}.gwaslab.tsv.gz.tbi, {sumstat_id}.{sumstat_id}.png, min_pvalue_table.tsv, inflation_factors_table.tsv, and table.snp_mapping.tsv.gz.
Standardization and Harmonization includes the following steps:
- Check SNP identifiers (SNPID/rsID).
- Fix chromosome notation (CHR), basepair positions (POS) and alleles (EA and NEA).
- Sanity check on statistics.
- Infer genome reference build version.
- Align alleles sorting them alphabetically. Precisely EA is the effect allele for which BETA is estimated, and EA is alphabetically lower than NEA (in case, flip the alleles to match the alphabetical order).
- Flip allele-specific statistics for mismatches: BETA = - BETA; Z = - Z; EAF = 1 - EAF.
- Build SNPID column (CHR:POS:EA:NEA) (Optional with
fixid: True
andoverwrite: True
to specify in rule-based condiguration files, basic_check step). - Re-name and re-order columns based on GWASLab format.
See also the GWASLab website.
Check the dags for:
- the default option
- the BCFtools liftover option
- with the delivery option