Skip to content

ht-diva/harmonization_pipeline

Repository files navigation

harmonization_pipeline

The pipeline is designed to harmonize summary statistics based on GWASLab.

Requirements

  • Singularity

see also environment.yml and Makefile

Getting started

  • git clone --recurse-submodules https://github.com/ht-diva/harmonization_pipeline.git
  • cd harmonization_pipeline

The config/config.yaml file is a configuration file that sets various parameters and paths for setting up the pipeline. Here's a breakdown of its components that you must accord to your needs:

Rule Execution Flags

These flags control which parts of the pipeline should be executed.

run:
  harmonization: True
  liftoverbcf_harmonization: False
  summarize: True
  delivery: False
  • harmonization: If set to True, the harmonization rules will be executed.
  • liftoverbcf_harmonization: If set to False, the harmonization rules will not be executed after BCFtools liftover.
  • summarize: If set to True, the summarization rules will be executed.
  • delivery: If set to False, the delivery rules will not be executed.

Paths

These paths define the locations of input files and directories where intermediate and final results will be stored. Change the paths as per your requirements.

sumstats_path: config/sumstats_paths.tsv
sumstats_suffix: ".gwas.regenie.gz"
sumstats_sep: "\t"
dest_path: "../test/destination"
workspace_path: "../test/results"
hg37_fasta_file_path: '/public_data/liftOver/human_g1k_v37.fasta'
hg38_fasta_file_path: '/public_data/liftOver/hg38.fa'
chain_file_path: '/public_data/liftOver/hg19ToHg38.over.chain.gz'
  • sumstats_path: Path to the file containing a path to each summary statistics to process.
  • sumstats_suffix: Common suffix of the input summary statistics files.
  • sumstats_sep: Separator of the input summary statistics files.
  • dest_path: Destination path for the final results.
  • workspace_path: Path where intermediate results will be stored.
  • hg37_fasta_file_path: Path to the file with the reference genome used in the input summary statistics, hg19 (GRCh37).
  • hg38_fasta_file_path: Path to the file with the target genome used in the liftover, hg38 (GRCh38).
  • chain_file_path: Path to the chain file to map the genomic coordinates from hg19 (GRCh37) to hg38 (GRCh38) during the liftover.

Common parameters

These parameters are used across different steps of the pipeline.

input_format: &iformat "regenie"
  • input_format: Defines the input format for summary statistics and aliased as &iformat. Check the Input formats section for more details.

Parameters for Specific Steps

params:
  harmonize_sumstats:
    input_format: *iformat
    config_file: "config/config_harmonize_sumstats.yml"
  snp_mapping:
    input_format: *iformat
    config_file: "config/config_snp_mapping.yml"
  summarize_sumstats:
    input_format: "gwaslab"
    config_file: "config/config_summarize_sumstats.yml"
  • harmonize_sumstats: Parameters for the harmonize_sumstats rule.

    • input_format: Uses the aliased *iformat.
    • config_file: Path to the configuration file for harmonization.
  • snp_mapping: Parameters for the snp_mapping rule.

    • input_format: Uses the aliased *iformat.
    • config_file: Path to the configuration file for SNP mapping.
  • summarize_sumstats: Parameters for the summarize_sumstats rule.

    • input_format: Set to "gwaslab".
    • config_file: Path to the configuration file for summarization.

At the bottom of each rule-based configuration files described above, there is a filename transformation section that specifies how to extract some information from the input filenames.

For example:

# Filename transformation, e.g.: seq.3007.7.gwas.regenie.gz
filename_mask: [ True, True, True, False, False, False]
filename_sep: '.'

This transformation section tells the pipeline how to split the filename into parts. The filename_mask list specifies which parts should be retained (True) and which should be discarded (False). The filename_sep specifies the separator used to split the filename into parts.

These configuration files are essential for setting up and running the analysis pipeline, ensuring that each step is correctly configured and that paths and formats are consistent across the workflow.

Configuration file examples

Examples of configuration files for BELIEVE, Decode, FinnGen, INTERVAL, UKB, and UKB-PPP input data are given in the folder examples.

Submitting the workflow

To submit the workflow to the HT HPC cluster, you can use the submit.sbatch script with the command sbatch submit.sbatch. Check the script to adapt it to your specific requirements.

Note on job names

The job name can now be displayed as rule name (rule_***) followed by the summary statistics identifier (wildcards_***) in the "COMMENT" field of squeue. Use the command:

squeue --me --format="%.10i %.9P %.8j %.50k %.8u %.2t %.6M %.6D %.6R"

with output (example):

JOBID PARTITION NAME COMMENT USER ST TIME NODES NODELIST
199xxxx cpuq 72a9f3ce-8929-... rule_harmonize_sumstats_wildcards_seq123 username R mm:ss 1 cnodexx
199xxxx cpuq harmonization_pipeline (null) username R mm:ss 1 cnodexx

Input formats

Possible input formats for summary statistics (see formatbook.json for more options to add):

  • finngen
  • vcf
  • decode
  • gwaslab
  • regenie
  • regenie_gene
  • fastgwa
  • ldsc
  • fuma
  • pickle
  • metal_het

Configuration files

This pipeline requires 6 configuration files in the folder config: the main configuration file config/config.yaml, and 5 rule-based configuration files where to specify the parameters of each step of the rule.

Examples of configuration files for BELIEVE, Decode, FinnGen, Genes & Health, INTERVAL, Meta-CHRIS-INTERVAL, UK Biobank and UKB-PPP input data are given in the folder examples.

Rules description

  • harmonize_sumstats (harmonization: True):
    Purpose: Performs GWASLab harmonization on input data without filtering.
    Output: {sumstat_id}.gwaslab.tsv.gz: Standardized and aligned GWAS summary statistics.

  • bgzip_tabix (harmonization: True):
    Purpose: Creates a region-based index (CHR and POS columns) of GWAS harmonized data for fast queries.
    Output: {sumstat_id}.gwaslab.tsv.gz.tbi: Index of GWAS harmonized data.

  • create_snp_mapping_table (harmonization: True):
    Purpose: Creates mapping table to match input data and harmonized summary statistics.
    Output: table.snp_mapping.tsv.gz: Table that links input SNPID (and rsID when available) to harmonized SNPID.

  • convert_to_vcf (liftoverbcf_harmonization: True):
    Purpose: Convert input data to VCF, i.e. the format required for BCFtools liftover (liftover_bcftools ).
    Output: {sumstat_id}.vcf: Summary statistics in VCF format.

  • liftover_bcftools (liftoverbcf_harmonization: True):
    Purpose: Performs BCFtools liftover on the VCF-converted input before performing harmonization (harmonize_sumstats), indexing (bgzip_tabix), and SNPID mapping (create_snp_mapping_table).
    Output: {sumstat_id}.liftover.vcf.gz: Liftover summary statistics in VCF format.

  • summarize_sumstats, quality_check, create_if_table, create_min_pvalue_table and create_quality_check_table (summarize: True):
    Purpose: Creates summary reports and plots of harmonized data.
    Outputs:
    {sumstat_id}.png: Includes a Manhattan plot of -log10(p-values) by chromosome/position, and a QQ plot of observed -log10(p-values) vs. expected, with thresholds for genome-wide significance.
    min_pvalue_table.tsv: Table with top association hits (SNPs with the smallest p-value in the GWAS summary statistics).
    inflation_factors_table.tsv: Table with genomic inflation factors (lambda GC, Median and Maximum chi-squared statistics).
    quality_check_table.tsv: Table with quality check information: nr. of log errors, difference of line nr. between input and harmonized summary statistics, log messages for nr. of removed variants due to bad statistics and/or during liftover, log message of (in)consistent variants.

  • sync_outputs_folder, sync_plots and sync_tables (delivery: True):
    Purpose: Copies GWAS indexes, and summary reports and plots to destination folder dest_path.
    Outputs: Copies of {sumstat_id}.gwaslab.tsv.gz.tbi, {sumstat_id}.{sumstat_id}.png, min_pvalue_table.tsv, inflation_factors_table.tsv, and table.snp_mapping.tsv.gz.

Standardization and Harmonization

Standardization and Harmonization includes the following steps:

  • Check SNP identifiers (SNPID/rsID).
  • Fix chromosome notation (CHR), basepair positions (POS) and alleles (EA and NEA).
  • Sanity check on statistics.
  • Infer genome reference build version.
  • Align alleles sorting them alphabetically. Precisely EA is the effect allele for which BETA is estimated, and EA is alphabetically lower than NEA (in case, flip the alleles to match the alphabetical order).
  • Flip allele-specific statistics for mismatches: BETA = - BETA; Z = - Z; EAF = 1 - EAF.
  • Build SNPID column (CHR:POS:EA:NEA) (Optional with fixid: True and overwrite: True to specify in rule-based condiguration files, basic_check step).
  • Re-name and re-order columns based on GWASLab format.

See also the GWASLab website.

DAGs

Check the dags for:

Releases

No releases published

Packages

 
 
 

Contributors 3

  •  
  •  
  •