harmonization_pipeline

The pipeline is designed to harmonize summary statistics based on GWASLab.

Requirements

Singularity

Getting started

git clone --recurse-submodules https://github.com/ht-diva/harmonization_pipeline.git
cd harmonization_pipeline

The config/config.yaml file is a configuration file that sets various parameters and paths for setting up the pipeline. Here's a breakdown of its components that you must accord to your needs:

Rule Execution Flags

These flags control which parts of the pipeline should be executed.

run:
  harmonization: True
  liftoverbcf_harmonization: False
  summarize: True
  delivery: False

harmonization: If set to True, the harmonization rules will be executed.
liftoverbcf_harmonization: If set to False, the harmonization rules will not be executed after BCFtools liftover.
summarize: If set to True, the summarization rules will be executed.
delivery: If set to False, the delivery rules will not be executed.

Paths

These paths define the locations of input files and directories where intermediate and final results will be stored. Change the paths as per your requirements.

sumstats_path: config/sumstats_paths.tsv
sumstats_suffix: ".gwas.regenie.gz"
sumstats_sep: "\t"
dest_path: "../test/destination"
workspace_path: "../test/results"
hg37_fasta_file_path: '/public_data/liftOver/human_g1k_v37.fasta'
hg38_fasta_file_path: '/public_data/liftOver/hg38.fa'
chain_file_path: '/public_data/liftOver/hg19ToHg38.over.chain.gz'

sumstats_path: Path to the file containing a path to each summary statistics to process.
sumstats_suffix: Common suffix of the input summary statistics files.
sumstats_sep: Separator of the input summary statistics files.
dest_path: Destination path for the final results.
workspace_path: Path where intermediate results will be stored.
hg37_fasta_file_path: Path to the file with the reference genome used in the input summary statistics, hg19 (GRCh37).
hg38_fasta_file_path: Path to the file with the target genome used in the liftover, hg38 (GRCh38).
chain_file_path: Path to the chain file to map the genomic coordinates from hg19 (GRCh37) to hg38 (GRCh38) during the liftover.

Common parameters

These parameters are used across different steps of the pipeline.

input_format: &iformat "regenie"

input_format: Defines the input format for summary statistics and aliased as &iformat. Check the Input formats section for more details.

Parameters for Specific Steps

params:
  harmonize_sumstats:
    input_format: *iformat
    config_file: "config/config_harmonize_sumstats.yml"
  snp_mapping:
    input_format: *iformat
    config_file: "config/config_snp_mapping.yml"
  summarize_sumstats:
    input_format: "gwaslab"
    config_file: "config/config_summarize_sumstats.yml"

harmonize_sumstats: Parameters for the harmonize_sumstats rule.
- input_format: Uses the aliased *iformat.
- config_file: Path to the configuration file for harmonization.
snp_mapping: Parameters for the snp_mapping rule.
- input_format: Uses the aliased *iformat.
- config_file: Path to the configuration file for SNP mapping.
summarize_sumstats: Parameters for the summarize_sumstats rule.
- input_format: Set to "gwaslab".
- config_file: Path to the configuration file for summarization.

At the bottom of each rule-based configuration files described above, there is a filename transformation section that specifies how to extract some information from the input filenames.

For example:

# Filename transformation, e.g.: seq.3007.7.gwas.regenie.gz
filename_mask: [ True, True, True, False, False, False]
filename_sep: '.'

This transformation section tells the pipeline how to split the filename into parts. The filename_mask list specifies which parts should be retained (True) and which should be discarded (False). The filename_sep specifies the separator used to split the filename into parts.

These configuration files are essential for setting up and running the analysis pipeline, ensuring that each step is correctly configured and that paths and formats are consistent across the workflow.

Configuration file examples

Examples of configuration files for BELIEVE, Decode, FinnGen, INTERVAL, UKB, and UKB-PPP input data are given in the folder examples.

Submitting the workflow

To submit the workflow to the HT HPC cluster, you can use the submit.sbatch script with the command sbatch submit.sbatch. Check the script to adapt it to your specific requirements.

Note on job names

The job name can now be displayed as rule name (rule_***) followed by the summary statistics identifier (wildcards_***) in the "COMMENT" field of squeue. Use the command:

squeue --me --format="%.10i %.9P %.8j %.50k %.8u %.2t %.6M %.6D %.6R"

with output (example):

JOBID	PARTITION	NAME	COMMENT	USER	ST	TIME	NODES	NODELIST
199xxxx	cpuq	72a9f3ce-8929-...	rule_harmonize_sumstats_wildcards_seq123	username	R	mm:ss	1	cnodexx
199xxxx	cpuq	harmonization_pipeline	(null)	username	R	mm:ss	1	cnodexx

Input formats

Possible input formats for summary statistics (see formatbook.json for more options to add):

finngen
vcf
decode
gwaslab
regenie
regenie_gene
fastgwa
ldsc
fuma
pickle
metal_het

Configuration files

This pipeline requires 6 configuration files in the folder config: the main configuration file config/config.yaml, and 5 rule-based configuration files where to specify the parameters of each step of the rule.

Examples of configuration files for BELIEVE, Decode, FinnGen, Genes & Health, INTERVAL, Meta-CHRIS-INTERVAL, UK Biobank and UKB-PPP input data are given in the folder examples.

Rules description

harmonize_sumstats (harmonization: True):
Purpose: Performs GWASLab harmonization on input data without filtering.
Output: {sumstat_id}.gwaslab.tsv.gz: Standardized and aligned GWAS summary statistics.
bgzip_tabix (harmonization: True):
Purpose: Creates a region-based index (CHR and POS columns) of GWAS harmonized data for fast queries.
Output: {sumstat_id}.gwaslab.tsv.gz.tbi: Index of GWAS harmonized data.
create_snp_mapping_table (harmonization: True):
Purpose: Creates mapping table to match input data and harmonized summary statistics.
Output: table.snp_mapping.tsv.gz: Table that links input SNPID (and rsID when available) to harmonized SNPID.
convert_to_vcf (liftoverbcf_harmonization: True):
Purpose: Convert input data to VCF, i.e. the format required for BCFtools liftover (liftover_bcftools ).
Output: {sumstat_id}.vcf: Summary statistics in VCF format.
liftover_bcftools (liftoverbcf_harmonization: True):
Purpose: Performs BCFtools liftover on the VCF-converted input before performing harmonization (harmonize_sumstats), indexing (bgzip_tabix), and SNPID mapping (create_snp_mapping_table).
Output: {sumstat_id}.liftover.vcf.gz: Liftover summary statistics in VCF format.
summarize_sumstats, quality_check, create_if_table, create_min_pvalue_table and create_quality_check_table (summarize: True):
Purpose: Creates summary reports and plots of harmonized data.
Outputs:
{sumstat_id}.png: Includes a Manhattan plot of -log10(p-values) by chromosome/position, and a QQ plot of observed -log10(p-values) vs. expected, with thresholds for genome-wide significance.
min_pvalue_table.tsv: Table with top association hits (SNPs with the smallest p-value in the GWAS summary statistics).
inflation_factors_table.tsv: Table with genomic inflation factors (lambda GC, Median and Maximum chi-squared statistics).
quality_check_table.tsv: Table with quality check information: nr. of log errors, difference of line nr. between input and harmonized summary statistics, log messages for nr. of removed variants due to bad statistics and/or during liftover, log message of (in)consistent variants.
sync_outputs_folder, sync_plots and sync_tables (delivery: True):
Purpose: Copies GWAS indexes, and summary reports and plots to destination folder dest_path.
Outputs: Copies of {sumstat_id}.gwaslab.tsv.gz.tbi, {sumstat_id}.{sumstat_id}.png, min_pvalue_table.tsv, inflation_factors_table.tsv, and table.snp_mapping.tsv.gz.

Standardization and Harmonization

Standardization and Harmonization includes the following steps:

Check SNP identifiers (SNPID/rsID).
Fix chromosome notation (CHR), basepair positions (POS) and alleles (EA and NEA).
Sanity check on statistics.
Infer genome reference build version.
Align alleles sorting them alphabetically. Precisely EA is the effect allele for which BETA is estimated, and EA is alphabetically lower than NEA (in case, flip the alleles to match the alphabetical order).
Flip allele-specific statistics for mismatches: BETA = - BETA; Z = - Z; EAF = 1 - EAF.
Build SNPID column (CHR:POS:EA:NEA) (Optional with fixid: True and overwrite: True to specify in rule-based condiguration files, basic_check step).
Re-name and re-order columns based on GWASLab format.

DAGs

Check the dags for:

the default option
the BCFtools liftover option
with the delivery option

Name		Name	Last commit message	Last commit date
Latest commit History 297 Commits
.github/workflows		.github/workflows
config		config
examples		examples
slurm		slurm
workflow		workflow
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
dag_default.svg		dag_default.svg
dag_delivery.svg		dag_delivery.svg
dag_liftoverbcftools.svg		dag_liftoverbcftools.svg
dag_postfiltering.svg		dag_postfiltering.svg
dag_prefiltering.svg		dag_prefiltering.svg
environment.yml		environment.yml
environment_dev.yml		environment_dev.yml
submit.sbatch		submit.sbatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

harmonization_pipeline

Requirements

Getting started

Configuration file examples

Submitting the workflow

Note on job names

Input formats

Configuration files

Rules description

Standardization and Harmonization

DAGs

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

ht-diva/harmonization_pipeline

Folders and files

Latest commit

History

Repository files navigation

harmonization_pipeline

Requirements

Getting started

Configuration file examples

Submitting the workflow

Note on job names

Input formats

Configuration files

Rules description

Standardization and Harmonization

DAGs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Packages