Skip to content

collaborativebioinformatics/Mosaicome

Repository files navigation

🧬 Mosaicome

Assembling the bigger picture of the genome, one piece at a time.

Status: In Progress Hackathon: Mosaic SV License: MIT


A central repository for Team Mosaicome's project for the Mosaic SV ROM Collaborative Bioinformatics SVM Hackathon. Our goal is to develop a pipeline for the accurate detection of genomic structural variations.

Table of Contents

  1. The team
  2. The Problem
  3. Tech Stack
  4. Dataset
  5. Project Roadmap
  6. Getting Started

👥 The Team


The Problem

Structural Variants (SVs) calling is a challenging process in which variants > 50bp are detected when compared to the reference genome. Many tools are available and already capable of detecting germline SVs. Low-frequency SVs (<20% VAF, aka mosaic) comprise a greater challenge as the signal is low. Our current strategy is able to detect ~50% of the SVs (VAF 5-100%) with 90% precision

ONT sequencing is still under development and new models for basecalling as well as new tools for aligning showed up since May this year. Better reads quality, meaning better nucleotides calling probabilities will allow better mapping, when using suited aligning tools.

Our goal is to improve recall without affecting much precision


🛠️ Tech Stack

This project will leverage a modern, open-source technology stack.

Category Technology / Library Purpose
Language Python 3.12+ Core programming language
Data Handling Pandas, NumPy Data manipulation & numerical analysis
Alignment Minimap2, Winnowmap, abpoa Data mapping & assembly
SV calling Sniffles2, CuteSV Data manipulation & numerical analysis
Visualization Matplotlib, IGV Plotting results & data exploration
Development Jupyter Lab, VS Code Interactive analysis & coding
Collaboration Git, GitHub Version control & project management

💾 The data

Long-read sequencing of six individuals from the HapMap project. The mix was done in-vitro so all sequencing randomness affect all samples the same We have a high-quality benchmark dataset derived from assemblies


🗺️ Project Roadmap

flowchart

Our progress will be tracked through the following key phases and milestones.

Phase Status Key Milestones
1 Exploration & Strategy
⚫ Data Analysis
⚫ Steps
2 Development & Prototyping
⚫ Steps
⚫ Steps
3 Refinement & Validation
⚫ Steps
⚫ Validation
⚫ Final Documentation

Status Key: Not Started, In Progress, Completed

Exploration & Strategy

Strategies:

  • Testing different parameters from each filter applied, some SVs were detected but filtered

  • As we are at the mercy of the aligners, we can improve alignment on candidate regions

    • Re-alignment with newer options (eg minimap lr:hq, suited for reads with Phred Q > 20). Or one can optionally remap with Winnowmap as this tool depicted reduction in mapping error rate.
    • Traversal of full mm2 alignments via SA tag parsing. Group sequences by sets of aligned regions, as well as possibly merging across homologous regions using XA tags.
    • Align reads to the reference (mm2, winnowmap) several times with different parameters.
      • Take reference windows and cluster reads by aligned regions.
      • Take abnormal events (e.g., aligning two similar regions on 2 different chromosomes or different strands on the same chromosome) and then feed into sequence clustering/consensus to get candidate haplotypes.
      • Align the local reads against generated haplotypes to filter for misassemblies/low quality results.
  • Assembly of candidate regions abpoa, which can potentially provide local haplotypes. It can automatically group and generate consensus, and it's pretty fast.

  • Define attributes of FNs to examine, then contrast attribute distribution to TPs

    • Attributes
      • number of supporting reads
      • Mapping quality
      • Avg. # of supplementary alignments for supporting reads
  • Impact of Read Quality on SV Calling On the way to finding a better strategy for recalling structural variants (SVs),
    we observed that the quality of ONT reads plays a crucial role.

SV Calling Performance Comparison (alignment options)

Dataset / Mapping Strategy Alignments (Primary / Secondary / Supplementary) SVs Found FP FN Precision Recall F1
All 369,812 reads, minimap2 map-ont 369,812 / 319,549 / 51,286 36,741 30 297 90.96% 50.42% 64.88
All 369,812 reads, minimap2 lr:hq 369,812 / 285,330 / 46,993 10,555 34 314 89.34% 47.58% 62.09
QV > 20 (160,134 reads), minimap2 lr:hq 160,134 / 117,005 / 21,019 5,253 23 326 92.23% 45.58% 61.01
QV > 20 after Dorado Correct (210K reads), minimap2 lr:hq 210,659 / 130,548 / 25,205 4,736 29 321 90.55% 46.41% 61.36
Interpretation

From this comparison, we see that using all reads with minimal quality filtering (QV > 10) leads to a large excess of predicted SVs (36,741), almost 7× more than when using only QV > 20 reads (5,253 SVs).
However, this higher callset does not substantially improve recall: the recall rate is only 50% with all reads versus 45% with QV > 20 reads.

Thus, although low-quality reads inflate the number of SV calls, they do not yield proportionally better sensitivity, and may instead increase false positives.

Region analysis of FN vs TP

QC — MAPQ Distribution

MAPQ Distribution: TP vs FN

(1) In the original alignment FN is bimodal with a large low-MAPQ mode near 0, and means between FN and TP differ a lot (Δμ≈18): many FN come from ambiguous/poor alignments. (2) When we remove low-MAPQ reads (as Sniffles does) the separation collapses: so at loci where most evidence lives in MAPQ<20, Sniffles effectively sees little or nothing, so those sites become FN. A residual subset of FNs still has high-MAPQ reads. (3) Re-aligning with Q20 reads shrinks the low-MAPQ tail dramatically, which means base-quality filtering cleans much of the ambiguity even before MAPQ gating. (4) If we aditionally filter by MAPQ, the distributions are almost indistinguishable. The conclusion: A key driver of FN vs TP in MAPQ is the low-MAPQ mass present in the original alignment; once you gate MAPQ>20 (and/or re-align with Q20 reads), the difference largely vanishes.

DELs — Coverage in interval vs flanks

Histogram (left): In TP the coverage ratio of interval/flanks of the BP shows a strong tail at ratios <0.8, which means that true coverage drop inside the interval. FN clusters near 1.0–1.1, i.e., little to no drop. Boxplot (right): Medians TP≈0.63 vs FN≈0.99 (Δ≈0.36), p ≪ 1e-10. We note that TP exhibits a clear coverage drop; FN remains ~flanks ⇒ weak/absent signal.

DELs coverage ratio histogram DELs coverage ratio boxplot

DELs — Coverage vs deletion signal inside the interval

There’s a clear negative relationship: when coverage ratio drops (<0.8), the per-read deletion fraction inside the interval rises (up to ~1). TP dominate this region; FN cluster near ratio≈1 with low fractions (<0.1). Also, deletion enrichment is high for TP when coverage drops (ratio <0.8), reaching values well >1; FN stay near 0–5 and stay around ratio ~1. Takeaway: enrichment is a robust signal for excess Deletion events in the interval vs flanks.

Coverage ratio vs deletion fraction inside interval Coverage ratio vs interval-to-flanks deletion fraction ratio

INS — Insertion prevalence in windows (outside BP)

We examined the prevalence (fraction) of reads around the breakpoint with at least one insertion. Plots: Left = histogram. Right = KDE density (area≈1 per group). We note a pattern of TP shift to high prevalence (≈0.7–1.0), while FN cluster lower (≈0–0.4). Mann–Whitney U p = 0.003 indicates a significant difference between TP and FN; with a higher median in TP (+0.156), insertion prevalence near the breakpoint is a useful discriminator of true INS in this dataset.

INS: insertion prevalence in windows (outside BP) INS: insertion prevalence in windows — KDE density

INS — Fraction of INS bases in windows (outside BP)

This metric is the median fraction of inserted bases among reads that contain ≥1 insertion in the breakpoint windows. Plots: Left = histogram (fraction of SVs per bin). Right = horizontal boxplot. Stats: Mann–Whitney U p = 4.61e-07; medians TP = 0.383 vs FN = 0.134 (Δ = +0.248). Takeaway: When present, TP insertions occupy a larger share of the read around the BP, whereas FN have small/dispersed insertions. This feature is coverage-robust and a strong discriminator of true INS.

INS: insertion fraction in windows (outside BP) INS: insertion fraction in windows — boxplot

Below is a comparison of SV calling results between using all mapped reads (QV > 10) and restricting the analysis to reads with Phred quality > 20.

Development

SV Calling Performance Comparison (different aligners)

Evaluate two aligners (minimap2, winnowmap) × three MAPQ filters (0/20/40) × five Sniffles2 profiles (very_sens, sensitive, balanced, strict, very_strict), then benchmark each callset with Truvari and optionally merge callsets with SURVIVOR (union, intersection). To explore precision/recall extremes via union ( recall) and intersection ( precision).

Filtering options

We tested multiple filters from Sniffles2

# Test FP FN Precision Recall F1 Notes
1 30 297 90.96% 50.42% 64.88% default
2 46 259 88.08% 56.76% 69.04% --mosaic-af-min 0.01 (USER)
3 70 224 84.27% 62.60% 71.84% "--mosaic-af-min 0.01 --minsupport 3 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL)"
4 70 223 84.30% 62.77% 71.96% "--mosaic-af-min 0.01 --minsupport 3 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL)+ minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)"
5 34 314 89.34% 47.58% 62.09% minimap2 lr:hq + default
6 23 326 92.23% 45.58% 61.01% reads Q20 +minimap2 lr:hq + default
7 27 324 91.06% 45.91% 61.04% winnowmap (parameters) + default
8 36 322 88.50% 46.24% 60.75% winnowmap (parameters) + default
9 37 328 87.99% 45.24% 59.76% winnowmap (parameters) + default
10 38 327 87.74% 45.41% 59.85% winnowmap (parameters) + default
11 36 315 88.75% 47.41% 61.80% winnowmap (parameters) + default
12 70 222 84.34% 62.94% 72.08% "--mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)"
13 34 320 89.13% 46.78% 61.18% winnowmap (parameters) + default
14* 54 227 87.53% 62.10% 72.66% "--mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) if precise + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)"
15* 48 230 88.49% 61.60% 72.64% "--mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) if (precise sd_len and sd_pos is low) + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)"
16 84 267 80% 55% 65% cuteSV -t 12 --genotype --report_readid --min_support 3 --min_size 50 bam/mosaic_data_chr21.bam ref/chr21.fasta cutesv/chr21.cutesv2.vcf ./cutesv/
17 26 382 89% 36% 52% cuteSV -t 12 --genotype --report_readid --min_support 3 --min_size 50 bam/mosaic_data_chr21.bam ref/chr21.fasta cutesv/chr21.cutesv2.vcf ./cutesv/
18 55 278 85% 53% 66% SURVIVOR merge samples.txt 1000 1 1 0 0 50
19 41 281 88.58% 53.09% 66.39% "reads Q20 +minimap2 lr:hq + --mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) if (precise sd_len and sd_pos is low) + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)"
20 29 321 90.55% 46.41% 61.37% "reads Q20 +minimap2 lr:hq + default sniffles"
image image

Other SV calling tools

We tested other tools (cuteSV) and compared to the benchmark set

Tool version FP FN Precision Recall F1 Notes
cuteSV v2.1.2 84 267 80% 55% 65% All SVs are included
cuteSV v2.1.2 26 382 89% 36% 52% Only PASS SVs are included
SURVIVOR merge v1.0.7 55 278 85% 53% 66% Sniffles default with cute SV (All)

Comparing Sniffles FNs vs. cuteSV FNs (before and after filtering):

  • 251 FNs are common between Sniffles and cuteSV (All SVs)
  • 42 FNs are common between Sniffles with cuteSV (PASS SVs only).
  • 4 FNs are not called by Sniffles only.

Future perspectives

BAM-derived feature extraction (no re-calling Sniffles)

To study why certain truth SVs were missed (FN) and how detected SVs (TP) differ, we extract quantitative evidence directly from the BAM around each breakpoint—without re-calling Sniffles. Rather than re-calling Sniffles at candidate sites, we built a BAM-only feature extractor around breakpoints (and with adaptive windows) that outputs per-SV aggregates and optional per-read details: depth at edges and inside, read composition, robust MAPQ stats, an identity proxy from NM, soft-clip burden and breakpoint-specific soft-clip fractions, split-read prevalence (SA:Z), indel load per kb, and strand balance. These tables are now the substrate for PCA/ML and principled tuning of mapping/calling parameters, with a quick interpretation rubric for DEL/INS and for understanding typical FN signatures.

Re-alignment to T2T to study FN

We re-aligned to the T2T-CHM13 reference genome to evaluate the effect of the reference genome on SV calling. Then we examined a couple of regions that were classified as false negatives (deletion) from Sniffles with hg38.

  • The bottom figure shows a false negative deletion from Sniffles/hg38 that is supported by only one read.
  • The top and middle figures show the hg38 and T2T regions, respectively. igv igv igv
FN

s1

FP
  • better clustering +++
  • Low support -> only PRECISE -> strict SVDEV LEN/POS -> stricter MAPQ!
  • check splits better
  • 2 alleles?
  • aligment error/repetitive sequnce?

clust1 clust2 2alleles1 2alleles2 aln1 aln2


🚀 Getting Started

(This section will be updated with instructions on how to set up the environment and run the project.)

  1. Clone the repository:
    git clone https://github.com/collaborativebioinformatics/Mosaicome.git
    
  2. Set up the environment:
    conda create --name mosaicome bioconda::winnowmap bioconda::sniffles bioconda::samtools bioconda::bcftools bioconda::truvari bioconda::minimap2

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 12