Assembling the bigger picture of the genome, one piece at a time.
A central repository for Team Mosaicome's project for the Mosaic SV ROM Collaborative Bioinformatics SVM Hackathon. Our goal is to develop a pipeline for the accurate detection of genomic structural variations.
- Luis Paulin
- Daniel Baker
- Rafal Woycicki
- Rajarshi Mondal
- Gerardo Fabián
- Asmaa Mahmoud
- Nadia Baig
- Adrian Chalco
- Gobikrishnan Subramaniam
- Ammara Saleem
- Gavin Monahan
- Jacob Krol
- Xingtian Yang
- Neda Ghohabi Esfahani
- Qichen Fu
Structural Variants (SVs) calling is a challenging process in which variants > 50bp are detected when compared to the reference genome. Many tools are available and already capable of detecting germline SVs. Low-frequency SVs (<20% VAF, aka mosaic) comprise a greater challenge as the signal is low. Our current strategy is able to detect ~50% of the SVs (VAF 5-100%) with 90% precision
ONT sequencing is still under development and new models for basecalling as well as new tools for aligning showed up since May this year. Better reads quality, meaning better nucleotides calling probabilities will allow better mapping, when using suited aligning tools.
Our goal is to improve recall without affecting much precision
This project will leverage a modern, open-source technology stack.
Category | Technology / Library | Purpose |
---|---|---|
Language | Python 3.12+ |
Core programming language |
Data Handling | Pandas , NumPy |
Data manipulation & numerical analysis |
Alignment | Minimap2 , Winnowmap , abpoa |
Data mapping & assembly |
SV calling | Sniffles2 , CuteSV |
Data manipulation & numerical analysis |
Visualization | Matplotlib , IGV |
Plotting results & data exploration |
Development | Jupyter Lab , VS Code |
Interactive analysis & coding |
Collaboration | Git , GitHub |
Version control & project management |
Long-read sequencing of six individuals from the HapMap project. The mix was done in-vitro so all sequencing randomness affect all samples the same We have a high-quality benchmark dataset derived from assemblies
Our progress will be tracked through the following key phases and milestones.
Phase | Status | Key Milestones |
---|---|---|
1 | ✅ |
Exploration & Strategy ⚫ Data Analysis ⚫ Steps |
2 | ✅ |
Development & Prototyping ⚫ Steps ⚫ Steps |
3 | ⏳ |
Refinement & Validation ⚫ Steps ⚫ Validation ⚫ Final Documentation |
Status Key: ⚪
Not Started, ⏳
In Progress, ✅
Completed
Strategies:
-
Testing different parameters from each filter applied, some SVs were detected but filtered
-
As we are at the mercy of the aligners, we can improve alignment on candidate regions
- Re-alignment with newer options (eg minimap lr:hq, suited for reads with Phred Q > 20). Or one can optionally remap with Winnowmap as this tool depicted reduction in mapping error rate.
- Traversal of full mm2 alignments via SA tag parsing. Group sequences by sets of aligned regions, as well as possibly merging across homologous regions using XA tags.
- Align reads to the reference (mm2, winnowmap) several times with different parameters.
- Take reference windows and cluster reads by aligned regions.
- Take abnormal events (e.g., aligning two similar regions on 2 different chromosomes or different strands on the same chromosome) and then feed into sequence clustering/consensus to get candidate haplotypes.
- Align the local reads against generated haplotypes to filter for misassemblies/low quality results.
-
Assembly of candidate regions abpoa, which can potentially provide local haplotypes. It can automatically group and generate consensus, and it's pretty fast.
-
Define attributes of FNs to examine, then contrast attribute distribution to TPs
- Attributes
- number of supporting reads
- Mapping quality
- Avg. # of supplementary alignments for supporting reads
- Attributes
-
Impact of Read Quality on SV Calling On the way to finding a better strategy for recalling structural variants (SVs),
we observed that the quality of ONT reads plays a crucial role.
Dataset / Mapping Strategy | Alignments (Primary / Secondary / Supplementary) | SVs Found | FP | FN | Precision | Recall | F1 |
---|---|---|---|---|---|---|---|
All 369,812 reads, minimap2 map-ont | 369,812 / 319,549 / 51,286 | 36,741 | 30 | 297 | 90.96% | 50.42% | 64.88 |
All 369,812 reads, minimap2 lr:hq | 369,812 / 285,330 / 46,993 | 10,555 | 34 | 314 | 89.34% | 47.58% | 62.09 |
QV > 20 (160,134 reads), minimap2 lr:hq | 160,134 / 117,005 / 21,019 | 5,253 | 23 | 326 | 92.23% | 45.58% | 61.01 |
QV > 20 after Dorado Correct (210K reads), minimap2 lr:hq | 210,659 / 130,548 / 25,205 | 4,736 | 29 | 321 | 90.55% | 46.41% | 61.36 |
From this comparison, we see that using all reads with minimal quality filtering (QV > 10) leads to a large excess of predicted SVs (36,741), almost 7× more than when using only QV > 20 reads (5,253 SVs).
However, this higher callset does not substantially improve recall: the recall rate is only 50% with all reads versus 45% with QV > 20 reads.
Thus, although low-quality reads inflate the number of SV calls, they do not yield proportionally better sensitivity, and may instead increase false positives.
(1) In the original alignment FN is bimodal with a large low-MAPQ mode near 0, and means between FN and TP differ a lot (Δμ≈18): many FN come from ambiguous/poor alignments. (2) When we remove low-MAPQ reads (as Sniffles does) the separation collapses: so at loci where most evidence lives in MAPQ<20, Sniffles effectively sees little or nothing, so those sites become FN. A residual subset of FNs still has high-MAPQ reads. (3) Re-aligning with Q20 reads shrinks the low-MAPQ tail dramatically, which means base-quality filtering cleans much of the ambiguity even before MAPQ gating. (4) If we aditionally filter by MAPQ, the distributions are almost indistinguishable. The conclusion: A key driver of FN vs TP in MAPQ is the low-MAPQ mass present in the original alignment; once you gate MAPQ>20 (and/or re-align with Q20 reads), the difference largely vanishes.
Histogram (left): In TP the coverage ratio of interval/flanks of the BP shows a strong tail at ratios <0.8, which means that true coverage drop inside the interval. FN clusters near 1.0–1.1, i.e., little to no drop. Boxplot (right): Medians TP≈0.63 vs FN≈0.99 (Δ≈0.36), p ≪ 1e-10. We note that TP exhibits a clear coverage drop; FN remains ~flanks ⇒ weak/absent signal.
There’s a clear negative relationship: when coverage ratio drops (<0.8), the per-read deletion fraction inside the interval rises (up to ~1). TP dominate this region; FN cluster near ratio≈1 with low fractions (<0.1). Also, deletion enrichment is high for TP when coverage drops (ratio <0.8), reaching values well >1; FN stay near 0–5 and stay around ratio ~1. Takeaway: enrichment is a robust signal for excess Deletion events in the interval vs flanks.
We examined the prevalence (fraction) of reads around the breakpoint with at least one insertion. Plots: Left = histogram. Right = KDE density (area≈1 per group). We note a pattern of TP shift to high prevalence (≈0.7–1.0), while FN cluster lower (≈0–0.4). Mann–Whitney U p = 0.003 indicates a significant difference between TP and FN; with a higher median in TP (+0.156), insertion prevalence near the breakpoint is a useful discriminator of true INS in this dataset.
This metric is the median fraction of inserted bases among reads that contain ≥1 insertion in the breakpoint windows. Plots: Left = histogram (fraction of SVs per bin). Right = horizontal boxplot. Stats: Mann–Whitney U p = 4.61e-07; medians TP = 0.383 vs FN = 0.134 (Δ = +0.248). Takeaway: When present, TP insertions occupy a larger share of the read around the BP, whereas FN have small/dispersed insertions. This feature is coverage-robust and a strong discriminator of true INS.
Below is a comparison of SV calling results between using all mapped reads (QV > 10) and restricting the analysis to reads with Phred quality > 20.
Evaluate two aligners (minimap2, winnowmap) × three MAPQ filters (0/20/40) × five Sniffles2 profiles (very_sens, sensitive, balanced, strict, very_strict), then benchmark each callset with Truvari and optionally merge callsets with SURVIVOR (union, intersection). To explore precision/recall extremes via union ( recall) and intersection ( precision).
We tested multiple filters from Sniffles2
# Test | FP | FN | Precision | Recall | F1 | Notes |
---|---|---|---|---|---|---|
1 | 30 | 297 | 90.96% | 50.42% | 64.88% | default |
2 | 46 | 259 | 88.08% | 56.76% | 69.04% | --mosaic-af-min 0.01 (USER) |
3 | 70 | 224 | 84.27% | 62.60% | 71.84% | "--mosaic-af-min 0.01 --minsupport 3 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL)" |
4 | 70 | 223 | 84.30% | 62.77% | 71.96% | "--mosaic-af-min 0.01 --minsupport 3 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL)+ minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)" |
5 | 34 | 314 | 89.34% | 47.58% | 62.09% | minimap2 lr:hq + default |
6 | 23 | 326 | 92.23% | 45.58% | 61.01% | reads Q20 +minimap2 lr:hq + default |
7 | 27 | 324 | 91.06% | 45.91% | 61.04% | winnowmap (parameters) + default |
8 | 36 | 322 | 88.50% | 46.24% | 60.75% | winnowmap (parameters) + default |
9 | 37 | 328 | 87.99% | 45.24% | 59.76% | winnowmap (parameters) + default |
10 | 38 | 327 | 87.74% | 45.41% | 59.85% | winnowmap (parameters) + default |
11 | 36 | 315 | 88.75% | 47.41% | 61.80% | winnowmap (parameters) + default |
12 | 70 | 222 | 84.34% | 62.94% | 72.08% | "--mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)" |
13 | 34 | 320 | 89.13% | 46.78% | 61.18% | winnowmap (parameters) + default |
14* | 54 | 227 | 87.53% | 62.10% | 72.66% | "--mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) if precise + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)" |
15* | 48 | 230 | 88.49% | 61.60% | 72.64% | "--mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) if (precise sd_len and sd_pos is low) + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)" |
16 | 84 | 267 | 80% | 55% | 65% | cuteSV -t 12 --genotype --report_readid --min_support 3 --min_size 50 bam/mosaic_data_chr21.bam ref/chr21.fasta cutesv/chr21.cutesv2.vcf ./cutesv/ |
17 | 26 | 382 | 89% | 36% | 52% | cuteSV -t 12 --genotype --report_readid --min_support 3 --min_size 50 bam/mosaic_data_chr21.bam ref/chr21.fasta cutesv/chr21.cutesv2.vcf ./cutesv/ |
18 | 55 | 278 | 85% | 53% | 66% | SURVIVOR merge samples.txt 1000 1 1 0 0 50 |
19 | 41 | 281 | 88.58% | 53.09% | 66.39% | "reads Q20 +minimap2 lr:hq + --mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) if (precise sd_len and sd_pos is low) + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)" |
20 | 29 | 321 | 90.55% | 46.41% | 61.37% | "reads Q20 +minimap2 lr:hq + default sniffles" |


We tested other tools (cuteSV) and compared to the benchmark set
Tool | version | FP | FN | Precision | Recall | F1 | Notes |
---|---|---|---|---|---|---|---|
cuteSV | v2.1.2 | 84 | 267 | 80% | 55% | 65% | All SVs are included |
cuteSV | v2.1.2 | 26 | 382 | 89% | 36% | 52% | Only PASS SVs are included |
SURVIVOR merge | v1.0.7 | 55 | 278 | 85% | 53% | 66% | Sniffles default with cute SV (All) |
- 251 FNs are common between Sniffles and cuteSV (All SVs)
- 42 FNs are common between Sniffles with cuteSV (PASS SVs only).
- 4 FNs are not called by Sniffles only.
To study why certain truth SVs were missed (FN) and how detected SVs (TP) differ, we extract quantitative evidence directly from the BAM around each breakpoint—without re-calling Sniffles.
Rather than re-calling Sniffles at candidate sites, we built a BAM-only feature extractor around breakpoints (and with adaptive windows) that outputs per-SV aggregates and optional per-read details: depth at edges and inside, read composition, robust MAPQ stats, an identity proxy from NM
, soft-clip burden and breakpoint-specific soft-clip fractions, split-read prevalence (SA:Z
), indel load per kb, and strand balance. These tables are now the substrate for PCA/ML and principled tuning of mapping/calling parameters, with a quick interpretation rubric for DEL/INS and for understanding typical FN signatures.
We re-aligned to the T2T-CHM13 reference genome to evaluate the effect of the reference genome on SV calling. Then we examined a couple of regions that were classified as false negatives (deletion) from Sniffles with hg38.
- The bottom figure shows a false negative deletion from Sniffles/hg38 that is supported by only one read.
- The top and middle figures show the hg38 and T2T regions, respectively.
- better clustering +++
- Low support -> only PRECISE -> strict SVDEV LEN/POS -> stricter MAPQ!
- check splits better
- 2 alleles?
- aligment error/repetitive sequnce?
(This section will be updated with instructions on how to set up the environment and run the project.)
- Clone the repository:
git clone https://github.com/collaborativebioinformatics/Mosaicome.git
- Set up the environment:
conda create --name mosaicome bioconda::winnowmap bioconda::sniffles bioconda::samtools bioconda::bcftools bioconda::truvari bioconda::minimap2