🧬 Mosaicome

Assembling the bigger picture of the genome, one piece at a time.

A central repository for Team Mosaicome's project for the Mosaic SV ROM Collaborative Bioinformatics SVM Hackathon. Our goal is to develop a pipeline for the accurate detection of genomic structural variations.

👥 The Team

❓ The Problem

Structural Variants (SVs) calling is a challenging process in which variants > 50bp are detected when compared to the reference genome. Many tools are available and already capable of detecting germline SVs. Low-frequency SVs (<20% VAF, aka mosaic) comprise a greater challenge as the signal is low. Our current strategy is able to detect ~50% of the SVs (VAF 5-100%) with 90% precision

ONT sequencing is still under development and new models for basecalling as well as new tools for aligning showed up since May this year. Better reads quality, meaning better nucleotides calling probabilities will allow better mapping, when using suited aligning tools.

Our goal is to improve recall without affecting much precision

🛠️ Tech Stack

This project will leverage a modern, open-source technology stack.

Category	Technology / Library	Purpose
Language	`Python 3.12+`	Core programming language
Data Handling	`Pandas`, `NumPy`	Data manipulation & numerical analysis
Alignment	`Minimap2`, `Winnowmap`, `abpoa`	Data mapping & assembly
SV calling	`Sniffles2`, `CuteSV`	Data manipulation & numerical analysis
Visualization	`Matplotlib`, `IGV`	Plotting results & data exploration
Development	`Jupyter Lab`, `VS Code`	Interactive analysis & coding
Collaboration	`Git`, `GitHub`	Version control & project management

💾 The data

Long-read sequencing of six individuals from the HapMap project. The mix was done in-vitro so all sequencing randomness affect all samples the same We have a high-quality benchmark dataset derived from assemblies

🗺️ Project Roadmap

Our progress will be tracked through the following key phases and milestones.

Phase	Status	Key Milestones
1	`✅`	Exploration & Strategy ⚫ Data Analysis ⚫ Steps
2	`✅`	Development & Prototyping ⚫ Steps ⚫ Steps
3	`⏳`	Refinement & Validation ⚫ Steps ⚫ Validation ⚫ Final Documentation

Status Key: ⚪ Not Started, ⏳ In Progress, ✅ Completed

Exploration & Strategy

Strategies:

Testing different parameters from each filter applied, some SVs were detected but filtered
As we are at the mercy of the aligners, we can improve alignment on candidate regions
- Re-alignment with newer options (eg minimap lr:hq, suited for reads with Phred Q > 20). Or one can optionally remap with Winnowmap as this tool depicted reduction in mapping error rate.
- Traversal of full mm2 alignments via SA tag parsing. Group sequences by sets of aligned regions, as well as possibly merging across homologous regions using XA tags.
- Align reads to the reference (mm2, winnowmap) several times with different parameters.
  - Take reference windows and cluster reads by aligned regions.
  - Take abnormal events (e.g., aligning two similar regions on 2 different chromosomes or different strands on the same chromosome) and then feed into sequence clustering/consensus to get candidate haplotypes.
  - Align the local reads against generated haplotypes to filter for misassemblies/low quality results.
Assembly of candidate regions abpoa, which can potentially provide local haplotypes. It can automatically group and generate consensus, and it's pretty fast.
Define attributes of FNs to examine, then contrast attribute distribution to TPs
- Attributes
  - number of supporting reads
  - Mapping quality
  - Avg. # of supplementary alignments for supporting reads
Impact of Read Quality on SV Calling On the way to finding a better strategy for recalling structural variants (SVs),
we observed that the quality of ONT reads plays a crucial role.

SV Calling Performance Comparison (alignment options)

Dataset / Mapping Strategy	Alignments (Primary / Secondary / Supplementary)	SVs Found	FP	FN	Precision	Recall	F1
All 369,812 reads, minimap2 map-ont	369,812 / 319,549 / 51,286	36,741	30	297	90.96%	50.42%	64.88
All 369,812 reads, minimap2 lr:hq	369,812 / 285,330 / 46,993	10,555	34	314	89.34%	47.58%	62.09
QV > 20 (160,134 reads), minimap2 lr:hq	160,134 / 117,005 / 21,019	5,253	23	326	92.23%	45.58%	61.01
QV > 20 after Dorado Correct (210K reads), minimap2 lr:hq	210,659 / 130,548 / 25,205	4,736	29	321	90.55%	46.41%	61.36

Interpretation

From this comparison, we see that using all reads with minimal quality filtering (QV > 10) leads to a large excess of predicted SVs (36,741), almost 7× more than when using only QV > 20 reads (5,253 SVs).
However, this higher callset does not substantially improve recall: the recall rate is only 50% with all reads versus 45% with QV > 20 reads.

Thus, although low-quality reads inflate the number of SV calls, they do not yield proportionally better sensitivity, and may instead increase false positives.

Region analysis of FN vs TP

QC — MAPQ Distribution

(1) In the original alignment FN is bimodal with a large low-MAPQ mode near 0, and means between FN and TP differ a lot (Δμ≈18): many FN come from ambiguous/poor alignments. (2) When we remove low-MAPQ reads (as Sniffles does) the separation collapses: so at loci where most evidence lives in MAPQ<20, Sniffles effectively sees little or nothing, so those sites become FN. A residual subset of FNs still has high-MAPQ reads. (3) Re-aligning with Q20 reads shrinks the low-MAPQ tail dramatically, which means base-quality filtering cleans much of the ambiguity even before MAPQ gating. (4) If we aditionally filter by MAPQ, the distributions are almost indistinguishable. The conclusion: A key driver of FN vs TP in MAPQ is the low-MAPQ mass present in the original alignment; once you gate MAPQ>20 (and/or re-align with Q20 reads), the difference largely vanishes.

DELs — Coverage in interval vs flanks

Histogram (left): In TP the coverage ratio of interval/flanks of the BP shows a strong tail at ratios <0.8, which means that true coverage drop inside the interval. FN clusters near 1.0–1.1, i.e., little to no drop. Boxplot (right): Medians TP≈0.63 vs FN≈0.99 (Δ≈0.36), p ≪ 1e-10. We note that TP exhibits a clear coverage drop; FN remains ~flanks ⇒ weak/absent signal.

DELs — Coverage vs deletion signal inside the interval

There’s a clear negative relationship: when coverage ratio drops (<0.8), the per-read deletion fraction inside the interval rises (up to ~1). TP dominate this region; FN cluster near ratio≈1 with low fractions (<0.1). Also, deletion enrichment is high for TP when coverage drops (ratio <0.8), reaching values well >1; FN stay near 0–5 and stay around ratio ~1. Takeaway: enrichment is a robust signal for excess Deletion events in the interval vs flanks.

$Coverage ratio vs deletion fraction inside interval$ $Coverage ratio vs interval-to-flanks deletion fraction ratio$

INS — Insertion prevalence in windows (outside BP)

We examined the prevalence (fraction) of reads around the breakpoint with at least one insertion. Plots: Left = histogram. Right = KDE density (area≈1 per group). We note a pattern of TP shift to high prevalence (≈0.7–1.0), while FN cluster lower (≈0–0.4). Mann–Whitney U p = 0.003 indicates a significant difference between TP and FN; with a higher median in TP (+0.156), insertion prevalence near the breakpoint is a useful discriminator of true INS in this dataset.

INS — Fraction of INS bases in windows (outside BP)

This metric is the median fraction of inserted bases among reads that contain ≥1 insertion in the breakpoint windows. Plots: Left = histogram (fraction of SVs per bin). Right = horizontal boxplot. Stats: Mann–Whitney U p = 4.61e-07; medians TP = 0.383 vs FN = 0.134 (Δ = +0.248). Takeaway: When present, TP insertions occupy a larger share of the read around the BP, whereas FN have small/dispersed insertions. This feature is coverage-robust and a strong discriminator of true INS.

$INS: insertion fraction in windows (outside BP)$ $INS: insertion fraction in windows — boxplot$

Below is a comparison of SV calling results between using all mapped reads (QV > 10) and restricting the analysis to reads with Phred quality > 20.

Development

SV Calling Performance Comparison (different aligners)

Evaluate two aligners (minimap2, winnowmap) × three MAPQ filters (0/20/40) × five Sniffles2 profiles (very_sens, sensitive, balanced, strict, very_strict), then benchmark each callset with Truvari and optionally merge callsets with SURVIVOR (union, intersection). To explore precision/recall extremes via union ( recall) and intersection ( precision).

Filtering options

We tested multiple filters from Sniffles2

# Test	FP	FN	Precision	Recall	F1	Notes
1	30	297	90.96%	50.42%	64.88%	default
2	46	259	88.08%	56.76%	69.04%	--mosaic-af-min 0.01 (USER)
3	70	224	84.27%	62.60%	71.84%	"--mosaic-af-min 0.01 --minsupport 3 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL)"
4	70	223	84.30%	62.77%	71.96%	"--mosaic-af-min 0.01 --minsupport 3 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL)+ minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)"
5	34	314	89.34%	47.58%	62.09%	minimap2 lr:hq + default
6	23	326	92.23%	45.58%	61.01%	reads Q20 +minimap2 lr:hq + default
7	27	324	91.06%	45.91%	61.04%	winnowmap (parameters) + default
8	36	322	88.50%	46.24%	60.75%	winnowmap (parameters) + default
9	37	328	87.99%	45.24%	59.76%	winnowmap (parameters) + default
10	38	327	87.74%	45.41%	59.85%	winnowmap (parameters) + default
11	36	315	88.75%	47.41%	61.80%	winnowmap (parameters) + default
12	70	222	84.34%	62.94%	72.08%	"--mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)"
13	34	320	89.13%	46.78%	61.18%	winnowmap (parameters) + default
14*	54	227	87.53%	62.10%	72.66%	"--mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) if precise + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)"
15*	48	230	88.49%	61.60%	72.64%	"--mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) if (precise sd_len and sd_pos is low) + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)"
16	84	267	80%	55%	65%	cuteSV -t 12 --genotype --report_readid --min_support 3 --min_size 50 bam/mosaic_data_chr21.bam ref/chr21.fasta cutesv/chr21.cutesv2.vcf ./cutesv/
17	26	382	89%	36%	52%	cuteSV -t 12 --genotype --report_readid --min_support 3 --min_size 50 bam/mosaic_data_chr21.bam ref/chr21.fasta cutesv/chr21.cutesv2.vcf ./cutesv/
18	55	278	85%	53%	66%	SURVIVOR merge samples.txt 1000 1 1 0 0 50
19	41	281	88.58%	53.09%	66.39%	"reads Q20 +minimap2 lr:hq + --mosaic-af-min 0.01 --minsupport 3 --mapq 18 (USER) + mosaic_min_reads = 2 for DEL and INS (INTERNAL) if (precise sd_len and sd_pos is low) + minsvlen FILTER deactivated if SUPPORT >= 10 (INTERNAL)"
20	29	321	90.55%	46.41%	61.37%	"reads Q20 +minimap2 lr:hq + default sniffles"

Other SV calling tools

We tested other tools (cuteSV) and compared to the benchmark set

Tool	version	FP	FN	Precision	Recall	F1	Notes
cuteSV	v2.1.2	84	267	80%	55%	65%	All SVs are included
cuteSV	v2.1.2	26	382	89%	36%	52%	Only PASS SVs are included
SURVIVOR merge	v1.0.7	55	278	85%	53%	66%	Sniffles default with cute SV (All)

Comparing Sniffles FNs vs. cuteSV FNs (before and after filtering):

251 FNs are common between Sniffles and cuteSV (All SVs)
42 FNs are common between Sniffles with cuteSV (PASS SVs only).
4 FNs are not called by Sniffles only.

Future perspectives

BAM-derived feature extraction (no re-calling Sniffles)

To study why certain truth SVs were missed (FN) and how detected SVs (TP) differ, we extract quantitative evidence directly from the BAM around each breakpoint—without re-calling Sniffles. Rather than re-calling Sniffles at candidate sites, we built a BAM-only feature extractor around breakpoints (and with adaptive windows) that outputs per-SV aggregates and optional per-read details: depth at edges and inside, read composition, robust MAPQ stats, an identity proxy from NM, soft-clip burden and breakpoint-specific soft-clip fractions, split-read prevalence (SA:Z), indel load per kb, and strand balance. These tables are now the substrate for PCA/ML and principled tuning of mapping/calling parameters, with a quick interpretation rubric for DEL/INS and for understanding typical FN signatures.

Re-alignment to T2T to study FN

We re-aligned to the T2T-CHM13 reference genome to evaluate the effect of the reference genome on SV calling. Then we examined a couple of regions that were classified as false negatives (deletion) from Sniffles with hg38.

The bottom figure shows a false negative deletion from Sniffles/hg38 that is supported by only one read.
The top and middle figures show the hg38 and T2T regions, respectively.

FN

FP

better clustering +++
Low support -> only PRECISE -> strict SVDEV LEN/POS -> stricter MAPQ!
check splits better
2 alleles?
aligment error/repetitive sequnce?

🚀 Getting Started

(This section will be updated with instructions on how to set up the environment and run the project.)

Clone the repository:

git clone https://github.com/collaborativebioinformatics/Mosaicome.git

Set up the environment:

conda create --name mosaicome bioconda::winnowmap bioconda::sniffles bioconda::samtools bioconda::bcftools bioconda::truvari bioconda::minimap2

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
docs		docs
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
get_info.py		get_info.py
notes.md		notes.md
parse_bench.py		parse_bench.py
parse_filtered.py		parse_filtered.py
run_sniffles2.sh		run_sniffles2.sh
run_truvari.sh		run_truvari.sh
workflow.svg		workflow.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 Mosaicome

Table of Contents

👥 The Team

❓ The Problem

🛠️ Tech Stack

💾 The data

🗺️ Project Roadmap

Exploration & Strategy

SV Calling Performance Comparison (alignment options)

Interpretation

Region analysis of FN vs TP

QC — MAPQ Distribution

DELs — Coverage in interval vs flanks

DELs — Coverage vs deletion signal inside the interval

INS — Insertion prevalence in windows (outside BP)

INS — Fraction of INS bases in windows (outside BP)

Development

SV Calling Performance Comparison (different aligners)

Filtering options

Other SV calling tools

Comparing Sniffles FNs vs. cuteSV FNs (before and after filtering):

Future perspectives

BAM-derived feature extraction (no re-calling Sniffles)

Re-alignment to T2T to study FN

FN

FP

🚀 Getting Started

About

Uh oh!

Releases

Packages

Contributors 12

Uh oh!

Languages

License

collaborativebioinformatics/Mosaicome

Folders and files

Latest commit

History

Repository files navigation

🧬 Mosaicome

Table of Contents

👥 The Team

❓ The Problem

🛠️ Tech Stack

💾 The data

🗺️ Project Roadmap

Exploration & Strategy

SV Calling Performance Comparison (alignment options)

Interpretation

Region analysis of FN vs TP

QC — MAPQ Distribution

DELs — Coverage in interval vs flanks

DELs — Coverage vs deletion signal inside the interval

INS — Insertion prevalence in windows (outside BP)

INS — Fraction of INS bases in windows (outside BP)

Development

SV Calling Performance Comparison (different aligners)

Filtering options

Other SV calling tools

Comparing Sniffles FNs vs. cuteSV FNs (before and after filtering):

Future perspectives

BAM-derived feature extraction (no re-calling Sniffles)

Re-alignment to T2T to study FN

FN

FP

🚀 Getting Started

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Uh oh!

Languages

Packages