How to support multiple SNP-sets

Right now, Poseidon is officially flexible with respect to the Snp-Set of the genotype files, but the archives are of course not. All public archives currently support only the 1240K format, or the HO-subset of it.

It would be desirable in the future to support more call-sets, in particular in light of more shotgun-sequencing being done. In light of the fact that our Poseidon-IDs are supposed to be unique per archive, which excludes the option to place multiple packages with different call-sets into the same archive, there are two basic options to consider:

**Option 1**: Split Archives to allow other than 1240K call sets, for example a "Community-Archive 1000G calls" or so. This does not require any change in the schema, and would be straight-forward to do with current infrastructure.
**Pros**: Simple, non-breaking and in principle immediately doable.
**Cons**: Meta-data will be duplicated across archives, which causes redundancy and may require complex syncing infrastructure to update Janno-files across archives.

**Option 2**: Extend the schema to allow for multiple genotype-datasets within one package
This is currently not possible, but is in principle not hard to implement. We would simply allow the YAML-schema to list multiple genotype datasets, each with their own `snpset` and separate genotype files.
There is one catch: The Janno-File contains several columns that are specific to the call-set (`Genotype_Ploidy`, `Data_Preparation_Pipeline_URL`, `Nr_SNPs`, `Coverage_on_Target_SNPs`). These can easily be made list-columns, of course, which would be non-breaking.
**Pros**: Would be a non-redundant solution with respect to package-metadata  and Janno-files, as these would not be duplicated.
**Cons**: Would require some additional implementation in the server and trident list and forge functionality. A minimal solution would be to ignore any call-set after the first in trident, and see how we can support secondary call-sets later on. However, at least with `fetch`, large-scale adoption of hosting multiple call-sets would result in much larger downloads of packages. So perhaps one should somehow change the server software to create multiple zip-files for download, with or without secondary call-sets.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to support multiple SNP-sets #77

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to support multiple SNP-sets #77

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions