Skip to content

How to support multiple SNP-sets #77

@stschiff

Description

@stschiff

Right now, Poseidon is officially flexible with respect to the Snp-Set of the genotype files, but the archives are of course not. All public archives currently support only the 1240K format, or the HO-subset of it.

It would be desirable in the future to support more call-sets, in particular in light of more shotgun-sequencing being done. In light of the fact that our Poseidon-IDs are supposed to be unique per archive, which excludes the option to place multiple packages with different call-sets into the same archive, there are two basic options to consider:

Option 1: Split Archives to allow other than 1240K call sets, for example a "Community-Archive 1000G calls" or so. This does not require any change in the schema, and would be straight-forward to do with current infrastructure.
Pros: Simple, non-breaking and in principle immediately doable.
Cons: Meta-data will be duplicated across archives, which causes redundancy and may require complex syncing infrastructure to update Janno-files across archives.

Option 2: Extend the schema to allow for multiple genotype-datasets within one package
This is currently not possible, but is in principle not hard to implement. We would simply allow the YAML-schema to list multiple genotype datasets, each with their own snpset and separate genotype files.
There is one catch: The Janno-File contains several columns that are specific to the call-set (Genotype_Ploidy, Data_Preparation_Pipeline_URL, Nr_SNPs, Coverage_on_Target_SNPs). These can easily be made list-columns, of course, which would be non-breaking.
Pros: Would be a non-redundant solution with respect to package-metadata and Janno-files, as these would not be duplicated.
Cons: Would require some additional implementation in the server and trident list and forge functionality. A minimal solution would be to ignore any call-set after the first in trident, and see how we can support secondary call-sets later on. However, at least with fetch, large-scale adoption of hosting multiple call-sets would result in much larger downloads of packages. So perhaps one should somehow change the server software to create multiple zip-files for download, with or without secondary call-sets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    stewingfor future considerationwontfixThis will not be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions