-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Right now, Poseidon is officially flexible with respect to the Snp-Set of the genotype files, but the archives are of course not. All public archives currently support only the 1240K format, or the HO-subset of it.
It would be desirable in the future to support more call-sets, in particular in light of more shotgun-sequencing being done. In light of the fact that our Poseidon-IDs are supposed to be unique per archive, which excludes the option to place multiple packages with different call-sets into the same archive, there are two basic options to consider:
Option 1: Split Archives to allow other than 1240K call sets, for example a "Community-Archive 1000G calls" or so. This does not require any change in the schema, and would be straight-forward to do with current infrastructure.
Pros: Simple, non-breaking and in principle immediately doable.
Cons: Meta-data will be duplicated across archives, which causes redundancy and may require complex syncing infrastructure to update Janno-files across archives.
Option 2: Extend the schema to allow for multiple genotype-datasets within one package
This is currently not possible, but is in principle not hard to implement. We would simply allow the YAML-schema to list multiple genotype datasets, each with their own snpset and separate genotype files.
There is one catch: The Janno-File contains several columns that are specific to the call-set (Genotype_Ploidy, Data_Preparation_Pipeline_URL, Nr_SNPs, Coverage_on_Target_SNPs). These can easily be made list-columns, of course, which would be non-breaking.
Pros: Would be a non-redundant solution with respect to package-metadata and Janno-files, as these would not be duplicated.
Cons: Would require some additional implementation in the server and trident list and forge functionality. A minimal solution would be to ignore any call-set after the first in trident, and see how we can support secondary call-sets later on. However, at least with fetch, large-scale adoption of hosting multiple call-sets would result in much larger downloads of packages. So perhaps one should somehow change the server software to create multiple zip-files for download, with or without secondary call-sets.