Releases · poseidon-framework/poseidon-hs

20 Aug 11:03

v1.6.7.3

987b8d7

Release v1.6.7.3 Latest

Latest

This is a minor release with few changes in the behaviour of trident. It mainly includes internal alterations that allow for better error reporting. On the user side there are three notable changes:

Better reporting of parsing errors for .ssf files

Every .ssf file column is now represented by its own data type, as it already has been the case for .janno columns. This allows for more precise reporting of issues. trident now points exactly to the broken column in case something is off.

More extensive warning mechanism for .janno and .ssf entries

We introduced a mechanism to not only report outright parsing failures on a per-column basis for .janno and .ssf files, but also minor deviations that make a given value not per-se wrong, but suspicious. These are now reported as warnings, while the respective Poseidon package is still read. The initial set of such checks in this release is small, but it is now easy to add more in the future.

Loosened requirements on accession ID columns in .ssf file

This release finally does away with the hard requirements on sample_accession, study_accession, and run_accession in the .ssf file reading process. These requirements were based on a particularly strict reading of the Poseidon schema. Now unexpected accession IDs only raise a warning.

Assets 7

25 Jun 12:20

github-actions

v1.6.7.1

f82e697

Release v1.6.7.1

This release finally brings two long-anticipated features: VCF writing support and an html API for serve. It also includes some minor bugfixes.

Writing support for VCF files

v1.5.7.0 added experimental reading support for .vcf files. In this release trident finally learns to also write them as an output of forge and genoconvert. This new output option is available with --outFormat VCF.

VCF is a rich format (as specified here) and trident currently uses only the features relevant for the genotype data typically handled by Poseidon. In particular, as trident must be able to convert from Plink and Eigenstrat, many fields that are typically expected in VCF files (such as read- and allelic depths or genotype likelihoods) are not written.

On the other hand VCF files written by trident contain the extra headers ##group_names=Group1,Group2,... and ##genetic_sex=F,F,M,U,... to encode information typically not stored in VCF. This is to ensure compatibility with the PLINK and EIGENSTRAT data formats. trident has modified behavior for consistency checks between Ind- and Geno-file for VCFs, since VCF files do not have to have these custom header fields defined above.

Please note that the VCF format support is still not specified in the Poseidon schema version this trident version supports (v2.7.1), so the feature continues to be experimental.

HMTL API for the web server implementation

trident includes a web server to host Poseidon packages and relevant meta-information. It can be started with the subcommand serve. The central Poseidon server at https://server.poseidon-adna.org is nothing but a public instance of serve with access to the public package archives. Previously this web server provided only context data through a JSON API and allowed downloading packages as .zip archives (these interfaces are used by list --remote and fetch).

This release now adds HTML output, so a human-readable website, to the server's API. The central, public version is available here, but by running serve locally one can just as well host such a website for a private package archive.

serve can still be started with trident serve -d <name_of_archive>=<path/to/archive>, but now a new --archiveConfigFile argument allows to read more complex configuration in YML format.

More info from the POSEIDON.yml file in the `list` output

Added a new option --fullOutput for list --packages to extend the output with additional information from the underlying packages' POSEIDON.yml files (file names, contributors, etc.).

Fixed two bugs in `rectify`

Fixed a small bug that prevented calculation of checksums for genotype data in rectify, and another one that prevented trident from reading packages with a wrong individual file (.ind/.fam) checksum even in rectify, where this should be possible.

Assets 7

19 Jan 17:01

github-actions

v1.6.2.1

c574a10

Release v1.6.2.1

This is a bigger release with various new features and improvements. It is technically breaking, because a minor, redundant argument of genoconvert was removed.

Writing support for gzipped genotype data

After reading support for zipped data was already added in V 1.5.7.0, this release now introduces the complementary writing feature for EIGENSTRAT and PLINK files in genoconvert and forge. Both commands get a new option -z which creates gzipped output.

  -z,--zip                 Should the resulting genotype- and snp-files be
                           gzipped?

Note that this feature includes a smart way of handling already available files to not overwrite them, but still consider them when updating a package's POSEIDON.yml file. -z is also usable with unpackaged genotype data (-p, --onlyGeno).

Future versions of the Poseidon package schema will formally specify this feature.

Bibliography information in `list` and the Web-API

The list subcommand now supports a new view (next to --packages, --groups and individuals): --bibliography allows to get a tabular overview of publications in a package repository.

$ trident list -d 2010_RasmussenNature --bibliography
...
.---------------------.--------------------------------------------------------------.-----------------------.------.---------------------------.---------------.
|       BibKey        |                            Title                             |        Author         | Year |            DOI            | Nr of samples |
:=====================:==============================================================:=======================:======:===========================:===============:
| AADR                | The Allen Ancient DNA Resource (AADR): A curated compendium… | Swapan Mallick et al. | 2023 | 10.1101/2023.04.06.535797 | 1             |
| AADRv424            | The Allen Ancient DNA Resource (AADR): A curated compendium… | S Mallick and D Reich | 2023 | 10.7910/DVN/FFIDCW        | 1             |
| RasmussenNature2010 | Ancient human genome sequence of an extinct Palaeo-Eskimo    | M Rasmussen et al.    | 2010 | 10.1038/nature08835       | 1             |
'---------------------'--------------------------------------------------------------'-----------------------'------'---------------------------'---------------'

Additional fields from the .bib file can be added to this table with -b|--bibField ... (just as -j|--jannoColumn ... for --individuals). --fullBib adds everything that is available (just as --fullJanno). As usual, tab-separated output can be requested with --raw for derived analyses on the command line.

Correspondingly the Web-API supports a new endpoint /bibliography to serve bibliography information via HTTP in JSON format. The optional query argument additionalJannoColumns=... allows to request extra fields here.

Remove empty .janno columns with `rectify`

The rectify subcommand was upgraded with a first option to manipulated .janno files in one or multiple packages: --jannoRemoveEmpty. This allows to remove empty columns from .janno files, so columns that only feature empty strings or n/a values.

  --jannoRemoveEmpty       Reorder the .janno file and remove empty colums.
                           Remember to pair this option with --checksumJanno to
                           also update the checksum.

With this change came a rewrite of the way trident fills empty fields with n/a when writing .janno and .ssf files. This behaviour now also affects the output of list!

Removed redundant `--onlyGeno` from `genoconvert`

We realized that --onlyGeno in genoconvert had the same effect as -o if a different output directory is chosen. We therefore decided to remove this argument and improve the documentation of -o:

  -o,--outPackagePath DIR  Path for the converted genotype files to be written
                           to. If a path is provided, only the converted
                           genotype files are written out, with no change of the
                           original package. If no path is provided, genotype
                           files will be converted in-place, including a change
                           in the POSEIDON.yml file to yield an updated valid
                           package (default: Nothing)

Bug fixes and technical changes

We fixed two bugs that broke the long-form genotype data input option (with --genoFile + --snpFile + ...). They were accidentally added with the recent interface changes for V 1.5.7.0. This input interface should now be fully functional again.

We finally switched to a new compiler version (GHC 9.6.6) and a new stackage resolver version (lts-22.43). This required some minor adjustments in the server code, but should not have any user-facing consequences.

Assets 7

04 Nov 15:42

github-actions

v1.5.7.3

49cc87f

Release v1.5.7.3

This patch release fixes three minor bugs, some of which were accidentally introduced with the big changes in v1.5.7.0.

Fixed a bug in the .janno reading triggered by trailing à characters.
Reverted unspecified behaviour: 0 is again allowed in the Nr_SNPs .janno column.
Fixed a bug introduced in v1.5.5.0, where command line input using the -p option would not behave correctly if the input files have multiple file endings, separated by dots.

Assets 7

26 Oct 18:38

github-actions

v1.5.7.0

6d6e8f3

Release v1.5.7.0

Warning

On 2024/11/06 we realized that this release includes a breaking change that is not documented below.
The command line input interface for unpackaged genotype data was modified from previously --inFormat EIGENSTRAT|PLINK + --genoFile + --snpFile + --indFile to now --genoFile + --snpFile + --indFile and --bedFile + --bimFile + --famFile. So the format selection with the --inFormat argument was removed and replaced with separate file selectors for EIGENSTRAT and PLINK data.
This affects all trident subcommands that allow reading of unpackaged genotype data, namely init, forge, genoconvert and validate.

This release further improves .janno parsing error messages and adds reading support for gzipped PLINK (.bed and .bim) and EIGENSTRAT (.geno and .snp) files. We also added (experimental) support for reading VCF files.

Better .janno error messages

Working with Poseidon packages generally involves reading and validation of .janno files. trident parses them carefully and reports structural issues that compromise their machine-readability. So far the error reports generally only included the line and type of an offending entry. This made it sometimes hard to determine which column exactly is broken. For this release we introduced individual data types for all specified .janno columns, which allows more precises error messages.

To demonstrate this we modified an existing .janno file in the Poseidon community archive (2012_MeyerScience) and broke some of its columns. We added non-UTF8 encoded characters in the Relation_Note column of line 2, a trailing ; in the Coverage_on_Target_SNPs column of line 3, and a leading x to the Latitude column of line 7.

Here is how these issues were previously reported and how they are shown now:

[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 2:
-parse error (Failed reading: conversion error: Cannot decode byte '\x80': Data.Text.Encoding: Invalid UTF-8 stream)
+parse error (Failed reading: conversion error: Cannot decode byte '\x80': Data.Text.Encoding: Invalid UTF-8 stream in column Relation_Note)
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 3:
-parse error in one column (expected data type: Double, broken value: "32.12;", problematic characters: ";")
+parse error (Failed reading: conversion error: Coverage_on_Target_SNPs can not be converted to Double, because of a trailing ";")
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 7:
-parse error (Failed reading: conversion error: expected Double, got "x18.93726" (Failed reading: takeWhile1))
+parse error (Failed reading: conversion error: Latitude can not be converted to Double because input does not start with a digit)

The error messages now include the relevant column name and are more concrete and easy to understand.

Reading support for gzipped genotype data

Although not yet part of the Poseidon 2.7.1 standard, Poseidon packages can now contain gzipped genotype files. Specifically, for EIGENSTRAT-formatted genotype data, the genotype matrix file (.geno) and the snp-list file (.snp) can now also be zipped. This strictly requires file endings with .gz, so .geno.gz and .snp.gz, respectively. Similarly, for PLINK-formatted genotype data, we now also accept .bed.gz and .bim.gz. Any such files with the gz file ending are assumed to be gzipped, and are decoded on the fly using stream-processing. Gzipped and unzipped files can also be mixed within the same package.

For commands that support the --genoOne option (init, forge and genoconvert), note that we make some assumptions, which are summarised in the help text for the option:

 -p,--genoOne FILE        One of the input genotype data files. Expects .bed,
                          .bed.gz, .bim, .bim.gz or .fam for PLINK, or .geno,
                          .geno.gz, .snp, .snp.gz or .ind for EIGENSTRAT. The
                          other files must be in the same directory and must
                          have the same base name. If a gzipped file is given,
                          it is assumed that the file pairs (.geno.gz, .snp.gz)
                          or (.bim.gz, .bed.gz) are both zipped, but not the
                          .fam or .ind file. If a .ind or .fam file is given,
                          it is assumed that none of the file triples is
                          zipped. For VCF please see option --vcfFile

At this point, genoconvert and forge do not support writing of gzipped files. This will be added in the future.

VCF support for genotype data

Although not yet part of the Poseidon 2.7.1 standard, Poseidon packages can now contain VCF (Variant Call Format) files as genotype data, optionally gzipped. In contrast to EIGENSTRAT and PLINK format, which require triples of files, the VCF format requires just one file with ending .vcf or .vcf.gz. VCF files contain sample names, but no information about genetic sex or group names. This information is usually provided in .janno files, so there is no loss of information in Poseidon packages. For trident init, which constructs a minimal .janno file from the genotypem file, we set the Genetic_Sex column to "U", and the Group_Name column to "unknown".

The VCF file format is very flexible and can encode a large amount of information (see https://samtools.github.io/hts-specs/VCFv4.2.pdf). We do not consider our parsing of VCF files to be complete. The feature is for now experimental, since future users may encounter valid VCF files that cause parsing errors in edge cases. Do not hesitate to file an issue in such a case: https://github.com/poseidon-framework/poseidon-hs/issues.

At this point, genoconvert and forge do not support writing of VCF files. This will be added in the future.

Assets 7

12 Jul 08:02

github-actions

v1.5.4.0

83b6cdf

Release v1.5.4.0

This bigger release adds a number of useful features to trident, some of them long requested. The highlights are ordered output for forge, a way to preserve key information if forge is applied to a singular source package, a new Web-API option to return the content of all available .janno columns, and better error messages for common trident issues.

Order `forge` output with `--ordered`

The order of samples in a Poseidon package created with trident forge depends on the order in which the relevant source packages are discovered by trident (e.g. when it crawls for packages in the -d base directories) and then the sample order within these packages. This mechanism did not allow for any convenient way to manually set the output order.

v1.5.4.0 adds a new option --ordered, which causes trident to output the resulting package with samples ordered according to the selection in -f or --forgeFile. This works through an alternative, slower sample selection algorithm that loops through the list of entities and checks for each entity which samples it adds or removes respectively from the final selection.

For simple, positive selection, packages, groups and samples are added as expected. Negative selection removes samples from the list again. If an entity is selected twice via positive selection, then its first occurrence is considered for the ordering.

Preserve the source package in `forge` with `--preservePyml`

For the specific task of subsetting a singular, existing Poseidon package it can be useful to preserve some fields of the POSEIDON.yml file of the source package, as well as supplementary information in the README.md and the CHANGELOG.md file. These are typically discarded by forge, but can now be copied over to the output package with the new --preservePyml output mode. Naturally this only works with a single source package!

--preservePyml specifically preserves the following POSEIDON.yml fields:

description
contributor
packageVersion
lastModified
readmeFile
changelogFile

Note that this does not include the package title, which can be easily set to be identical to the source with -n or -o if it is desired. The poseidonVersion field is also not copied, because trident can only ever produce output packages with the latest Poseidon schema version.

While implementing this we clearly separated the different forge output modes (--onlyGeno, --minimal, --preservePyml and the default) and made them mutually exclusive. We did so to avoid an increasingly complex set of interactions between them for the future.

One particular application of --preservePyml is the reordering of samples in an existing Poseidon package MyPac with the new --ordered flag. We suggest the following workflow for this application:

Generate a --forgeFile with the desired order of the samples in MyPac. This can be done manually or with any suitable tool. Here is an example, where we employ qjanno to generate a forge selection so that the samples are ordered alphabetically by their Poseidon_ID:

qjanno "SELECT '<'||Poseidon_ID||'>' FROM d(MyPac) ORDER BY Poseidon_ID" --raw --noOutHeader > myOrder.txt

Use trident forge with --ordered and --preservePyml to create the package with the specified order:

trident forge -d MyPac --forgeFile myOrder.txt -o MyPac2 --ordered --preservePyml

Apply trident rectify to increment the package version number and document the reordering:

trident rectify -d MyPac2 --packageVersion Minor --logText "reordered the samples alphabetically by Poseidon_ID"

MyPac2 then acts as a stand-in replacement for MyPac that only differs in the order of samples (and maybe the order of variables/fields in the POSEIDON.yml, .janno, .ssf or .bib files). This workflow is not as convenient as in-place reordering would be -- but much safer.

Request all `.janno` columns in `list` and the Web-API

trident list --individuals allows to access per-sample information for Poseidon packages on the command line. With the -j option arbitrary additional columns from the .janno files can be appended to the output. Here, for example, the Country and the Genetic_Sex columns:

 trident list -d 2010_RasmussenNature --individuals -j "Country" -j "Genetic_Sex"

.------------.---------------------.----------------------.----------------.-----------.-----------.-------------.
| Individual |        Group        |       Package        | PackageVersion | Is Latest |  Country  | Genetic_Sex |
:============:=====================:======================:================:===========:===========:=============:
| Inuk.SG    | Greenland_Saqqaq.SG | 2010_RasmussenNature | 2.1.1          | True      | Greenland | M           |
'------------'---------------------'----------------------'----------------'-----------'-----------'-------------'

v1.5.4.0 adds a --fullJanno flag to request all columns at once, without having to list them individually with many -j arguments.

This convenience feature was also added to the Web-API, where it can be triggered with ?additionalJannoColumns=ALL on the /individuals endpoint:

https://server.poseidon-adna.org/individuals?additionalJannoColumns=ALL

Better error messages

In previous trident versions some common error messages were not well rendered on the command line. This concerned particularly errors when parsing command line input, the POSEIDON.yml file or genotype data. We applied multiple changes here to improve the cli output.

The behaviour of the global trident option --errLength was also changed. It now only truncates genotype data-related messages, but does so as well if these are raised on the [Warning] log level. This should make the previously often illegible trident output upon broken genotype data more readable.

Assets 7

06 May 20:12

github-actions

v1.5.0.1

466870a

Release v1.5.0.1

This very minor release only affects the static trident executables produced for every release.

It introduces a distinction between pre-built X64 and ARM64 executables for macOS, where changes in the main processor architecture have recently rendered old builds invalid for new systems and vice versa.

That means the executable trident-macOS will henceforward not longer exist, but instead the executables trident-macOS-X64 and trident-macOS-ARM64.

In the past we have not explicitly documented changes in the compilation pipeline - v1.5.0.0, for example, came with a major overhaul of the pipeline - but in this case a small version bump seems to be in order to announce the split in available artefacts.

Assets 7

03 May 15:56

github-actions

v1.5.0.0

6b28a20

Release v1.5.0.0

This is a minor, but technically breaking release. It removes the example contributor Josiah Carberry from new packages created by trident init and trident forge

Previously every package created by init or forge included an example entry in the contributor field of the POSEIDON.yml file:

- name: Josiah Carberry
  email: [email protected]
  orcid: 0000-0002-1825-0097

This served the purpose of reminding users to actually set a contributor and giving an example how to do so. To simplify scripting with Poseidon packages we now remove this slightly gimmicky default.

To encourage setting the contributor field we instead introduce a reading/validation warning in case the contributor field is empty:

[Warning] Contributor missing in POSEIDON.yml file of package 2010_RasmussenNature-2.1.1

Assets 6

26 Feb 14:10

github-actions

v1.4.1.0

8c54f6f

Release v1.4.1.0

This release adds an entirely new subcommand to merge two .janno files (jannocoalesce) and improves the error messages for broken .janno files.

Merging `.janno` files with `jannocoalesce`

The need for a tool to combine the information of two .janno files arose in the Poseidon ecosystem as we started to conceptualize the Poseidon Minotaur Archive. This archive will be populated by paper-wise Poseidon packages for which the genotype data was regenerated through the Minotaur workflow (work in progress). We plan to reprocess various packages that are already in the Poseidon Community Archive and for these packages we want to copy e.g. spatiotemporal information from the already available .janno files. jannocoalesce is the answer to this specific need, but can also be useful for various other applications.

It generally works by reading a source .janno file with -s|--sourceFile (or all .janno files in a -d|--baseDir) and a target .janno file with -t|--targetFile. It then merges these files by a key column, which can be selected with --sourceKey and --targetKey. The default for both of these key columns is the Poseidon_ID. In case the entries in the key columns slightly and systematically differ, e.g. because the Poseidon_IDs in either have a special suffix (for example _SG), then the --stripIdRegex option allows to strip these with a regular expression to thus match the keys.

jannocoalesce generally attempts to fill all empty cells in the target .janno file with information from the source. --includeColumns and --excludeColumns allow to select specific columns for which this should be done. In some cases it may be desirable to not just fill empty fields in the target, but overwrite the information already there with the -f|--force option. If the target file should be preserved, then the output can be directed to a new output .janno file with -o|--outFile.

Better error messages for broken `.janno` files

.janno file validation is a core feature of trident. With this release we try to improve the error messages for a two common situations:

Broken number fields. This can happen if some text or wrong character ends up in a number field.

So far the error messages for this case have been pretty technical. Here for example if an integer field is filled with 430;, where the integer number 430 is accidentally written with a trailing ;:

parse error (Failed reading: conversion error: expected Int, got "430;" (incomplete field parse, leftover: [59]))

The new error message is more clear:

parse error in one column (expected data type: Int, broken value: "430;", problematic characters: ";")

Inconsistent Date_*, Contamination_* and Relation_* columns. These sets of columns have to be cross-consistent, following a logic that is especially complex for the Date_* fields (see here).

So far any inconsistency was reported with this generic error message:

The Date_* columns are not consistent

Now we include far more precise messages, like e.g.:

Date_Type is not "C14", but either Date_C14_Uncal_BP or Date_C14_Uncal_BP_Err are not empty.

This should simplify tedious .janno file debugging in the future.

Assets 6

30 Oct 08:56

github-actions

v1.4.0.3

a760f94

Release v1.4.0.3

This small release fixes a performance issue related to finding the latest version of all packages. The bug had severe detrimental effects on forge and fetch, which are now resolved.

We used this opportunity to switch to a new GHC version and new versions of a lot of dependencies for building trident.

Assets 6

Releases: poseidon-framework/poseidon-hs

Release v1.6.7.3

Better reporting of parsing errors for .ssf files

More extensive warning mechanism for .janno and .ssf entries

Loosened requirements on accession ID columns in .ssf file

Uh oh!

Release v1.6.7.1

Writing support for VCF files

HMTL API for the web server implementation

More info from the POSEIDON.yml file in the list output

Fixed two bugs in rectify

Uh oh!

Release v1.6.2.1

Writing support for gzipped genotype data

Bibliography information in list and the Web-API

Remove empty .janno columns with rectify

Removed redundant --onlyGeno from genoconvert

Bug fixes and technical changes

Uh oh!

Release v1.5.7.3

Uh oh!

Release v1.5.7.0

Better .janno error messages

Reading support for gzipped genotype data

VCF support for genotype data

Uh oh!

Release v1.5.4.0

Order forge output with --ordered

Preserve the source package in forge with --preservePyml

Request all .janno columns in list and the Web-API

Better error messages

Uh oh!

Release v1.5.0.1

Uh oh!

Release v1.5.0.0

Uh oh!

Release v1.4.1.0

Merging .janno files with jannocoalesce

Better error messages for broken .janno files

Uh oh!

Release v1.4.0.3

Uh oh!

More info from the POSEIDON.yml file in the `list` output

Fixed two bugs in `rectify`

Bibliography information in `list` and the Web-API

Remove empty .janno columns with `rectify`

Removed redundant `--onlyGeno` from `genoconvert`

Order `forge` output with `--ordered`

Preserve the source package in `forge` with `--preservePyml`

Request all `.janno` columns in `list` and the Web-API

Merging `.janno` files with `jannocoalesce`

Better error messages for broken `.janno` files