Skip to content

matsengrp/larch

Repository files navigation

Getting started

Installation

Install via conda

To install via conda, it is available on the matsengrp channel:

conda install -c matsengrp larch-phylo

Currently only available in Linux.

Build from source

Requirements

  • GCC 7.5
  • cmake 3.16

For Ubuntu 18.04 LTS the following commands installs the requirements:

sudo apt install --no-install-recommends git git-lfs cmake make g++ mpi-default-dev libprotobuf-dev libboost-dev libboost-program-options-dev libboost-filesystem-dev libboost-iostreams-dev libboost-date-time-dev protobuf-compiler automake autoconf libtool nasm

To get a recent cmake, download from https://cmake.org/download/, for example:

wget https://github.com/Kitware/CMake/releases/download/v3.23.1/cmake-3.23.1-linux-x86_64.tar.gz

Build Environments

  • singularity 3.5.3
  • conda 22.9.0

Larch can be built utilizing a Singularity container or a Conda environment.

To build Singularity image, use the definition provided:

singularity build larch-singularity.sif larch-singularity.def
singularity shell larch-singularity.sif --net

To setup a conda environment capable of building Larch, create larch using the standard environment file provided:

conda env create -f environment.yml

To setup a conda environment capable of building Larch including development tools, create larch-dev using the development environment file provided:

conda env create -f environment-dev.yml

Building

There are 4 executables that are built automatically as part of the larch package and provide various methods for exploring tree space and manipulating DAGs/trees:

  • larch-test is the suite of tests used to validate the various routines.
  • larch-usher is a tool that takes an input tree/DAG and explores tree space through SPR moves.
  • larch-dagutil is a utility that manipulates (e.g. merge, prune) or inspects DAGs/trees.
  • larch-dag2dot is a utility that writes a DAG to a DOT file format for easier viewing.

Note: If you run against memory limitations during the cmake step, you can regulate number of parallel threads with export CMAKE_NUM_THREADS="8" (reduce number as necessary).

To build all from larch/ directory, run:

git submodule update --init --recursive
mkdir build
cd build
cmake ..
make -j16

# optionally, to install outside of build directory
make install

Cmake build options:

  • add -DMAKE_BUILD_TYPE=Debug to build in debug mode. -DMAKE_BUILD_TYPE=Release is enabled by default.
  • add -DCMAKE_CXX_CLANG_TIDY="clang-tidy" to enable clang-tidy.
  • add -DUSE_ASAN=yes to enable asan and ubsan.
  • add -DCMAKE_INSTALL_PREFIX=path/to/install to select install location. By default, this will perform a system-wide installation. To install in current conda environment, use -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX.

Running


file formats

For all tools in this suite, a number of file formats are supported for loading and storing MATs and MADAGs. When passing filepaths as arguments, the file format can be explicitly specified with --input-format/--output-format options. Alternatively, the program can infer the file format when filepath contains a recognized file extension.

File format options:

  • MADAG dagbin Supported as input and output. *.dagbin is the recognized extension.
  • MADAG protobuf Supported as input and output. *.pb_dag is the recognized extension, or using *.pb WITHOUT a --MAT-refseq-file option.
  • MAT protobuf Supported as input only. *.pb_tree is the recognized extension, or using *.pb WITH a --MAT-refseq-file option.
  • MADAG json Supported as input only. *.json_dag or *.json is the recognized extension.

larch-test

From the larch/build/bin directory:

ln -s ../../data
./larch-test

Passing nocatch to the tests executable will allow exceptions to escape, which is useful for debugging. A gdb session can be started with gdb --args build/larch-test nocatch.

larch-test options:

  • nocatch allows test exceptions to escape, which is useful for debugging. A gdb session can be started with gdb --args build/larch-test nocatch.
  • --list produces a list of all available tests, along with an ID number.
  • --range runs tests by ID with a string of comma-separated range or single ID arguments [e.g. 1-5,7,9,12-13].
  • -tag excludes tests with a given tag.
  • +tag includes tests with a given tag.
  • For example, the -tag "slow" removes tests which require an long runtime to complete.

larch-usher

From the larch/build/bin directory:

./larch-usher -i ../data/testcase/tree_1.pb.gz -o output_dag.pb -c 10

This command runs 10 iterations of larch-usher on the provided tree, and writes the final result to the file output_dag.pb

larch-usher options:

  • -i,--input [REQUIRED] Filepath to the input tree/DAG (accepted file formats are: MADAG protobuf, MAT protobuf, JSON, Dagbin).
  • -o,--output [REQUIRED] Filepath to the output tree/DAG (accepted file formats are: MADAG protobuf, Dagbin).
  • -c,--count [Default: 1] Number of larch-usher iterations to run.
  • -r,--MAT-refseq-file [REQUIRED if provided input file is a MAT protobuf] Filepath to json reference sequence.
  • -v,--VCF-input-file Filepath to VCF containing ambiguous sequence data.
  • -l,--logpath [Default: optimization_log] Filepath to write summary log.
  • -s,--switch-subtrees [Default: never] Switch to optimizing subtrees after the specified number of iterations.
  • --min-subtree-clade-size [Default: 100] The minimum number of leaves in a subtree sampled for optimization (ignored without option -s).
  • --max-subtree-clade-size [Default: 1000] The maximum number of leaves in a subtree sampled for optimization (ignored without option -s).
  • --move-coeff-nodes [Default: 1] New node coefficient for scoring moves. Set to 0 to apply only parsimony-optimal SPR moves.
  • --move-coeff-pscore [Default: 1] Parsimony score coefficient for scoring moves. Set to 0 to apply only topologically novel SPR moves.
  • --sample-method [Default: parsimony] Select method for sampling optimization tree from the DAG. Options are: (parsimony, random, rf-minsum, rf-maxsum).
  • --sample-uniformly [Default: use natural distribution] Use a uniform distribution to sample trees for optimization.
  • For example, if the sampling method is parsimony and --sample-uniformly is provided, then a uniform distribution on parsimony-optimal trees is sampled from.
  • --callback-option [Default: best-moves] Specify which SPR moves are chosen and applied. Options are: (all-moves, best-moves-fixed-tree, best-moves-treebased, best-moves).
  • --trim [Default: do not trim] Trim optimized dag to contain only parsimony-optimal trees before writing to protobuf.
  • --keep-fragment-uncollapsed [Default: collapse] Do not collapse empty (non-mutation-bearing) edges in the optimization tree.
  • --quiet [Default: write intermediate files] Do not write intermediate protobuf file at each iteration.
  • --input-format [Default: format inferred by file extension] Specify the format of the input file. Options are: (dagbin, pb, dag-pb, tree-pb, json, dag-json)
  • --output-format [Default: format inferred by file extension] Specify the format of the output file. Options are: (dagbin, pb, dag-pb)
  • -S Enable smart stopping: larch-usher will terminate when parsimony improvement ceases to occur.
  • -T specify a hard time limit after which larch-usher will terminate.

larch-dagutil

From the larch/build/bin directory:

./larch-dagutil -i ../data/testcase/tree_1.pb.gz -i ../data/testcase/tree_2.pb.gz -o merged_trees.pb

This executable takes a list of protobuf files and merges the resulting DAGs together into one.

Note about merging ambiguous data using larch-dagutil

There is some non-determinism in parsimony score that can happen when merging multiple DAGs on the same ambiguous leafset without providing a VCF. The larch-dagutil implementation can merge multiple DAGs whose leafsets contain matching sampleIds into a single DAG, but the protobuf format only stores edge mutations, which are fully disambiguated. So the ambiguities are recovered by passing a VCF file to the program. When a VCF is not supplied, the overall parsimony score of the merged DAG is not well-defined. This is because the nodes are added in parallel, and so the disambiguation assigned to any given leaf node is determined by the order in which the parallel algorithm accesses the leaves from each DAG. So the disambiguation for each leaf is based on a random choice of the trees from which the DAG is constructed, and is not necessarily consistent with the disambiguation for its sister leaves.

dag-util options:

  • -i,--input Filepath to the input Tree/DAG (accepted file formats are: MADAG protobuf, MAT protobuf, JSON, Dagbin).
  • -o,--output [Default: does not print output] Filepath to the output Tree/DAG (accepted file formats are: MADAG protobuf, Dagbin).
  • -r,--MAT-refseq-file [REQUIRED if input protobufs are MAT protobuf format] Filepath to json reference sequence.
  • -t,--trim Trim output (Default trimming method is trim to best parsimony).
  • --rf Trim output to minimize RF distance to the provided DAG file (Ignored if -t flag is not provided).
  • -s,--sample Write a sampled single tree from DAG to file, rather than the whole DAG.
  • --dag-info Print stats about the DAG (tree count, all parsimony scores, all RF distances)
  • --parsimony Print all parsimony scores.
  • --sum-rf-distance Print all sum RF distances.
  • --input-format [Default: format inferred by file extension] Specify the format of the input file(s). Options are: (dagbin, pb, dag-pb, tree-pb, json, dag-json)
  • --output-format [Default: format inferred by file extension] Specify the format of the output file. Options are: (dagbin, pb, dag-pb)
  • --rf-format [Default: format inferred by file extension] Specify the format of the RF file. Options are: (dagbin, pb, dag-pb, tree-pb, json, dag-json)

larch-dag2dot

From the larch/build/bin directory:

./larch-dag2dot -i ../data/testcase/full_dag.pb

This command writes the provided DAG in dot format to stdout.

dag2dot options:

  • -i,--input Filepath to the input Tree/DAG (accepted file formats are: MADAG protobuf, MAT protobuf, JSON, Dagbin).
  • -o,--output [Default: DOT written to stdout] Filepath to the output DOT file.
  • --input-format [Default: format inferred by file extension] Specify the format of the input file. Options are: (dagbin, pb, dag-pb, tree-pb, json, dag-json)
  • --dag/--tree [REQUIRED if file extension is *.pb] Specify whether input file is a DAG or a Tree.

Third-party

About

Inference and manipulation of history DAGs

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 6