To install via conda, it is available on the matsengrp
channel:
conda install -c matsengrp larch-phylo
Currently only available in Linux.
- GCC 7.5
- cmake 3.16
For Ubuntu 18.04 LTS the following commands installs the requirements:
sudo apt install --no-install-recommends git git-lfs cmake make g++ mpi-default-dev libprotobuf-dev libboost-dev libboost-program-options-dev libboost-filesystem-dev libboost-iostreams-dev libboost-date-time-dev protobuf-compiler automake autoconf libtool nasm
To get a recent cmake, download from https://cmake.org/download/
, for example:
wget https://github.com/Kitware/CMake/releases/download/v3.23.1/cmake-3.23.1-linux-x86_64.tar.gz
- singularity 3.5.3
- conda 22.9.0
Larch can be built utilizing a Singularity container or a Conda environment.
To build Singularity image, use the definition provided:
singularity build larch-singularity.sif larch-singularity.def
singularity shell larch-singularity.sif --net
To setup a conda environment capable of building Larch, create larch
using the standard environment file provided:
conda env create -f environment.yml
To setup a conda environment capable of building Larch including development tools, create larch-dev
using the development environment file provided:
conda env create -f environment-dev.yml
There are 4 executables that are built automatically as part of the larch package and provide various methods for exploring tree space and manipulating DAGs/trees:
larch-test
is the suite of tests used to validate the various routines.larch-usher
is a tool that takes an input tree/DAG and explores tree space through SPR moves.larch-dagutil
is a utility that manipulates (e.g. merge, prune) or inspects DAGs/trees.larch-dag2dot
is a utility that writes a DAG to a DOT file format for easier viewing.
Note: If you run against memory limitations during the cmake step, you can regulate number of parallel threads with export CMAKE_NUM_THREADS="8"
(reduce number as necessary).
To build all from larch/
directory, run:
git submodule update --init --recursive
mkdir build
cd build
cmake ..
make -j16
# optionally, to install outside of build directory
make install
Cmake build options:
- add
-DMAKE_BUILD_TYPE=Debug
to build in debug mode.-DMAKE_BUILD_TYPE=Release
is enabled by default. - add
-DCMAKE_CXX_CLANG_TIDY="clang-tidy"
to enable clang-tidy. - add
-DUSE_ASAN=yes
to enable asan and ubsan. - add
-DCMAKE_INSTALL_PREFIX=path/to/install
to select install location. By default, this will perform a system-wide installation. To install in current conda environment, use-DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
.
For all tools in this suite, a number of file formats are supported for loading and storing MATs and MADAGs. When passing filepaths as arguments, the file format can be explicitly specified with --input-format/--output-format
options. Alternatively, the program can infer the file format when filepath contains a recognized file extension.
File format options:
MADAG dagbin
Supported as input and output.*.dagbin
is the recognized extension.MADAG protobuf
Supported as input and output.*.pb_dag
is the recognized extension, or using*.pb
WITHOUT a--MAT-refseq-file
option.MAT protobuf
Supported as input only.*.pb_tree
is the recognized extension, or using*.pb
WITH a--MAT-refseq-file
option.MADAG json
Supported as input only.*.json_dag
or*.json
is the recognized extension.
From the larch/build/bin
directory:
ln -s ../../data
./larch-test
Passing nocatch to the tests executable will allow exceptions to escape, which is useful for debugging. A gdb session can be started with gdb --args build/larch-test nocatch
.
larch-test options:
nocatch
allows test exceptions to escape, which is useful for debugging. A gdb session can be started withgdb --args build/larch-test nocatch
.--list
produces a list of all available tests, along with an ID number.--range
runs tests by ID with a string of comma-separated range or single ID arguments [e.g. 1-5,7,9,12-13].-tag
excludes tests with a given tag.+tag
includes tests with a given tag.- For example, the
-tag "slow"
removes tests which require an long runtime to complete.
From the larch/build/bin
directory:
./larch-usher -i ../data/testcase/tree_1.pb.gz -o output_dag.pb -c 10
This command runs 10 iterations of larch-usher on the provided tree, and writes the final result to the file output_dag.pb
larch-usher options:
-i,--input
[REQUIRED] Filepath to the input tree/DAG (accepted file formats are: MADAG protobuf, MAT protobuf, JSON, Dagbin).-o,--output
[REQUIRED] Filepath to the output tree/DAG (accepted file formats are: MADAG protobuf, Dagbin).-c,--count
[Default: 1] Number of larch-usher iterations to run.-r,--MAT-refseq-file
[REQUIRED if provided input file is a MAT protobuf] Filepath to json reference sequence.-v,--VCF-input-file
Filepath to VCF containing ambiguous sequence data.-l,--logpath
[Default:optimization_log
] Filepath to write summary log.-s,--switch-subtrees
[Default: never] Switch to optimizing subtrees after the specified number of iterations.--min-subtree-clade-size
[Default: 100] The minimum number of leaves in a subtree sampled for optimization (ignored without option-s
).--max-subtree-clade-size
[Default: 1000] The maximum number of leaves in a subtree sampled for optimization (ignored without option-s
).--move-coeff-nodes
[Default: 1] New node coefficient for scoring moves. Set to 0 to apply only parsimony-optimal SPR moves.--move-coeff-pscore
[Default: 1] Parsimony score coefficient for scoring moves. Set to 0 to apply only topologically novel SPR moves.--sample-method
[Default:parsimony
] Select method for sampling optimization tree from the DAG. Options are: (parsimony
,random
,rf-minsum
,rf-maxsum
).--sample-uniformly
[Default: use natural distribution] Use a uniform distribution to sample trees for optimization.- For example, if the sampling method is
parsimony
and--sample-uniformly
is provided, then a uniform distribution on parsimony-optimal trees is sampled from. --callback-option
[Default:best-moves
] Specify which SPR moves are chosen and applied. Options are: (all-moves
,best-moves-fixed-tree
,best-moves-treebased
,best-moves
).--trim
[Default: do not trim] Trim optimized dag to contain only parsimony-optimal trees before writing to protobuf.--keep-fragment-uncollapsed
[Default: collapse] Do not collapse empty (non-mutation-bearing) edges in the optimization tree.--quiet
[Default: write intermediate files] Do not write intermediate protobuf file at each iteration.--input-format
[Default: format inferred by file extension] Specify the format of the input file. Options are: (dagbin
,pb
,dag-pb
,tree-pb
,json
,dag-json
)--output-format
[Default: format inferred by file extension] Specify the format of the output file. Options are: (dagbin
,pb
,dag-pb
)-S
Enable smart stopping: larch-usher will terminate when parsimony improvement ceases to occur.-T
specify a hard time limit after which larch-usher will terminate.
From the larch/build/bin
directory:
./larch-dagutil -i ../data/testcase/tree_1.pb.gz -i ../data/testcase/tree_2.pb.gz -o merged_trees.pb
This executable takes a list of protobuf files and merges the resulting DAGs together into one.
There is some non-determinism in parsimony score that can happen when merging multiple DAGs on the same ambiguous leafset without providing a VCF. The larch-dagutil
implementation can merge multiple DAGs whose leafsets contain matching sampleIds into a single DAG, but the protobuf format only stores edge mutations, which are fully disambiguated. So the ambiguities are recovered by passing a VCF file to the program. When a VCF is not supplied, the overall parsimony score of the merged DAG is not well-defined. This is because the nodes are added in parallel, and so the disambiguation assigned to any given leaf node is determined by the order in which the parallel algorithm accesses the leaves from each DAG. So the disambiguation for each leaf is based on a random choice of the trees from which the DAG is constructed, and is not necessarily consistent with the disambiguation for its sister leaves.
dag-util options:
-i,--input
Filepath to the input Tree/DAG (accepted file formats are: MADAG protobuf, MAT protobuf, JSON, Dagbin).-o,--output
[Default: does not print output] Filepath to the output Tree/DAG (accepted file formats are: MADAG protobuf, Dagbin).-r,--MAT-refseq-file
[REQUIRED if input protobufs are MAT protobuf format] Filepath to json reference sequence.-t,--trim
Trim output (Default trimming method is trim to best parsimony).--rf
Trim output to minimize RF distance to the provided DAG file (Ignored if-t
flag is not provided).-s,--sample
Write a sampled single tree from DAG to file, rather than the whole DAG.--dag-info
Print stats about the DAG (tree count, all parsimony scores, all RF distances)--parsimony
Print all parsimony scores.--sum-rf-distance
Print all sum RF distances.--input-format
[Default: format inferred by file extension] Specify the format of the input file(s). Options are: (dagbin
,pb
,dag-pb
,tree-pb
,json
,dag-json
)--output-format
[Default: format inferred by file extension] Specify the format of the output file. Options are: (dagbin
,pb
,dag-pb
)--rf-format
[Default: format inferred by file extension] Specify the format of the RF file. Options are: (dagbin
,pb
,dag-pb
,tree-pb
,json
,dag-json
)
From the larch/build/bin
directory:
./larch-dag2dot -i ../data/testcase/full_dag.pb
This command writes the provided DAG in dot format to stdout.
dag2dot options:
-i,--input
Filepath to the input Tree/DAG (accepted file formats are: MADAG protobuf, MAT protobuf, JSON, Dagbin).-o,--output
[Default: DOT written to stdout] Filepath to the output DOT file.--input-format
[Default: format inferred by file extension] Specify the format of the input file. Options are: (dagbin
,pb
,dag-pb
,tree-pb
,json
,dag-json
)--dag/--tree
[REQUIRED if file extension is *.pb] Specify whether input file is a DAG or a Tree.
- Lohmann, N. (2022). JSON for Modern C++ (Version 3.10.5) [Computer software]. https://github.com/nlohmann
- Eric Niebler. Range library for C++14/17/20. https://github.com/ericniebler/range-v3