-
Notifications
You must be signed in to change notification settings - Fork 1
MAF to Tile CSV Design
Common Translation layer repository of tools can be found at https://github.com/Intel-HLS/GenomicsSampleAPIs/tree/master/utils. The tools in this repository are described in the follow sub-sections.
Loader util for GenomicsDB. Uses a loader config to generate the required loader configuration file for GenomicsDB. When -l
is specified, this loader will take the callset and vid maps to automate the GenomicsDB loading process.
Loader configuration file allowing a user to specify mpi (if desired) as well as pointers to GenomicsDB loader executable, etc.
An example loader config with some basic settings for loading GenomicsDB. More details can be found on the GenomicsDB wiki.
Config defines the valid assemblies. This is a placeholder for future configuration information.
class CSVLine
provides the methods to populate the fields that are expected for a Tile DB entry, and generates csv line. It also validates the entries before generating a CSV line. The expected usage is that this class be called by a higher level program that understands the input format, populates the CSVLine structure, and gets a CSV line that is compatible with the GenomicsDB vcf2tiledb loader.
Usage Note: Populate the ALT field first since it determines the size of PL and AD fields.
class File2Tile
provides the key data structure and functionality required to build a conversion script. Uses the ConfigReader described below.
class ConfigReader
takes the Master configuration file that details the minimum required mapping to build a CSV file. These configuration files can be found here. The fields of the configuration file are described below:
Defines the HG19 Assembly with the length of each chromosome, the specific order the chromosomes of the given assembly should be placed along the tiledb horizontal dimension, and the offset factor that defines the padding between the chromosomes. This information will be used by translate.py
to compute the column #s for Tile DB.
This file defines the Reference and ReferenceSet in MetaDB.
Input translation layer is the custom script layer that understands the nuances of the input data set and passes them on to the common translation layer to generate the CSV file. ICGC data set is taken as an input to show how an input translation layer is scripted. You can find the code at https://github.com/Intel-HLS/GenomicsSampleAPIs/tree/master/utils/.
###maf_importer.py
Conversion Script. See details in the Usage Notes below.
Few things to keep in mind:
- Connects to a database instance to register new entities and assign TileDB rows.
- Include all the files that need to be loaded into Tile DB into a single conversion instead of doing piece-wise conversion. The reason being, Tile DB expects consistent set of sample IDs in the input CSV file.
- Produces the CallSet Map and VID map file required for GenomicsDB loading.
###maf_pyspark.py
Conversion Script which has the same functionality as maf_importer.py
, but uses the spark map-reduce hooks. By running it in a distributed spark cluster, the run times for the conversion is reduced by orders of magnitude. The options to run are the same as maf_importer.py
through import.py
but -t
is not supported.
The input configuration expects at least a master configuration file that is described in Table. NOTE: ICGC data requires a variants config (JSON) file that describes the mapping of ICGC fields to variant names.
Syntax
usage: maf2tile.py [-h] -c CONFIG -d OUTPUTDIR -i INPUTS [INPUTS ...] [-z]
[-s SPARK] [-o OUTPUT] [-a APPEND_CALLSETS] [-l LOADER]
Convert MAF format to Tile DB CSV
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
input configuration file for MAF conversion
-d OUTPUTDIR, --outputdir OUTPUTDIR
Output directory where the outputs need to be stored
-i INPUTS [INPUTS ...], --inputs INPUTS [INPUTS ...]
List of input MAF files to convert
-z, --gzipped True/False indicating if the input file is a gzipped
file or not
-s SPARK, --spark SPARK
Run as spark. Where SPARK is the spark-master URI
-o OUTPUT, --output OUTPUT
output Tile DB CSV file (without the path) which will
be stored in the output directory. Required for spark.
-a APPEND_CALLSETS, --append_callsets APPEND_CALLSETS
CallSet mapping file to append.
-l LOADER, --loader LOADER
Loader JSON to load data into Tile DB.
See the usage section for examples.
- Variant Store
- Python API
- Utils
- MAF to TileDB Import
- VCF to TileDB Import
- Additional Info