Skip to content

brandon-burciaga/PythonDataParsers

Repository files navigation

Biological Database Parsers

This directory contains python scripts that parse various scientific databases for use on a work related project.

All scripts written by Brandon Burciaga.

Database sources

NCBI Entrez Gene and Comparative Toxicogenomics Database (CTD)

  • refractor.py

    • main module to run for CTD and NCBI Entrez Gene parsing

    • Run as: python refactor.py -p ~/path/to/top/dir -s source

      • source : 'CTD', 'NCBIEntrezGene'
    • Infile(s):

      • User edited JSON data file specifying files and attributes to parse for each source
      • NCBI Entrez Gene data (ftp://ftp.ncbi.nih.gov/gene/DATA/)
      • Comparative Toxicogenomics Database (http://ctdbase.org/)
    • Outfile(s):

      • Pipe delimited files in the directory created through createOutDirectory() function in general.py
      • View specific source parsers for specifics on outfiles
  • ncbientrezgene.py

    • Holds logic for processing NCBI Entrez Gene nodes and relationships in preparation for writing to outfiles
  • ctd.py

    • Class is work in progress
    • Holds logic for processing Comparative Toxicogenomics Database nodes and relationships in preparation for writing to outfiles
  • parent.py

    • SourceClass is the parent class for each specific source class (NCBIEntrezGene, CTD)
    • Contains attributes passed in from main() function in refactor.py, and
    • Class methods shared between the child classes for header fixing and generic tsv parsing
  • general.py

    • Contains general functions for creating output directory, running help flags, etc used by parent.py and refractor.py

NCBI Taxonomy

  • taxonomyParser.py
    • Parses NCBI Taxonomy database files for neo4j graph database node and relationship creation.
    • Outfiles are .csv (pipe delimited) and generated with first line header demonstrating the format of the rest of the file (in columns)
    • See howToRun() method for instructions using this script.
    • Infile(s) [NCBI Taxonomy files in .dmp format, 2015 versions used]:
      • nodes.dmp, ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
      • names.dmp, ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
      • citations.dmp, ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
    • Outfile(s): taxNodeOut.csv, taxRelnOut.csv
    • Imports: taxonomyClasses.py

Therapeutic Target Database (TTD), Medical Subject Headings (MeSH)

National Agricultural Library (NAL)

  • nalParser.py
    • Parses National Agricultural Library thesaurus for neo4j graph database node and relationship creation.
    • Outfiles are .csv (pipe delimited) and generated with first line header demonstrating the format of the rest of the file (in columns)
    • See howToRun() method for instructions using this script.
    • Infile(s) [NAL Thesaurus, 2015 versions used]:
    • Outfile(s): nalNodeOut.csv

Online Mendelian Inheritance in Man (OMIM)

  • omimParser.py
    • Parses Online Mendelian Inheritance in Man (OMIM) database files for neo4j graph database node and relationship creation.
    • Outfiles are .csv (pipe delimited) and generated with first line header demonstrating the format of the rest of the file (in columns)
    • See howToRun() method for instructions using this script.
    • Infile(s) [OMIM files, 2015 versions used]:
    • Outfile(s): mimDisorderNodeOut.csv, mimDisorderRenOut.csv, mimEntrezRelnOut.csv, mimGeneNodeOut.csv, mimGeneRelnOut.csv
    • Imports: omimClasses.py module

Various Ontologies

About

Various biological data parsers used in neo4j graph database ETL

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published