Some tools for building a Translator BigKG. This software project is experimental and unfinished.
There are two types of intended users for the stitch-proj software: someone who is
tasked with ingesting the
Babel concept identifier normalization database
into a local sqlite databse (an "ingester") and someone developing a
application, such as a BigKG build system, that wants to programmatically query
a local Babel sqlite database for node normalization, etc. ("querier"). The
"ingester" type user will need to rea this entire README document, in order to
be able to set up and run the ingest_babel.py program to carry out an ingest
of Babel into a local sqlite database. The "querier" type user can skip over the
sections of this document that discuss ingesting Babel, and focus on the
sections about downloading the pre-built Babel sqlite database from S3 and using
the local_babel.py python module that provides functions for querying the
local Babel sqlite database.
ingest_babel.py: downloads and ingests the Babel concept identifier synonymization database into a local sqlite3 relational databaselocal_babel.py: functions for querying the local Babel sqlite databaserow_counts.py: a script that prints out the row counts of the tables in the local Babel sqlite database
- CPython 3.12, which needs to be available in your path as
python3.12, with thevenvlibrary installed and in the python path - At least 32 GiB of system memory
- Sufficient disk space in wherever filesystem hosts your
stitch-projdirectory, which will depend on your use-case:- To build
babel.sqlite, at least 600 GiB of free file system storage space (usage transiently spikes to ~522 GiB and then the final database size is ~181 GiB). - To use a local
babel.sqlitein your application, 200 GiB of free system storage space to store the sqlite file.
- To build
- Linux or MacOS (this software has not been tested on Windows; see "Systems on which this software has been tested").
- If you want to download the pre-built Babel sqlite database file, you will need to have
curlorwgetinstalled. - Optionally, you can install
sqlite3_analyzer, if you want to obtain detailed database statistics (see instructions below in this page).
The stitch-proj project's module ingest_babel.py has been tested in three compute environments:
- We have tested a full run of
ingest_babel.pyon this system (releasebabel-sqlite-20250331and releasebabel-sqlite-20250817). This instance has instance namestitch2.rtx.aiand is in theus-west-1AWS region. - Ubuntu 24.04
i4i.2xlargeinstance (Intel Xeon 8375C processor, which is x86_64 architecture), 64 GiB of memorygp3root volume (500 GiB)Nitro SSDvolume (1.7 TiB)
- We have tested a full run of
ingest_babel.pyon this system (releasebabel-sqlite-20250123). - Ubuntu 24.04
c7g.4xlargeinstance (Graviton3 processor, which is ARM64 architecture), 32 GiB of memorygp3root volume (800 GiB)- CPython, Numpy, and Pandas were compiled locally using gcc/g++ with the following CFLAGS:
-mcpu=neoverse-v1 -mtune=neoverse-v1 -march=armv8.4-a+crypto -O3 -pipe - To enable local compilation of CPython, Numpy, and Pandas, the following packages were
aptinstalled:sqlite3,build-essential,gcc,g++,make,libffi-dev,libssl-dev,zlib1g-dev,libbz2-dev,libreadline-dev,libsqlite3-dev,libncursesw5-dev,tk-dev,libgdbm-dev,libnss3-dev,liblzma-dev,uuid-dev,python3-dev,gfortran,libopenblas-dev,liblapack-dev,libfreetype6-dev,libpng-dev,libjpeg-dev,libtiff-dev,libffi-dev,liblzma-dev,pkg-config,cmake,python3.12-venv.
- We have tested only partial ingests of Babel on this system type. For reasons
I don't fully understand,
ingest_babel.pyruns quite fast on the M1 Max, compared to the Graviton3 processor. I've tested on the following MacOS system: - MacOS 14.6.1
- Apple M1 Max processor, 64 GiB of memory
- Apple SSD AP2048R Media SSD (2 TiB)
python3.12installed via Homebrewopenblasinstalled via Homebrew
All external PyPI distribution package requirements for the stitch-proj project are listed in the
requirements.txt file.
The run-checks.sh script (see section "Running
the type checks, lint checks, ..." below) depends on the packages pytest,
ruff, vulture, and pylint. For a "querying" type user that is just using
local_babel.py, only three PyPI distribution packages are needed, requests,
numpy, and the Biolink Model Toolkit (bmt). Additionally, for an "ingester"
type user who wants to run ingest_babel.py to build a local Babel sqlite
database from scratch, the PyPI packages pandas, ray, and
htmllistparse are needed. The requirements.txt file contains
the full set of dependencies.
You can just run
cd stitch-proj
./run-setup-venv.sh
Or if you are using AWS,
ssh [email protected](if running in AWS); else just create a newbashsessiongit clone https://github.com/Translator-CATRAX/stitch-proj.gitcd stitch-proj(this is the directory that containsrequirements.txt)./run-setup-venv.shThe last step above (i.e., thepip3 install -e .step) sets up some symbolic links within your virtualenv, so thatstitchutilscan be imported without manipulating the PYTHONPATH, no matter what the current working directory is. You will need this in order for the unit test moduletests/test_ingest_babel.pyto run successfully.
ssh [email protected](if running in AWS); else just create a newbashsessioncd stitch-projscreen(to enter a screen session)./instance-memory-tracker.shctrl-X D(to exit the screen session)screen(to enter a second screen session)./run-ingest-aws.shctrl-X D(to exit the second screen session)tail -f ingest-babel.log(so you can watch progress)- In another terminal session, watch memory usage using
top
After approximately 37 hours, the ingest script should complete, leaving
the finished database as a file
/home/ubuntu/stitch-proj/babel.sqlite (see Requirements for the expected size).
The ingest_babel.py script (internally) turns off buffering for the stdout
and stderr streams, so that output logging information is seen immediately
in the logfile as soon as an update is "printed" by the python script.
This behavior cannot be overridden at the python3.12 command-line.
If you prefer to run ingest_babel.py by invoking it directly from the
command-line (rather than by using the run-ingest-aws.sh script), that
can be done using the ingest-babel script that is set up in your virtualenv.
After setting up your virtualenv and installing stitch-proj using the pip3 install -e . command as shown above, you can run
venv/bin/ingest-babel COMMAND_LINE_ARGS
where COMMAND_LINE_ARGS represents the various command-line arguments you wish
to pass to the Babel ingest script, ingest_babel.py. Note, if you do this,
you will want to ensure that whatever location you specify (or, alternatively,
the default location you opt to leave in place) for the ingest_babel.py
temporary file directory will have at least 600 GiB of free space available
(although upon script completion, ingest_babel.py will not need any temp
directory space). In most cases, the easiest way to ensure this is to specify,
in calling ingest_babel.py, the location that you choose for a temporary file
directory using the --temp-dir command-line option, and further, to specify a
temporary file directory location that is in the same filesystem as the
location where you are configuring ingest_babel.py to output the Babel sqlite
file. This way, the space on the filesystem is "shared" between the temp
directory and the final output database. The run-ingest-aws.sh script takes
care of this, in an idempotent way, by creating a local temp dir and then
configuring ingest_babel.py to use that temp dir (and ensuring that the final
output Babel sqlite file goes into the same filesystem).
babel-20250331.sqlite
(173 GiB) is available for download from AWS S3. For details and an MD5
checksum hash, see the (Releases
page)[https://github.com/Translator-CATRAX/stitch-proj/releases] for the stich
project. You will need to download (or, alternatively, build from scratch using
ingest_babel.py) this file in order to be able to run the unit test
suite.
This schema diagram was generated using DbVisualizer Free version 24.3.3.

In the cliques table, the combination of columns primary_identifier_id and
type_id are unique, as confirmed by this SQL query returning no rows:
sqlite> SELECT primary_identifier_id, type_id, COUNT(*) as count
...> FROM cliques
...> GROUP BY primary_identifier_id, type_id
...> HAVING COUNT(*) > 1 LIMIT 10;
In contrast, the column primary_identifier_id on the cliques table by itself
is not unique; there can be more than one clique with the same
primary_identifier_id and different type_id values. In theory, I should
probably add a two-column uniqueness constraint to the cliques table, but I
have not yet done so. See issue 16:
#16
If you are a developer looking to improve local_babel.py,
consider installing and compiling sqlite3_analyzer, which is available
from the sqlite software project area on GitHub.
On Ubuntu, you can just perform the following steps to have
sqlite3_analyzer available in /usr/local/bin:
cd stitch-prod
git clone https://github.com/sqlite/sqlite.git
cd sqlite
./configure --prefix=/usr/local
make sqlite3_analyzer
sudo cp sqlite3_analyzer /usr/local/bin
sudo chmod a+x /usr/local/bin/sqlite3_analyzer
On MacOS, you can just use Homebrew to install sqlite3_analyzer:
brew install sqlite-analyzer
which will install the program in /opt/homebrew/bin/sqlite3_analyzer.
One analyzes the database like this:
sqlite3_analyzer babel.sqlite > babel-sqlite-analysis.txt
The analysis should take less than an hour.
For now, see the module tests/test_local_babel.py for examples.
First, download babel-20250331.sqlite from S3 as described above, and ensure
that in the top-level stitch-proj directory, there is a symbolic link db or
a subdirectory db such that if the current working directory is the top-level
stitch-proj directory, the relative path db/babel-2025331.sqlite can open
the database file. Something like this should do it:
cd stitch-proj
mkdir -p db
curl -s -L https://rtx-kg2-public.s3.us-west-2.amazonaws.com/babel-20250331.sqlite > \
db/babel-20250331.sqlite
These checks should be run before any commit:
cd stitch-proj
./run-checks.sh
which will run type checks (using mypy), lint checks (using ruff),
dead code tests (using vulture), and unit tests (using pytest).
Note that some of the unit tests require Internet connectivity; if
you do not have a working Internet connection, and if you run the unit
tests, you will see a runtime error like this:
E urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
/opt/homebrew/Cellar/[email protected]/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/urllib/request.py:1347: URLError
=========================================== short test summary info ============================================
FAILED tests/test_stitchutils.py::test_get_biolink_categories - urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
========================================= 1 failed, 17 passed in 1.96s =========================================
First, you need to make sure that underneath the top-level
"stich" directory, there is a subdirectory "db" containing
the babel-20250331.sqlite file (see section
"Downloading a pre-built Babel sqlite database file").
Then you can run the unit test suite, like this:
cd stitch-proj
venv/bin/pytest -v
Note that you should not try to run the unit tests like this:
cd stitch-proj/tests
../venv/bin/pytest -v
because if you do it that way, the test_local_babel.py module
won't be able to find the sqlite database that it depends on, and
you will get a large number of errors from that unit test module.
Running all three integration tests of ingest_babel.py
may take up to an hour (and will require a fast Internet connection,
since the integration tests ingest various Babel compendia and
conflation files, which they load remotely via HTTPS). To run the
tests:
cd stitch-proj
./run-integration-tests.sh
Use the ingest_babel.py script to generate the ddl.sql file as follows:
cd stitch-proj
venv/bin/python3 stitch/ingest_babel.sql --print-ddl --dry-run 2>ddl.sql
On macOS, run the DbVisualizer application (free version
24.3.3). Under the "File" menu select "Open File...", then navigate to the new
ddl.sql file. In the treeview control under "SQLite" on the left, open
"Schema" and click on "Tables". In the "Tables" view in the main application
pane, click on the "References" tab. Use macOS system screen-capture tool to
obtain a PNG of the schema diagram.
Run these steps:
cd stitch-proj
venv/bin/python3 stitch/row_counts.py babel.sqlite
Then, every time you start the instance:
sudo mkdir -p /mnt/localssd
sudo lsblk
The last command (sudo lsblk) should provide the name of the 1.7 TiB local SSD device,
like /dev/nvme1n1. Use that in place of "/dev/nvme1n1" below. Continuing with the commands
that you should perform every time you start the instance:
sudo mkfs.ext4 /dev/nvme1n1
sudo mount /dev/nvme1n1 /mnt/localssd
sudo chown ubuntu:ubuntu /mnt/localssd
mkdir -p /mnt/localssd/stitch-proj
And if it is the first time you are setting up the instance, you should do this step:
ln -s /mnt/localssd/stitch-proj /home/ubuntu/stitch-proj
(but that symbolic link will persist even when you stop and then start the instance).
Like this:
stat -c %s babel-20250817.sqlite | awk '{printf "%.2f GiB\n", $1/1024/1024/1024}'
Please see the Babel CITATION.cff file.