translator-ingests

NCATS Translator Data Ingests

This software repository forms an integral part of the Biomedical Data Translator Consortium, Performance Phase 3 efforts at biomedical knowledge integration, within the auspices of the Data INGest and Operations (“DINGO”) Working Group. The repository aggregates and coordinates the development of knowledge-specific and shared library software used for Translator data ingests from primary (mostly external “third party”) knowledge sources, into so-called Translator “Tier 1” knowledge graph(s). This software is primarily coded in Python.

A general discussion of the Translator Data Ingest architecture is provided here.

Technical Prerequisites

The project uses the uv Python package and project manager You will need to install uv onto your system, along with a suitable Python (Release 3.12) interpreter.

The project initially (mid-June 2025) uses a conventional unix-style make file to execute tasks. For this reason, working within a command line interface terminal. A MacOSX, Ubuntu or Windows WSL2 (with Ubuntu) is recommended. See the Developers’ README for tips on configuring your development environment.

Ingest Processes and Artifacts

To ensure that ingests are performed rigorously, consistently, and reproducibly, we have defined an Standard Operating Procedure (SOP) to guide the source ingest process.

The SOP is initially tailored to guide re-ingest of current sources to create a “functional replacement” of the Phase 2 system - but it can be adapted to guide ingest of new sources as well.

Below are descriptions and links for the various artifacts prescribed by the SOP.

  1. Ingest Assignment Tables: Records owner and contributor assignments for each ingest. (Sheet 1) (Sheet 2)
  2. Source Ingest Tickets: Tracks contributor questions and discussions about the ingest. (Tickets) (Project Board) (CTD Example)
  3. Ingest Surveys: Describe current ingests of a source from Phase 2 to facilaitate comparison and alignment. (Directory) (CTD Example)
  4. Reference Ingest Guides (RIGs): Document scope, content, and modeling decisions for an ingest. (Template) (Instructions) (CTD Example)
  5. Ingst Code: Python code used to execute an ingest as described in a RIG. (Directory) (CTD Example)
  6. KGX Files: The final knowledge graphs and ingest metadata that is produced by ingest code. (CTD Example - TO DO)

Initial Minimal Viable Product: A CTD Example

Here, we apply a koza transform of data from the Comparative Toxicology Database, writing the knowledge graph output out to jsonlines (jsonl) files. The project is built and executed using a conventional (unix-like) Makefile:

│ Usage:
│     make <target>
│
│ Targets:
│     help                Print this help message
│ 
│     all                 Install everything and test
│     fresh               Clean and install everything
│     clean               Clean up build artifacts
│     clobber             Clean up generated files
│
│     install             install python requirements
│     download            Download data
│     run                 Run the transform
│
│     test                Run all tests
│
│     lint                Lint all code
│     format              Format all code  running the following steps.

The task involves the following steps/components: