NCATS Translator Data Ingests¶

This software repository forms an integral part of the Biomedical Data Translator Consortium, Performance Phase 3 efforts at biomedical knowledge integration, within the auspices of the Data INGest and Operations ("DINGO") Working Group. The repository aggregates and coordinates the development of knowledge-specific and shared library software used for Translator data ingests from primary (mostly external "third party") knowledge sources, into so-called Translator "Tier 1" knowledge graph(s). This software is primarily coded in Python.

A general discussion of the Translator Data Ingest architecture is provided here.

Technical Prerequisites¶

The project uses the uv Python package and project manager You will need to install uv onto your system, along with a suitable Python (Release 3.12) interpreter.

The project initially (mid-June 2025) uses a conventional unix-style make file to execute tasks. For this reason, working within a command line interface terminal. A MacOSX, Ubuntu or Windows WSL2 (with Ubuntu) is recommended. See the Developers' README for tips on configuring your development environment.

Ingest Processes and Artifacts¶

To ensure that ingests are performed rigorously, consistently, and reproducibly, we have defined an Standard Operating Procedure (SOP) to guide the source ingest process.

The SOP is initially tailored to guide re-ingest of current sources to create a "functional replacement" of the Phase 2 knowledge provider sources, but it can be adapted to guide the ingest tasks of new sources as well.

Follow the steps and use / generate the artifacts described below, to perform a source ingest according to standard operating procedure.

Ingest Assignment and Tracking (required): Record owner/contributor assignments and track status for each ingest. (ingest list)
Ingest Surveys (as needed): Describe past ingests of a source to facilitate comparison and alignment (useful when there are multiple prior ingests). (directory) (ctd example)
Resource Ingest Guides (RIGs) (required): Document scope, content, and modeling decisions for an ingest task, in a computable yaml file format. (yaml schema) (yaml template) (yaml example) (derived markdown example) (full rig catalog). For KGX passthrough ingests, if a meta_knowledge_graph.json file is available, then RIG's can be partially populated, after creation, with node and edge target_info data using the mkg_to_rig.py script. If global additions or deletions of RIG property tag-values are needed, then the annotate_rig.py script may be used.
Source Ingest Tickets (as needed): If content or modeling questions arise, create a source-ingest ticket in the DINGO repo (ingest issues)
Ingest Code and Tests (required): Author ingest code / artifacts following RIG spec, along with unit tests, using shared python code base. (ingest code) (code template) (code example) (unit tests) (unit test template)
KGX Files (required): Execute ingest code and normalization services to generate normalized knowledge graphs and ingest metadata artifacts. (ctd example - TO DO)
KGX Summary Reports (under development): Automated scripts generate reports that summarize the content of KGX ingest files, to facilitate manual QA/debugging, and provide documentation of KG content and modeling. (ctd example - TO DO)

Additional Notes¶

Populate the ingest-specific download.yaml file that describes the input data of the knowledge source (ingest template example).
Write the configuration file that describes the source and the transform to be applied. (directory) (ingest template example)
Write the Python script used to execute the ingest task as described in a RIG and to pass the unit tests which were written. (directory) (ingest template example)
Write unit tests with mock (but realistic) data, to illustrate how input records for a specified source are transformed into knowledge graph nodes and edges. See the (unit ingest tests directory) for some examples, and the (ingest template example) highlighting the use of some generic utility code available to fast-track the development of such ingest unit tests.
Ingest Code parsers are generally written to generate their knowledge graphs - nodes and edges - using a Biolink Model-constrained Pydantic model (the exception to this is a 'pass-through' KGX file processor which bypasses the Pydantic model).
Use of the Pydantic model is recommended since it provides a standardized way to validate and transform input data.
The Translator Ingest pipeline converts the resulting parser Koza KnowledgeGraph output objects into KGX node and edge (jsonl) file content (that is, the Ingest Code does not write the KGX files directly, nor need to worry about doing so).
That said, the KGX ingest metadata needs to be generated separately using the ingest metadata schema which has a Python implementation.

Initial Minimal Viable Product: A CTD Example¶

Here, we apply a koza transform of data from the Comparative Toxicology Database, writing the knowledge graph output out to jsonlines (jsonl) files. The project is built and executed using targets in a conventional (unix-like) make command, operating on a Makefile in the repository.

Alternately, there is a justfile upon which the cross-platform just command tool may be used on functionally equivalent targets. Install just then type just help for usage.

│ Usage:
│     make <target>  # or just <target>
│
│ Targets:
│     help                Print this help message
│ 
│     all                 Install everything and test
│     fresh               Clean and install everything
│     clean               Clean up build artifacts
│     clobber             Clean up generated files
│
│     install             install python requirements
│     download            Download data
│     run                 Run the transform
│
│     test                Run all tests
│
│     lint                Lint all code
│     format              Format all code  running the following steps.

The task involves the following steps/components:

CTD download source data: download.yaml
CTD transform configuration file: ctd.yaml
CTD transform code: ctd.py
CTD transform documentation
Unit tests: test_ctd.py