Skip to content

SemMedDB Minimal Reference Ingest Guide (RIG)

Source Information

InfoRes ID: infores:semmeddb

Description: Literature-derived semantic predications extracted by SemRep from PubMed; this RIG documents a post-processed RTX-KG2 edges subset rather than a raw MySQL ingest.

Data Access Locations: - SemMedDB (original NLM source behind SemRep; not the file this ingest reads directly): https://lhncbc.nlm.nih.gov/temp/SemRep_SemMedDB_SKR/SemMedDB_download.html - RTX-KG2 public SemMedDB edges slice (gzipped JSONL — this is what translator-ingests downloads): https://rtx-kg2-public.s3.us-west-2.amazonaws.com/kg2.10.3-semmeddb-edges.jsonl.gz RTX-KG2 has already projected SemMedDB predications into Biolink-shaped edges (predicate, publications, publications_info, kg2_ids, optional qualifiers, domain_range_exclusion). This pipeline transforms that slice with Koza; it does not ingest raw SemMedDB MySQL dumps.

Data Provision Mechanisms: file_download

Data Formats: jsonl

Data Versioning and Releases: Versioning inherited from the RTX-KG2 snapshot built from SemMedDB VER43-era releases.

Ingest Information

Ingest Categories: primary_knowledge_provider

Utility: Large-scale literature-derived relationships suitable for hypothesis generation and cross-graph enrichment with evidence-bearing edges.

Scope: Edges taken directly from RTX-KG2's SemMedDB slice (nodes.jsonl/edges.jsonl), restricted to a selected predicate set and evidence-bearing records.

Relevant Files

File Name Location Description
kg2.10.3-semmeddb-edges.jsonl (RTX-KG2 SemMedDB slice) https://rtx-kg2-public.s3.us-west-2.amazonaws.com/kg2.10.3-semmeddb-edges.jsonl.gz (downloaded via semmeddb/download.yaml; decompressed JSONL is the Koza reader input) KGX-compatible edges with Biolink predicates, literature evidence, kg2_ids, and optional qualifiers.
kg2.10.3-semmeddb-nodes.jsonl (conceptual / transform output) Not a separate download for this ingest; subject/object nodes are emitted by the transform from edge endpoints (PREFIX_TO_CLASS). A full KG2 graph build may ship a companion nodes file elsewhere on rtx-kg2-public. Nodes correspond to CURIEs appearing on included edges after typing by prefix.

Included Content

File Name Included Records Fields Used
kg2.10.3-semmeddb-edges.jsonl Edges whose Biolink predicate is in the selected set (20 predicates): biolink:affects, biolink:causes, biolink:close_match, biolink:coexists_with, biolink:derives_from, biolink:diagnoses, biolink:disrupts, biolink:exacerbates_condition, biolink:has_input, biolink:has_part, biolink:interacts_with, biolink:located_in, biolink:manifestation_of, biolink:occurs_in, biolink:precedes, biolink:predisposes_to_condition, biolink:preventative_for_condition, biolink:produces, biolink:related_to, biolink:treats_or_applied_or_studied_to_treat. These are mapped by KG2 from original SEMMEDDB predicates: ADMINISTERED_TO, AFFECTS, ASSOCIATED_WITH, AUGMENTS, CAUSES, COEXISTS_WITH, COMPLICATES, CONVERTS_TO, DIAGNOSES, DISRUPTS, INHIBITS, INTERACTS_WITH, LOCATION_OF (inverted), MANIFESTATION_OF, OCCURS_IN, PART_OF (inverted), PRECEDES, PREDISPOSES, PREVENTS, PROCESS_OF, PRODUCES, SAME_AS, STIMULATES, TREATS, USES, XREF. During transform, biolink:preventative_for_condition is remapped to biolink:treats_or_applied_or_studied_to_treat. subject, predicate, object, publications, publications_info, domain_range_exclusion, kg2_ids, qualified_predicate, qualified_object_aspect, qualified_object_direction
kg2.10.3-semmeddb-nodes.jsonl Only nodes referenced by included edges. id, name, category, xrefs (as available)

Filtered Content

File Name Filtered Records Rationale
kg2.10.3-semmeddb-edges.jsonl Edges whose original SemMedDB predicate (extracted from kg2_ids) is in the BTE-excluded set: COMPARED_WITH, ISA, MEASURES, HIGHER_THAN, LOWER_THAN. All five map to biolink:related_to in KG2 but are removed because BTE considers them non-informative for knowledge graph use. Align with BTE filtering of low-value predicates hidden inside biolink:related_to.
kg2.10.3-semmeddb-edges.jsonl Predicates dropped by the original RTX-KG2 ingest: SEMMEDDB:MEASUREMENT_OF, SEMMEDDB:METHOD_OF, SEMMEDDB:NOM, SEMMEDDB:PREP, SEMMEDDB:VERB. These predicates are removed upstream by KG2 before we receive the data.
kg2.10.3-semmeddb-edges.jsonl Edges with 3 or fewer publications Ensure each edge has meaningful literature support; singleton text-mining extractions are noisy.
kg2.10.3-semmeddb-edges.jsonl Edges where domain_range_exclusion is True Remove semantically incoherent subject/object type combinations flagged by KG2.
kg2.10.3-semmeddb-edges.jsonl For edges with more than 200 publications, the publication list is trimmed to the union of top-100 by confidence score (min of subject/object score) and top-100 by recency (publication year), yielding at most ~200 PMIDs per edge. This only affects rare extreme cases (some edges have 60k+ PMIDs). Disabled when SEMMEDDB_UNCAPPED=1 environment variable is set. Prevent oversized edges while retaining the most informative publications by two orthogonal strategies. Configurable for uncapped builds.

Target Information

Target InfoRes ID: infores:translator-semmeddb-kgx

Edge Types

Subject Categories Predicate Object Categories Knowledge Level Agent Type UI Explanation
not_provided text_mining_agent SemMedDB data are generated by an automated text-mining agent that extracts relationships from biomedical literature. The record represents text evidence that [SUBJECT] and [OBJECT] were reported with [RELATIONSHIP] in the source literature.

Node Types

Node Category Source Identifier Types Additional Notes
biolink:SmallMolecule
biolink:Drug
biolink:ChemicalEntity
biolink:MolecularMixture
biolink:ChemicalMixture
biolink:Cell
biolink:Disease
biolink:ComplexMolecularMixture
biolink:PhenotypicFeature
biolink:BiologicalProcess
biolink:MolecularActivity
biolink:CellularComponent
biolink:Gene
biolink:OrganismTaxon
biolink:Protein
biolink:GrossAnatomicalStructure
biolink:AnatomicalEntity
biolink:Polypeptide
biolink:InformationContentEntity
biolink:Procedure
biolink:Behavior
biolink:Agent
biolink:Activity
biolink:Device
biolink:Cohort
biolink:PopulationOfIndividualOrganisms
biolink:Phenomenon
biolink:GenomicEntity
biolink:BiologicalEntity
biolink:Publication
biolink:NucleicAcidEntity
biolink:ClinicalAttribute
biolink:PhysicalEntity
biolink:DiseaseOrPhenotypicFeature
biolink:Human
biolink:Event
biolink:PhysiologicalProcess

Provenance Information

Contributors: - Erica Wood: code author - Evan Morris: code support - Adilbek Bazarkulov: code support, domain expertise - Sierra Moxon: data modeling, domain expertise - Matthew Brush: data modeling, domain expertise

Artifacts: - SemMedDB overview: https://lhncbc.nlm.nih.gov/temp/SemRep_SemMedDB_SKR/dbinfo.html - Biolink Model (schema): https://github.com/biolink/biolink-model - KGX documentation: https://kgx.readthedocs.io - Summary of predicates included and filtered and their mappings: https://docs.google.com/spreadsheets/d/12XmPE9eJp3H7yJnwg5Wmx5-BZdXXbiwO02vDRc1BY-c/edit?gid=520297121#gid=520297121