Skip to content

Text Mining Knowledge Provider (TMKP)

Source Information

InfoRes ID: infores:text-mining-provider-targeted

Description: The Text Mining Knowledge Provider (TMKP) is a Translator knowledge provider that extracts biomedical assertions from scientific literature using advanced text mining techniques. It processes PubMed abstracts and full-text articles to identify relationships between biological entities such as chemicals, genes, diseases, and phenotypes. The system uses natural language processing and machine learning approaches to extract high-confidence assertions about how these entities interact.

Citations: - https://github.com/NCATSTranslator/Text-Mining-Provider-Roadmap

Data Access Locations: - https://storage.googleapis.com/translator-text-workflow-dev-public/

Data Provision Mechanisms: file_download

Data Formats: tar.gz, tsv, json

Data Versioning and Releases: Periodic releases managed through Google Cloud Storage. Version dates included in file paths.

Ingest Information

Ingest Categories: primary_knowledge_provider

Utility: TMKP provides literature-mined assertions that complement manually curated knowledge sources. It enables discovery of relationships that may not be captured in structured databases, particularly for emerging research areas and novel connections between biological entities.

Scope: Covers text-mined relationships between chemicals, genes/proteins, diseases, and phenotypes extracted from PubMed literature. Each assertion includes supporting evidence from the source text and publication metadata.

Relevant Files

File Name Location Description
targeted_assertions.tar.gz https://storage.googleapis.com/translator-text-workflow-dev-public/kgx/UniProt/2023-03-05/ Archive containing nodes.tsv, edges.tsv, and content_metadata.json

Included Content

File Name Included Records Fields Used
nodes.tsv All biological entities (chemicals, proteins, diseases, phenotypes) id, name, category
edges.tsv All text-mined assertions between entities subject, predicate, object, qualified_predicate, object_aspect_qualifier, object_direction_qualifier, relation, _attributes
content_metadata.json Metadata about available biolink classes and predicates nodes, edges

Filtered Content

File Name Filtered Records Rationale
edges.tsv Edges where subject/object ID prefixes violate Biolink Model domain/range constraints. For example, edges like "MONDO:xxx biolink:treats DRUGBANK:yyy" are excluded because the association type expects chemicals as subjects and diseases as objects, not the reverse. Validation uses BMT to dynamically determine valid ID prefixes for each predicate's domain/range constraints, filtering out semantically invalid edges such as reversed ChemicalToDiseaseOrPhenotypicFeatureAssociation edges where diseases/phenotypes (MONDO, HP prefixes) are incorrectly positioned as subjects.
edges.tsv GeneRegulatesGeneAssociation edges that lack required qualifiers (object_aspect_qualifier or object_direction_qualifier). These qualifiers are required for semantic correctness of regulatory relationships. Edges without them would have ambiguous meaning.

Future Content Considerations

edge_property_content: Review study results data and modeling - it may be that we can simplify modeling depending on what requirements are for this type of metadata this phase.

other: Consider whether it is sufficient to have RIG Edge Type objects be the source of truth for UI explanations, given that edges end up being generated in normalized KGs that don't fit into these categories. How to ensure that all edges in a KG will have an explanation, and that the UI team will be able to find and access and use these explanations easily and accurately.

edge_content: Approach for one-hop predictions like those described above (whether DINGO creates them at ingest, or re-instate the CQS to create them).

Additional Notes: Attributes in the data are encoded as JSON objects that may contain nested attributes representing supporting studies and evidence. The ingest creates Study objects containing TextMiningStudyResult objects to model the nested study result data structure, with supporting text and document references captured in the TextMiningStudyResult objects.

Target Information

Target InfoRes ID: infores:text-mining-provider-targeted

Edge Types

Subject Categories Predicate Object Categories Knowledge Level Agent Type UI Explanation
biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:ComplexMolecularMixture biolink:Protein not_provided text_mining_agent Targeted text mining identified that a chemical affects a protein, with causal direction
and magnitude qualifiers indicating how the chemical modulates protein activity or abundance.
Evidence from scientific literature includes supporting text excerpts and publication references.
biolink:Protein biolink:Protein not_provided text_mining_agent Targeted text mining identified gene regulatory relationships where one protein causes
changes in another protein's activity or abundance. These edges include required qualifiers
indicating the direction and magnitude of the regulatory effect.
biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:ComplexMolecularMixture biolink:Disease, biolink:PhenotypicFeature knowledge_assertion text_mining_agent Targeted text mining identified therapeutic or studied-for-treatment relationships where
a chemical treats, or was applied or studied to treat, a disease or phenotypic feature.
The broad predicate reflects that NLP cannot distinguish between actual treatment and
studied/applied treatment relationships.
biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:ComplexMolecularMixture biolink:Disease, biolink:PhenotypicFeature not_provided text_mining_agent Targeted text mining identified that a chemical contributes to a disease or phenotypic
feature, indicating a causal or risk relationship based on evidence from literature.
biolink:Protein biolink:Disease, biolink:PhenotypicFeature not_provided text_mining_agent Targeted text mining identified that a protein/gene contributes to a disease or phenotypic
feature, indicating molecular involvement based on evidence from literature. Uses the
Biolink EPC pattern with 'affects' as the primary predicate and 'contributes_to' as the
qualified predicate. The subject_form_or_variant_qualifier indicates the form of the
protein involved (modified, loss-of-function, or gain-of-function).
biolink:Protein biolink:Disease, biolink:PhenotypicFeature not_provided text_mining_agent Targeted text mining identified that a protein/gene affects a disease or phenotypic
feature, based on linguistic patterns in scientific literature.

Node Types

Node Category Source Identifier Types Additional Notes
biolink:ChemicalEntity DRUGBANK None
biolink:Protein UniProtKB, DRUGBANK None
biolink:SmallMolecule DRUGBANK, CHEBI None
biolink:Disease MONDO, HP None
biolink:MolecularMixture DRUGBANK, CHEBI None
biolink:PhenotypicFeature HP None
biolink:ComplexMolecularMixture DRUGBANK None

Future Modeling Considerations

edge_properties: Review study results data and modeling - it may be that we can simplify modeling depending on what requirements are for this type of metadata. Some of the scores/metadata TMKP captures may be useful for scoring.

other: May want to improve ui_explanations for edge types once we understand how the UI will stitch together / display predicates and qualifier values for these edges.

Additional Notes: All nodes created from edge records are minimal NamedThing placeholders with only an id. Richer node information (names, categories, etc.) comes from the nodes.tsv file via the node transform, or from downstream normalization.

Provenance Information

Contributors: - TMKP Team, Matt Brush - modeling, data generation, and text mining pipeline - Sierra Moxon - ingest, code, data modeling

Artifacts: - TMKP GitHub: https://github.com/NCATSTranslator/Text-Mining-Provider-Roadmap