Text Mining Knowledge Provider (TMKP)¶
Source Information¶
InfoRes ID: infores:text-mining-provider-targeted
Description: The Text Mining Knowledge Provider (TMKP) is a Translator knowledge provider that extracts biomedical assertions from scientific literature using advanced text mining techniques. It processes PubMed abstracts and full-text articles to identify relationships between biological entities such as chemicals, genes, diseases, and phenotypes. The system uses natural language processing and machine learning approaches to extract high-confidence assertions about how these entities interact.
Citations: - https://github.com/NCATSTranslator/Text-Mining-Provider-Roadmap
Data Access Locations: - https://storage.googleapis.com/translator-text-workflow-dev-public/
Data Provision Mechanisms: file_download
Data Formats: tar.gz, tsv, json
Data Versioning and Releases: Periodic releases managed through Google Cloud Storage. Version dates included in file paths.
Ingest Information¶
Ingest Categories: primary_knowledge_provider
Utility: TMKP provides literature-mined assertions that complement manually curated knowledge sources. It enables discovery of relationships that may not be captured in structured databases, particularly for emerging research areas and novel connections between biological entities.
Scope: Covers text-mined relationships between chemicals, genes/proteins, diseases, and phenotypes extracted from PubMed literature. Each assertion includes supporting evidence from the source text and publication metadata.
Relevant Files¶
| File Name | Location | Description |
|---|---|---|
| targeted_assertions.tar.gz | https://storage.googleapis.com/translator-text-workflow-dev-public/kgx/UniProt/2023-03-05/ | Archive containing nodes.tsv, edges.tsv, and content_metadata.json |
Included Content¶
| File Name | Included Records | Fields Used |
|---|---|---|
| nodes.tsv | All biological entities (chemicals, proteins, diseases, phenotypes) | id, name, category |
| edges.tsv | All text-mined assertions between entities | subject, predicate, object, qualified_predicate, object_aspect_qualifier, object_direction_qualifier, relation, _attributes |
| content_metadata.json | Metadata about available biolink classes and predicates | nodes, edges |
Filtered Content¶
| File Name | Filtered Records | Rationale |
|---|---|---|
| edges.tsv | Edges where subject/object ID prefixes violate Biolink Model domain/range constraints. For example, edges like "MONDO:xxx biolink:treats DRUGBANK:yyy" are excluded because the association type expects chemicals as subjects and diseases as objects, not the reverse. | Validation uses BMT to dynamically determine valid ID prefixes for each predicate's domain/range constraints, filtering out semantically invalid edges such as reversed ChemicalToDiseaseOrPhenotypicFeatureAssociation edges where diseases/phenotypes (MONDO, HP prefixes) are incorrectly positioned as subjects. |
| edges.tsv | GeneRegulatesGeneAssociation edges that lack required qualifiers (object_aspect_qualifier or object_direction_qualifier). | These qualifiers are required for semantic correctness of regulatory relationships. Edges without them would have ambiguous meaning. |
Future Content Considerations¶
edge_property_content: Review study results data and modeling - it may be that we can simplify modeling depending on what requirements are for this type of metadata this phase.
other: Consider whether it is sufficient to have RIG Edge Type objects be the source of truth for UI explanations, given that edges end up being generated in normalized KGs that don't fit into these categories. How to ensure that all edges in a KG will have an explanation, and that the UI team will be able to find and access and use these explanations easily and accurately.
edge_content: Approach for one-hop predictions like those described above (whether DINGO creates them at ingest, or re-instate the CQS to create them).
Additional Notes: Attributes in the data are encoded as JSON objects that may contain nested attributes representing supporting studies and evidence. The ingest creates Study objects containing TextMiningStudyResult objects to model the nested study result data structure, with supporting text and document references captured in the TextMiningStudyResult objects.
Target Information¶
Target InfoRes ID: infores:text-mining-provider-targeted
Edge Types¶
| Subject Categories | Predicate | Object Categories | Knowledge Level | Agent Type | UI Explanation |
|---|---|---|---|---|---|
| biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:ComplexMolecularMixture | biolink:Protein | not_provided | text_mining_agent | Targeted text mining identified that a chemical affects a protein, with causal direction | |
| and magnitude qualifiers indicating how the chemical modulates protein activity or abundance. | |||||
| Evidence from scientific literature includes supporting text excerpts and publication references. | |||||
| biolink:Protein | biolink:Protein | not_provided | text_mining_agent | Targeted text mining identified gene regulatory relationships where one protein causes | |
| changes in another protein's activity or abundance. These edges include required qualifiers | |||||
| indicating the direction and magnitude of the regulatory effect. | |||||
| biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:ComplexMolecularMixture | biolink:Disease, biolink:PhenotypicFeature | knowledge_assertion | text_mining_agent | Targeted text mining identified therapeutic or studied-for-treatment relationships where | |
| a chemical treats, or was applied or studied to treat, a disease or phenotypic feature. | |||||
| The broad predicate reflects that NLP cannot distinguish between actual treatment and | |||||
| studied/applied treatment relationships. | |||||
| biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:ComplexMolecularMixture | biolink:Disease, biolink:PhenotypicFeature | not_provided | text_mining_agent | Targeted text mining identified that a chemical contributes to a disease or phenotypic | |
| feature, indicating a causal or risk relationship based on evidence from literature. | |||||
| biolink:Protein | biolink:Disease, biolink:PhenotypicFeature | not_provided | text_mining_agent | Targeted text mining identified that a protein/gene contributes to a disease or phenotypic | |
| feature, indicating molecular involvement based on evidence from literature. Uses the | |||||
| Biolink EPC pattern with 'affects' as the primary predicate and 'contributes_to' as the | |||||
| qualified predicate. The subject_form_or_variant_qualifier indicates the form of the | |||||
| protein involved (modified, loss-of-function, or gain-of-function). | |||||
| biolink:Protein | biolink:Disease, biolink:PhenotypicFeature | not_provided | text_mining_agent | Targeted text mining identified that a protein/gene affects a disease or phenotypic | |
| feature, based on linguistic patterns in scientific literature. | |||||
Node Types¶
| Node Category | Source Identifier Types | Additional Notes |
|---|---|---|
| biolink:ChemicalEntity | DRUGBANK | None |
| biolink:Protein | UniProtKB, DRUGBANK | None |
| biolink:SmallMolecule | DRUGBANK, CHEBI | None |
| biolink:Disease | MONDO, HP | None |
| biolink:MolecularMixture | DRUGBANK, CHEBI | None |
| biolink:PhenotypicFeature | HP | None |
| biolink:ComplexMolecularMixture | DRUGBANK | None |
Future Modeling Considerations¶
edge_properties: Review study results data and modeling - it may be that we can simplify modeling depending on what requirements are for this type of metadata. Some of the scores/metadata TMKP captures may be useful for scoring.
other: May want to improve ui_explanations for edge types once we understand how the UI will stitch together / display predicates and qualifier values for these edges.
Additional Notes: All nodes created from edge records are minimal NamedThing placeholders with only an id. Richer node information (names, categories, etc.) comes from the nodes.tsv file via the node transform, or from downstream normalization.
Provenance Information¶
Contributors: - TMKP Team, Matt Brush - modeling, data generation, and text mining pipeline - Sierra Moxon - ingest, code, data modeling
Artifacts: - TMKP GitHub: https://github.com/NCATSTranslator/Text-Mining-Provider-Roadmap