Skip to content

Text Mining Knowledge Provider (TMKP)

Source Information

InfoRes ID: infores:text-mining-provider-cooccurrence

Description: The Text Mining Knowledge Provider (TMKP) is a Translator knowledge provider that extracts biomedical assertions from scientific literature using advanced text mining techniques. It processes PubMed abstracts and full-text articles to identify relationships between biological entities such as chemicals, genes, diseases, and phenotypes. The system uses natural language processing and machine learning approaches to extract high-confidence assertions about how these entities interact.

Citations: - https://github.com/NCATSTranslator/Text-Mining-Provider-Roadmap

Data Access Locations: - https://storage.googleapis.com/translator-text-workflow-dev-public/

Data Provision Mechanisms: file_download

Data Formats: tar.gz, tsv, json

Data Versioning and Releases: Periodic releases managed through Google Cloud Storage. Version dates included in file paths.

Additional Notes: None

Ingest Information

Ingest Categories: primary_knowledge_provider

Utility: TMKP provides literature-mined assertions that complement manually curated knowledge sources. It enables discovery of relationships that may not be captured in structured databases, particularly for emerging research areas and novel connections between biological entities.

Scope: Covers text-mined relationships between chemicals, genes/proteins, diseases, and phenotypes extracted from PubMed literature. Each assertion includes supporting evidence from the source text and publication metadata.

Relevant Files

File Name Location Description
targeted_assertions.tar.gz https://storage.googleapis.com/translator-text-workflow-dev-public/kgx/UniProt/2023-03-05/ Archive containing nodes.tsv, edges.tsv, and content_metadata.json

Included Content

File Name Included Records Fields Used
nodes.tsv All biological entities (chemicals, proteins, diseases, phenotypes) id, name, category
edges.tsv All text-mined assertions between entities subject, predicate, object, qualified_predicate, qualifiers, attributes
content_metadata.json Metadata about available biolink classes and predicates nodes, edges

Additional Notes: Attributes in the data are encoded as JSON objects that may contain nested attributes representing supporting studies and evidence.

Target Information

Target InfoRes ID: infores:translator-text-mining-kgx

Edge Types

Subject Categories Predicate Object Categories Knowledge Level Agent Type UI Explanation
biolink:Protein biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:Protein not_provided text_mining_agent Text mining identified that a protein affects a chemical or other protein,
with evidence from scientific literature.
biolink:Disease biolink:Protein, biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture not_provided text_mining_agent Text mining identified relationships between diseases and chemicals/proteins
based on co-occurrence and linguistic patterns in literature.
biolink:PhenotypicFeature biolink:Protein, biolink:SmallMolecule, biolink:ChemicalEntity not_provided text_mining_agent Text mining identified relationships between phenotypes and chemicals/proteins.

Node Types

Node Category Source Identifier Types Additional Notes
biolink:ChemicalEntity DRUGBANK None
biolink:Protein UniProtKB, DRUGBANK None
biolink:SmallMolecule DRUGBANK, CHEBI None
biolink:Disease MONDO, HP None
biolink:MolecularMixture DRUGBANK, CHEBI None
biolink:PhenotypicFeature HP None
biolink:ComplexMolecularMixture DRUGBANK None

Provenance Information

Contributors: - TMKP Team - data generation and text mining pipeline - Sierra Moxon - ingest, code, data modeling

Artifacts: - TMKP GitHub: https://github.com/NCATSTranslator/Text-Mining-Provider-Roadmap