Text Mining Knowledge Provider (TMKP)¶
Source Information¶
InfoRes ID: infores:text-mining-provider-cooccurrence
Description: The Text Mining Knowledge Provider (TMKP) is a Translator knowledge provider that extracts biomedical assertions from scientific literature using advanced text mining techniques. It processes PubMed abstracts and full-text articles to identify relationships between biological entities such as chemicals, genes, diseases, and phenotypes. The system uses natural language processing and machine learning approaches to extract high-confidence assertions about how these entities interact.
Citations: - https://github.com/NCATSTranslator/Text-Mining-Provider-Roadmap
Data Access Locations: - https://storage.googleapis.com/translator-text-workflow-dev-public/
Data Provision Mechanisms: file_download
Data Formats: tar.gz, tsv, json
Data Versioning and Releases: Periodic releases managed through Google Cloud Storage. Version dates included in file paths.
Additional Notes: None
Ingest Information¶
Ingest Categories: primary_knowledge_provider
Utility: TMKP provides literature-mined assertions that complement manually curated knowledge sources. It enables discovery of relationships that may not be captured in structured databases, particularly for emerging research areas and novel connections between biological entities.
Scope: Covers text-mined relationships between chemicals, genes/proteins, diseases, and phenotypes extracted from PubMed literature. Each assertion includes supporting evidence from the source text and publication metadata.
Relevant Files¶
| File Name | Location | Description |
|---|---|---|
| targeted_assertions.tar.gz | https://storage.googleapis.com/translator-text-workflow-dev-public/kgx/UniProt/2023-03-05/ | Archive containing nodes.tsv, edges.tsv, and content_metadata.json |
Included Content¶
| File Name | Included Records | Fields Used |
|---|---|---|
| nodes.tsv | All biological entities (chemicals, proteins, diseases, phenotypes) | id, name, category |
| edges.tsv | All text-mined assertions between entities | subject, predicate, object, qualified_predicate, qualifiers, attributes |
| content_metadata.json | Metadata about available biolink classes and predicates | nodes, edges |
Additional Notes: Attributes in the data are encoded as JSON objects that may contain nested attributes representing supporting studies and evidence.
Target Information¶
Target InfoRes ID: infores:translator-text-mining-kgx
Edge Types¶
| Subject Categories | Predicate | Object Categories | Knowledge Level | Agent Type | UI Explanation |
|---|---|---|---|---|---|
| biolink:Protein | biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:Protein | not_provided | text_mining_agent | Text mining identified that a protein affects a chemical or other protein, | |
| with evidence from scientific literature. | |||||
| biolink:Disease | biolink:Protein, biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture | not_provided | text_mining_agent | Text mining identified relationships between diseases and chemicals/proteins | |
| based on co-occurrence and linguistic patterns in literature. | |||||
| biolink:PhenotypicFeature | biolink:Protein, biolink:SmallMolecule, biolink:ChemicalEntity | not_provided | text_mining_agent | Text mining identified relationships between phenotypes and chemicals/proteins. | |
Node Types¶
| Node Category | Source Identifier Types | Additional Notes |
|---|---|---|
| biolink:ChemicalEntity | DRUGBANK | None |
| biolink:Protein | UniProtKB, DRUGBANK | None |
| biolink:SmallMolecule | DRUGBANK, CHEBI | None |
| biolink:Disease | MONDO, HP | None |
| biolink:MolecularMixture | DRUGBANK, CHEBI | None |
| biolink:PhenotypicFeature | HP | None |
| biolink:ComplexMolecularMixture | DRUGBANK | None |
Provenance Information¶
Contributors: - TMKP Team - data generation and text mining pipeline - Sierra Moxon - ingest, code, data modeling
Artifacts: - TMKP GitHub: https://github.com/NCATSTranslator/Text-Mining-Provider-Roadmap