Text Mining Knowledge Provider (TMKP)¶

Source Information¶

InfoRes ID: infores:text-mining-provider-targeted

Description: The Text Mining Knowledge Provider (TMKP) is a Translator knowledge provider that extracts biomedical assertions from scientific literature using advanced text mining techniques. It processes PubMed abstracts and full-text articles to identify relationships between biological entities such as chemicals, genes, diseases, and phenotypes. The system uses natural language processing and machine learning approaches to extract high-confidence assertions about how these entities interact.

Citations: - https://github.com/NCATSTranslator/Text-Mining-Provider-Roadmap

Data Access Locations: - https://storage.googleapis.com/translator-text-workflow-dev-public/

Data Provision Mechanisms: file_download

Data Formats: tar.gz, tsv, json

Data Versioning and Releases: Periodic releases managed through Google Cloud Storage. Version dates included in file paths.

Ingest Information¶

Ingest Categories: primary_knowledge_provider

Utility: TMKP provides literature-mined assertions that complement manually curated knowledge sources. It enables discovery of relationships that may not be captured in structured databases, particularly for emerging research areas and novel connections between biological entities.

Scope: Covers text-mined relationships between chemicals, genes/proteins, diseases, and phenotypes extracted from PubMed literature. Each assertion includes supporting evidence from the source text and publication metadata.

Relevant Files¶

File Name	Location	Description
targeted_assertions.tar.gz	https://storage.googleapis.com/translator-text-workflow-dev-public/kgx/UniProt/2023-03-05/	Archive containing nodes.tsv, edges.tsv, and content_metadata.json

Included Content¶

File Name	Included Records	Fields Used
nodes.tsv	All biological entities (chemicals, proteins, diseases, phenotypes)	id, name, category
edges.tsv	All text-mined assertions between entities	subject, predicate, object, qualified_predicate, object_aspect_qualifier, object_direction_qualifier, relation, _attributes
content_metadata.json	Metadata about available biolink classes and predicates	nodes, edges

Filtered Content¶

File Name	Filtered Records	Rationale
edges.tsv	Edges where subject/object ID prefixes violate Biolink Model domain/range constraints. For example, edges like "MONDO:xxx biolink:treats DRUGBANK:yyy" are excluded because the association type expects chemicals as subjects and diseases as objects, not the reverse.	Validation uses BMT to dynamically determine valid ID prefixes for each predicate's domain/range constraints, filtering out semantically invalid edges such as reversed ChemicalToDiseaseOrPhenotypicFeatureAssociation edges where diseases/phenotypes (MONDO, HP prefixes) are incorrectly positioned as subjects.
edges.tsv	GeneRegulatesGeneAssociation edges that lack required qualifiers (object_aspect_qualifier or object_direction_qualifier).	These qualifiers are required for semantic correctness of regulatory relationships. Edges without them would have ambiguous meaning.

Future Content Considerations¶

edge_property_content: Review study results data and modeling - it may be that we can simplify modeling depending on what requirements are for this type of metadata this phase.

other: Consider whether it is sufficient to have RIG Edge Type objects be the source of truth for UI explanations, given that edges end up being generated in normalized KGs that don't fit into these categories. How to ensure that all edges in a KG will have an explanation, and that the UI team will be able to find and access and use these explanations easily and accurately.

edge_content: Approach for one-hop predictions like those described above (whether DINGO creates them at ingest, or re-instate the CQS to create them).

Additional Notes: Attributes in the data are encoded as JSON objects that may contain nested attributes representing supporting studies and evidence. The ingest creates Study objects containing TextMiningStudyResult objects to model the nested study result data structure, with supporting text and document references captured in the TextMiningStudyResult objects.

Target Information¶

Target InfoRes ID: infores:text-mining-provider-targeted

Edge Types¶

Subject Categories	Object Categories	Knowledge Level	Agent Type	UI Explanation
biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:ComplexMolecularMixture	biolink:Protein	not_provided	text_mining_agent	Targeted text mining identified that a chemical affects a protein, with causal direction
and magnitude qualifiers indicating how the chemical modulates protein activity or abundance.
Evidence from scientific literature includes supporting text excerpts and publication references.

biolink:Protein	biolink:Protein	not_provided	text_mining_agent	Targeted text mining identified gene regulatory relationships where one protein causes
changes in another protein's activity or abundance. These edges include required qualifiers
indicating the direction and magnitude of the regulatory effect.

biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:ComplexMolecularMixture	biolink:Disease, biolink:PhenotypicFeature	knowledge_assertion	text_mining_agent	Targeted text mining identified therapeutic or studied-for-treatment relationships where
a chemical treats, or was applied or studied to treat, a disease or phenotypic feature.
The broad predicate reflects that NLP cannot distinguish between actual treatment and
studied/applied treatment relationships.

biolink:SmallMolecule, biolink:ChemicalEntity, biolink:MolecularMixture, biolink:ComplexMolecularMixture	biolink:Disease, biolink:PhenotypicFeature	not_provided	text_mining_agent	Targeted text mining identified that a chemical contributes to a disease or phenotypic
feature, indicating a causal or risk relationship based on evidence from literature.

biolink:Protein	biolink:Disease, biolink:PhenotypicFeature	not_provided	text_mining_agent	Targeted text mining identified that a protein/gene contributes to a disease or phenotypic
feature, indicating molecular involvement based on evidence from literature. Uses the
Biolink EPC pattern with 'affects' as the primary predicate and 'contributes_to' as the
qualified predicate. The subject_form_or_variant_qualifier indicates the form of the
protein involved (modified, loss-of-function, or gain-of-function).

biolink:Protein	biolink:Disease, biolink:PhenotypicFeature	not_provided	text_mining_agent	Targeted text mining identified that a protein/gene affects a disease or phenotypic
feature, based on linguistic patterns in scientific literature.

Node Types¶

Node Category	Source Identifier Types	Additional Notes
biolink:ChemicalEntity	DRUGBANK	None
biolink:Protein	UniProtKB, DRUGBANK	None
biolink:SmallMolecule	DRUGBANK, CHEBI	None
biolink:Disease	MONDO, HP	None
biolink:MolecularMixture	DRUGBANK, CHEBI	None
biolink:PhenotypicFeature	HP	None
biolink:ComplexMolecularMixture	DRUGBANK	None

Future Modeling Considerations¶

edge_properties: Review study results data and modeling - it may be that we can simplify modeling depending on what requirements are for this type of metadata. Some of the scores/metadata TMKP captures may be useful for scoring.

other: May want to improve ui_explanations for edge types once we understand how the UI will stitch together / display predicates and qualifier values for these edges.

Additional Notes: All nodes created from edge records are minimal NamedThing placeholders with only an id. Richer node information (names, categories, etc.) comes from the nodes.tsv file via the node transform, or from downstream normalization.

Provenance Information¶

Contributors: - TMKP Team, Matt Brush - modeling, data generation, and text mining pipeline - Sierra Moxon - ingest, code, data modeling

Artifacts: - TMKP GitHub: https://github.com/NCATSTranslator/Text-Mining-Provider-Roadmap