Skip to content

Pubtator Reference Ingest Guide

Source Information

InfoRes ID: infores:pubtator

Description: PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals.

Citations: - https://doi.org/10.1093/nar/gkae235

Data Access Locations: - https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/

Data Provision Mechanisms: file_download

Data Formats: tsv, other

Data Versioning and Releases: The FTP files are expected to be updated monthly, according to the FTP README. There isn't a formal version, so we use the last-modified date.

Ingest Information

Ingest Categories: primary_knowledge_provider

Utility: Pubtator is a large, text-mined resource with a variety of entity relations. It could be used to augment or replace other text-mined resources that we use and are no longer being maintained/updated. It could be used in MVP1 (may treat disease X), MVP2 (drug Y may increase/decrease gene Z's activity), or Pathfinder queries.

Relevant Files

File Name Location Description
relation2pubtator3.gz https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/ whole set of relations extracted by BioREx

Included Content

File Name Included Records Fields Used
relation2pubtator3.gz (1) Entity types are "Chemical", "Disease", or "Gene" (2) "relation" value is mapped (didn't include "compare", "cotreat")

Filtered Content

File Name Filtered Records Rationale
relation2pubtator3.gz At least 1 entity type isn't "Chemical", "Disease", or "Gene". Currently, the other entity types are for variants ("DNAMutation", "ProteinMutation", "SNP", "Mutation") NodeNorm currently doesn't support any variant namespaces, so it cannot resolve the IDs for these entities.
relation2pubtator3.gz relation value is 'compare' This seems like too vague to be a meaningful relationship. The definition is "The effect comparison of two chemicals/drugs" (https://www.ncbi.nlm.nih.gov/research/pubtator3/tutorial "Relation Annotations", Table 2)
relation2pubtator3.gz relation value is 'cotreat' Couldn't find a biolink-model predicate that represented this relationship well. The definition is "It is defined as the use of two or more chemical/drugs administered separately or in a fixed-dose combination" (https://www.ncbi.nlm.nih.gov/research/pubtator3/tutorial "Relation Annotations", Table 2).
relation2pubtator3.gz Entity ID maps to NodeNorm clique with unexpected category Currently, 3 entity IDs map to unexpected main categories (not ChemicalEntity, DiseaseOrPheno, or Gene-related). Filtering them out, just in case they produce odd edges. Could remove some IDs from filter if NodeNorm addresses their category issues OrganismTaxon: - MESH:C100843 (Lacteol): supplement with heat-killed Lactobacillus (brand link). Pubtator classifies as Chemical, should be a biolink:Drug? - MESH:C000598555 (2,5-dihexyl-N,N'-dicyano-p-quinonediimine): very little info, Pubtator classifies as Chemical. But the name looks like a chemical, so this appears to be a NodeNorm error CellularComponent: - MESH:C000719328 (smoker's inclusion bodies): Pubtator classifies as Disease. But Nodenorm is correct that it's a CellularComponent (see wiki paragraph 3)

Future Content Considerations

other: NodeNorm could add the variant rs# (dbSNP) namespace. Then some of the variant data would have IDs that can be NodeNormed and we could ingest it. - Relevant files: relation2pubtator3.gz

edge_property_content: See if we can add text snippet information for relations using other files. - Relevant files: entity files or BioC-XML files

Target Information

Target InfoRes ID: infores:translator-pubtator

Edge Types

Subject Categories Predicate Object Categories Knowledge Level Agent Type UI Explanation
biolink:ChemicalEntity, biolink:DiseaseOrPhenotypicFeature, biolink:Gene biolink:ChemicalEntity, biolink:DiseaseOrPhenotypicFeature, biolink:Gene not_provided text_mining_agent The Pubtator-reported 'relation' was 'associate', which corresponds to a very general predicate.
biolink:ChemicalEntity biolink:DiseaseOrPhenotypicFeature not_provided text_mining_agent The Pubtator-reported 'relation' was 'cause'. Based on the relation definition, we picked this predicate.
biolink:ChemicalEntity biolink:ChemicalEntity not_provided text_mining_agent The Pubtator-reported 'relation' was 'drug_interact'. For a potential drug-drug interaction, we picked this predicate.
biolink:ChemicalEntity, biolink:Gene biolink:ChemicalEntity, biolink:Gene not_provided text_mining_agent The Pubtator-reported 'relation' was 'interact'. Based on the relation definition (physical interaction), we picked this predicate.
biolink:DiseaseOrPhenotypicFeature, biolink:ChemicalEntity, biolink:Gene biolink:ChemicalEntity, biolink:Gene not_provided text_mining_agent The Pubtator-reported 'relation' was 'inhibit' and 'negative_correlate'. Based on the relation definitions (negative correlation), we picked this predicate.
biolink:DiseaseOrPhenotypicFeature, biolink:ChemicalEntity, biolink:Gene biolink:ChemicalEntity, biolink:Gene not_provided text_mining_agent The Pubtator-reported 'relation' was 'stimulate' and 'positive_correlate'. Based on the relation definitions (positive correlation), we picked this predicate.
biolink:ChemicalEntity biolink:DiseaseOrPhenotypicFeature not_provided text_mining_agent The Pubtator-reported 'relation' was 'treat' for a Chemical - Disease pair. We use a more general predicate because this is a text-mined assertion.

Node Types

Node Category Source Identifier Types Additional Notes
biolink:ChemicalEntity MESH
biolink:DiseaseOrPhenotypicFeature MESH, OMIM
biolink:Gene NCBIGene

Future Modeling Considerations

predicates: Add a biolink predicate for relation value 'cotreat', so we can ingest that data?

Additional Notes: (1) Relation name can be specific, but def is usually very general (starts with positive/negative correlation…). And the relation-assignment can be incorrect, so using weaker/general predicates may be better? (2) A few MESH (chem/disease) and OMIM (disease) IDs are NodeNormed to NCBIGene, which can be seen in the normalization-metadata file, field normalized_to. This seems like an issue with the data or NodeNorm that should be investigated at some point. (3) When running the ingest, there are some validation warnings that other categories are encountered when Chemical was expected. CX expects something is going on with the MESH IDs - either issues with the data, issues with NodeNorm, or understandable issues resolving proteins vs drugs. (4) Can't create links to Pubtator webpages (source_record_urls) because they include entity names/labels, which aren't in the relation file (ref: https://www.ncbi.nlm.nih.gov/research/pubtator3/tutorial "Relation Search"). (5) Using relation definitions in tutorial https://www.ncbi.nlm.nih.gov/research/pubtator3/tutorial ("Relation Annotations", Table 2), which seem to make more sense and match the current data better than the paper's supplementary table 2. (6) Paper Supplementary Table 7 summarizes each tool within Pubtator (what it does, citation, all public domain license)

Provenance Information

Contributors: - Colleen Xu - code author, data modeling - Andrew Su: domain expertise

Artifacts: - Analyzing Pubtator MetaTriples and comparing it to semmeddb/TMKP: https://docs.google.com/spreadsheets/d/1O096szXwAkjJRYf3MxlRyhUeNkRvtZ6Gxk96v__mXgU/edit?gid=2045242875#gid=2045242875 - notebooks for development work currently in parser code directory