Drug-Gene Interaction Database (DGIdb) Reference Ingest Guide¶

Source Information¶

InfoRes ID: infores:dgidb

Description: The Drug-Gene Interaction Database (DGIdb) streamlines the search for druggable therapeutic targets through the aggregation, categorization, and curation of drug and gene data from publications and expert resources.

Citations: - https://doi.org/10.1093/nar/gkad1040

Data Access Locations: - Downloads page: https://dgidb.org/downloads

Data Provision Mechanisms: file_download, api_endpoint, other

Data Formats: tsv

Data Versioning and Releases: DGIdb updates 1-2 times a year since 2021 (based on the raw data dump table in https://dgidb.org/downloads). Versioning is by date (year-month abbrev.) but 2024-Dec also has a semantic version number in its header.

Ingest Information¶

Ingest Categories: aggregation_interpreter

Utility: DGIdb provides associations between drugs/chemicals and genes/proteins. These associations could be used in MVP1 (may treat disease X), MVP2 (drug Y may increase/decrease gene Z's activity), or Pathfinder queries.

Relevant Files¶

File Name	Location	Description
interactions.tsv	https://dgidb.org/downloads	TSV download of drug-gene interaction claims

Included Content¶

File Name	Included Records	Fields Used
interactions.tsv	(1) Have a value in gene_concept_id and drug_concept_id columns. (2) Drug ID is from a namespace that NodeNorm uses for ChemicalEntities (currently: rxcui, chembl (compound), drugbank).

Future Content Considerations¶

node_property_content: interactions.tsv has some columns that could be ingested as node properties: drug_is_approved, drug_is_immunotherapy, drug_is_antineoplastic, drug_specificity_score, gene_specificity_score. However, we may want to use a different resource that's updated more frequently or has more coverage. - Relevant files: interactions.tsv

node_property_content: DGIdb raw files for gene and drug claims could be sources of node properties. However, we may want to use a different resource that's updated more frequently or has more coverage. - Relevant files: genes.tsv, drugs.tsv

other: According to others that worked with DGIdb in the past, there's a way to load a local database that has more information. However, I looked at the Github instructions for recreating the database locally, and I don't think this can be easily/automatically done by the Translator ingest pipeline.

Target Information¶

Target InfoRes ID: infores:translator-dgidb

Edge Types¶

Subject Categories	Object Categories	Knowledge Level	Agent Type	UI Explanation
biolink:ChemicalEntity	biolink:Gene	knowledge_assertion	automated_agent	Source DGIdb aggregates drug-gene interaction statements of different types (e.g. 'antagonist', blocker', inhibitor', 'unknown') from several external knowledgebases. An automated algorithm assigns an 'interaction score' to each unique drug-gene pair across aggregated records. In performing this aggregation and 'meta-assessment', DGIdb makes its own implicit claims about each interaction, which are captured in Translator edges. The record(s) used to create this particular edge have a DGIdb 'interaction_type' that is empty or explicitly reported as 'other/unknown' - and thus assert only that an interaction of some type is occurring. This relationship is represented using the broadly-scoped Biolink 'interacts with' predicate.
biolink:ChemicalEntity	biolink:Gene	knowledge_assertion	automated_agent	Source DGIdb aggregates drug-gene interaction statements of different types (e.g. 'antagonist', blocker', inhibitor', 'unknown') from several external knowledgebases. An automated algorithm assigns an 'interaction score' to each unique drug-gene pair across aggregated records. In performing this aggregation and 'meta-assessment', DGIdb makes its own implicit claims about each interaction, which are captured in Translator edges. The record(s) used to create this particular edge have a DGIdb 'interaction_type' that explicitly states an interaction (e.g. 'binder'), or necessarily implies that a physical-interaction is occurring (e.g. 'agonist', 'antibody', 'blocker', 'inhibitor', 'inverse agonist'). This relationship is represented using the Biolink 'directly physically interacts with' predicate, with qualifiers that report the interaction mechanism when provided (e.g. 'binding').
biolink:ChemicalEntity	biolink:Gene	knowledge_assertion	automated_agent	Source DGIdb aggregates drug-gene interaction statements of different types (e.g. 'antagonist', blocker', inhibitor', 'unknown') from several external knowledgebases. An automated algorithm assigns an 'interaction score' to each unique drug-gene pair across aggregated records. In performing this aggregation and 'meta-assessment', DGIdb makes its own implicit claims about each interaction, which are captured in Translator edges. The record(s) used to create this particular edge have a DGIDb 'interaction_type' that indicates a specific type of effect the drug has on the activity or abundance of a gene/protein, and often the direction of the effect (e.g. that it causes increased activity of the gene/protein, or decreased stability of the gene/protein). These causal relationships are represented using the Biolink 'affects' predicate, with qualifiers capturing additional detail about the aspect and direction of the effect, and a causal mechanism where provided.

Node Types¶

Node Category	Source Identifier Types	Additional Notes
biolink:ChemicalEntity	RXCUI, CHEMBL.COMPOUND, DRUGBANK
biolink:Gene	HGNC, NCBIGene, ENSEMBL

Future Modeling Considerations¶

spoq_pattern: 'immunotherapy' interaction_type: its full meaning isn't captured in the current qualifier-pattern because there were some concerns with using causal_mechanism_qualifier. In the future, we could try representing this using existing or new qualifiers (therapeutic context?).

other: It's unclear what the 'NCI' value in the interaction_source_db_name column stands for (not mentioned in the current DGIdb website and the 2024 paper is unclear). We think it refers to NCIt, so we mapped it to NCIt's infores. Notes: (1) NCI is an organization, not an information resource; (2) Matt suggested that we could instead map it to 'NCI enterprise vocabulary services' to cover 'the more expansive suite of terminologies / knowledgebases / services NCI provides'.

other: Matt said underlying-source licensing and data overlap with 'directly-ingested' resources are not concerns because DGIdb is an 'interpreter' and can be set as the primary source (with the underlying sources set to supporting data providers or publications). HOWEVER, if we are directly ingesting all underlying sources for an edge, the DGIdb edge then seems like extraneous duplication.

Additional Notes: (1) Edge provenance: primary source is DGIdb, and the underlying sources are supporting data providers or publications. (2) We couldn't generate links to DGIdb interaction webpages because they use an ID that looks random or like a hash (and links to the drug or gene can currently be generated but wouldn't be specific to the edge).

Provenance Information¶

Contributors: - Colleen Xu - code author, data modeling - Matthew Brush - data modeling, domain expertise

Artifacts: - Recording special logic info: https://github.com/NCATSTranslator/Data-Ingest-Coordination-Working-Group/issues/55 - notebooks for development work currently in parser code directory - source data examples: https://docs.google.com/spreadsheets/d/1MtjzDFwLTa0iWIBGZuNiL8esXPsiANt7Gnm4FYK8QKA/edit?gid=302410930#gid=302410930 - interaction type to mechanism mapping table: https://docs.google.com/spreadsheets/d/1DeAE04O1mz3R9s3dCZpG2hQp9hwif5WjdkUMsBci-u8/edit?gid=1429630006#gid=1429630006