Skip to content

CTD Resource Ingest Guide

Source Information

InfoRes ID: infores:ctd

Description: CTD is a robust, publicly available database that aims to advance understanding about how environmental exposures affect human health. It provides knowledge, manually curated from the literature, about chemicals and their relationship to other biological entities: chemical to gene/protein interactions plus chemical to disease and gene to disease relationships. These data are integrated with functional and pathway data to aid in the development of hypotheses about the mechanisms underlying environmentally influenced diseases. It also generates novel inferences by further analyzing the knowledge they curate/create - based on statistically significant connections with intermediate concept (e.g. Chemical X associated with Disease Y based on shared associations with a common set of genes).

Citations: - Davis AP, Wiegers TC, Johnson RJ, Sciaky D, Wiegers J, Mattingly CJ Comparative Toxicogenomics Database (CTD). update 2023. Nucleic Acids Res. 2022 Sep 28

Data Access Locations: - CTD Bulk Downloads - http://ctdbase.org/downloads/ (this page includes file sizes and simple data dictionaries for each download) - CTD Catalog - https://ctdbase.org/reports/ (a simple list of files, reports the number of rows in each file)

Data Provision Mechanisms: file_download

Data Formats: tsv, csv, obo

Data Versioning and Releases: No consistent cadence for releases, but on average there are 1-2 releases each month. Versioning is based on the month and year of the release Releases page / change log: https://ctdbase.org/about/changes/ Latest status page: https://ctdbase.org/about/dataStatus.go

Ingest Information

Ingest Categories: primary_knowledge_provider

Utility: CTD is a rich source of manually curated chemical associations to other biological entities which are an important type of edge for Translator query and reasoning use cases, including treatment predictions, chemical-gene regulation predictions, and pathfinder queries. It is one of the few sources that focus on non-drug chemicals, e.g. environmental stressors, and how these are related to diseases, biological processes, and genes.

Scope: Covers a diverse range of Chemical association knowledge types provided across several files provided by CTD, based on many knowledge curation/creation approaches.

Relevant Files

File Name Location Description
CTD_chemicals_diseases.tsv.gz http://ctdbase.org/downloads/ Manually curated and computationally inferred associations between chemicals and diseases
CTD_exposure_events.tsv.gz http://ctdbase.org/downloads/ Descriptions of statistical studies of how exposure to chemicals affects a particular population, with some records providing outcomes
CTD_chem_gene_ixns.tsv.gz http://ctdbase.org/downloads/ Descriptions of curated, detailed causal associations describing how chemicals affect genes/gene products.
CTD_chem_go_enriched.tsv.gz http://ctdbase.org/downloads/ CTD calculates which GO terms are statistically enriched among the genes/proteins that interact with each chemical or its descendants.
CTD_chem_pathways_enriched.tsv.gz http://ctdbase.org/downloads/ Inferred associations between chemicals and biological pathways - computationally derived enrichments based on curated chemical–gene and gene–pathway data in CTD.
CTD_pheno_term_ixns.tsv.gz http://ctdbase.org/downloads/ Curated, literature-based associations between chemicals and phenotypes (often disease-like or physiological outcomes).
CTD_genes_diseases.tsv.gz and CTD_curated_genes_diseases.tsv.gz http://ctdbase.org/downloads/ Manually curated and computationally inferred associations between genes and diseases (to be ingested in future iteration)

Included Content

File Name Included Records Fields Used
CTD_chemicals_diseases.tsv.gz Associations curated and designated therapeutic or marker/mechanism ChemicalName, ChemicalID, CasRN, DiseaseName, DiseaseID, DirectEvidence, OmimIDs, PubMedIDs
CTD_exposure_events.tsv.gz All records with an outcomerelationship of "positive correlation" or "negative correlation"
CTD_chem_gene_ixns.tsv.gz All records that only involve two entities, records that involve three or more entities are dropped.
CTD_chem_go_enriched.tsv.gz Records with a corrected p value less than 1e-10 and a highest go level greater than or equal to 3
CTD_chem_pathways_enriched.tsv.gz Records with a corrected p value less than 1e-10
CTD_pheno_term_ixns.tsv.gz All records (none filtered)

Filtered Content

File Name Filtered Records Rationale
CTD_chemicals_diseases.tsv.gz Records lacking a DirectEvidence designation These relationships are only based on computational inference and were found to be not reliably useful enough to include
CTD_chem_gene_ixns.tsv.gz Records involving three or more entities are removed. It is not clear how we could split them into individual edges in a way that preserves the original knowledge assertion.
CTD_chem_go_enriched.tsv.gz Records with a corrected p value higher than 1e-10 or a highestGoLevel lower than 3 are dropped. This removes a large number of weak and overly vague associations.
CTD_chem_pathways_enriched.tsv.gz Records with a corrected p value higher than 1e-10 are dropped. This removes a large number of weak associations.
CTD_exposure_events.tsv.gz Records that do not have an outcomerelationship of "positive correlation" or "negative correlation" are dropped.

Future Content Considerations

edge_content: We should ingest gene-disease association edges from the CTD_genes_diseases.tsv file and/or the curated CTD_curated_genes_diseases.tsv.gz file. Lots of content here. Assess quality and compare to other gene-disease association sources when we do a comprehensive review of content/modeling in this area.

edge_content: For inferred Chemical-Disease associations, consider adding a cutoff to remove lower quality/confidence inferences (e.g. based on shared gene count, publication count, or inference score). At present we include even inferences based on a single shared gene/pub - which is not really meaningful. - Relevant files: CTD_chemicals_diseases.tsv.gz

edge_content: For enrichment-based Chemical-GO term associations, consider adding a cutoff to remove lower quality/confidence associations based on p-values. - Relevant files: CTD_chem_go_enriched.tsv

edge_content: For enrichment-based Chemical-Pathway associations, consider adding a cutoff to remove lower quality/confidence associations based on p-values. - Relevant files: CTD_chem_pathways_enriched.tsv

edge_property_content: Consider an edge property that reports the list of shared genes supporting C-D inferred associations. - Relevant files: CTD_chemicals_diseases.tsv.gz

node_property_content: Molepro ingested chemical properties in its previous ingests - which we will likely bring in at some point. In particular, data from the CTD_chemicals.tsv file - Relevant files: CTD_chemicals.tsv

edge_property_content: The exposures event file provides a lot of provenance/evidence metadata about supporting studies, including characteristics of subject populations, location of the study, and experimental/statistical data values. Consider for inclusion using a StudyResult object. - Relevant files: CTD_exposure_events.tsv.gz

Target Information

Target InfoRes ID: infores:translator-ctd

Edge Types

Subject Categories Predicate Object Categories Knowledge Level Agent Type UI Explanation
biolink:ChemicalEntity biolink:DiseaseOrPhenotypicFeature knowledge_assertion manual_agent Source CTD data provides assertions about Chemical-Disease relationships that were manually curated from literature by CTD curators. The CTD record used to create this Translator edge has a "T" (therapeutic) as a DirectEvidence code, indicating the chemical to be a "potential" treatment in virtue of its clinical use or scientific study. The broad and imprecise nature of this relationship is best represented with the Biolink predicate 'treats or applied or studied to treat' - which is used when it is not clear whether the chemical shows established efficacy or was merely studied or attempted as a treatment.
biolink:ChemicalEntity biolink:DiseaseOrPhenotypicFeature knowledge_assertion manual_agent Source CTD data provides assertions about Chemical-Disease relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge has a 'DirectEvidence' code of "M" (marker/mechanism), indicating that the chemical is either a marker or contributing factor for a condition. This vague/imprecise relationship implies that at minimum there is correlation between the presence of the chemical and condition, and is best represented here using the Biolink 'correlated_with' predicate.
biolink:ChemicalEntity biolink:DiseaseOrPhenotypicFeature statistical_association data_analysis_pipeline Source CTD data is based on statistical analysis of gene associations shared by a given Chemical-Diseae pair, as determined by an automated CTD pipeline that assigns an 'inference score' to each pair. The CTD record(s) used to create this Translator edge have an 'inference score' that indicates a statistically significant overrepresentation of shared gene associations between the Chemical and Disease. This hints but does not demonstrate that a real biological relationship may exist. The indirect, statistical basis of this relationship is best represented using the relatively generic 'associated_with' Biolink predicate.
biolink:ChemicalEntity, biolink:ComplexMolecularMixture, biolink:MolecularMixture, biolink:SmallMolecule biolink:PhenotypicFeature, biolink:Disease statistical_association manual_agent Source CTD data provides correlations between Chemicals and Disease based on a single real-world exposure study, that was curated from the literature by CTD curators. These correlations are separately calculated by authors or curators for each study, not an automated analysis pipeline. The CTD record used to create this Translator edge shows a statistically significant correlation between the studied exposure (usually a chemical) and a measured outcome (usually a disease or phenotype). This relationship is represented using the Biolink 'correlated with' predicate, or one of its directional sub-predicates if this information is provided ('positively/negatively correlated with'). Note that the edge is based on results from a single study - and may hold only in the context of the study's design or population.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:Gene knowledge_assertion manual_agent Source CTD data provides assertions about Chemical-Gene relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge describes how a chemical affects a particular aspect and/or form of a gene or gene product, and often the direction of the effect (e.g. that it causes increased activity of the gene, or decreased stability of the gene). These causal relationships are represented using the Biolink 'affects' or 'causes' predicate, with qualifiers capturing additional detail about the aspect and direction of the effect.
biolink:Gene biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule knowledge_assertion manual_agent Source CTD data provides assertions about Gene-Chemical relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge describes how a gene or gene product affects a particular aspect or form of a chemical entity, and often the direction of the effect (e.g. that it causes increased abundance of a chemical, or decreased stability). These causal relationships are represented using the Biolink 'affects' or 'causes' predicates, with qualifiers capturing additional detail about the aspect and direction of the gene's effect.
biolink:Gene biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule knowledge_assertion manual_agent Source CTD data provides assertions about Gene-Chemical relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge describes how a gene or gene product (usually a mutant form) may affect the susceptibility/sensitivity of a cell or organism to a particular chemical exposure. The relationship is represented using the Biolink 'affects sensitivity to' predicate - or one of its directional sub-predicates ('increases/decreases sensitivity to') if a direction is provided - with qualifiers capturing additional detail when a specific form of the gene mediates this sensitivity (e.g. a mutant_form).
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:Gene Source CTD data provides assertions about Chemical-Gene relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge describes how a chemical may affect the susceptibility/sensitivity of a cell or organism to the actions of a particular gene or gene product (e.g. a signaling molecule like TNF). This relationship is represented using the Biolink 'affects sensitivity to' predicate - or one of its directional sub-predicates ('increases/decreases sensitivity to') if a direction is provided.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:Gene knowledge_assertion manual_agent Source CTD data provides assertions about Chemical-Gene relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge reports that a chemical entity binds to a particular genes/gene product, but does not indicate a specific affect/outcome resulting from this interaction on the gene. This relationship is represented using the Biolink 'directly physically interacts with' predicate, with qualifiers capturing additional detail about a specific form, part, or derivative of the gene or chemical that participates in the interaction, when relevant.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:BiologicalProcess, biolink:BiologicalProcessOrActivity, biolink:MolecularActivity, biolink:Pathway knowledge_assertion manual_agent Source CTD data report assertions about chemicals and cellular phenotypes, that were manually curated from literature by CTD. The CTD record used to create this Translator edge reports that a chemical causes a cellular phenotype, which is framed as an affect on a biological process, activity, or pathway. This relationship is represented using the Biolink 'affects' or 'causes' predicates, with qualifiers capturing additional detail about the direction of the effect, and the anatomical or species context of the effect where provided (e.g. that it causes increased levels of the process or activity in a particular cell type, tissue, or species).
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:BiologicalProcessOrActivity, biolink:MolecularActivity, biolink:CellularComponent statistical_association data_analysis_pipeline Source CTD data are based on an enrichment analysis of Gene Ontology (GO) terms annotated to all genes that interact with a particular chemical - to derive Chemical-GO Term associations. This analysis is performed by a CTD automated analysis pipeline. The CTD record used to create this Translator edge reports an overrepresented GO term associated with a particular chemical - as indicated by a sufficiently low p-value from the analysis. This hints but does not directly demonstrate that the chemical may impact the process, activity, or cellular component the GO term represents. The indirect, statistical basis of this relationship is best represented using the 'associated_with' predicate in Biolink.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:Pathway statistical_association data_analysis_pipeline Source CTD data are based on an enrichment analysis of Pathways annotated to all genes that interact with a particular chemical - to derive Chemical-Pathway associations. This analysis is performed by a CTD automated analysis pipeline. The CTD record used to create this Translator edge reports an overrepresented Pathway associated with a particular chemical - as indicated by a sufficiently low p-value from the analysis. This hints but does not directly demonstrate that the chemical may impact the Pathway. The indirect, statistical basis of this relationship is best represented using the 'associated_with' predicate in Biolink.

Node Types

Node Category Source Identifier Types Additional Notes
MeSH Majority are Biolink SmallMolecules
MeSH, OMIM
GO
NCBIGene
KEGG.PATHWAY, REACT

Future Modeling Considerations

edge_properties: Revisit use of 'has_confidence_score' edge property if/when we refactor this part of the Biolink Model.

predicates: Revisit 'correlated_with' and 'treats_or_studied_or_applied_to_treat' predicates if/when we refactor modeling or conventions here.

other: May want to improve ui_explanations for the chemical-gene interactions edge types, once we understand how the UI will stitch together / display predicates and qualifier values for these edges.

edges: CTD_pheno_term_ixns and CTD_chem_gene_ixns both have lots of interactions involve three entities, like A causes B which results in B causing C. These are hard to model on a single edge, but we may be able to extract single edges from them.

edges: (related to the consideration above) The CTD_pheno_term_ixns.tsv that links chemicals to phenotypes (post-composed from GO term + direction) includes a column indicating gene(s) mediating the phenotype. Consider adding this info (perhaps using a new statement-level qualifier such as 'gene context qualifier', or 'mediating gene qualifier'.)

Provenance Information

Contributors: - Kevin Schaper: code author - Evan Morris: code author - Sierra Moxon: code support - Vlado Dancik: code support, domain expertise - Matthew Brush: data modeling, domain expertise

Artifacts: - Ingest Survey: https://docs.google.com/spreadsheets/d/1R9z-vywupNrD_3ywuOt_sntcTrNlGmhiUWDXUdkPVpM/edit?gid=0#gid=0 - KGX Summary Report of phase 2 ingests: https://docs.google.com/spreadsheets/d/1IVpkL0tFyk6U7c3tKFKKOlOhtJdMyZlI-1jn69LT0Rs/edit?gid=663403777#gid=663403777