Skip to content

CTD Resource Ingest Guide

Source Information

InfoRes ID: infores:ctd

Description: CTD is a robust, publicly available database that aims to advance understanding about how environmental exposures affect human health. It provides knowledge, manually curated from the literature, about chemicals and their relationship to other biological entities: chemical to gene/protein interactions plus chemical to disease and gene to disease relationships. These data are integrated with functional and pathway data to aid in the development of hypotheses about the mechanisms underlying environmentally influenced diseases. It also generates novel inferences by further analyzing the knowledge they curate/create - based on statistically significant connections with intermediate concept (e.g. Chemical X associated with Disease Y based on shared associations with a common set of genes).

Citations: - Davis AP, Wiegers TC, Johnson RJ, Sciaky D, Wiegers J, Mattingly CJ Comparative Toxicogenomics Database (CTD). update 2023. Nucleic Acids Res. 2022 Sep 28

Data Access Locations: - CTD Bulk Downloads - http://ctdbase.org/downloads/ (this page includes file sizes and simple data dictionaries for each download) - CTD Catalog - https://ctdbase.org/reports/ (a simple list of files, reports the number of rows in each file)

Data Provision Mechanisms: file_download

Data Formats: tsv, csv, obo

Data Versioning and Releases: No consistent cadence for releases, but on average there are 1-2 releases each month. Versioning is based on the month and year of the release Releases page / change log: https://ctdbase.org/about/changes/ Latest status page: https://ctdbase.org/about/dataStatus.go

Ingest Information

Ingest Categories: primary_knowledge_provider

Utility: CTD is a rich source of manually curated chemical associations to other biological entities which are an important type of edge for Translator query and reasoning use cases, including treatment predictions, chemical-gene regulation predictions, and pathfinder queries. It is one of the few sources that focus on non-drug chemicals, e.g. environmental stressors, and how these are related to diseases, biological processes, and genes.

Scope: Covers a diverse range of Chemical association knowledge types provided across several files provided by CTD, based on many knowledge curation/creation approaches.

Relevant Files

File Name Location Description
CTD_chemicals_diseases.tsv.gz http://ctdbase.org/downloads/ Manually curated and computationally inferred associations between chemicals and diseases
CTD_exposure_events.tsv.gz http://ctdbase.org/downloads/ Descriptions of statistical studies of how exposure to chemicals affects a particular population, with some records providing outcomes
CTD_chem_gene_ixns.tsv.gz http://ctdbase.org/downloads/ Descriptions of curated, detailed causal associations describing how chemicals affect genes/gene products.
CTD_chem_go_enriched.tsv.gz http://ctdbase.org/downloads/ CTD calculates which GO terms are statistically enriched among the genes/proteins that interact with each chemical or its descendants.
CTD_chem_pathways_enriched.tsv.gz http://ctdbase.org/downloads/ Inferred associations between chemicals and biological pathways - computationally derived enrichments based on curated chemical–gene and gene–pathway data in CTD.
CTD_pheno_term_ixns.tsv.gz http://ctdbase.org/downloads/ Curated, literature-based associations between chemicals and phenotypes (often disease-like or physiological outcomes).
CTD_genes_diseases.tsv.gz and CTD_curated_genes_diseases.tsv.gz http://ctdbase.org/downloads/ Manually curated and computationally inferred associations between genes and diseases (to be ingested in future iteration)

Included Content

File Name Included Records Fields Used
CTD_chemicals_diseases.tsv.gz All records, including curated therapeutic and marker/mechanism associations, as well as inferred associtionas (records lacking a value in the DirectEvidence column). ChemicalName, ChemicalID, CasRN, DiseaseName, DiseaseID, DirectEvidence, InferenceGeneSymbol, InferenceScore, OmimIDs, PubMedIDs
CTD_exposure_events.tsv.gz All records (none filtered)
CTD_chem_gene_ixns.tsv.gz All records (none filtered)
CTD_chem_go_enriched.tsv.gz All records (none filtered)
CTD_chem_pathways_enriched.tsv.gz All records (none filtered)
CTD_pheno_term_ixns.tsv.gz All records (none filtered)

Future Content Considerations

edge_content: We should ingest gene-disease association edges from the CTD_genes_diseases.tsv file and/or the curated CTD_curated_genes_diseases.tsv.gz file. Lots of content here. Assess quality and compare to other gene-disease association sources when we do a comprehensive review of content/modeling in this area.

edge_content: For inferred Chemical-Disease associations, consider adding a cutoff to remove lower quality/confidence inferences (e.g. based on shared gene count, publication count, or inference score). At present we include even inferences based on a single shared gene/pub - which is not really meaningful. - Relevant files: CTD_chemicals_diseases.tsv.gz

edge_content: For enrichment-based Chemical-GO term associations, consider adding a cutoff to remove lower quality/confidence associatiosn based on p-values. - Relevant files: CTD_chem_go_enriched.tsv

edge_content: For enrichment-based Chemical-Pathway associations, consider adding a cutoff to remove lower quality/confidence associatiosn based on p-values. - Relevant files: CTD_chem_pathways_enriched.tsv

edge_property_content: Consider an edge property that reports the list of shared genes supporting C-D inferred associations. - Relevant files: CTD_chemicals_diseases.tsv.gz

node_property_content: Molepro ingested chemical properties in its previous ingests - which we will likely bring in at some point. In particular, data from the CTD_chemicals.tsv file - Relevant files: CTD_chemicals.tsv

edge_property_content: The exposures event file provides a lot of provenance/evidence metadata about supporting studies, including characteristics of subject populations, location of the study, and experimental/statistical data values. Consider for inclusion using a StudyResult object. - Relevant files: CTD_exposure_events.tsv.gz

Target Information

Target InfoRes ID: infores:translator-ctd

Edge Types

Subject Categories Predicate Object Categories Knowledge Level Agent Type UI Explanation
biolink:ChemicalEntity biolink:DiseaseOrPhenotypicFeature knowledge_assertion manual_agent CTD Chemical-Disease records with a "T" (therapeutic) DirectEvidence code indicate the chemical to be a "potential" treatment in virtue of its clinical use or study - which maps best to the Biolink predicate 'treats_or_applied_or_studied_to_treat'.
biolink:ChemicalEntity biolink:DiseaseOrPhenotypicFeature knowledge_assertion manual_agent CTD Chemical-Disease records with a DirectEvidence code of "M" (marker/mechanism) indicate that the chemical is manually flagged as a marker or contributing factor for a condition. This implies that at minimum there is correlation between the presence of the chemical and condition, for which we use the Biolink 'correlated_with' predicate.
biolink:ChemicalEntity biolink:DiseaseOrPhenotypicFeature statistical_association data_analysis_pipeline CTD Chemical-Disease records with an inference score have a statistically significant number of shared gene associations that suggest a biological relationship may exist. The statistical basis of this general inferred relationship is best reported using the Biolink 'associated_with' predicate.
biolink:ChemicalEntity, biolink:ComplexMolecularMixture, biolink:MolecularMixture, biolink:SmallMolecule biolink:PhenotypicFeature, biolink:Disease statistical_association manual_agent A positive or negative correlation edge is created when the results of an environmental exposure study curated by CTD reports a statistically significant association between the exposure (usually a chemical) and outcome measure (usually a disease or phenotype). Note that this edge is based on results from a single study.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:Gene knowledge_assertion manual_agent CTD curators manually extract statements from publications about how chemical entities affect genes/gene products in the body, often reporting a particular aspect or form of the gene that is effected, and the direction of the effect. Translator represents details of these causal associations using specific combinations of predicates (biolink:affects / biolink:causes) and qualifier properties (e.g. object_aspect, object_direction).
biolink:Gene biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule knowledge_assertion manual_agent CTD curators manually extract statements from publications about how genes affect chemical entities in the body, often reporting a particular form or aspect of the chemical that is effected, and the direction of the effect. Translator represents details of these causal associations using specific combinations of predicates (biolink:affects / biolink:causes) and qualifier properties (e.g. object_aspect, object_direction).
biolink:Gene biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule knowledge_assertion manual_agent CTD curators manually extract statements from publications reporting that a gene or gene product (usually a mutant form) may affect susceptibility/sensitivity to the effects of a particular chemical exposure. Translator represents this relationship using the biolink:affects_sensitivity_to predicate, or biolink:increases/decreases_sensitivity_to predicates if a direction is provided.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:Gene knowledge_assertion manual_agent CTD curators manually extract statements from publications reporting that exposure to a chemical entity may affect susceptibility/sensitivity to the effects of particular gene or gene product (e.g. a signaling molecule like TNF). Translator represents this relationship using the biolink:affects_sensitivity_to predicate, or biolink:increases/decreases_sensitivity_to predicates if a direction is provided.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:BiologicalProcess, biolink:BiologicalProcessOrActivity, biolink:MolecularActivity, biolink:Pathway knowledge_assertion manual_agent CTD curators manually extract statements from publications about how chemicals affect processes, activities, or pathways in the body, often reporting the direction of the effect. Translator represents details of these causal associations using specific combinations of predicates (biolink:affects, biolink:causes) and qualifier properties (e.g. object_direction).
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:Gene knowledge_assertion manual_agent CTD curators manually extract statements from publications about chemicals 'binding' to gene products. Translator represents this using the biolink:directly_physically_interacts_with predicate, and qualifiers to indicate when a certain form, part, or derivative of the chemical or gene product are relevant to this interaction.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:BiologicalProcessOrActivity, biolink:MolecularActivity, biolink:CellularComponent statistical_association data_analysis_pipeline A CTD analysis defines associations between a chemical entity and a GO term based on statistically significant enrichment (p < 0.05) of the GO term among the genes that are reported to be affected by the chemical in CTD's curated data. Translator reports this using the biolnk:associated_with predicate, and provides the p-value as supporting evidence.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule biolink:Pathway statistical_association data_analysis_pipeline CTD defines associations between a chemical entity and a pathway based on statistically significant overrepresentation/enrichment (p < 0.05) in the Pathway of the genes reported to be affected by the chemical in CTD's curated data. Translator reports this using the biolnk:associated_with predicate, and provides the p-value as supporting evidence.

Node Types

Node Category Source Identifier Types Additional Notes
MeSH, CAS RN Majority are Biolink SmallMolecules
MeSH, OMIM, DOID
GO
NCBIGene
KEGG, Reactome, WikiPathways ID
MeSH

Future Modeling Considerations

edge_properties: Revisit use of 'has_confidence_score' edge property if/when we refactor this part of the Biolink Model.

predicates: Revisit 'correlated_with' and 'treats_or_studied_or_applied_to_treat' predicates if/when we refactor modeling or conventions here.

other: May want to improve ui_explanations for the chemical-gene interactions edge types, once we understand how the UI will stitch together / display predicates and qualifier values for these edges.

edges: CTD_pheno_term_ixns and CTD_chem_gene_ixns both have lots of interactions involve three entities, like A causes B which results in B causing C. These are hard to model on a single edge, but we may be able to extract single edges from them.

Additional Notes: CTD_chemicals_diseases.tsv. data includes one row per curated 'T', or 'M' association with pub reference(s), plus one row per shared gene association with pub reference(s), and inference scores. Separate edges will be created for each type of association reported between a chemical and a given disease, according to the mappings described above. All "shared gene" rows in the source data file for a given C-D pair will be aggregated into a single 'associated_with' edge that reports an associated_with relationship with the inference score as an edge property (and possibly the list of shared genes). This means that for a given C-D pair in the CTD file, there may be 1, 2, or 3 separate edges created in the Translator graph.

Provenance Information

Contributors: - Kevin Schaper: code author - Evan Morris: code author - Sierra Moxon: code support - Vlado Dancik: code support, domain expertise - Matthew Brush: data modeling, domain expertise

Artifacts: - Ingest Survey: https://docs.google.com/spreadsheets/d/1R9z-vywupNrD_3ywuOt_sntcTrNlGmhiUWDXUdkPVpM/edit?gid=0#gid=0 - KGX Summary Report of phase 2 ingests: https://docs.google.com/spreadsheets/d/1IVpkL0tFyk6U7c3tKFKKOlOhtJdMyZlI-1jn69LT0Rs/edit?gid=663403777#gid=663403777