CTD Resource Ingest Guide¶
Source Information¶
InfoRes ID: infores:ctd
Description: CTD is a robust, publicly available database that aims to advance understanding about how environmental exposures affect human health. It provides knowledge, manually curated from the literature, about chemicals and their relationship to other biological entities: chemical to gene/protein interactions plus chemical to disease and gene to disease relationships. These data are integrated with functional and pathway data to aid in the development of hypotheses about the mechanisms underlying environmentally influenced diseases. It also generates novel inferences by further analyzing the knowledge they curate/create - based on statistically significant connections with intermediate concept (e.g. Chemical X associated with Disease Y based on shared associations with a common set of genes).
Citations: - Davis AP, Wiegers TC, Johnson RJ, Sciaky D, Wiegers J, Mattingly CJ Comparative Toxicogenomics Database (CTD). update 2023. Nucleic Acids Res. 2022 Sep 28
Data Access Locations: - CTD Bulk Downloads - http://ctdbase.org/downloads/ (this page includes file sizes and simple data dictionaries for each download) - CTD Catalog - https://ctdbase.org/reports/ (a simple list of files, reports the number of rows in each file)
Data Provision Mechanisms: file_download
Data Formats: tsv, csv, obo
Data Versioning and Releases: No consistent cadence for releases, but on average there are 1-2 releases each month. Versioning is based on the month and year of the release Releases page / change log: https://ctdbase.org/about/changes/ Latest status page: https://ctdbase.org/about/dataStatus.go
Ingest Information¶
Ingest Categories: primary_knowledge_provider
Utility: CTD is a rich source of manually curated chemical associations to other biological entities which are an important type of edge for Translator query and reasoning use cases, including treatment predictions, chemical-gene regulation predictions, and pathfinder queries. It is one of the few sources that focus on non-drug chemicals, e.g. environmental stressors, and how these are related to diseases, biological processes, and genes.
Scope: Covers a diverse range of Chemical association knowledge types provided across several files provided by CTD, based on many knowledge curation/creation approaches.
Relevant Files¶
| File Name | Location | Description |
|---|---|---|
| CTD_chemicals_diseases.tsv.gz | http://ctdbase.org/downloads/ | Manually curated and computationally inferred associations between chemicals and diseases |
| CTD_exposure_events.tsv.gz | http://ctdbase.org/downloads/ | Descriptions of statistical studies of how exposure to chemicals affects a particular population, with some records providing outcomes |
| CTD_chem_gene_ixns.tsv.gz | http://ctdbase.org/downloads/ | Descriptions of curated, detailed causal associations describing how chemicals affect genes/gene products. |
| CTD_chem_go_enriched.tsv.gz | http://ctdbase.org/downloads/ | CTD calculates which GO terms are statistically enriched among the genes/proteins that interact with each chemical or its descendants. |
| CTD_chem_pathways_enriched.tsv.gz | http://ctdbase.org/downloads/ | Inferred associations between chemicals and biological pathways - computationally derived enrichments based on curated chemical–gene and gene–pathway data in CTD. |
| CTD_pheno_term_ixns.tsv.gz | http://ctdbase.org/downloads/ | Curated, literature-based associations between chemicals and phenotypes (often disease-like or physiological outcomes). |
| CTD_genes_diseases.tsv.gz and CTD_curated_genes_diseases.tsv.gz | http://ctdbase.org/downloads/ | Manually curated and computationally inferred associations between genes and diseases (to be ingested in future iteration) |
Included Content¶
| File Name | Included Records | Fields Used |
|---|---|---|
| CTD_chemicals_diseases.tsv.gz | All records, including curated therapeutic and marker/mechanism associations, as well as inferred associtionas (records lacking a value in the DirectEvidence column). | ChemicalName, ChemicalID, CasRN, DiseaseName, DiseaseID, DirectEvidence, InferenceGeneSymbol, InferenceScore, OmimIDs, PubMedIDs |
| CTD_exposure_events.tsv.gz | All records (none filtered) | |
| CTD_chem_gene_ixns.tsv.gz | All records (none filtered) | |
| CTD_chem_go_enriched.tsv.gz | All records (none filtered) | |
| CTD_chem_pathways_enriched.tsv.gz | All records (none filtered) | |
| CTD_pheno_term_ixns.tsv.gz | All records (none filtered) |
Future Content Considerations¶
edge_content: We should ingest gene-disease association edges from the CTD_genes_diseases.tsv file and/or the curated CTD_curated_genes_diseases.tsv.gz file. Lots of content here. Assess quality and compare to other gene-disease association sources when we do a comprehensive review of content/modeling in this area.
edge_content: For inferred Chemical-Disease associations, consider adding a cutoff to remove lower quality/confidence inferences (e.g. based on shared gene count, publication count, or inference score). At present we include even inferences based on a single shared gene/pub - which is not really meaningful. - Relevant files: CTD_chemicals_diseases.tsv.gz
edge_content: For enrichment-based Chemical-GO term associations, consider adding a cutoff to remove lower quality/confidence associatiosn based on p-values. - Relevant files: CTD_chem_go_enriched.tsv
edge_content: For enrichment-based Chemical-Pathway associations, consider adding a cutoff to remove lower quality/confidence associatiosn based on p-values. - Relevant files: CTD_chem_pathways_enriched.tsv
edge_property_content: Consider an edge property that reports the list of shared genes supporting C-D inferred associations. - Relevant files: CTD_chemicals_diseases.tsv.gz
node_property_content: Molepro ingested chemical properties in its previous ingests - which we will likely bring in at some point. In particular, data from the CTD_chemicals.tsv file - Relevant files: CTD_chemicals.tsv
edge_property_content: The exposures event file provides a lot of provenance/evidence metadata about supporting studies, including characteristics of subject populations, location of the study, and experimental/statistical data values. Consider for inclusion using a StudyResult object. - Relevant files: CTD_exposure_events.tsv.gz
Target Information¶
Target InfoRes ID: infores:translator-ctd
Edge Types¶
| Subject Categories | Predicate | Object Categories | Knowledge Level | Agent Type | UI Explanation |
|---|---|---|---|---|---|
| biolink:ChemicalEntity | biolink:DiseaseOrPhenotypicFeature | knowledge_assertion | manual_agent | CTD Chemical-Disease records with a "T" (therapeutic) DirectEvidence code indicate the chemical to be a "potential" treatment in virtue of its clinical use or study - which maps best to the Biolink predicate 'treats_or_applied_or_studied_to_treat'. | |
| biolink:ChemicalEntity | biolink:DiseaseOrPhenotypicFeature | knowledge_assertion | manual_agent | CTD Chemical-Disease records with a DirectEvidence code of "M" (marker/mechanism) indicate that the chemical is manually flagged as a marker or contributing factor for a condition. This implies that at minimum there is correlation between the presence of the chemical and condition, for which we use the Biolink 'correlated_with' predicate. | |
| biolink:ChemicalEntity | biolink:DiseaseOrPhenotypicFeature | statistical_association | data_analysis_pipeline | CTD Chemical-Disease records with an inference score have a statistically significant number of shared gene associations that suggest a biological relationship may exist. The statistical basis of this general inferred relationship is best reported using the Biolink 'associated_with' predicate. | |
| biolink:ChemicalEntity, biolink:ComplexMolecularMixture, biolink:MolecularMixture, biolink:SmallMolecule | biolink:PhenotypicFeature, biolink:Disease | statistical_association | manual_agent | A positive or negative correlation edge is created when the results of an environmental exposure study curated by CTD reports a statistically significant association between the exposure (usually a chemical) and outcome measure (usually a disease or phenotype). Note that this edge is based on results from a single study. | |
| biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule | biolink:Gene | knowledge_assertion | manual_agent | CTD curators manually extract statements from publications about how chemical entities affect genes/gene products in the body, often reporting a particular aspect or form of the gene that is effected, and the direction of the effect. Translator represents details of these causal associations using specific combinations of predicates (biolink:affects / biolink:causes) and qualifier properties (e.g. object_aspect, object_direction). | |
| biolink:Gene | biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule | knowledge_assertion | manual_agent | CTD curators manually extract statements from publications about how genes affect chemical entities in the body, often reporting a particular form or aspect of the chemical that is effected, and the direction of the effect. Translator represents details of these causal associations using specific combinations of predicates (biolink:affects / biolink:causes) and qualifier properties (e.g. object_aspect, object_direction). | |
| biolink:Gene | biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule | knowledge_assertion | manual_agent | CTD curators manually extract statements from publications reporting that a gene or gene product (usually a mutant form) may affect susceptibility/sensitivity to the effects of a particular chemical exposure. Translator represents this relationship using the biolink:affects_sensitivity_to predicate, or biolink:increases/decreases_sensitivity_to predicates if a direction is provided. | |
| biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule | biolink:Gene | knowledge_assertion | manual_agent | CTD curators manually extract statements from publications reporting that exposure to a chemical entity may affect susceptibility/sensitivity to the effects of particular gene or gene product (e.g. a signaling molecule like TNF). Translator represents this relationship using the biolink:affects_sensitivity_to predicate, or biolink:increases/decreases_sensitivity_to predicates if a direction is provided. | |
| biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule | biolink:BiologicalProcess, biolink:BiologicalProcessOrActivity, biolink:MolecularActivity, biolink:Pathway | knowledge_assertion | manual_agent | CTD curators manually extract statements from publications about how chemicals affect processes, activities, or pathways in the body, often reporting the direction of the effect. Translator represents details of these causal associations using specific combinations of predicates (biolink:affects, biolink:causes) and qualifier properties (e.g. object_direction). | |
| biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule | biolink:Gene | knowledge_assertion | manual_agent | CTD curators manually extract statements from publications about chemicals 'binding' to gene products. Translator represents this using the biolink:directly_physically_interacts_with predicate, and qualifiers to indicate when a certain form, part, or derivative of the chemical or gene product are relevant to this interaction. | |
| biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule | biolink:BiologicalProcessOrActivity, biolink:MolecularActivity, biolink:CellularComponent | statistical_association | data_analysis_pipeline | A CTD analysis defines associations between a chemical entity and a GO term based on statistically significant enrichment (p < 0.05) of the GO term among the genes that are reported to be affected by the chemical in CTD's curated data. Translator reports this using the biolnk:associated_with predicate, and provides the p-value as supporting evidence. | |
| biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule | biolink:Pathway | statistical_association | data_analysis_pipeline | CTD defines associations between a chemical entity and a pathway based on statistically significant overrepresentation/enrichment (p < 0.05) in the Pathway of the genes reported to be affected by the chemical in CTD's curated data. Translator reports this using the biolnk:associated_with predicate, and provides the p-value as supporting evidence. |
Node Types¶
| Node Category | Source Identifier Types | Additional Notes |
|---|---|---|
| MeSH, CAS RN | Majority are Biolink SmallMolecules | |
| MeSH, OMIM, DOID | ||
| GO | ||
| NCBIGene | ||
| KEGG, Reactome, WikiPathways ID | ||
| MeSH |
Future Modeling Considerations¶
edge_properties: Revisit use of 'has_confidence_score' edge property if/when we refactor this part of the Biolink Model.
predicates: Revisit 'correlated_with' and 'treats_or_studied_or_applied_to_treat' predicates if/when we refactor modeling or conventions here.
other: May want to improve ui_explanations for the chemical-gene interactions edge types, once we understand how the UI will stitch together / display predicates and qualifier values for these edges.
edges: CTD_pheno_term_ixns and CTD_chem_gene_ixns both have lots of interactions involve three entities, like A causes B which results in B causing C. These are hard to model on a single edge, but we may be able to extract single edges from them.
Additional Notes: CTD_chemicals_diseases.tsv. data includes one row per curated 'T', or 'M' association with pub reference(s), plus one row per shared gene association with pub reference(s), and inference scores. Separate edges will be created for each type of association reported between a chemical and a given disease, according to the mappings described above. All "shared gene" rows in the source data file for a given C-D pair will be aggregated into a single 'associated_with' edge that reports an associated_with relationship with the inference score as an edge property (and possibly the list of shared genes). This means that for a given C-D pair in the CTD file, there may be 1, 2, or 3 separate edges created in the Translator graph.
Provenance Information¶
Contributors: - Kevin Schaper: code author - Evan Morris: code author - Sierra Moxon: code support - Vlado Dancik: code support, domain expertise - Matthew Brush: data modeling, domain expertise
Artifacts: - Ingest Survey: https://docs.google.com/spreadsheets/d/1R9z-vywupNrD_3ywuOt_sntcTrNlGmhiUWDXUdkPVpM/edit?gid=0#gid=0 - KGX Summary Report of phase 2 ingests: https://docs.google.com/spreadsheets/d/1IVpkL0tFyk6U7c3tKFKKOlOhtJdMyZlI-1jn69LT0Rs/edit?gid=663403777#gid=663403777