CTD Resource Ingest Guide¶

Source Information¶

InfoRes ID: infores:ctd

Description: CTD is a robust, publicly available database that aims to advance understanding about how environmental exposures affect human health. It provides knowledge, manually curated from the literature, about chemicals and their relationship to other biological entities: chemical to gene/protein interactions plus chemical to disease and gene to disease relationships. These data are integrated with functional and pathway data to aid in the development of hypotheses about the mechanisms underlying environmentally influenced diseases. It also generates novel inferences by further analyzing the knowledge they curate/create - based on statistically significant connections with intermediate concept (e.g. Chemical X associated with Disease Y based on shared associations with a common set of genes).

Citations: - Davis AP, Wiegers TC, Johnson RJ, Sciaky D, Wiegers J, Mattingly CJ Comparative Toxicogenomics Database (CTD). update 2023. Nucleic Acids Res. 2022 Sep 28

Data Access Locations: - CTD Bulk Downloads - http://ctdbase.org/downloads/ (this page includes file sizes and simple data dictionaries for each download) - CTD Catalog - https://ctdbase.org/reports/ (a simple list of files, reports the number of rows in each file)

Data Provision Mechanisms: file_download

Data Formats: tsv, csv, obo

Data Versioning and Releases: No consistent cadence for releases, but on average there are 1-2 releases each month. Versioning is based on the month and year of the release Releases page / change log: https://ctdbase.org/about/changes/ Latest status page: https://ctdbase.org/about/dataStatus.go

Ingest Information¶

Ingest Categories: primary_knowledge_provider

Utility: CTD is a rich source of manually curated chemical associations to other biological entities which are an important type of edge for Translator query and reasoning use cases, including treatment predictions, chemical-gene regulation predictions, and pathfinder queries. It is one of the few sources that focus on non-drug chemicals, e.g. environmental stressors, and how these are related to diseases, biological processes, and genes.

Scope: Covers a diverse range of Chemical association knowledge types provided across several files provided by CTD, based on many knowledge curation/creation approaches.

Relevant Files¶

File Name	Location	Description
CTD_chemicals_diseases.tsv.gz	http://ctdbase.org/downloads/	Manually curated and computationally inferred associations between chemicals and diseases
CTD_exposure_events.tsv.gz	http://ctdbase.org/downloads/	Descriptions of statistical studies of how exposure to chemicals affects a particular population, with some records providing outcomes
CTD_chem_gene_ixns.tsv.gz	http://ctdbase.org/downloads/	Descriptions of curated, detailed causal associations describing how chemicals affect genes/gene products.
CTD_chem_go_enriched.tsv.gz	http://ctdbase.org/downloads/	CTD calculates which GO terms are statistically enriched among the genes/proteins that interact with each chemical or its descendants.
CTD_chem_pathways_enriched.tsv.gz	http://ctdbase.org/downloads/	Inferred associations between chemicals and biological pathways - computationally derived enrichments based on curated chemical–gene and gene–pathway data in CTD.
CTD_pheno_term_ixns.tsv.gz	http://ctdbase.org/downloads/	Curated, literature-based associations between chemicals and phenotypes (often disease-like or physiological outcomes).
CTD_genes_diseases.tsv.gz and CTD_curated_genes_diseases.tsv.gz	http://ctdbase.org/downloads/	Manually curated and computationally inferred associations between genes and diseases (to be ingested in future iteration)

Included Content¶

File Name	Included Records	Fields Used
CTD_chemicals_diseases.tsv.gz	Associations curated and designated therapeutic or marker/mechanism	ChemicalName, ChemicalID, CasRN, DiseaseName, DiseaseID, DirectEvidence, OmimIDs, PubMedIDs
CTD_exposure_events.tsv.gz	All records with an outcomerelationship of "positive correlation" or "negative correlation"
CTD_chem_gene_ixns.tsv.gz	All records that only involve two entities, records that involve three or more entities are dropped.
CTD_chem_go_enriched.tsv.gz	Records with a corrected p value less than 1e-10 and a highest go level greater than or equal to 3
CTD_chem_pathways_enriched.tsv.gz	Records with a corrected p value less than 1e-10
CTD_pheno_term_ixns.tsv.gz	All records (none filtered)

Filtered Content¶

File Name	Filtered Records	Rationale
CTD_chemicals_diseases.tsv.gz	Records lacking a DirectEvidence designation	These relationships are only based on computational inference and were found to be not reliably useful enough to include
CTD_chem_gene_ixns.tsv.gz	Records involving three or more entities are removed.	It is not clear how we could split them into individual edges in a way that preserves the original knowledge assertion.
CTD_chem_go_enriched.tsv.gz	Records with a corrected p value higher than 1e-10 or a highestGoLevel lower than 3 are dropped.	This removes a large number of weak and overly vague associations.
CTD_chem_pathways_enriched.tsv.gz	Records with a corrected p value higher than 1e-10 are dropped.	This removes a large number of weak associations.
CTD_exposure_events.tsv.gz	Records that do not have an outcomerelationship of "positive correlation" or "negative correlation" are dropped.

Future Content Considerations¶

edge_content: We should ingest gene-disease association edges from the CTD_genes_diseases.tsv file and/or the curated CTD_curated_genes_diseases.tsv.gz file. Lots of content here. Assess quality and compare to other gene-disease association sources when we do a comprehensive review of content/modeling in this area.

edge_content: For inferred Chemical-Disease associations, consider adding a cutoff to remove lower quality/confidence inferences (e.g. based on shared gene count, publication count, or inference score). At present we include even inferences based on a single shared gene/pub - which is not really meaningful. - Relevant files: CTD_chemicals_diseases.tsv.gz

edge_content: For enrichment-based Chemical-GO term associations, consider adding a cutoff to remove lower quality/confidence associations based on p-values. - Relevant files: CTD_chem_go_enriched.tsv

edge_content: For enrichment-based Chemical-Pathway associations, consider adding a cutoff to remove lower quality/confidence associations based on p-values. - Relevant files: CTD_chem_pathways_enriched.tsv

edge_property_content: Consider an edge property that reports the list of shared genes supporting C-D inferred associations. - Relevant files: CTD_chemicals_diseases.tsv.gz

node_property_content: Molepro ingested chemical properties in its previous ingests - which we will likely bring in at some point. In particular, data from the CTD_chemicals.tsv file - Relevant files: CTD_chemicals.tsv

edge_property_content: The exposures event file provides a lot of provenance/evidence metadata about supporting studies, including characteristics of subject populations, location of the study, and experimental/statistical data values. Consider for inclusion using a StudyResult object. - Relevant files: CTD_exposure_events.tsv.gz

Target Information¶

Target InfoRes ID: infores:translator-ctd

Edge Types¶

Subject Categories	Object Categories	Knowledge Level	Agent Type	UI Explanation
biolink:ChemicalEntity	biolink:DiseaseOrPhenotypicFeature	knowledge_assertion	manual_agent	Source CTD data provides assertions about Chemical-Disease relationships that were manually curated from literature by CTD curators. The CTD record used to create this Translator edge has a "T" (therapeutic) as a DirectEvidence code, indicating the chemical to be a "potential" treatment in virtue of its clinical use or scientific study. The broad and imprecise nature of this relationship is best represented with the Biolink predicate 'treats or applied or studied to treat' - which is used when it is not clear whether the chemical shows established efficacy or was merely studied or attempted as a treatment.
biolink:ChemicalEntity	biolink:DiseaseOrPhenotypicFeature	knowledge_assertion	manual_agent	Source CTD data provides assertions about Chemical-Disease relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge has a 'DirectEvidence' code of "M" (marker/mechanism), indicating that the chemical is either a marker or contributing factor for a condition. This vague/imprecise relationship implies that at minimum there is correlation between the presence of the chemical and condition, and is best represented here using the Biolink 'correlated_with' predicate.
biolink:ChemicalEntity	biolink:DiseaseOrPhenotypicFeature	statistical_association	data_analysis_pipeline	Source CTD data is based on statistical analysis of gene associations shared by a given Chemical-Diseae pair, as determined by an automated CTD pipeline that assigns an 'inference score' to each pair. The CTD record(s) used to create this Translator edge have an 'inference score' that indicates a statistically significant overrepresentation of shared gene associations between the Chemical and Disease. This hints but does not demonstrate that a real biological relationship may exist. The indirect, statistical basis of this relationship is best represented using the relatively generic 'associated_with' Biolink predicate.
biolink:ChemicalEntity, biolink:ComplexMolecularMixture, biolink:MolecularMixture, biolink:SmallMolecule	biolink:PhenotypicFeature, biolink:Disease	statistical_association	manual_agent	Source CTD data provides correlations between Chemicals and Disease based on a single real-world exposure study, that was curated from the literature by CTD curators. These correlations are separately calculated by authors or curators for each study, not an automated analysis pipeline. The CTD record used to create this Translator edge shows a statistically significant correlation between the studied exposure (usually a chemical) and a measured outcome (usually a disease or phenotype). This relationship is represented using the Biolink 'correlated with' predicate, or one of its directional sub-predicates if this information is provided ('positively/negatively correlated with'). Note that the edge is based on results from a single study - and may hold only in the context of the study's design or population.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule	biolink:Gene	knowledge_assertion	manual_agent	Source CTD data provides assertions about Chemical-Gene relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge describes how a chemical affects a particular aspect and/or form of a gene or gene product, and often the direction of the effect (e.g. that it causes increased activity of the gene, or decreased stability of the gene). These causal relationships are represented using the Biolink 'affects' or 'causes' predicate, with qualifiers capturing additional detail about the aspect and direction of the effect.
biolink:Gene	biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule	knowledge_assertion	manual_agent	Source CTD data provides assertions about Gene-Chemical relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge describes how a gene or gene product affects a particular aspect or form of a chemical entity, and often the direction of the effect (e.g. that it causes increased abundance of a chemical, or decreased stability). These causal relationships are represented using the Biolink 'affects' or 'causes' predicates, with qualifiers capturing additional detail about the aspect and direction of the gene's effect.
biolink:Gene	biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule	knowledge_assertion	manual_agent	Source CTD data provides assertions about Gene-Chemical relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge describes how a gene or gene product (usually a mutant form) may affect the susceptibility/sensitivity of a cell or organism to a particular chemical exposure. The relationship is represented using the Biolink 'affects sensitivity to' predicate - or one of its directional sub-predicates ('increases/decreases sensitivity to') if a direction is provided - with qualifiers capturing additional detail when a specific form of the gene mediates this sensitivity (e.g. a mutant_form).
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule	biolink:Gene			Source CTD data provides assertions about Chemical-Gene relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge describes how a chemical may affect the susceptibility/sensitivity of a cell or organism to the actions of a particular gene or gene product (e.g. a signaling molecule like TNF). This relationship is represented using the Biolink 'affects sensitivity to' predicate - or one of its directional sub-predicates ('increases/decreases sensitivity to') if a direction is provided.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule	biolink:Gene	knowledge_assertion	manual_agent	Source CTD data provides assertions about Chemical-Gene relationships that were manually curated from literature by CTD. The CTD record used to create this Translator edge reports that a chemical entity binds to a particular genes/gene product, but does not indicate a specific affect/outcome resulting from this interaction on the gene. This relationship is represented using the Biolink 'directly physically interacts with' predicate, with qualifiers capturing additional detail about a specific form, part, or derivative of the gene or chemical that participates in the interaction, when relevant.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule	biolink:BiologicalProcess, biolink:BiologicalProcessOrActivity, biolink:MolecularActivity, biolink:Pathway	knowledge_assertion	manual_agent	Source CTD data report assertions about chemicals and cellular phenotypes, that were manually curated from literature by CTD. The CTD record used to create this Translator edge reports that a chemical causes a cellular phenotype, which is framed as an affect on a biological process, activity, or pathway. This relationship is represented using the Biolink 'affects' or 'causes' predicates, with qualifiers capturing additional detail about the direction of the effect, and the anatomical or species context of the effect where provided (e.g. that it causes increased levels of the process or activity in a particular cell type, tissue, or species).
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule	biolink:BiologicalProcessOrActivity, biolink:MolecularActivity, biolink:CellularComponent	statistical_association	data_analysis_pipeline	Source CTD data are based on an enrichment analysis of Gene Ontology (GO) terms annotated to all genes that interact with a particular chemical - to derive Chemical-GO Term associations. This analysis is performed by a CTD automated analysis pipeline. The CTD record used to create this Translator edge reports an overrepresented GO term associated with a particular chemical - as indicated by a sufficiently low p-value from the analysis. This hints but does not directly demonstrate that the chemical may impact the process, activity, or cellular component the GO term represents. The indirect, statistical basis of this relationship is best represented using the 'associated_with' predicate in Biolink.
biolink:ChemicalEntity, biolink:MolecularMixture, biolink:SmallMolecule	biolink:Pathway	statistical_association	data_analysis_pipeline	Source CTD data are based on an enrichment analysis of Pathways annotated to all genes that interact with a particular chemical - to derive Chemical-Pathway associations. This analysis is performed by a CTD automated analysis pipeline. The CTD record used to create this Translator edge reports an overrepresented Pathway associated with a particular chemical - as indicated by a sufficiently low p-value from the analysis. This hints but does not directly demonstrate that the chemical may impact the Pathway. The indirect, statistical basis of this relationship is best represented using the 'associated_with' predicate in Biolink.

Node Types¶

Node Category	Source Identifier Types	Additional Notes
	MeSH	Majority are Biolink SmallMolecules
	MeSH, OMIM
	GO
	NCBIGene
	KEGG.PATHWAY, REACT

Future Modeling Considerations¶

edge_properties: Revisit use of 'has_confidence_score' edge property if/when we refactor this part of the Biolink Model.

predicates: Revisit 'correlated_with' and 'treats_or_studied_or_applied_to_treat' predicates if/when we refactor modeling or conventions here.

other: May want to improve ui_explanations for the chemical-gene interactions edge types, once we understand how the UI will stitch together / display predicates and qualifier values for these edges.

edges: CTD_pheno_term_ixns and CTD_chem_gene_ixns both have lots of interactions involve three entities, like A causes B which results in B causing C. These are hard to model on a single edge, but we may be able to extract single edges from them.

edges: (related to the consideration above) The CTD_pheno_term_ixns.tsv that links chemicals to phenotypes (post-composed from GO term + direction) includes a column indicating gene(s) mediating the phenotype. Consider adding this info (perhaps using a new statement-level qualifier such as 'gene context qualifier', or 'mediating gene qualifier'.)

Provenance Information¶

Contributors: - Kevin Schaper: code author - Evan Morris: code author - Sierra Moxon: code support - Vlado Dancik: code support, domain expertise - Matthew Brush: data modeling, domain expertise

Artifacts: - Ingest Survey: https://docs.google.com/spreadsheets/d/1R9z-vywupNrD_3ywuOt_sntcTrNlGmhiUWDXUdkPVpM/edit?gid=0#gid=0 - KGX Summary Report of phase 2 ingests: https://docs.google.com/spreadsheets/d/1IVpkL0tFyk6U7c3tKFKKOlOhtJdMyZlI-1jn69LT0Rs/edit?gid=663403777#gid=663403777