Jensen Lab DISEASES Database Reference Ingest Guide¶
Source Information¶
InfoRes ID: infores:diseases
Description: The DISEASES database is a web resource that integrates knowledge on gene-disease associations. It generates de novo associations through automated text mining, and aggregates associations from external sources of manually curated knowledge and GWAS-based study results. The associations are assigned a confidence score to facilitate comparisons across data types and sources.
Citations: - https://doi.org/10.1093/database/baac019 - https://www.sciencedirect.com/science/article/pii/S1046202314003831
Data Access Locations: - https://diseases.jensenlab.org/Downloads
Data Provision Mechanisms: file_download
Data Formats: tsv
Data Versioning and Releases: Updated weekly on Fridays, according to current maintainer 2025-08-05 (not sure if this is when the update posts or when it starts - the update process may take days). Website offers only download of latest version. It does not include a version or creation date for this download. Old, versioned releases archived at https://figshare.com/authors/Lars_Juhl_Jensen/96428
Ingest Information¶
Ingest Categories: primary_knowledge_provider, aggregation_provider
Utility: DISEASES contains gene-disease associations from unique sources, including their own text-mining pipeline and external human-curated resources that are hard to access or parse (MedlinePlus, AmyCo). These associations could be used in MVP1 (may treat disease X) or Pathfinder queries.
Scope: This ingest covers text-mined co-occurrence associations, and manually curated associations from MedlinePlus and AmyCo. Content aggregated from UniProt is not ingested. Experiment-based associations from TIGA data are not ingested (we will find a direct source of GWAS-based associations - TIGA and / or something else).
Relevant Files¶
| File Name | Location | Description |
|---|---|---|
| human_disease_textmining_filtered.tsv | https://diseases.jensenlab.org/Downloads | Text mined associations, filtered to contain only the non-redundant associations that are shown within the web interface when querying for a gene |
| human_disease_knowledge_filtered.tsv | https://diseases.jensenlab.org/Downloads | Curated associations, filtered to contain only the non-redundant associations that are shown within the web interface when querying for a gene |
Included Content¶
| File Name | Included Records | Fields Used |
|---|---|---|
| human_disease_textmining_filtered.tsv | All G-D association records generated by their text-mining tool | gene_id, disease_id, z_score, confidence_score, url. Note that the file doesn't have a header of column names. Colleen Xu assigned the column names/fields after manually reviewing the content and reading the descriptions of the files in https://diseases.jensenlab.org/Downloads. |
| human_disease_knowledge_filtered.tsv | G-D association records aggregated from MedlinePlus and AmyCo sources | gene_id, disease_id, source_db, confidence_score. Note that the file doesn't have a header of column names. Colleen Xu assigned the column names/fields after manually reviewing the content and reading the descriptions of the files in https://diseases.jensenlab.org/Downloads. |
Filtered Content¶
| File Name | Filtered Records | Rationale |
|---|---|---|
| human_disease_textmining_filtered.tsv | Records with no ENSP ID in gene ID column or no DOID in disease ID column | Need node IDs that are in NodeNorm's scope. Other values are non-ID strings (based on string-searches, a little manual review) or IDs that wouldn't be resolved by NodeNorm (AmyCo). |
| human_disease_textmining_filtered.tsv | Records that had NodeNorm mapping failure on gene or disease ID | Need node IDs that NodeNorm successfully maps to entities. |
| human_disease_knowledge_filtered.tsv | G-D association records aggregated from UniProt | Questionable quality and completeness of Uniprot data in DISEASES - best to get this content directly from UniProt. |
| human_disease_knowledge_filtered.tsv | Complete duplicates | Only need 1 copy of each unique record |
| human_disease_knowledge_filtered.tsv | Records with no ENSP ID in gene ID column or no DOID in disease ID column | Need node IDs that are in NodeNorm's scope. Other values are non-ID strings (based on string-searches, a little manual review) or IDs that wouldn't be resolved by NodeNorm (AmyCo). |
| human_disease_knowledge_filtered.tsv | Records that had NodeNorm mapping failure on gene or disease ID | Need node IDs that NodeNorm successfully maps to entities. |
Future Content Considerations¶
edge_content: Consider filtering some of the lower scoring text-mined associations if we can define a threshold/cutoff - Relevant files: human_disease_textmining_filtered.tsv
Target Information¶
Target InfoRes ID: infores:translator-diseases
Edge Types¶
| Subject Categories | Predicate | Object Categories | Knowledge Level | Agent Type | UI Explanation |
|---|---|---|---|---|---|
| biolink:Gene, biolink:Protein | biolink:Disease | text_co_occurrence | data_analysis_pipeline | Source DISEASES data is generated by their text-mining and analysis pipeline that uses NER to identify Genes and Diseases in literature and statistical methods to determine the co-occurrence patterns of these concepts. The DISEASES record used to create this Translator edge represents a statistically-significant co-occurrence (z-score >=3) of the Gene and Disease concepts in literature. This relationship is therefore represented using the Biolink 'occurs together in literature with' predicate. | |
| biolink:Gene, biolink:Protein | biolink:Disease | knowledge_assertion | manual_agent | Source DISEASES provides Gene-Disease associations ingested from external, manually-curated sources. The DISEASES record used to create this Translator edge does not report a specific type of gene-disease relationship for these 'associations' and is therefore represented using the relatively-generic Biolink 'associated with' predicate. |
Node Types¶
| Node Category | Source Identifier Types | Additional Notes |
|---|---|---|
| biolink:Gene | ENSEMBL | Source uses the ENSP (protein) identifiers from Ensembl |
| biolink:Protein | ENSEMBL | Source uses the ENSP (protein) identifiers from Ensembl |
| biolink:Disease | DOID |
Future Modeling Considerations¶
predicates: Revisit use of associated_with predicate for curated edges after we refactor the associated_with and/or gene-disease-relationship branches of the Biolink predicate hierarchy (if we reserve this predicate for statistically-based relationships, we may need to use related_to)
edge_properties: Revisit modeling of confidence score/levels and z-score if/when we refactor these parts of the Biolink Model
Provenance Information¶
Contributors: - Colleen Xu - code author, data modeling - Andrew Su - code support, domain expertise - Matthew Brush - data modeling, domain expertise
Artifacts: - Github Ticket: https://github.com/NCATSTranslator/Data-Ingest-Coordination-Working-Group/issues/13 - notebooks for development work currently in parser code directory