Jensen Lab DISEASES Database Reference Ingest Guide¶

Source Information¶

InfoRes ID: infores:diseases

Description: The DISEASES database is a web resource that integrates knowledge on gene-disease associations. It generates de novo associations through automated text mining, and aggregates associations from external sources of manually curated knowledge and GWAS-based study results. The associations are assigned a confidence score to facilitate comparisons across data types and sources.

Citations: - https://doi.org/10.1093/database/baac019 - https://www.sciencedirect.com/science/article/pii/S1046202314003831

Data Access Locations: - https://diseases.jensenlab.org/Downloads

Data Provision Mechanisms: file_download

Data Formats: tsv

Data Versioning and Releases: Updated weekly on Fridays, according to current maintainer 2025-08-05 (not sure if this is when the update posts or when it starts - the update process may take days). Website offers only download of latest version. It does not include a version or creation date for this download. Old, versioned releases archived at https://figshare.com/authors/Lars_Juhl_Jensen/96428

Ingest Information¶

Ingest Categories: primary_knowledge_provider, aggregation_provider

Utility: DISEASES contains gene-disease associations from unique sources, including their own text-mining pipeline and external human-curated resources that are hard to access or parse (MedlinePlus, AmyCo). These associations could be used in MVP1 (may treat disease X) or Pathfinder queries.

Scope: This ingest covers text-mined co-occurrence associations, and manually curated associations from MedlinePlus and AmyCo. Content aggregated from UniProt is not ingested. Experiment-based associations from TIGA data are not ingested (we will find a direct source of GWAS-based associations - TIGA and / or something else).

Relevant Files¶

File Name	Location	Description
human_disease_textmining_filtered.tsv	https://diseases.jensenlab.org/Downloads	Text mined associations, filtered to contain only the non-redundant associations that are shown within the web interface when querying for a gene
human_disease_knowledge_filtered.tsv	https://diseases.jensenlab.org/Downloads	Curated associations, filtered to contain only the non-redundant associations that are shown within the web interface when querying for a gene

Included Content¶

File Name	Included Records	Fields Used
human_disease_textmining_filtered.tsv	All G-D association records generated by their text-mining tool	gene_id, disease_id, z_score, confidence_score, url. Note that the file doesn't have a header of column names. Colleen Xu assigned the column names/fields after manually reviewing the content and reading the descriptions of the files in https://diseases.jensenlab.org/Downloads.
human_disease_knowledge_filtered.tsv	G-D association records aggregated from MedlinePlus and AmyCo sources	gene_id, disease_id, source_db, confidence_score. Note that the file doesn't have a header of column names. Colleen Xu assigned the column names/fields after manually reviewing the content and reading the descriptions of the files in https://diseases.jensenlab.org/Downloads.

Filtered Content¶

File Name	Filtered Records	Rationale
human_disease_textmining_filtered.tsv	Records with no ENSP ID in gene ID column or no DOID in disease ID column	Need node IDs that are in NodeNorm's scope. Other values are non-ID strings (based on string-searches, a little manual review) or IDs that wouldn't be resolved by NodeNorm (AmyCo).
human_disease_textmining_filtered.tsv	Records that had NodeNorm mapping failure on gene or disease ID	Need node IDs that NodeNorm successfully maps to entities.
human_disease_knowledge_filtered.tsv	G-D association records aggregated from UniProt	Questionable quality and completeness of Uniprot data in DISEASES - best to get this content directly from UniProt.
human_disease_knowledge_filtered.tsv	Complete duplicates	Only need 1 copy of each unique record
human_disease_knowledge_filtered.tsv	Records with no ENSP ID in gene ID column or no DOID in disease ID column	Need node IDs that are in NodeNorm's scope. Other values are non-ID strings (based on string-searches, a little manual review) or IDs that wouldn't be resolved by NodeNorm (AmyCo).
human_disease_knowledge_filtered.tsv	Records that had NodeNorm mapping failure on gene or disease ID	Need node IDs that NodeNorm successfully maps to entities.

Future Content Considerations¶

edge_content: Consider filtering some of the lower scoring text-mined associations if we can define a threshold/cutoff - Relevant files: human_disease_textmining_filtered.tsv

Target Information¶

Target InfoRes ID: infores:translator-diseases

Edge Types¶

Subject Categories	Predicate	Object Categories	Knowledge Level	Agent Type	UI Explanation
biolink:Gene, biolink:Protein		biolink:Disease	text_co_occurrence	data_analysis_pipeline	Source DISEASES data is generated by their text-mining and analysis pipeline that uses NER to identify Genes and Diseases in literature and statistical methods to determine the co-occurrence patterns of these concepts. The DISEASES record used to create this Translator edge represents a statistically-significant co-occurrence (z-score >=3) of the Gene and Disease concepts in literature. This relationship is therefore represented using the Biolink 'occurs together in literature with' predicate.
biolink:Gene, biolink:Protein		biolink:Disease	knowledge_assertion	manual_agent	Source DISEASES provides Gene-Disease associations ingested from external, manually-curated sources. The DISEASES record used to create this Translator edge does not report a specific type of gene-disease relationship for these 'associations' and is therefore represented using the relatively-generic Biolink 'associated with' predicate.

Node Types¶

Node Category	Source Identifier Types	Additional Notes
biolink:Gene	ENSEMBL	Source uses the ENSP (protein) identifiers from Ensembl
biolink:Protein	ENSEMBL	Source uses the ENSP (protein) identifiers from Ensembl
biolink:Disease	DOID

Future Modeling Considerations¶

predicates: Revisit use of associated_with predicate for curated edges after we refactor the associated_with and/or gene-disease-relationship branches of the Biolink predicate hierarchy (if we reserve this predicate for statistically-based relationships, we may need to use related_to)

edge_properties: Revisit modeling of confidence score/levels and z-score if/when we refactor these parts of the Biolink Model

Provenance Information¶

Contributors: - Colleen Xu - code author, data modeling - Andrew Su - code support, domain expertise - Matthew Brush - data modeling, domain expertise

Artifacts: - Github Ticket: https://github.com/NCATSTranslator/Data-Ingest-Coordination-Working-Group/issues/13 - notebooks for development work currently in parser code directory