EBI Gene2Phenotype Reference Ingest Guide¶
Source Information¶
InfoRes ID: infores:gene2phenotype
Description: EBI's Gene2Phenotype dataset contains high-quality gene-disease associations curated by UK disease domain experts and consultant clinical geneticists. It integrates data on genes, their variants, and related disorders. It is constructed by experts reviewing published literature, and it is primarily an inclusion list to allow targeted filtering of genome-wide data for diagnostic purposes. Each entry associates a gene with a disease, including a confidence level, allelic requirement and molecular mechanism.
Citations: - https://doi.org/10.1186/s13073-024-01398-1 - https://www.nature.com/articles/s41467-019-10016-3
Data Access Locations: - Latest data is provided at https://www.ebi.ac.uk/gene2phenotype/download (downloads created on-the-fly) - Archived static releases provided on the FTP site at https://ftp.ebi.ac.uk/pub/databases/gene2phenotype/G2P_data_downloads/
Data Provision Mechanisms: file_download
Data Formats: csv, other
Data Versioning and Releases: On-the-fly downloads: creation/download date are the same and can be used for versioning; note that the date in the filename may differ from the date in your timezone. Static releases: creation date (shown on FTP site and in folder/file names) can be used for versioning; releases cut and archived roughly every 1-2 months.
Ingest Information¶
Ingest Categories: primary_knowledge_provider
Utility: EBI G2P associations are useful as edges in support paths for MVP1 ('what may treat disease X'), and in Pathfinder queries.
Relevant Files¶
| File Name | Location | Description |
|---|---|---|
| G2P_all_[date].csv | https://www.ebi.ac.uk/gene2phenotype/api/panel/all/download/ | Associations from all panels (disease categories) |
Included Content¶
| File Name | Included Records | Fields Used |
|---|---|---|
| G2P_all_[date].csv | Records where 'confidence' value is 'definitive', 'strong', or 'moderate' | g2p id, hgnc id, disease mim, disease MONDO, allelic requirement, confidence, molecular mechanism, publications, date of last review |
Filtered Content¶
| File Name | Filtered Records | Rationale |
|---|---|---|
| G2P_all_[date].csv | Records where 'confidence' value is 'limited', 'disputed', or 'refuted' | Evidence level not sufficient for inclusion |
| G2P_all_[date].csv | Records with no values in both 'disease mim' and 'disease MONDO' columns | No IDs to use for disease nodes |
| G2P_all_[date].csv | Records with NodeNorm mapping failures for the node IDs | Failed normalization means that the node would not be connected to other data/nodes in Translator graphs |
Future Content Considerations¶
edge_content: Revisit exclusion of 'disputed' and/or 'refuted' records once Translator can model/handle negation better
edge_property_content: Lots of additional edge-level information that we could include in future iterations, including: 'confidence' level values when we improve/refactor modeling of confidence in Biolink; variant information ('variant consequence', 'variant types' columns - their values map to SO terms); (Matt's note) rich evidence and provenance metadata provided by the source (e.g. type of experiments/methods used to determine the molecular mechanism, and supporting publications).
Target Information¶
Target InfoRes ID: infores:translator-gene2phenotype-kgx
Edge Types¶
| Subject Categories | Predicate | Object Categories | Knowledge Level | Agent Type | UI Explanation |
|---|---|---|---|---|---|
| biolink:Gene | biolink:Disease | knowledge_assertion | manual_agent | EBI G2P curators manually determined through the evaluation of different types of evidence that variants of this gene of the indicated form (e.g. loss of function, gain of function, dominant negative) play a causal role in this disease. |
Node Types¶
| Node Category | Source Identifier Types | Additional Notes |
|---|---|---|
| biolink:Gene | HGNC | |
| biolink:Disease | OMIM, orphanet, MONDO | 'disease mim' column is source of OMIM and orphanet IDs. MONDO IDs from 'disease MONDO' column are only used if row doesn't have a value in 'disease mim' column |
Future Modeling Considerations¶
qualifiers: May want to revisit how we handle the 'molecular mechanism' and 'variant types' columns VS the biolink-model qualifier options
edge_properties: Revisit modeling of allelic_requirement (uses a regex pattern to match HP id syntax now, rather than an enumerated list of permissible values)
Provenance Information¶
Contributors: - Colleen Xu: code author, data modeling - Andrew Su: domain expertise - Sierra Moxon: domain expertise - Matthew Brush: data modeling, domain expertise
Artifacts: - Github Ticket on confidence 'limited' value: https://github.com/biolink/biolink-model/issues/1581 - PR on biolink allelic_requirement: https://github.com/biolink/biolink-model/pull/1576