Alliance of Genome Resources (AGR) Reference Ingest Guide¶
Source Information¶
InfoRes ID: infores:agrkb
Description: The Alliance of Genome Resources (AGR) is a consortium of model organism databases and the Gene Ontology Consortium that provides a unified view of gene function, biological processes, phenotypes, and disease associations across multiple model organisms. The AGR aggregates and harmonizes data from member databases including MGI (mouse), RGD (rat), SGD (yeast), WormBase (C. elegans), FlyBase (D. melanogaster), ZFIN (zebrafish), and Xenbase (X. laevis and X. tropicalis).
Citations: - Alliance of Genome Resources Portal: enhancing a MOD resource. Nucleic Acids Res. 2020 Jan 8;48(D1):D650-D658. PMID: 31691787 - https://www.alliancegenome.org/
Data Access Locations: - All downloads: https://www.alliancegenome.org/downloads - FMS (File Management System): https://fms.alliancegenome.org/download/
Data Provision Mechanisms: file_download
Data Formats: json, tsv
Data Versioning and Releases: The Alliance releases data in numbered versions (e.g., 7.0.0, 7.1.0) approximately every 2-3 months. Each release is archived and accessible via the FMS. Release information is available at https://www.alliancegenome.org/downloads and version metadata can be retrieved from the API at https://www.alliancegenome.org/api/releaseInfo.
Ingest Information¶
Ingest Categories: aggregation_provider
Utility: The Alliance provides high-quality, manually curated gene-phenotype associations and gene expression data for mouse and rat, which are critical model organisms for understanding human biology and disease. These associations support Translator queries relating to gene function, phenotypes, and anatomical expression patterns.
Scope: This ingest focuses on mouse (NCBITaxon:10090) and rat (NCBITaxon:10116) data only, specifically: (1) gene-to-phenotype associations where the subject is a gene (not genotype or variant), and (2) gene-to-expression site associations. All nodes are created as minimal placeholders (id + category only) for downstream merging with richer entity data from other sources.
Relevant Files¶
| File Name | Location | Description |
|---|---|---|
| BGI_MGI.json.gz | https://fms.alliancegenome.org/download/BGI_MGI.json.gz | Mouse gene basic information (used for gene ID lookup to filter phenotype records) |
| BGI_RGD.json.gz | https://fms.alliancegenome.org/download/BGI_RGD.json.gz | Rat gene basic information (used for gene ID lookup to filter phenotype records) |
| PHENOTYPE_MGI.json.gz | https://fms.alliancegenome.org/download/PHENOTYPE_MGI.json.gz | Mouse phenotype associations (includes gene, genotype, and variant associations) |
| PHENOTYPE_RGD.json.gz | https://fms.alliancegenome.org/download/PHENOTYPE_RGD.json.gz | Rat phenotype associations (includes gene, genotype, and variant associations) |
| EXPRESSION_MGI.json.gz | https://fms.alliancegenome.org/download/EXPRESSION_MGI.json.gz | Mouse gene expression data |
| EXPRESSION_RGD.json.gz | https://fms.alliancegenome.org/download/EXPRESSION_RGD.json.gz | Rat gene expression data |
Included Content¶
| File Name | Included Records | Fields Used |
|---|---|---|
| BGI_MGI.json.gz | Gene IDs for mouse genes only, used for phenotype filtering. Data is loaded into DuckDB for fast lookup but no gene nodes are created from this file. | basicGeneticEntity.primaryId, basicGeneticEntity.taxonId |
| BGI_RGD.json.gz | Gene IDs for rat genes only, used for phenotype filtering. Data is loaded into DuckDB for fast lookup but no gene nodes are created from this file. | basicGeneticEntity.primaryId, basicGeneticEntity.taxonId |
| PHENOTYPE_MGI.json.gz | Only records where objectId corresponds to a gene (filtered using BGI gene lookup). Records where objectId is a genotype or variant are excluded. | objectId, phenotypeTermIdentifiers, evidence.publicationId, conditionRelations |
| PHENOTYPE_RGD.json.gz | Only records where objectId corresponds to a gene (filtered using BGI gene lookup). Records where objectId is a genotype or variant are excluded. | objectId, phenotypeTermIdentifiers, evidence.publicationId, conditionRelations |
| EXPRESSION_MGI.json.gz | All gene expression records | geneId, whereExpressed.anatomicalStructureTermId, whereExpressed.cellularComponentTermId, whenExpressed.stageTermId, evidence.publicationId, crossReference.id, assay |
| EXPRESSION_RGD.json.gz | All gene expression records | geneId, whereExpressed.anatomicalStructureTermId, whereExpressed.cellularComponentTermId, whenExpressed.stageTermId, evidence.publicationId, crossReference.id, assay |
Filtered Content¶
| File Name | Filtered Records | Rationale |
|---|---|---|
| PHENOTYPE_MGI.json.gz | Phenotype associations where objectId is a genotype ID (e.g., MGI genotype IDs) or variant/allele ID | This ingest focuses only on direct gene-to-phenotype associations. Genotype and variant associations may be considered for future ingests but require different modeling patterns and are currently out of scope. |
| PHENOTYPE_RGD.json.gz | Phenotype associations where objectId is a genotype ID (e.g., RGD genotype IDs) or variant/allele ID | This ingest focuses only on direct gene-to-phenotype associations. Genotype and variant associations may be considered for future ingests but require different modeling patterns and are currently out of scope. |
Future Content Considerations¶
edge_content: Consider adding genotype-to-phenotype and variant-to-phenotype associations, which are currently filtered out. These would require extending the ingest to create proper genotype and variant nodes. - Relevant files: PHENOTYPE_MGI.json.gz, PHENOTYPE_RGD.json.gz, AGM_MGI.json.gz, AGM_RGD.json.gz, VARIANT-ALLELE files
edge_content: Consider adding gene-to-disease associations from DISEASE-ALLIANCE_COMBINED.tsv.gz file, which provides curated disease annotations. - Relevant files: DISEASE-ALLIANCE_COMBINED.tsv.gz
edge_content: Consider ingesting data for additional species beyond mouse and rat (e.g., zebrafish, fly, worm, yeast) - Relevant files: All Alliance files for other species
edge_property_content: Consider adding richer provenance information, such as mapping evidence codes to knowledge_level and agent_type enums more granularly
node_property_content: Currently all nodes are minimal placeholders. Future iterations could enrich gene nodes with names, symbols, and other properties from the BGI files, or rely on downstream normalization for this enrichment.
Additional Notes: The ingest uses DuckDB to create an in-memory lookup table of gene IDs from BGI files, allowing efficient filtering of phenotype records by entity type without loading all genotype and allele data. This approach significantly reduces memory footprint and processing time compared to loading all entity types.
Target Information¶
Target InfoRes ID: infores:translator-agr-kgx
Edge Types¶
| Subject Categories | Predicate | Object Categories | Knowledge Level | Agent Type | UI Explanation |
|---|---|---|---|---|---|
| biolink:Gene | biolink:PhenotypicFeature | knowledge_assertion | manual_agent | These associations represent manually curated phenotype observations for genes in mouse and rat model organisms. Each association links a gene to one or more phenotypic features observed when the gene is perturbed, knocked out, or otherwise modified. | |
| biolink:Gene | biolink:AnatomicalEntity, biolink:CellularComponent | knowledge_assertion | manual_agent | These associations represent gene expression patterns observed in specific anatomical locations or cellular components in mouse and rat. Expression data is typically derived from experimental assays and annotated with developmental stage and assay type information. |
Node Types¶
| Node Category | Source Identifier Types | Additional Notes |
|---|---|---|
| biolink:Gene | MGI (Mouse Genome Informatics), RGD (Rat Genome Database) | Gene nodes are created as minimal placeholders with only id and category. Richer gene information (names, symbols, descriptions, etc.) is expected to be merged from authoritative gene sources during downstream normalization. |
| biolink:PhenotypicFeature | MP (Mammalian Phenotype Ontology), HP (Human Phenotype Ontology) | Phenotype nodes are minimal placeholders. Full phenotype definitions, names, and hierarchical relationships should come from ontology sources during normalization. |
| biolink:AnatomicalEntity | UBERON (Uber-anatomy ontology), EMAPA (Mouse anatomical dictionary), MA (Mouse adult gross anatomy) | Anatomical entity nodes are minimal placeholders. Full anatomical information should come from ontology sources during normalization. |
| biolink:CellularComponent | GO (Gene Ontology - Cellular Component aspect) | Cellular component nodes are minimal placeholders. Full GO term information should come from ontology sources during normalization. |
Future Modeling Considerations¶
node_properties: If genes are not adequately normalized/enriched downstream, consider adding basic gene properties (symbol, name) from BGI files to gene placeholder nodes.
qualifiers: Review whether additional qualifying information could be extracted from Alliance data to further contextualize associations (e.g., genetic background, sex, age).
edge_properties: Consider mapping evidence codes to more granular knowledge_level and agent_type enums if such mappings become available from Alliance.
Additional Notes: The minimal placeholder approach for nodes is intentional: it reduces redundancy and allows authoritative sources for each entity type to provide the canonical entity information during normalization. The Alliance ingest focuses on providing high-quality associations while leaving entity enrichment to specialized sources.
Provenance Information¶
Contributors: - Kevin Schaper: code refactoring, optimization - Matthew Brush: data modeling guidance
Artifacts: - Related Alliance standalone ingest repository: https://github.com/monarch-initiative/alliance-ingest