Skip to content

Alliance of Genome Resources (AGR) Reference Ingest Guide

Source Information

InfoRes ID: infores:agrkb

Description: The Alliance of Genome Resources (AGR) is a consortium of model organism databases and the Gene Ontology Consortium that provides a unified view of gene function, biological processes, phenotypes, and disease associations across multiple model organisms. The AGR aggregates and harmonizes data from member databases including MGI (mouse), RGD (rat), SGD (yeast), WormBase (C. elegans), FlyBase (D. melanogaster), ZFIN (zebrafish), and Xenbase (X. laevis and X. tropicalis).

Citations: - Alliance of Genome Resources Portal: enhancing a MOD resource. Nucleic Acids Res. 2020 Jan 8;48(D1):D650-D658. PMID: 31691787 - https://www.alliancegenome.org/

Data Access Locations: - All downloads: https://www.alliancegenome.org/downloads - FMS (File Management System): https://fms.alliancegenome.org/download/

Data Provision Mechanisms: file_download

Data Formats: json, tsv

Data Versioning and Releases: The Alliance releases data in numbered versions (e.g., 7.0.0, 7.1.0) approximately every 2-3 months. Each release is archived and accessible via the FMS. Release information is available at https://www.alliancegenome.org/downloads and version metadata can be retrieved from the API at https://www.alliancegenome.org/api/releaseInfo.

Ingest Information

Ingest Categories: aggregation_provider

Utility: The Alliance provides high-quality, manually curated gene-phenotype associations and gene expression data for mouse and rat, which are critical model organisms for understanding human biology and disease. These associations support Translator queries relating to gene function, phenotypes, and anatomical expression patterns.

Scope: This ingest focuses on mouse (NCBITaxon:10090) and rat (NCBITaxon:10116) data only, specifically: (1) gene-to-phenotype associations where the subject is a gene (not genotype or variant), and (2) gene-to-expression site associations. All nodes are created as minimal placeholders (id + category only) for downstream merging with richer entity data from other sources.

Relevant Files

File Name Location Description
BGI_MGI.json.gz https://fms.alliancegenome.org/download/BGI_MGI.json.gz Mouse gene basic information (used for gene ID lookup to filter phenotype records)
BGI_RGD.json.gz https://fms.alliancegenome.org/download/BGI_RGD.json.gz Rat gene basic information (used for gene ID lookup to filter phenotype records)
PHENOTYPE_MGI.json.gz https://fms.alliancegenome.org/download/PHENOTYPE_MGI.json.gz Mouse phenotype associations (includes gene, genotype, and variant associations)
PHENOTYPE_RGD.json.gz https://fms.alliancegenome.org/download/PHENOTYPE_RGD.json.gz Rat phenotype associations (includes gene, genotype, and variant associations)
EXPRESSION_MGI.json.gz https://fms.alliancegenome.org/download/EXPRESSION_MGI.json.gz Mouse gene expression data
EXPRESSION_RGD.json.gz https://fms.alliancegenome.org/download/EXPRESSION_RGD.json.gz Rat gene expression data

Included Content

File Name Included Records Fields Used
BGI_MGI.json.gz Gene IDs for mouse genes only, used for phenotype filtering. Data is loaded into DuckDB for fast lookup but no gene nodes are created from this file. basicGeneticEntity.primaryId, basicGeneticEntity.taxonId
BGI_RGD.json.gz Gene IDs for rat genes only, used for phenotype filtering. Data is loaded into DuckDB for fast lookup but no gene nodes are created from this file. basicGeneticEntity.primaryId, basicGeneticEntity.taxonId
PHENOTYPE_MGI.json.gz Only records where objectId corresponds to a gene (filtered using BGI gene lookup). Records where objectId is a genotype or variant are excluded. objectId, phenotypeTermIdentifiers, evidence.publicationId, conditionRelations
PHENOTYPE_RGD.json.gz Only records where objectId corresponds to a gene (filtered using BGI gene lookup). Records where objectId is a genotype or variant are excluded. objectId, phenotypeTermIdentifiers, evidence.publicationId, conditionRelations
EXPRESSION_MGI.json.gz All gene expression records geneId, whereExpressed.anatomicalStructureTermId, whereExpressed.cellularComponentTermId, whenExpressed.stageTermId, evidence.publicationId, crossReference.id, assay
EXPRESSION_RGD.json.gz All gene expression records geneId, whereExpressed.anatomicalStructureTermId, whereExpressed.cellularComponentTermId, whenExpressed.stageTermId, evidence.publicationId, crossReference.id, assay

Filtered Content

File Name Filtered Records Rationale
PHENOTYPE_MGI.json.gz Phenotype associations where objectId is a genotype ID (e.g., MGI genotype IDs) or variant/allele ID This ingest focuses only on direct gene-to-phenotype associations. Genotype and variant associations may be considered for future ingests but require different modeling patterns and are currently out of scope.
PHENOTYPE_RGD.json.gz Phenotype associations where objectId is a genotype ID (e.g., RGD genotype IDs) or variant/allele ID This ingest focuses only on direct gene-to-phenotype associations. Genotype and variant associations may be considered for future ingests but require different modeling patterns and are currently out of scope.

Future Content Considerations

edge_content: Consider adding genotype-to-phenotype and variant-to-phenotype associations, which are currently filtered out. These would require extending the ingest to create proper genotype and variant nodes. - Relevant files: PHENOTYPE_MGI.json.gz, PHENOTYPE_RGD.json.gz, AGM_MGI.json.gz, AGM_RGD.json.gz, VARIANT-ALLELE files

edge_content: Consider adding gene-to-disease associations from DISEASE-ALLIANCE_COMBINED.tsv.gz file, which provides curated disease annotations. - Relevant files: DISEASE-ALLIANCE_COMBINED.tsv.gz

edge_content: Consider ingesting data for additional species beyond mouse and rat (e.g., zebrafish, fly, worm, yeast) - Relevant files: All Alliance files for other species

edge_property_content: Consider adding richer provenance information, such as mapping evidence codes to knowledge_level and agent_type enums more granularly

node_property_content: Currently all nodes are minimal placeholders. Future iterations could enrich gene nodes with names, symbols, and other properties from the BGI files, or rely on downstream normalization for this enrichment.

Additional Notes: The ingest uses DuckDB to create an in-memory lookup table of gene IDs from BGI files, allowing efficient filtering of phenotype records by entity type without loading all genotype and allele data. This approach significantly reduces memory footprint and processing time compared to loading all entity types.

Target Information

Target InfoRes ID: infores:translator-agr-kgx

Edge Types

Subject Categories Predicate Object Categories Knowledge Level Agent Type UI Explanation
biolink:Gene biolink:PhenotypicFeature knowledge_assertion manual_agent These associations represent manually curated phenotype observations for genes in mouse and rat model organisms. Each association links a gene to one or more phenotypic features observed when the gene is perturbed, knocked out, or otherwise modified.
biolink:Gene biolink:AnatomicalEntity, biolink:CellularComponent knowledge_assertion manual_agent These associations represent gene expression patterns observed in specific anatomical locations or cellular components in mouse and rat. Expression data is typically derived from experimental assays and annotated with developmental stage and assay type information.

Node Types

Node Category Source Identifier Types Additional Notes
biolink:Gene MGI (Mouse Genome Informatics), RGD (Rat Genome Database) Gene nodes are created as minimal placeholders with only id and category. Richer gene information (names, symbols, descriptions, etc.) is expected to be merged from authoritative gene sources during downstream normalization.
biolink:PhenotypicFeature MP (Mammalian Phenotype Ontology), HP (Human Phenotype Ontology) Phenotype nodes are minimal placeholders. Full phenotype definitions, names, and hierarchical relationships should come from ontology sources during normalization.
biolink:AnatomicalEntity UBERON (Uber-anatomy ontology), EMAPA (Mouse anatomical dictionary), MA (Mouse adult gross anatomy) Anatomical entity nodes are minimal placeholders. Full anatomical information should come from ontology sources during normalization.
biolink:CellularComponent GO (Gene Ontology - Cellular Component aspect) Cellular component nodes are minimal placeholders. Full GO term information should come from ontology sources during normalization.

Future Modeling Considerations

node_properties: If genes are not adequately normalized/enriched downstream, consider adding basic gene properties (symbol, name) from BGI files to gene placeholder nodes.

qualifiers: Review whether additional qualifying information could be extracted from Alliance data to further contextualize associations (e.g., genetic background, sex, age).

edge_properties: Consider mapping evidence codes to more granular knowledge_level and agent_type enums if such mappings become available from Alliance.

Additional Notes: The minimal placeholder approach for nodes is intentional: it reduces redundancy and allows authoritative sources for each entity type to provide the canonical entity information during normalization. The Alliance ingest focuses on providing high-quality associations while leaving entity enrichment to specialized sources.

Provenance Information

Contributors: - Kevin Schaper: code refactoring, optimization - Matthew Brush: data modeling guidance

Artifacts: - Related Alliance standalone ingest repository: https://github.com/monarch-initiative/alliance-ingest