Alliance of Genome Resources (AGR) Reference Ingest Guide¶

Source Information¶

InfoRes ID: infores:agrkb

Description: The Alliance of Genome Resources (AGR) is a consortium of model organism databases and the Gene Ontology Consortium that provides a unified view of gene function, biological processes, phenotypes, and disease associations across multiple model organisms. The AGR aggregates and harmonizes data from member databases including MGI (mouse), RGD (rat), SGD (yeast), WormBase (C. elegans), FlyBase (D. melanogaster), ZFIN (zebrafish), and Xenbase (X. laevis and X. tropicalis).

Citations: - Alliance of Genome Resources Portal: enhancing a MOD resource. Nucleic Acids Res. 2020 Jan 8;48(D1):D650-D658. PMID: 31691787 - https://www.alliancegenome.org/

Data Access Locations: - All downloads: https://www.alliancegenome.org/downloads - FMS (File Management System): https://fms.alliancegenome.org/download/

Data Provision Mechanisms: file_download

Data Formats: json, tsv

Data Versioning and Releases: The Alliance releases data in numbered versions (e.g., 7.0.0, 7.1.0) approximately every 2-3 months. Each release is archived and accessible via the FMS. Release information is available at https://www.alliancegenome.org/downloads and version metadata can be retrieved from the API at https://www.alliancegenome.org/api/releaseInfo.

Ingest Information¶

Ingest Categories: aggregation_provider

Utility: The Alliance provides high-quality, manually curated gene-phenotype associations and gene expression data for mouse and rat, which are critical model organisms for understanding human biology and disease. These associations support Translator queries relating to gene function, phenotypes, and anatomical expression patterns.

Scope: This ingest focuses on mouse (NCBITaxon:10090) and rat (NCBITaxon:10116) data only, specifically: (1) gene-to-phenotype associations where the subject is a gene (not genotype or variant), and (2) gene-to-expression site associations. All nodes are created as minimal placeholders (id + category only) for downstream merging with richer entity data from other sources.

Relevant Files¶

File Name	Location	Description
BGI_MGI.json.gz	https://fms.alliancegenome.org/download/BGI_MGI.json.gz	Mouse gene basic information (used for gene ID lookup to filter phenotype records)
BGI_RGD.json.gz	https://fms.alliancegenome.org/download/BGI_RGD.json.gz	Rat gene basic information (used for gene ID lookup to filter phenotype records)
PHENOTYPE_MGI.json.gz	https://fms.alliancegenome.org/download/PHENOTYPE_MGI.json.gz	Mouse phenotype associations (includes gene, genotype, and variant associations)
PHENOTYPE_RGD.json.gz	https://fms.alliancegenome.org/download/PHENOTYPE_RGD.json.gz	Rat phenotype associations (includes gene, genotype, and variant associations)
EXPRESSION_MGI.json.gz	https://fms.alliancegenome.org/download/EXPRESSION_MGI.json.gz	Mouse gene expression data
EXPRESSION_RGD.json.gz	https://fms.alliancegenome.org/download/EXPRESSION_RGD.json.gz	Rat gene expression data

Included Content¶

File Name	Included Records	Fields Used
BGI_MGI.json.gz	Gene IDs for mouse genes only, used for phenotype filtering. Data is loaded into DuckDB for fast lookup but no gene nodes are created from this file.	basicGeneticEntity.primaryId, basicGeneticEntity.taxonId
BGI_RGD.json.gz	Gene IDs for rat genes only, used for phenotype filtering. Data is loaded into DuckDB for fast lookup but no gene nodes are created from this file.	basicGeneticEntity.primaryId, basicGeneticEntity.taxonId
PHENOTYPE_MGI.json.gz	Only records where objectId corresponds to a gene (filtered using BGI gene lookup). Records where objectId is a genotype or variant are excluded.	objectId, phenotypeTermIdentifiers, evidence.publicationId, conditionRelations
PHENOTYPE_RGD.json.gz	Only records where objectId corresponds to a gene (filtered using BGI gene lookup). Records where objectId is a genotype or variant are excluded.	objectId, phenotypeTermIdentifiers, evidence.publicationId, conditionRelations
EXPRESSION_MGI.json.gz	All gene expression records	geneId, whereExpressed.anatomicalStructureTermId, whereExpressed.cellularComponentTermId, whenExpressed.stageUberonSlimTerm.uberonTerm, evidence.publicationId, assay
EXPRESSION_RGD.json.gz	All gene expression records	geneId, whereExpressed.anatomicalStructureTermId, whereExpressed.cellularComponentTermId, whenExpressed.stageUberonSlimTerm.uberonTerm, evidence.publicationId, assay

Filtered Content¶

File Name	Filtered Records	Rationale
PHENOTYPE_MGI.json.gz	Phenotype associations where objectId is a genotype ID (e.g., MGI genotype IDs) or variant/allele ID	This ingest focuses only on direct gene-to-phenotype associations. Genotype and variant associations may be considered for future ingests but require different modeling patterns and are currently out of scope.
PHENOTYPE_RGD.json.gz	Phenotype associations where objectId is a genotype ID (e.g., RGD genotype IDs) or variant/allele ID	This ingest focuses only on direct gene-to-phenotype associations. Genotype and variant associations may be considered for future ingests but require different modeling patterns and are currently out of scope.

Future Content Considerations¶

edge_content: Consider adding genotype-to-phenotype and variant-to-phenotype associations, which are currently filtered out. These would require extending the ingest to create proper genotype and variant nodes. - Relevant files: PHENOTYPE_MGI.json.gz, PHENOTYPE_RGD.json.gz, AGM_MGI.json.gz, AGM_RGD.json.gz, VARIANT-ALLELE files

edge_content: Consider adding gene-to-disease associations from DISEASE-ALLIANCE_COMBINED.tsv.gz file, which provides curated disease annotations. - Relevant files: DISEASE-ALLIANCE_COMBINED.tsv.gz

edge_content: Consider ingesting data for additional species beyond mouse and rat (e.g., zebrafish, fly, worm, yeast) - Relevant files: All Alliance files for other species

edge_property_content: Consider adding richer provenance information, such as mapping evidence codes to knowledge_level and agent_type enums more granularly

node_property_content: Currently all nodes are minimal placeholders. Future iterations could enrich gene nodes with names, symbols, and other properties from the BGI files, or rely on downstream normalization for this enrichment.

Additional Notes: The ingest uses DuckDB to create an in-memory lookup table of gene IDs from BGI files, allowing efficient filtering of phenotype records by entity type without loading all genotype and allele data. This approach significantly reduces memory footprint and processing time compared to loading all entity types.

Target Information¶

Target InfoRes ID: infores:translator-agr-kgx

Edge Types¶

Subject Categories	Predicate	Object Categories	Knowledge Level	Agent Type	UI Explanation
biolink:Gene		biolink:PhenotypicFeature	knowledge_assertion	manual_agent	These associations represent manually curated phenotype observations for genes in mouse and rat model organisms. Each association links a gene to one or more phenotypic features observed when the gene is perturbed, knocked out, or otherwise modified.
biolink:Gene		biolink:AnatomicalEntity, biolink:CellularComponent	knowledge_assertion	manual_agent	These associations represent gene expression patterns observed in specific anatomical locations or cellular components in mouse and rat. Expression data is typically derived from experimental assays and annotated with developmental stage and assay type information.

Node Types¶

Node Category	Source Identifier Types	Additional Notes
biolink:Gene	MGI (Mouse Genome Informatics), RGD (Rat Genome Database)	Gene nodes are created as minimal placeholders with only id and category. Richer gene information (names, symbols, descriptions, etc.) is expected to be merged from authoritative gene sources during downstream normalization.
biolink:PhenotypicFeature	MP (Mammalian Phenotype Ontology)	Phenotype nodes are minimal placeholders. Full phenotype definitions, names, and hierarchical relationships should come from ontology sources during normalization.
biolink:AnatomicalEntity	EMAPA (Mouse Developmental Anatomy Ontology)	Anatomical entity nodes are minimal placeholders. Full anatomical information should come from ontology sources during normalization.
biolink:CellularComponent	GO (Gene Ontology - Cellular Component aspect)	Cellular component nodes are minimal placeholders. Full GO term information should come from ontology sources during normalization.

Future Modeling Considerations¶

node_properties: If genes are not adequately normalized/enriched downstream, consider adding basic gene properties (symbol, name) from BGI files to gene placeholder nodes.

qualifiers: Review whether additional qualifying information could be extracted from Alliance data to further contextualize associations (e.g., genetic background, sex, age).

edge_properties: Consider mapping evidence codes to more granular knowledge_level and agent_type enums if such mappings become available from Alliance.

edge_content: Ingest additional species.

Additional Notes: The minimal placeholder approach for nodes is intentional: it reduces redundancy and allows authoritative sources for each entity type to provide the canonical entity information during normalization. The Alliance ingest focuses on providing high-quality associations while leaving entity enrichment to specialized sources.

Provenance Information¶

Contributors: - Kevin Schaper: code refactoring, optimization - Matthew Brush: data modeling guidance

Artifacts: - Related Alliance standalone ingest repository: https://github.com/monarch-initiative/alliance-ingest