Skip to content

GO Annotations (GOA) Reference Ingest Guide

Source Information

InfoRes ID: infores:goa

Description: GO Annotations connect genes to a Gene Ontology term that describes a molecular function it enables, a biological process in which it participates, or a cellular component in which it is located. Most are produced through rigorous manual curation of the literature, although some are based on automated pipelines that assign GO terms based on things like orthology or sequence similarity.

Citations: - Data Archive: https://zenodo.org/records/10536401 - Publication: https://doi.org/10.1093/nar/gky1055

Data Access Locations: - All downloads: https://geneontology.org/docs/download-go-annotations/ - Commonly studied organisms: https://current.geneontology.org/products/pages/downloads.html

Data Provision Mechanisms: file_download

Data Formats: tsv

Data Versioning and Releases: Release cadence: Approximately every four weeks, synchronized with UniProtKB. Versioning: By date - each GAF header includes a !Generated: YYYY-MM-DD line. Release notes: https://geneontology.org/docs/download-go-annotations/ and https://geneontology.org/docs/go-annotation-file-gaf-format-2.2/. Release archive: https://release.geneontology.org/. Formats: tsv in GAF format (17 columns)

Ingest Information

Ingest Categories: primary_knowledge_provider

Utility: GOA is a rich source of manually curated knowledge about gene function with broad relevance to all Translator queries and use cases.

Scope: This initial ingest of GOA covers molecular function, biological process, and cellular component annotations about human, mouse, and rat genes, including manually curated and electronically inferred content, from GAF files (GPAD and GPI formats not ingested). Other species may be added in future updates to the ingest.

Relevant Files

File Name Location Description
goa_human.gaf https://current.geneontology.org/products/pages/downloads.html Human gene-product to GO term associations (GAF 2.2)
mgi.gaf https://current.geneontology.org/products/pages/downloads.html Mouse gene-product to GO term associations (GAF 2.2)
rgd.gaf https://current.geneontology.org/products/pages/downloads.html Rat gene-product to GO term associations (GAF 2.2)

Included Content

File Name Included Records Fields Used
goa_human.gaf All records included DB, DB Object ID, DB Object Symbol, Relation, GO ID, DB:Reference(s), Evidence Code, With (or) From, Aspect, DB Object Name, DB Object Type, Taxon
mgi.gaf All records included DB, DB Object ID, DB Object Symbol, Relation, GO ID, DB:Reference(s), Evidence Code, With (or) From, Aspect, DB Object Name, DB Object Type, Taxon
rgd.gaf All records included DB, DB Object ID, DB Object Symbol, Relation, GO ID, DB:Reference(s), Evidence Code, With (or) From, Aspect, DB Object Name, DB Object Type, Taxon

Future Content Considerations

edge_content: Consider ingesting Gene/Product to GO Term annotations from other taxon (beyond human, mouse, and rat)

edge_content: Consider inclusion of qualifying information (as may be found in the Annotation Extensions, or With or From columns) to existing and new Gene/Product to GO Term annotations. With/From column in particular to show provenance of inferred associations with codes like IBA or IEA.

edge_content: Consider ingesting associations between two GO Terms, per the specification at https://wiki.geneontology.org/index.php/Annotation_Relations#Standard_Annotation:_Annotation_Extension_Relations

node_property_content: GOA generally adds some curatorial value when pulling annotations from sources like MODs, PANTHER, etc. Minimally, id mapping, quality checking, etc. - but often more than this. This is why we are ok with making them the primary source. But we should attribute MODs, PANTHER, etc. as supporting data sources in these cases, if we can find this info in the data.

node_property_content: t.b.d. if we will bring in taxon info about gene/gene product nodes from GOA, or rely on other gene property authorities for this information (e.g. ncbigene)

Target Information

Target InfoRes ID: infores:translator-goa-kgx

Edge Types

Subject Categories Predicate Object Categories Knowledge Level Agent Type UI Explanation
biolink:Gene, biolink:Protein, biolink:MacromolecularComplex, biolink:RNAProduct biolink:MolecularActivity knowledge_assertion, prediction, not_provided manual_agent, automated_agent, manual_validation_of_automated_agent, not_provided A GO Annotation uses 'enables' predicate when a gene product is solely capable of executing the reported function.
biolink:MacromolecularComplex biolink:MolecularActivity knowledge_assertion, prediction, not_provided manual_agent, automated_agent, manual_validation_of_automated_agent, not_provided A GO Annotation uses 'contributes_to' predicate when a gene product is required as part of a macromolecular complex for executing the reported function.
biolink:Gene, biolink:Protein, biolink:MacromolecularComplex, biolink:RNAProduct biolink:BiologicalProcess knowledge_assertion, prediction, not_provided manual_agent, automated_agent, manual_validation_of_automated_agent, not_provided A GO Annotation uses 'involved_in' predicate when a gene product's molecular function plays an integral role in the reported biological process.
biolink:Gene, biolink:Protein, biolink:MacromolecularComplex, biolink:RNAProduct biolink:BiologicalProcess knowledge_assertion, prediction, not_provided manual_agent, automated_agent, manual_validation_of_automated_agent, not_provided A GO Annotation uses this upstream predicate family when a gene product acts upstream of a biological process, with predicate variants capturing mechanism/timing and effect direction.
biolink:Gene, biolink:Protein, biolink:MacromolecularComplex, biolink:RNAProduct biolink:CellularComponent knowledge_assertion, prediction, not_provided manual_agent, automated_agent, manual_validation_of_automated_agent, not_provided A GO Annotation uses this cellular component predicate family for active localization, static localization, or transient co-localization in a cellular component.
biolink:Gene, biolink:Protein, biolink:RNAProduct biolink:MacromolecularComplex knowledge_assertion, prediction, not_provided manual_agent, automated_agent, manual_validation_of_automated_agent, not_provided A GO Annotation uses 'part_of' predicate when a gene product is a component of the reported macromolecular complex.

Node Types

Node Category Source Identifier Types Additional Notes
biolink:Gene MGI, RGD
biolink:Protein UniProtKB accession
biolink:MacromolecularComplex ComplexPortal IDs
biolink:RNAProduct RNAcentral IDs
biolink:BiologicalProcess Gene Ontology IDs (Aspect P)
biolink:MolecularActivity Gene Ontology IDs (Aspect F)
biolink:CellularComponent Gene Ontology IDs (Aspect C)
biolink:Pathway Gene Ontology IDs (Aspect P, re-categorized by normalization)

Future Modeling Considerations

qualifiers: Introduce qualifier-based representation if/when we decide to ingest any qualifying context on GO annotations

node_properties: If we end up ingesting taxon info for gene nodes, we may have to update the Biolink Model to support this (currently in_taxon is represented as a predicate, and species_context_qualifier as an edge property - but there is no taxon node property)

edge_content: AB: Completed initial edge type consolidation in this QA pass for upstream and cellular component predicate families; monitor whether additional grouping is needed after next production validation run.

Provenance Information

Contributors: - Adilbek Bazarkulov: code author - Evan Morris: code support - Adilbek Bazarkulov: code support, domain expertise - Sierra Moxon: data modeling, domain expertise - Matthew Brush: data modeling, domain expertise

Artifacts: - Ingest Survey: https://docs.google.com/spreadsheets/d/18wGm2a0W1oIXm7cn8TZ99xn_aAMJ91SgAsuPDcV-lII/edit?gid=325339947#gid=325339947 - Ingest Ticket: https://github.com/NCATSTranslator/Data-Ingest-Coordination-Working-Group/issues/8