Skip to content

NCBI Gene Reference Ingest Guide

Source Information

InfoRes ID: infores:ncbi-gene

Description: "The NCBI Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide."

Citations: - National Center for Biotechnology Information (NCBI)[Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1988] – [cited 2024 Dec]. Available from: https://www.ncbi.nlm.nih.gov/

Data Access Locations: - NCBI FTP Site: https://ftp.ncbi.nih.gov/gene/DATA/

Data Provision Mechanisms: file_download

Data Formats: tsv

Data Versioning and Releases: "NCBI Gene data is updated regularly. Gene information files are available as current snapshots from the FTP site."

Ingest Information

Ingest Categories: primary_provider

Utility: "NCBI Gene provides authoritative gene information for human, mouse, and rat genes, including official gene symbols, descriptions, and taxonomic information critical for gene normalization and identification in biomedical knowledge graphs."

Scope: "This ingest focuses on human (NCBITaxon:9606), mouse (NCBITaxon:10090), and rat (NCBITaxon:10116) genes only from the gene_info.gz file. All gene records for these species are included to provide comprehensive gene coverage."

Relevant Files

File Name Location Description
gene_info.gz https://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz Tab-delimited file containing gene information for all NCBI species, filtered to human, mouse, and rat genes

Included Content

File Name Included Records Fields Used
gene_info.gz Gene records for human (taxon ID 9606), mouse (taxon ID 10090), and rat (taxon ID 10116) only tax_id, GeneID, Symbol, description, Full_name_from_nomenclature_authority

Filtered Content

File Name Filtered Records Rationale
gene_info.gz All gene records for species other than human, mouse, and rat This ingest focuses specifically on human, mouse, and rat genes as these are the primary model organisms of interest for biomedical research and translation.

Future Content Considerations

node_content: Consider expanding to include additional model organisms (e.g., zebrafish, fly, worm, yeast) if needed for broader comparative genomics support - Relevant files: gene_info.gz (additional taxon IDs)

node_property_content: Consider adding additional gene properties such as map_location, chromosome, synonyms, or dbXrefs for richer gene annotation

Additional Notes: The NCBI Gene ingest provides foundational gene nodes that serve as authoritative identifiers and basic properties for downstream integration with other biomedical data sources.

Target Information

Target InfoRes ID: infores:translator-ncbi-gene-kgx

Node Types

Node Category Source Identifier Types Additional Notes
biolink:Gene NCBIGene (NCBI Gene ID) Gene nodes include official gene symbols, descriptions, full names, and taxonomic information.

Future Modeling Considerations

edge_content: Future iterations could include gene-to-gene orthology relationships or gene-to-pathway associations if such data becomes available from NCBI Gene

Additional Notes: This ingest creates comprehensive gene nodes for human, mouse, and rat that serve as foundational entities for biomedical knowledge graphs, providing authoritative gene identifiers and basic properties.

Provenance Information

Contributors: - Generated following NCBI Gene documentation and requirements for human, mouse, and rat gene coverage