NCBI Gene Reference Ingest Guide¶
Source Information¶
InfoRes ID: infores:ncbi-gene
Description: "The NCBI Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide."
Citations: - National Center for Biotechnology Information (NCBI)[Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1988] – [cited 2024 Dec]. Available from: https://www.ncbi.nlm.nih.gov/
Data Access Locations: - NCBI FTP Site: https://ftp.ncbi.nih.gov/gene/DATA/
Data Provision Mechanisms: file_download
Data Formats: tsv
Data Versioning and Releases: "NCBI Gene data is updated regularly. Gene information files are available as current snapshots from the FTP site."
Ingest Information¶
Ingest Categories: primary_provider
Utility: "NCBI Gene provides authoritative gene information for human, mouse, and rat genes, including official gene symbols, descriptions, and taxonomic information critical for gene normalization and identification in biomedical knowledge graphs."
Scope: "This ingest focuses on human (NCBITaxon:9606), mouse (NCBITaxon:10090), and rat (NCBITaxon:10116) genes only from the gene_info.gz file. All gene records for these species are included to provide comprehensive gene coverage."
Relevant Files¶
| File Name | Location | Description |
|---|---|---|
| gene_info.gz | https://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz | Tab-delimited file containing gene information for all NCBI species, filtered to human, mouse, and rat genes |
Included Content¶
| File Name | Included Records | Fields Used |
|---|---|---|
| gene_info.gz | Gene records for human (taxon ID 9606), mouse (taxon ID 10090), and rat (taxon ID 10116) only | tax_id, GeneID, Symbol, description, Full_name_from_nomenclature_authority |
Filtered Content¶
| File Name | Filtered Records | Rationale |
|---|---|---|
| gene_info.gz | All gene records for species other than human, mouse, and rat | This ingest focuses specifically on human, mouse, and rat genes as these are the primary model organisms of interest for biomedical research and translation. |
Future Content Considerations¶
node_content: Consider expanding to include additional model organisms (e.g., zebrafish, fly, worm, yeast) if needed for broader comparative genomics support - Relevant files: gene_info.gz (additional taxon IDs)
node_property_content: Consider adding additional gene properties such as map_location, chromosome, synonyms, or dbXrefs for richer gene annotation
Additional Notes: The NCBI Gene ingest provides foundational gene nodes that serve as authoritative identifiers and basic properties for downstream integration with other biomedical data sources.
Target Information¶
Target InfoRes ID: infores:translator-ncbi-gene-kgx
Node Types¶
| Node Category | Source Identifier Types | Additional Notes |
|---|---|---|
| biolink:Gene | NCBIGene (NCBI Gene ID) | Gene nodes include official gene symbols, descriptions, full names, and taxonomic information. |
Future Modeling Considerations¶
edge_content: Future iterations could include gene-to-gene orthology relationships or gene-to-pathway associations if such data becomes available from NCBI Gene
Additional Notes: This ingest creates comprehensive gene nodes for human, mouse, and rat that serve as foundational entities for biomedical knowledge graphs, providing authoritative gene identifiers and basic properties.
Provenance Information¶
Contributors: - Generated following NCBI Gene documentation and requirements for human, mouse, and rat gene coverage