Search & lookup terms#

Entities and ontologies can be complex with many different identifiers.

Here we show Bionty’s lookup model for species, genes, proteins and cell markers. You’ll see how to

  • access the reference table via .df()

  • look up an entity term via .lookup()

  • look up an entity term via .search()

import bionty as bt

.fields: fields of an ontology reference#

gene_bt = bt.Gene()

gene_bt


Gene
Species: human
Source: ensembl, release-109
#terms: 75124

📖 Gene.df(): ontology reference table
🔎 Gene.lookup(): autocompletion of terms
🎯 Gene.search(): free text search of terms
🧐 Gene.inspect(): check if identifiers are mappable
👽 Gene.map_synonyms(): map synonyms to standardized names
🔗 Gene.ontology: Pronto.Ontology object
gene_bt.fields
{'biotype',
 'description',
 'ensembl_gene_id',
 'hgnc_id',
 'ncbi_gene_id',
 'symbol',
 'synonyms'}

Fields can be accessed as attributes for autocompletion:

(You can pass them to the field parameter in any bionty function instead of strings.)

gene_bt.ncbi_gene_id
ncbi_gene_id

.df(): reference table#

Data scientists love DataFrames, and every entity has a reference table containing all the fields.

df = gene_bt.df()
df.head()
ensembl_gene_id symbol ncbi_gene_id hgnc_id biotype description synonyms
0 ENSG00000000003 TSPAN6 7105 HGNC:11858 protein_coding tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] TSPAN-6|T245|TM4SF6
1 ENSG00000000005 TNMD 64102 HGNC:17757 protein_coding tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] tendin|ChM1L|TEM|myodulin|BRICD4
2 ENSG00000000419 DPM1 8813 HGNC:3005 protein_coding dolichyl-phosphate mannosyltransferase subunit... CDGIE|MPDS
3 ENSG00000000457 SCYL3 57147 HGNC:19285 protein_coding SCY1 like pseudokinase 3 [Source:HGNC Symbol;A... PACE1|PACE-1
4 ENSG00000000460 C1orf112 55732 HGNC:25565 protein_coding chromosome 1 open reading frame 112 [Source:HG... FLJ10706

To access the information of, for example the multiple gene symbols, we select the corresponding species through Pandas:

df.set_index("symbol").loc[["LMNA", "TCF7", "BRCA1"]]
ensembl_gene_id ncbi_gene_id hgnc_id biotype description synonyms
symbol
LMNA ENSG00000160789 4000 HGNC:6636 protein_coding lamin A/C [Source:HGNC Symbol;Acc:HGNC:6636] PRO1|LMNL1|MADA|CMD1A|HGPS|LMN1|LGMD1B
TCF7 ENSG00000081059 6932 HGNC:11639 protein_coding transcription factor 7 [Source:HGNC Symbol;Acc... TCF-1
BRCA1 ENSG00000012048 672 HGNC:1100 protein_coding BRCA1 DNA repair associated [Source:HGNC Symbo... PPP1R53|RNF53|FANCS|BRCC1

.lookup(): Lookup terms and records with autocompletion#

Terms can be searched with auto-complete using a lookup object.

lookup = gene_bt.lookup()

We provide dot. accessor for normalized terms (lower case, only contains alphanumeric characters and underscores):

lookup.tcf7
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', hgnc_id='HGNC:11639', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')

To look up the exact original strings, convert the lookup object to dict and use the bracket[] accessor for autocompletion:

lookup_dict = lookup.dict()
lookup_dict["TCF7"]
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', hgnc_id='HGNC:11639', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')

By default, the name field is used to generate lookup keys.

You can specify another field to look up:

lookup = gene_bt.lookup(gene_bt.hgnc_id)

If multiple entries are matched, they are returned as a list:

lookup.hgnc_10478
[Gene(ensembl_gene_id='ENSG00000227322', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000206289', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000235712', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000204231', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000231321', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000228333', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP')]
lookup_dict = lookup.dict()
lookup_dict["HGNC:10478"]
[Gene(ensembl_gene_id='ENSG00000227322', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000206289', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000235712', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000204231', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000231321', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000228333', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP')]

.search: Search a term against a field#

celltype_bt = bt.CellType()


celltype_bt.search("cytotoxic T cells").head(2)
ontology_id definition synonyms parents __ratio__
name
cytotoxic T cell CL:0000910 A Mature T Cell That Differentiated And Acquir... cytotoxic T-lymphocyte|cytotoxic T lymphocyte|... [CL:0000911] 96.969697
Tc2 cell CL:0000918 A Cd8-Positive, Alpha-Beta Positive T Cell Exp... Tc2 T lymphocyte|Tc2 T cell|Tc2 T-cell|Th2 CD8... [CL:0000908] 76.190476

By default, search also matches against each of the synonyms:

celltype_bt.search("P cell").head(2)
ontology_id definition synonyms parents __ratio__
name
nodal myocyte CL:0002072 A Specialized Cardiac Myocyte In The Sinoatria... cardiac pacemaker cell|myocytus nodalis|P cell [CL:0002086] 100.000000
pigmented ciliary epithelial cell CL:0002303 A Cell That Is Part Of Pigmented Ciliary Epith... PE cell [CL:0000529] 92.307692

You can turn off synonym matching with synonyms_field=None:

celltype_bt.search("P cell", synonyms_field=None).head(2)
ontology_id definition synonyms parents __ratio__
name
PP cell CL:0000696 A Cell That Stores And Secretes Pancreatic Pol... type F enteroendocrine cell [CL:0000167, CL:0000164] 92.307692
peg cell CL:4033014 A Small, Narrow, Peg-Shaped Epithelial Cell Wi... FTE stem-like cell|fallopian tube epithelial s... [CL:0000066] 85.714286

Match against another field (default is “name”):

celltype_bt.search("CD8+ alpha beta T cells", field=celltype_bt.definition).head(2)
ontology_id name synonyms parents __ratio__
definition
A T Cell That Expresses An Alpha-Beta T Cell Receptor Complex. CL:0000789 alpha-beta T cell alpha-beta T lymphocyte|alpha-beta T-lymphocyt... [CL:0000084] 85.000000
A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor. CL:0000625 CD8-positive, alpha-beta T cell CD8-positive, alpha-beta T-cell|CD8-positive, ... [CL:0000791] 81.481481

Return all results as a DataFrame ranked by matching ratios:

celltype_bt.search("P cell", top_hit=True)
CellType(name='nodal myocyte', ontology_id='CL:0002072', definition='A Specialized Cardiac Myocyte In The Sinoatrial And Atrioventricular Nodes. The Cell Is Slender And Fusiform Confined To The Nodal Center, Circumferentially Arranged Around The Nodal Artery.', synonyms='cardiac pacemaker cell|myocytus nodalis|P cell', parents=array(['CL:0002086'], dtype=object))

Tied results will all be returns as top hits:

celltype_bt.search("A cell", top_hit=True, synonyms_field=None)
[CellType(name='fat cell', ontology_id='CL:0000136', definition='A Fat-Storing Cell Found Mostly In The Abdominal Cavity And Subcutaneous Tissue Of Mammals. Fat Is Usually Stored In The Form Of Triglycerides.', synonyms='adipocyte|adipose cell', parents=array(['CL:0002320', 'CL:0000325'], dtype=object)),
 CellType(name='cap cell', ontology_id='CL:0000676', definition=None, synonyms=None, parents=array(['CL:0000548', 'CL:0000378'], dtype=object))]