Fergal J Martin, M Ridwan Amode, Alisha Aneja, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Arne Becker, Ruth Bennett, Andrew Berry, Jyothish Bhai, Simarpreet Kaur Bhurji, Alexandra Bignell, Sanjay Boddu, Paulo R Branco Lins, Lucy Brooks, Shashank Budhanuru Ramaraju, Mehrnaz Charkhchi, Alexander Cockburn, Luca Da Rin Fiorretto, Claire Davidson, Kamalkumar Dodiya, Sarah Donaldson, Bilal El Houdaigui, Tamara El Naboulsi, Reham Fatima, Carlos Garcia Giron, Thiago Genez, Gurpreet S Ghattaoraya, Jose Gonzalez Martinez, Cristi Guijarro, Matthew Hardy, Zoe Hollis, Thibaut Hourlier, Toby Hunt, Mike Kay, Vinay Kaykala, Tuan Le, Diana Lemos, Diego Marques-Coelho, José Carlos Marugán, Gabriela Alejandra Merino, Louisse Paola Mirabueno, Aleena Mushtaq, Syed Nakib Hossain, Denye N Ogeh, Manoj Pandian Sakthivel, Anne Parker, Malcolm Perry, Ivana Piližota, Irina Prosovetskaia, José G Pérez-Silva, Ahamed Imran Abdul Salam, Nuno Saraiva-Agostinho, Helen Schuilenburg, Dan Sheppard, Swati Sinha, Botond Sipos, William Stark, Emily Steed, Ranjit Sukumaran, Dulika Sumathipala, Marie-Marthe Suner, Likhitha Surapaneni, Kyösti Sutinen, Michal Szpak, Francesca Floriana Tricomi, David Urbina-Gómez, Andres Veidenberg, Thomas A Walsh, Brandon Walts, Elizabeth Wass, Natalie Willhoft, Jamie Allen, Jorge Alvarez-Jarreta, Marc Chakiachvili, Bethany Flint, Stefano Giorgetti, Leanne Haggerty, Garth R Ilsley, Jane E Loveland, Benjamin Moore, Jonathan M Mudge, John Tate, David Thybert, Stephen J Trevanion, Andrea Winterbottom, Adam Frankish, Sarah E Hunt, Magali Ruffier, Fiona Cunningham, Sarah Dyer, Robert D Finn, Kevin L Howe, Peter W Harrison, Andrew D Yates, and Paul Flicek Ensembl 2023 Nucleic Acids Res. 2023, 51(D1):D933-D941 PMID: 36318249 doi:10.1093/nar/gkac958
Kanehisa, M. and Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes Nucleic Acids Res. 28, 27-30 (2000) PMID: 10592173 doi: 10.1093/nar/28.1.27)
Kanehisa, M. Toward understanding the origin and evolution of cellular organisms Protein Sci. 28, 1947-1951 (2019) PMID: 31441146 doi: 10.1002/pro.3715
Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. and Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes Nucleic Acids Res. 51, D587-D592 (2023) PMID: 36300620 doi: 10.1093/nar/gkac963
Malkov, Y. A., & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs IEEE Transactions on Pattern Analysis and Machine Intelligence 42(4), 824–836 (2020) doi: 10.1109/tpami.2018.2889473
Ukkonen, E. On-line construction of suffix trees Algorithmica, 14(3), 249–260 (1995) doi: 10.1007/bf01206331
EnsEMBL search generates a CSV file (comma delimited) of EnsEMBL identifier entries.
Usage: ensembl_search [OPTIONS] --certificate --index --file --output
Options:
-c, --certificate A DER-encoded X.509 file
-i, --index A column index to take the set of values
-d, --delimiter The delimiter character that separates each field value (e.g. ',', ';', '\t')
-f, --file The flat file (e.g. CSV, TSV) file path to parse for identifiers
-n, --no-headers A flag that indicates no header row is present
-O, --output The output file name and path to write a CSV file
-h, --help Print help information
-V, --version Print version information
Given the following file /home/user/data/csv/gene_expressions.csv with the following entries:
| Gene name | Gene | Tissue region | Transcripts per million |
|---|---|---|---|
| CLHC1 | ENSG00000162994 | cerebral cortex | 0.9 |
| CLHC1 | ENSG00000162994 | basal ganglia | 1.4 |
| CLHC1 | ENSG00000162994 | hippocampal formation | 1.5 |
| SLC19A2 | ENSG00000117479 | hippocampal formation | 2.0 |
| SLC19A2 | ENSG00000117479 | cerebral cortex | 3.5 |
| SETD9 | ENSG00000155542 | midbrain | 0.7 |
| ... | ... | ... |
The following command will return a list of EnsEMBL identifiers:
ensembl_search \
--file "/home/user/data/csv/gene_expressions.csv" \
--index 1 \
--certificate "/home/user/data/certificates/authorities.pem" \
--certificate "/home/user/data/certificates/additional_authorities.pem" \
--output "/home/user/data/csv/EnsEMBL_entries.csv"
| assembly_name | biotype | canonical_transcript | db_type | description | display_name | dna | end | id | logic_name | object_type | seq_region_name | source | species | start | strand | version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GRCh38 | protein_coding | ENST00000285947.5 | core | SET domain containing 9 [Source:HGNC Symbol;Acc:HGNC:28508] | SETD9 | GACAGCCGT... | 56925532 | ENSG00000155542 | ensembl_havana_gene_homo_sapiens | Gene | 5 | ensembl_havana | homo_sapiens | 56909260 | 1 | 12 |
| GRCh38 | protein_coding | ENST00000401408.6 | core | clathrin heavy chain linker domain containing 1 [Source:HGNC Symbol;Acc:HGNC:26453] | CLHC1 | TTTTTATGT... | 55232563 | ENSG00000162994 | ensembl_havana_gene_homo_sapiens | Gene | 2 | ensembl_havana | homo_sapiens | 55172547 | -1 | 16 |
| GRCh38 | protein_coding | ENST00000236137.10 | core | solute carrier family 19 member 2 [Source:HGNC Symbol;Acc:HGNC:10938] | SLC19A2 | TTTGATTAA... | 169485944 | ENSG00000117479 | ensembl_havana_gene_homo_sapiens | Gene | 1 | ensembl_havana | homo_sapiens | 169463909 | -1 | 15 |
Identifiers outputs to standard output a set of identifiers from a column present in a flat file (e.g. CSV, TSV).
Usage: identifiers [OPTIONS] --index --file
Options:
-i, --index A column index to take the set of values
-d, --delimiter The delimiter character that separates each field value (e.g. ',', ';', '\t')
-f, --file The flat file (e.g. CSV, TSV) file path to parse for identifiers
-n, --no-headers A flag that indicates no header row is present
-h, --help Print help information
-V, --version Print version information
Given the following file /home/user/data/csv/gene_expressions.csv with the following entries,
| Gene name | Gene | Tissue region | Transcripts per million |
|---|---|---|---|
| CLHC1 | ENSG00000162994 | cerebral cortex | 0.9 |
| CLHC1 | ENSG00000162994 | basal ganglia | 1.4 |
| CLHC1 | ENSG00000162994 | hippocampal formation | 1.5 |
| SLC19A2 | ENSG00000117479 | hippocampal formation | 2.0 |
| SLC19A2 | ENSG00000117479 | cerebral cortex | 3.5 |
| SETD9 | ENSG00000155542 | midbrain | 0.7 |
The following command,
identifiers \
--file "/home/user/data/csv/gene_expressions.csv" \
--index 1
will output EnsEMBL identifiers to standard output:
ENSG00000162994
ENSG00000117479
ENSG0000015554
Constructs an SVG image of a suffix tree from a string provided as input. Usage: suffix_tree [OPTIONS] --string --file
Options:
-s, --string A string to construct a suffix tree
-f, --file A file path and name
-h, --help Print help information
-V, --version Print version information
The command,
suffix_tree \
--string cacao
--file ./output.svg
will generate the image: