Skip to content

NathanWEdwards/utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Utils

An eclectic mix of introductory musings with the Rust language.

CircleCI License: MIT

Citations and References

Fergal J Martin, M Ridwan Amode, Alisha Aneja, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Arne Becker, Ruth Bennett, Andrew Berry, Jyothish Bhai, Simarpreet Kaur Bhurji, Alexandra Bignell, Sanjay Boddu, Paulo R Branco Lins, Lucy Brooks, Shashank Budhanuru Ramaraju, Mehrnaz Charkhchi, Alexander Cockburn, Luca Da Rin Fiorretto, Claire Davidson, Kamalkumar Dodiya, Sarah Donaldson, Bilal El Houdaigui, Tamara El Naboulsi, Reham Fatima, Carlos Garcia Giron, Thiago Genez, Gurpreet S Ghattaoraya, Jose Gonzalez Martinez, Cristi Guijarro, Matthew Hardy, Zoe Hollis, Thibaut Hourlier, Toby Hunt, Mike Kay, Vinay Kaykala, Tuan Le, Diana Lemos, Diego Marques-Coelho, José Carlos Marugán, Gabriela Alejandra Merino, Louisse Paola Mirabueno, Aleena Mushtaq, Syed Nakib Hossain, Denye N Ogeh, Manoj Pandian Sakthivel, Anne Parker, Malcolm Perry, Ivana Piližota, Irina Prosovetskaia, José G Pérez-Silva, Ahamed Imran Abdul Salam, Nuno Saraiva-Agostinho, Helen Schuilenburg, Dan Sheppard, Swati Sinha, Botond Sipos, William Stark, Emily Steed, Ranjit Sukumaran, Dulika Sumathipala, Marie-Marthe Suner, Likhitha Surapaneni, Kyösti Sutinen, Michal Szpak, Francesca Floriana Tricomi, David Urbina-Gómez, Andres Veidenberg, Thomas A Walsh, Brandon Walts, Elizabeth Wass, Natalie Willhoft, Jamie Allen, Jorge Alvarez-Jarreta, Marc Chakiachvili, Bethany Flint, Stefano Giorgetti, Leanne Haggerty, Garth R Ilsley, Jane E Loveland, Benjamin Moore, Jonathan M Mudge, John Tate, David Thybert, Stephen J Trevanion, Andrea Winterbottom, Adam Frankish, Sarah E Hunt, Magali Ruffier, Fiona Cunningham, Sarah Dyer, Robert D Finn, Kevin L Howe, Peter W Harrison, Andrew D Yates, and Paul Flicek Ensembl 2023 Nucleic Acids Res. 2023, 51(D1):D933-D941 PMID: 36318249 doi:10.1093/nar/gkac958

Kanehisa, M. and Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes Nucleic Acids Res. 28, 27-30 (2000) PMID: 10592173 doi: 10.1093/nar/28.1.27)

Kanehisa, M. Toward understanding the origin and evolution of cellular organisms Protein Sci. 28, 1947-1951 (2019) PMID: 31441146 doi: 10.1002/pro.3715

Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. and Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes Nucleic Acids Res. 51, D587-D592 (2023) PMID: 36300620 doi: 10.1093/nar/gkac963

Malkov, Y. A., & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs IEEE Transactions on Pattern Analysis and Machine Intelligence 42(4), 824–836 (2020) doi: 10.1109/tpami.2018.2889473

Ukkonen, E. On-line construction of suffix trees Algorithmica, 14(3), 249–260 (1995) doi: 10.1007/bf01206331

EnsEMBL Search

EnsEMBL search generates a CSV file (comma delimited) of EnsEMBL identifier entries.

Usage: ensembl_search [OPTIONS] --certificate --index --file --output

Options:

-c, --certificate A DER-encoded X.509 file

-i, --index A column index to take the set of values

-d, --delimiter The delimiter character that separates each field value (e.g. ',', ';', '\t')

-f, --file The flat file (e.g. CSV, TSV) file path to parse for identifiers

-n, --no-headers A flag that indicates no header row is present

-O, --output The output file name and path to write a CSV file

-h, --help Print help information

-V, --version Print version information

Example

Given the following file /home/user/data/csv/gene_expressions.csv with the following entries:

Gene name Gene Tissue region Transcripts per million
CLHC1 ENSG00000162994 cerebral cortex 0.9
CLHC1 ENSG00000162994 basal ganglia 1.4
CLHC1 ENSG00000162994 hippocampal formation 1.5
SLC19A2 ENSG00000117479 hippocampal formation 2.0
SLC19A2 ENSG00000117479 cerebral cortex 3.5
SETD9 ENSG00000155542 midbrain 0.7
... ... ...

The following command will return a list of EnsEMBL identifiers:

ensembl_search \
--file "/home/user/data/csv/gene_expressions.csv" \
--index 1 \
--certificate "/home/user/data/certificates/authorities.pem" \
--certificate "/home/user/data/certificates/additional_authorities.pem" \
--output "/home/user/data/csv/EnsEMBL_entries.csv"
assembly_name biotype canonical_transcript db_type description display_name dna end id logic_name object_type seq_region_name source species start strand version
GRCh38 protein_coding ENST00000285947.5 core SET domain containing 9 [Source:HGNC Symbol;Acc:HGNC:28508] SETD9 GACAGCCGT... 56925532 ENSG00000155542 ensembl_havana_gene_homo_sapiens Gene 5 ensembl_havana homo_sapiens 56909260 1 12
GRCh38 protein_coding ENST00000401408.6 core clathrin heavy chain linker domain containing 1 [Source:HGNC Symbol;Acc:HGNC:26453] CLHC1 TTTTTATGT... 55232563 ENSG00000162994 ensembl_havana_gene_homo_sapiens Gene 2 ensembl_havana homo_sapiens 55172547 -1 16
GRCh38 protein_coding ENST00000236137.10 core solute carrier family 19 member 2 [Source:HGNC Symbol;Acc:HGNC:10938] SLC19A2 TTTGATTAA... 169485944 ENSG00000117479 ensembl_havana_gene_homo_sapiens Gene 1 ensembl_havana homo_sapiens 169463909 -1 15

Identifiers

Identifiers outputs to standard output a set of identifiers from a column present in a flat file (e.g. CSV, TSV).

Usage: identifiers [OPTIONS] --index --file

Options:

-i, --index A column index to take the set of values

-d, --delimiter The delimiter character that separates each field value (e.g. ',', ';', '\t')

-f, --file The flat file (e.g. CSV, TSV) file path to parse for identifiers

-n, --no-headers A flag that indicates no header row is present

-h, --help Print help information

-V, --version Print version information

Example

Given the following file /home/user/data/csv/gene_expressions.csv with the following entries,

Gene name Gene Tissue region Transcripts per million
CLHC1 ENSG00000162994 cerebral cortex 0.9
CLHC1 ENSG00000162994 basal ganglia 1.4
CLHC1 ENSG00000162994 hippocampal formation 1.5
SLC19A2 ENSG00000117479 hippocampal formation 2.0
SLC19A2 ENSG00000117479 cerebral cortex 3.5
SETD9 ENSG00000155542 midbrain 0.7

The following command,

identifiers \
--file "/home/user/data/csv/gene_expressions.csv" \
--index 1

will output EnsEMBL identifiers to standard output:

ENSG00000162994
ENSG00000117479
ENSG0000015554

Suffix Tree

Constructs an SVG image of a suffix tree from a string provided as input. Usage: suffix_tree [OPTIONS] --string --file

Options:

-s, --string A string to construct a suffix tree

-f, --file A file path and name

-h, --help Print help information

-V, --version Print version information

Example

The command,

suffix_tree \
--string cacao
--file ./output.svg

will generate the image:

Suffix Tree example

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages