Utils

An eclectic mix of introductory musings with the Rust language.

Citations and References

Fergal J Martin, M Ridwan Amode, Alisha Aneja, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Arne Becker, Ruth Bennett, Andrew Berry, Jyothish Bhai, Simarpreet Kaur Bhurji, Alexandra Bignell, Sanjay Boddu, Paulo R Branco Lins, Lucy Brooks, Shashank Budhanuru Ramaraju, Mehrnaz Charkhchi, Alexander Cockburn, Luca Da Rin Fiorretto, Claire Davidson, Kamalkumar Dodiya, Sarah Donaldson, Bilal El Houdaigui, Tamara El Naboulsi, Reham Fatima, Carlos Garcia Giron, Thiago Genez, Gurpreet S Ghattaoraya, Jose Gonzalez Martinez, Cristi Guijarro, Matthew Hardy, Zoe Hollis, Thibaut Hourlier, Toby Hunt, Mike Kay, Vinay Kaykala, Tuan Le, Diana Lemos, Diego Marques-Coelho, José Carlos Marugán, Gabriela Alejandra Merino, Louisse Paola Mirabueno, Aleena Mushtaq, Syed Nakib Hossain, Denye N Ogeh, Manoj Pandian Sakthivel, Anne Parker, Malcolm Perry, Ivana Piližota, Irina Prosovetskaia, José G Pérez-Silva, Ahamed Imran Abdul Salam, Nuno Saraiva-Agostinho, Helen Schuilenburg, Dan Sheppard, Swati Sinha, Botond Sipos, William Stark, Emily Steed, Ranjit Sukumaran, Dulika Sumathipala, Marie-Marthe Suner, Likhitha Surapaneni, Kyösti Sutinen, Michal Szpak, Francesca Floriana Tricomi, David Urbina-Gómez, Andres Veidenberg, Thomas A Walsh, Brandon Walts, Elizabeth Wass, Natalie Willhoft, Jamie Allen, Jorge Alvarez-Jarreta, Marc Chakiachvili, Bethany Flint, Stefano Giorgetti, Leanne Haggerty, Garth R Ilsley, Jane E Loveland, Benjamin Moore, Jonathan M Mudge, John Tate, David Thybert, Stephen J Trevanion, Andrea Winterbottom, Adam Frankish, Sarah E Hunt, Magali Ruffier, Fiona Cunningham, Sarah Dyer, Robert D Finn, Kevin L Howe, Peter W Harrison, Andrew D Yates, and Paul Flicek Ensembl 2023 Nucleic Acids Res. 2023, 51(D1):D933-D941 PMID: 36318249 doi:10.1093/nar/gkac958

Kanehisa, M. and Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes Nucleic Acids Res. 28, 27-30 (2000) PMID: 10592173 doi: 10.1093/nar/28.1.27)

Kanehisa, M. Toward understanding the origin and evolution of cellular organisms Protein Sci. 28, 1947-1951 (2019) PMID: 31441146 doi: 10.1002/pro.3715

Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. and Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes Nucleic Acids Res. 51, D587-D592 (2023) PMID: 36300620 doi: 10.1093/nar/gkac963

Malkov, Y. A., & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs IEEE Transactions on Pattern Analysis and Machine Intelligence 42(4), 824–836 (2020) doi: 10.1109/tpami.2018.2889473

Ukkonen, E. On-line construction of suffix trees Algorithmica, 14(3), 249–260 (1995) doi: 10.1007/bf01206331

EnsEMBL Search

EnsEMBL search generates a CSV file (comma delimited) of EnsEMBL identifier entries.

Usage: ensembl_search [OPTIONS] --certificate --index --file --output

Options:

-c, --certificate A DER-encoded X.509 file

-i, --index A column index to take the set of values

-d, --delimiter The delimiter character that separates each field value (e.g. ',', ';', '\t')

-f, --file The flat file (e.g. CSV, TSV) file path to parse for identifiers

-n, --no-headers A flag that indicates no header row is present

-O, --output The output file name and path to write a CSV file

-h, --help Print help information

-V, --version Print version information

Example

Given the following file /home/user/data/csv/gene_expressions.csv with the following entries:

Gene name	Gene	Tissue region	Transcripts per million
CLHC1	ENSG00000162994	cerebral cortex	0.9
CLHC1	ENSG00000162994	basal ganglia	1.4
CLHC1	ENSG00000162994	hippocampal formation	1.5
SLC19A2	ENSG00000117479	hippocampal formation	2.0
SLC19A2	ENSG00000117479	cerebral cortex	3.5
SETD9	ENSG00000155542	midbrain	0.7
...	...	...

The following command will return a list of EnsEMBL identifiers:

ensembl_search \
--file "/home/user/data/csv/gene_expressions.csv" \
--index 1 \
--certificate "/home/user/data/certificates/authorities.pem" \
--certificate "/home/user/data/certificates/additional_authorities.pem" \
--output "/home/user/data/csv/EnsEMBL_entries.csv"

assembly_name	biotype	canonical_transcript	db_type	description	display_name	dna	end	id	logic_name	object_type	seq_region_name	source	species	start	strand	version
GRCh38	protein_coding	ENST00000285947.5	core	SET domain containing 9 [Source:HGNC Symbol;Acc:HGNC:28508]	SETD9	GACAGCCGT...	56925532	ENSG00000155542	ensembl_havana_gene_homo_sapiens	Gene	5	ensembl_havana	homo_sapiens	56909260	1	12
GRCh38	protein_coding	ENST00000401408.6	core	clathrin heavy chain linker domain containing 1 [Source:HGNC Symbol;Acc:HGNC:26453]	CLHC1	TTTTTATGT...	55232563	ENSG00000162994	ensembl_havana_gene_homo_sapiens	Gene	2	ensembl_havana	homo_sapiens	55172547	-1	16
GRCh38	protein_coding	ENST00000236137.10	core	solute carrier family 19 member 2 [Source:HGNC Symbol;Acc:HGNC:10938]	SLC19A2	TTTGATTAA...	169485944	ENSG00000117479	ensembl_havana_gene_homo_sapiens	Gene	1	ensembl_havana	homo_sapiens	169463909	-1	15

Identifiers

Identifiers outputs to standard output a set of identifiers from a column present in a flat file (e.g. CSV, TSV).

Usage: identifiers [OPTIONS] --index --file

Options:

-i, --index A column index to take the set of values

-d, --delimiter The delimiter character that separates each field value (e.g. ',', ';', '\t')

-f, --file The flat file (e.g. CSV, TSV) file path to parse for identifiers

-n, --no-headers A flag that indicates no header row is present

-h, --help Print help information

-V, --version Print version information

Example

Given the following file /home/user/data/csv/gene_expressions.csv with the following entries,

Gene name	Gene	Tissue region	Transcripts per million
CLHC1	ENSG00000162994	cerebral cortex	0.9
CLHC1	ENSG00000162994	basal ganglia	1.4
CLHC1	ENSG00000162994	hippocampal formation	1.5
SLC19A2	ENSG00000117479	hippocampal formation	2.0
SLC19A2	ENSG00000117479	cerebral cortex	3.5
SETD9	ENSG00000155542	midbrain	0.7

The following command,

identifiers \
--file "/home/user/data/csv/gene_expressions.csv" \
--index 1

will output EnsEMBL identifiers to standard output:

ENSG00000162994
ENSG00000117479
ENSG0000015554

Suffix Tree

Constructs an SVG image of a suffix tree from a string provided as input. Usage: suffix_tree [OPTIONS] --string --file

Options:

-s, --string A string to construct a suffix tree

-f, --file A file path and name

-h, --help Print help information

-V, --version Print version information

Example

The command,

suffix_tree \
--string cacao
--file ./output.svg

will generate the image:

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.circleci		.circleci
docs/images		docs/images
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Utils

An eclectic mix of introductory musings with the Rust language.

Citations and References

EnsEMBL Search

Example

Identifiers

Example

Suffix Tree

Example

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Utils

An eclectic mix of introductory musings with the Rust language.

Citations and References

EnsEMBL Search

Example

Identifiers

Example

Suffix Tree

Example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages