Skip to content

One row per sample/allele in csv output #33

@domenico-simone

Description

@domenico-simone

VCF file:

##fileformat=VCFv4.0
##reference=chrRCRS
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=.,Type=Integer,Description="Reads covering the REF position">
##FORMAT=<ID=HF,Number=.,Type=Float,Description="Heteroplasmy Frequency of variant allele">
##FORMAT=<ID=CILOW,Number=.,Type=Float,Description="Value defining the lower limit of the confidence interval of the heteroplasmy fraction">
##FORMAT=<ID=CIUP,Number=.,Type=Float,Description="Value defining the upper limit of the confidence interval of the heteroplasmy fraction">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SRR043366	SRR043354
chrRCRS	263	0	A	G,C	0	PASS	AC=1;AN=2	GT:DP:HF:CILOW:CIUP	0/1:167:0.994:0.963:1.0	0/2:167:0.994:0.963:1.0

Idea for a CSV output:

SAMPLE	CHROM	POS	ID	REF	ALT	QUAL	AC	AN	Locus	FunctionalLocus	CodonPosition	AaChange	HF	CILOW	CIUP	…
SRR043366	chrRCRS	263	0	A	G	0	2;-1	4	MT-DLOOP	MT-HV2 (Hypervariable segment 2)	.	.	0.994	0.963	1	…
SRR043354	chrRCRS	263	0	A	C	0	2;-1	4	MT-DLOOP	MT-HV2 (Hypervariable segment 2)	.	.	0.853	0.79	0.9	…

So we have one row for each SAMPLE-ALT. This means that, if one samples has > 1 ALT allele, there will be, for that sample, as many rows as the number of ALT alleles. This would make this table easily usable for downstream processing, eg: wrangling and plotting with tidyverse packages, creating dynamic (sortable/filterable) plots with HTMLwidgets etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions