Hello and thank you for this useful tool!
I am investigating repeat expansion distributions in a cohort of ONT-sequenced samples and have noticed that, in multiple instances, the tool infers very different copy number values for samples whose insertion distribution profiles look nearly identical when inspected via CIGAR strings.
Example
Alignment:
minimap2 -a -x map-ont --MD -t 16 reference.fa sample.fastq -o sample.sam
Straglr:
straglr.py sample.bam hg38.fa sample --min_ins_size 5 --region regions.bed
At locus chr6:75866276 sample_1 and sample_2 vcf report same repeated unit ("GT"), while different copy numbers (21.5 and 44.5, respectivelly):
- Sample1:
chr6 75866276 . T <CNV:TR> . PASS RUS_REF=GT;SVLEN=32;RN=1;RUS=GT;RUC=21.5;CIRUC=-2.0,1.0 GT:DP:AD 1:41:21
- Sample2:
chr6 75866276 . T <CNV:TR> . PASS RUS_REF=GT;SVLEN=32;RN=1;RUS=GT;RUC=44.5;CIRUC=-10.5,4.0 GT:DP:AD 1:32:23
When inspecting CIGAR-derived insertion lengths and repeat unit counts (GT/TG) at this locus, both samples show nearly identical distributions (see below). This is roughly consistent with Sample1's RUC=21.5 (considering 16 copy number offset from hg38), but for Sample2 a RUC=44.5 call would imply a distribution shifted ~20–25 units higher.
Question
Is the RUC=44.5 call for Sample2 reliable, or should it be flagged as a miscall? Does it makes sense to cross-checking RUC against CIGAR-derived insertion lengths or am I missing something?
I’ll attach straglr tsv output and alignment files for both samples.
Thank you.
sample_1.tsv
sample_2.tsv
sample_1.txt
sample_2.txt

Hello and thank you for this useful tool!
I am investigating repeat expansion distributions in a cohort of ONT-sequenced samples and have noticed that, in multiple instances, the tool infers very different copy number values for samples whose insertion distribution profiles look nearly identical when inspected via CIGAR strings.
Example
Alignment:
minimap2 -a -x map-ont --MD -t 16 reference.fa sample.fastq -o sample.samStraglr:
straglr.py sample.bam hg38.fa sample --min_ins_size 5 --region regions.bedAt locus chr6:75866276 sample_1 and sample_2 vcf report same repeated unit ("GT"), while different copy numbers (21.5 and 44.5, respectivelly):
chr6 75866276 . T <CNV:TR> . PASS RUS_REF=GT;SVLEN=32;RN=1;RUS=GT;RUC=21.5;CIRUC=-2.0,1.0 GT:DP:AD 1:41:21chr6 75866276 . T <CNV:TR> . PASS RUS_REF=GT;SVLEN=32;RN=1;RUS=GT;RUC=44.5;CIRUC=-10.5,4.0 GT:DP:AD 1:32:23When inspecting CIGAR-derived insertion lengths and repeat unit counts (GT/TG) at this locus, both samples show nearly identical distributions (see below). This is roughly consistent with Sample1's RUC=21.5 (considering 16 copy number offset from hg38), but for Sample2 a RUC=44.5 call would imply a distribution shifted ~20–25 units higher.
Question
Is the RUC=44.5 call for Sample2 reliable, or should it be flagged as a miscall? Does it makes sense to cross-checking RUC against CIGAR-derived insertion lengths or am I missing something?
I’ll attach straglr tsv output and alignment files for both samples.
Thank you.
sample_1.tsv
sample_2.tsv
sample_1.txt
sample_2.txt