Describe the bug
The query for transcripts against the UTA database is non-deterministic, and returns an arbitrary matching transcript each time it's run.
This is a result of a join between associated_accessions and tx_exon_aln_v which doesn't fully constrain the rows in the tx_exon_aln_v table, resulting in the database choosing one arbitrarily. The fields of this arbitrary row are used to order the results, resulting in a random transcript being selected and returned.
I ran fusor 50 times using a script, and it resulted in a variety of transcripts being selected for the 5' partner (columns are count and transcript ID):
8 refseq:NM_001320453.1
6 refseq:NM_001320454.1
4 refseq:NM_001320454.2
11 refseq:NM_001320455.1
3 refseq:NM_001320455.2
14 refseq:NM_006365.2
4 refseq:NM_006365.3
Steps to reproduce
- Download this arriba fusion.
- Use fusor to annotate this fusion.
- Check the 5' partner's transcript.
- Rerun fusor, and check the transcript again.
Expected behavior
The chosen transcript is deterministic and does not change from one run to another.
Current behavior
The chosen transcript is non-deterministic, and returns an arbitrary matching transcript each time it is run.
Possible reason(s)
The query joins the tables associated_accessions with tx_exon_aln_v on associated_accessions.tx_ac = tx_exon_aln_v.tx_ac (this happens transitively through a join on the transcript with the same column).
For a given value of associated_accessions.tx_ac, there may be several matching tx_exon_aln_v rows (example below).
As a result, the join chooses an arbitrary row from tx_exon_aln_v. These rows differ in their tx_start_i and tx_end_i, which is used to order the transcripts, resulting in the transcript order varying each time the query is run.
Example query of matching values of tx_exon_aln_v for a given tx_ac:
uta=> select tx_ac, tx_start_i, tx_end_i from tx_exon_aln_v where tx_ac = 'NM_001320453.1';
tx_ac | tx_start_i | tx_end_i
----------------+------------+----------
NM_001320453.1 | 0 | 173
NM_001320453.1 | 173 | 248
NM_001320453.1 | 248 | 344
NM_001320453.1 | 344 | 444
NM_001320453.1 | 444 | 794
NM_001320453.1 | 0 | 173
NM_001320453.1 | 173 | 248
NM_001320453.1 | 248 | 344
NM_001320453.1 | 344 | 444
NM_001320453.1 | 444 | 794
(10 rows)
Suggested fix
Find a way to make the join from associated_accessions.tx_ac = tx_exon_aln_v.tx_ac deterministic.
Branch, commit, and/or version
main
Screenshots
No response
Environment details
Not relevant
Additional details
No response
Contribution
I can help prepare a PR, but I would like someone with a better understanding of the structure of the UTA database to determine how best to make the join deterministic.
Describe the bug
The query for transcripts against the UTA database is non-deterministic, and returns an arbitrary matching transcript each time it's run.
This is a result of a join between
associated_accessionsandtx_exon_aln_vwhich doesn't fully constrain the rows in thetx_exon_aln_vtable, resulting in the database choosing one arbitrarily. The fields of this arbitrary row are used to order the results, resulting in a random transcript being selected and returned.I ran fusor 50 times using a script, and it resulted in a variety of transcripts being selected for the 5' partner (columns are count and transcript ID):
Steps to reproduce
Expected behavior
The chosen transcript is deterministic and does not change from one run to another.
Current behavior
The chosen transcript is non-deterministic, and returns an arbitrary matching transcript each time it is run.
Possible reason(s)
The query joins the tables
associated_accessionswithtx_exon_aln_vonassociated_accessions.tx_ac = tx_exon_aln_v.tx_ac(this happens transitively through a join on thetranscriptwith the same column).For a given value of
associated_accessions.tx_ac, there may be several matchingtx_exon_aln_vrows (example below).As a result, the join chooses an arbitrary row from
tx_exon_aln_v. These rows differ in theirtx_start_iandtx_end_i, which is used to order the transcripts, resulting in the transcript order varying each time the query is run.Example query of matching values of
tx_exon_aln_vfor a giventx_ac:Suggested fix
Find a way to make the join from
associated_accessions.tx_ac = tx_exon_aln_v.tx_acdeterministic.Branch, commit, and/or version
main
Screenshots
No response
Environment details
Not relevant
Additional details
No response
Contribution
I can help prepare a PR, but I would like someone with a better understanding of the structure of the UTA database to determine how best to make the join deterministic.