-
Notifications
You must be signed in to change notification settings - Fork 2
Description
I am running into issues with a set of models for ModelArchive where the alignment for the modeled sequence with the reference sequence contains insertions and deletions.
I noticed that one can use details="insertion" and details="deletion" when defining ihm.reference.SeqDif but:
- This is not actually documented anywhere as far as I can tell (and so I am not sure if I am using this correctly).
- The deletions do not contain any information about the position of the deletion since
ihm.reference.SeqDifonly stores the position in the entity sequence (which does not exist here). - The checks for the validity of the alignment are disabled if any
SeqDifobject hasdetails="insertion"anddetails="deletion". This would be very helpful, since dictionary validation cannot catch any issue which relates to the reference sequence.
I looked at examples in the PDB and noticed the following:
- In 9zrl, there are deletions in which are handled by setting
_struct_ref_seq_dif.pdbx_seq_db_seq_num. I think this makes a lot of sense and should also be done in python-modelcif (and -ihm). - In 9z4e there are insertions which look like the ones produced by python-modelcif. So I think that part is handled well.
I have a small test code attached here (testing_insertion.py.gz) to check the insertions and deletions and whether mismatches are recognized.
So to sum up, here is what I think needs to be done (open for discussion of course):
- Update documentation of
modelcif.reference.SeqDifto document how insertions and deletions can be defined - Update
modelcif.reference.SeqDifto also allow the definition of adb_seq_idvariable to set_struct_ref_seq_dif.pdbx_seq_db_seq_numfor deletions - Update the checks for alignment validity (or at least display a warning that this is skipped if there are insertions and deletions)
Side note: in PDB-IHM, do you add SeqDif details as "conflict" or "variant" or something else for point mutations? We could do the same in ModelArchive or just keep it empty since we usually do not really know where the mismatches come from (often enough the reference and model sequences just deviated over time).