Handling of insertions and mismatches in modelcif.reference.Alignment

I am running into issues with a set of models for ModelArchive where the alignment for the modeled sequence with the reference sequence contains insertions and deletions.

I noticed that one can use `details="insertion"` and `details="deletion"` when defining `ihm.reference.SeqDif` but:

1. This is not actually documented anywhere as far as I can tell (and so I am not sure if I am using this correctly).
2. The deletions do not contain any information about the position of the deletion since `ihm.reference.SeqDif` only stores the position in the entity sequence (which does not exist here).
3. The checks for the validity of the alignment are disabled if any `SeqDif` object has `details="insertion"` and `details="deletion"`. This would be very helpful, since dictionary validation cannot catch any issue which relates to the reference sequence.

I looked at examples in the PDB and noticed the following:

- In [9zrl](https://files.rcsb.org/header/9zrl.cif), there are deletions in which are handled by setting `_struct_ref_seq_dif.pdbx_seq_db_seq_num`. I think this makes a lot of sense and should also be done in python-modelcif (and -ihm).
- In [9z4e](https://files.rcsb.org/header/9z4e.cif) there are insertions which look like the ones produced by python-modelcif. So I think that part is handled well.

I have a small test code attached here ([testing_insertion.py.gz](https://github.com/user-attachments/files/25683047/testing_insertion.py.gz)) to check the insertions and deletions and whether mismatches are recognized.

So to sum up, here is what I think needs to be done (open for discussion of course):

- [ ] Update documentation of `modelcif.reference.SeqDif` to document how insertions and deletions can be defined
- [ ] Update `modelcif.reference.SeqDif` to also allow the definition of a `db_seq_id` variable to set `_struct_ref_seq_dif.pdbx_seq_db_seq_num` for deletions
- [ ] Update the checks for alignment validity (or at least display a warning that this is skipped if there are insertions and deletions)

Side note: in PDB-IHM, do you add SeqDif details as "conflict" or "variant" or something else for point mutations? We could do the same in ModelArchive or just keep it empty since we usually do not really know where the mismatches come from (often enough the reference and model sequences just deviated over time).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of insertions and mismatches in modelcif.reference.Alignment #54

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling of insertions and mismatches in modelcif.reference.Alignment #54

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions