Skip to content

Handling of insertions and mismatches in modelcif.reference.Alignment #54

@gtauriello

Description

@gtauriello

I am running into issues with a set of models for ModelArchive where the alignment for the modeled sequence with the reference sequence contains insertions and deletions.

I noticed that one can use details="insertion" and details="deletion" when defining ihm.reference.SeqDif but:

  1. This is not actually documented anywhere as far as I can tell (and so I am not sure if I am using this correctly).
  2. The deletions do not contain any information about the position of the deletion since ihm.reference.SeqDif only stores the position in the entity sequence (which does not exist here).
  3. The checks for the validity of the alignment are disabled if any SeqDif object has details="insertion" and details="deletion". This would be very helpful, since dictionary validation cannot catch any issue which relates to the reference sequence.

I looked at examples in the PDB and noticed the following:

  • In 9zrl, there are deletions in which are handled by setting _struct_ref_seq_dif.pdbx_seq_db_seq_num. I think this makes a lot of sense and should also be done in python-modelcif (and -ihm).
  • In 9z4e there are insertions which look like the ones produced by python-modelcif. So I think that part is handled well.

I have a small test code attached here (testing_insertion.py.gz) to check the insertions and deletions and whether mismatches are recognized.

So to sum up, here is what I think needs to be done (open for discussion of course):

  • Update documentation of modelcif.reference.SeqDif to document how insertions and deletions can be defined
  • Update modelcif.reference.SeqDif to also allow the definition of a db_seq_id variable to set _struct_ref_seq_dif.pdbx_seq_db_seq_num for deletions
  • Update the checks for alignment validity (or at least display a warning that this is skipped if there are insertions and deletions)

Side note: in PDB-IHM, do you add SeqDif details as "conflict" or "variant" or something else for point mutations? We could do the same in ModelArchive or just keep it empty since we usually do not really know where the mismatches come from (often enough the reference and model sequences just deviated over time).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions