Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Unexpected behaviour] Unexpected behavoiur when parsing mmCIF files #269

Open
Croydon-Brixton opened this issue Feb 16, 2023 · 5 comments

Comments

@Croydon-Brixton
Copy link

Thank you for providing pdbfixer.

I was using it to fix various protein structures and noticed the following unexpected behaviour (including the code to reproduce):

  • The PDBFixer reports a different number of chains and chain names whether I load the same structure from a .cif or a .pdb file.
  • This doesn't happen for other parsers, for example the biotite parser.

=> Does this suggest that .cif files might not parsed correctly wrt. to chain names?

Screenshot 2023-02-16 at 13 11 43

Code to reproduce:

from pdbfixer import PDBFixer

print("=== As loaded from CIF file ===")
fixer = PDBFixer(filename="2rb2.cif")
print("Chains:", [chain.id for chain in fixer.topology.chains()])
print(fixer.topology.atoms)

print("=== As loaded from PDB file ===")
fixer = PDBFixer(filename="2rb2.pdb")
print("Chains:", [chain.id for chain in fixer.topology.chains()])
print(fixer.topology.atoms)

print("==================================")
# As comparison parse with biotite:
import biotite.structure.io as bsio
import biotite.structure as bs

print(" === COMPARISON: Biotite from CIF ===")
biotite_from_cif = bsio.load_structure("2rb2.cif")
print(biotite_from_cif[:3], f"\nAtoms: {len(biotite_from_cif)}", f"\nChains: {bs.get_chains(biotite_from_cif)}")

print(" === COMPARISON: Biotite from PDB ===")
biotite_from_pdb = bsio.load_structure("2rb2.pdb")
print(biotite_from_pdb[:3], f"\nAtoms: {len(biotite_from_pdb)}", f"\nChains: {bs.get_chains(biotite_from_pdb)}")
@Croydon-Brixton
Copy link
Author

EDIT: Looking further it seems the .cif files provide two chain IDs (_atom_site.auth_asym_id and _atom_site.label_asym_id ) and the PDBFixer parser has a different preference than the biotite parser.

Is there a way to change the default entry from which PDBFixer reads the chain_id such that it will match with that provided in PDB files?

@swails
Copy link
Contributor

swails commented Feb 16, 2023

From the code, it seems that the label_asym_id and auth_asym_id columns are chosen based on which one specifies the "most" different chains.

However, the auth_asym_id column is the one intended to align with what's found in the published literature (and, in my experience, the corresponding PDB file). It is also a mandatory data item, so is guaranteed to always be in a (valid) PDBx/mmCIF file.

By contrast, I've found from recent investigation that the _atom_site.label_* fields are used primarily as internal relational keys between the different sections of the PDBx/mmCIF file (e.g., to map data between different sections, say anisotropic temperature factors to atoms). For many "unimportant" atoms, like solvent and ions, it is not bothered to assign meaningful values to these atoms.

I personally think the OpenMM PDBx/mmCIF parser should stick to the auth_* fields where possible.

@Croydon-Brixton
Copy link
Author

Thank you for the clarification @swails, this is very helpful.

I agree with you that it would make sense for the default behaviour to stick to the auth_* fields when possible.
In this case, it seems we would simply need to delete the following lines.

@peastman
Copy link
Member

See #194 and #195. We had to make it work that way because neither field consistently identifies chains in all files.

The presence of duplicate auth_ and label_ fields in PDBx/mmCIF is a mess that causes lots of problems. They don't get used consistently in all files. The documentation on them is ambiguous and sometimes contradictory. It also sometimes conflicts with how they're used in files from RCSB.

@swails
Copy link
Contributor

swails commented Feb 17, 2023

Interesting... that's too bad. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants