Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated Solvation Shell Analysis #227

Open
wants to merge 115 commits into
base: develop
Choose a base branch
from

Conversation

cadeduckworth
Copy link
Contributor

No description provided.

cadeduckworth and others added 30 commits September 15, 2022 10:27
* separating functionality for use with _single_universe without iterating frames
* retaining functionality for use with _single_frame
* try/except pattern condensed to one run method
* single run method implemented for _single_universe and _single_frame
* functionality of original run method retained, while removing frame iteration for _single_universe usage
* progress bar still active when using _single_universe, but no frame iteration occurs
directory_paths: finds MDPOW project data and returns pandas DataFrame
directory_iteration: takes pandas DataFrame or csv and iterates automated_dihedral_analysis over MDPOW project directories
automated_dihedral_analysis: group of functions that includes automation scripts for DihedralAnalysis, automatic dihedral atom group selection, seaborn violin plots function, and other required functions
Future changes will eliminate redundancies following initial testing
…class to test possible ways to call automation-related functions
removed raise
added docstrings to describe use of NotImplementedError
responded with reason for removing _conclude_universe()
next step: updating html documentation following successful testing
fixed expected indent block error
removed raise
removed _conclude_universe()
…se automation for DihedralAnalysis

automated_dihedral_analysis.py
directory_iteration.py
directory_paths.py
…ted-dihedral-analysis

updating with changes for backend and testing from PR #216 ensemble_run_update
*incomplete, pushing to test
*source and html doc additions
*changed all instances of datadir to dirname for consistency with DihedralAnalysis
*all changes pertain to new workflows modules
@codecov
Copy link

codecov bot commented Jan 8, 2023

Codecov Report

Merging #227 (1f62fa6) into develop (cafb300) will decrease coverage by 1.45%.
The diff coverage is 0.00%.

❗ Current head 1f62fa6 differs from pull request most recent head 61f53b9. Consider uploading reports for the commit 61f53b9 to get more accurate results

@@             Coverage Diff             @@
##           develop     #227      +/-   ##
===========================================
- Coverage    80.15%   78.70%   -1.45%     
===========================================
  Files           15       16       +1     
  Lines         1900     1935      +35     
  Branches       291      295       +4     
===========================================
  Hits          1523     1523              
- Misses         286      321      +35     
  Partials        91       91              
Impacted Files Coverage Δ
mdpow/workflows/solvations.py 0.00% <0.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@orbeckst orbeckst mentioned this pull request Jan 16, 2023
@orbeckst
Copy link
Member

orbeckst commented Apr 4, 2023

Lots of conflicts to resolve, @cadeduckworth !

@cadeduckworth
Copy link
Contributor Author

@orbeckst
The following is a brief overview of the new manual EnsembleAnalysis workflow based on the changes I made to the original SolvationAnalysis: (the changes to mdpow.analysis.solvation are in the most recent commit`)

I made changes that allow a user to input a max distance, and obtain all of the solute-solvent interactions within that distance (distance and atom/atomgroup info included), per molecule, per solvent, per interaction, per lambda.

This is a very rough draft, especially for the plot, but the workflow, results DataFrame, and FacetGrid get the main idea across for what I am going for.

SAMPL9 Data, FF = GAFF, Molecule = S08

Workflow:
Screen Shot 2023-04-05 at 12 37 38 PM

Results DF: (it might be better to break the *_ix lists up into more columns)
Screen Shot 2023-04-05 at 12 37 55 PM

Figure (FacetGrid): (please excuse the quality, I am trying to work out the kinks in the .map() function)
Screen Shot 2023-04-05 at 12 30 41 PM

@orbeckst
Copy link
Member

orbeckst commented Apr 5, 2023

The manual workflow as shown is a good mini tutorial and looks already very good.

Presenting the data will be more of a challenge—the distance KDEs are naturally cut off at the cutoff, which makes it look a bit weird/incomplete. And once the solute disappears, the N becomes quite big and swamps the small numbers. Perhaps try a log plot?? Or normalize??? Also, are the large numbers compatible with what you get for a sphere of radius with the cutoff at the density of water or octanol/other solvent?

@orbeckst
Copy link
Member

orbeckst commented Apr 5, 2023

The distance histograms are interesting. You can see how molecules start to overlap with the disappearing solute.

@cadeduckworth
Copy link
Contributor Author

cadeduckworth commented Apr 6, 2023

@orbeckst
Although there is a large step size specified, from a theoretical perspective I interpreted the violins as a stepwise visual contribution to confirmation that things are going according to plan in the context of turning off intermolecular interactions. I might be thinking about it wrong, but the solute or solvent is essentially being electrostatically and otherwise alchemically removed from the system which is positively indicated by the increased equilibration of solvent molecules, right?

On that note, I think seeing that there is not much change when turning of Coulombic interactions for neutral solute/solvent is a good indication, and then seeing the gradual increase of number of solvent molecules at a decreased distance when turning off VDW in that sense is also a good indication of sufficient sampling, IFF, this holds up for other cases with more frames included.

Perhaps I should run a simulation with one of the charged molecules that was excluded for logP calculations from one of the SAMPL sets and see what happens if I can get GROMACS to comply?

EDIT: paragraph spacing, wording

After thought: a nearest neighbors calculation or comparison of individual lambdas to a population distribution could help later in identifying when some systems are not behaving?

@cadeduckworth
Copy link
Contributor Author

@orbeckst

I will look into approximately homogenous equilibration of solvents to try to confirm the data for more completely sampled datasets when I get the chance (those lambdas for VDW closer to 1, when the solute is essentially removed). I apologize if I did not use the correct terms there, it has been a while since I covered equilibrium chemistry, but I think I understand what you are getting at.

@cadeduckworth
Copy link
Contributor Author

@orbeckst

I ran a more complete analysis of the same SAMPL9 GAFF molecule (S08) with a step size of 500 and cleaned up the plots a bit.
This is the product of what I have been playing with using cairosvg and svgutils instead of sns.FacetGrid, which, outside of .catplots, when initialized directly, pretty much requires that axes be shared for the same data being analyzed, which is exactly the same as sharing axes in .catplots.

So, this is the SVG converted to pdf. The width ratios are slightly off, which affects the placement of the Mol object, but I cleaned up the scale on the y-axes as well as the height/width aspect ratio. Fixing the axes/scales and including much more data seems to give a better look at what is going on.

If we do eventually go with a stacked plot scheme, then I will reserve less of the right side and reduce it to a single Mol object and legend to make it look better.

S08.pdf

@cadeduckworth
Copy link
Contributor Author

From earlier this morning before making changes to SolvationAnalysis:

Using the original SolvationAnalysis class, which summed the number of solvent molecules within a specified first and second cutoff distance, separately, I made a time series of the number of solvent molecules.
The concept does not make much sense since each lambda simulation starts from the same relaxed equilibrated MD sim(?), but if I can map the time series to the lambda values then it might be useful as an option later on for something. Note: this is using very under sampled data, large step size, but also shows 95% CI

Screen Shot 2023-04-06 at 12 00 55 AM

@orbeckst
Copy link
Member

orbeckst commented Apr 7, 2023

For the S08 data, the number of solvent atoms is initially 60,000 — there aren't that many waters in the box, are they?

Can you do the following calculation: Given the standard density of water at the simulation conditions, how many water molecules would you expect to find in a volume roughly equivalent to (1) the solute, (2) the solute with its surface blown up by 4 Å? A rough physicist estimate will suffice as a start, e.g., approximating the molecule as a cylinder or sphere...

Then compare the estimate to the numbers from solvation analysis.

@@ -56,21 +56,38 @@ def __init__(self, solute: EnsembleAtomGroup, solvent: EnsembleAtomGroup, distan
self._dists = distances

def _prepare_ensemble(self):
self._col = ['distance', 'solvent', 'interaction',
'lambda', 'time', 'N_solvent']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you now returning distances? If so, better write a new class and leave the old one as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orbeckst
I agree, because they are doing different things. Can they both exist as separate classes within the same module?

@orbeckst
Copy link
Member

orbeckst commented Apr 7, 2023

I am not sure if the time series contain more interesting information than the averages or distributions — at least in equilibrium. Time series in equilibrium are good as a diagnostic to check if the system is likely sampling equilibrium: if the fluctuations are around a stable mean then it might be equilibrium...

@orbeckst
Copy link
Member

orbeckst commented Apr 7, 2023

I don't think that we need to add the molecular structure next to the solvation plots unless you want to highlight specific atoms. Otherwise I'd expect a molecular structure to be included in an overview panel and the reader can use the ID to cross-reference.

@cadeduckworth
Copy link
Contributor Author

@orbeckst
I am working on some calculations/estimates, but for clarification, the 'counts' bar plot is counting each solvent atom that is within the cutoff distance near any solute atom non-exclusively, so multiple solvent atom - solute atoms distances are being recorded and counted, meaning that solvent atom A near solute atom B and solute atom C are counted separately. I believe this is why the numbers are so high. There could be an easy way to count each solvent atom only once with a minor adjustment to the script. Additionally, we could still record all of the data, and only count each solvent molecule once in the plot by only counting the unique resids.

@orbeckst
Copy link
Member

orbeckst commented Apr 8, 2023

Each atom should only be counted once. There's a single closest interaction between a solvent atom and the solute (barring coincidental equidistant triangles).

@orbeckst
Copy link
Member

@cadeduckworth is this something you'd want to complete for a MDPOW paper?

Can you resolve the conflicts so that one can look at the changes in the context of the latest version of the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants