-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring the way data is returned in panedr #33
Conversation
I have now added a test that compares the dictionary of arrays that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start. have a look at comment for some changes.
panedr/panedr.py
Outdated
Does not roll the reading position back. | ||
""" | ||
magic = data.unpack_int() | ||
return magic == -7777777 | ||
|
||
|
||
def edr_to_df(path, verbose=False): | ||
def read_edr(path, verbose_set=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change from verbose
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt weird about verbose=verbose
, but yeah, I guess there is not really a reason not to put this 😅
panedr/panedr.py
Outdated
|
||
|
||
def edr_to_df(path: str, verbose: bool = False): | ||
import pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is pandas now an optional dependency in requirements.txt etc? If so I would guard this with a try: except
similar to
try:
import pandas
except ImportError:
raise ImportError("""ERROR --- pandas was not found!
pandas is required to use the `.edr_to_df()` functionality.
try installing it using pip eg:
pip install pandas """)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll make a note to add a test for raising this error as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure what the best way to make pandas optional is. I have now done this by removing pandas from requirements.txt and adding a section under [extras] in setup.cfg. panedr can now be installed with pandas by running
pip install -e .[pandas]
Please let me know if this is not the best way to do these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather do things a little bit differently. The way you do it here will break the user experience for the standalone case. A user who do not use mdanalysis will install panedr and won't have pandas to use the main function out of the box.
Instead, I would create 2 packages: the default one that depends on pandas and a "lite" one for downstream integrators who want to minimise dependencies.
I don't know how to do that, though...
panedr/panedr.py
Outdated
|
||
|
||
def edr_to_dict(path: str, verbose: bool = False): | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an optional dependency? I think its probably safe to make numpy
compulsory for panedr. If we do make it compulsory you can import this at the top of the file otherwise use a guard like the one for pandas mentioned in my other comment.
Thoughts @jbarnoud?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is reasonable to make numpy a compulsory dependency. The current users have to install it already because of pandas anyway; the new users will likely use it as well.
tests/test_edr.py
Outdated
@@ -163,6 +163,14 @@ def _assert_progress_range(self, progress, dt, start, stop, step): | |||
assert ref_line == progress_line | |||
|
|||
|
|||
def test_edr_to_dict(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only really tests that it returns the same as edr_to_df
so perhaps a more descriptive name like edr_to_dict_matches_edr_to_df
. I know its a bit verbose but clarity is always good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no issue with the tests depending on pandas. Optional tests are usually a bad idea because they allow issues to fall through the cracks. It is justifiable in some cases, though, like when a dependency is very specialised and difficult to install.
panedr/panedr.py
Outdated
@@ -75,7 +75,7 @@ | |||
Enxnm = collections.namedtuple('Enxnm', 'name unit') | |||
ENX_VERSION = 5 | |||
|
|||
__all__ = ['edr_to_df'] | |||
__all__ = ['edr_to_df', 'edr_to_dict'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason not to add read_edr
here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw read_edr
as merely providing data for the two user-exposed functions edr_to_df
and edr_to_dict
, so that the user should never need to call read_edr
itself directly. I am not sure if the return values of this function are of use to a user, but yeah, that's not really a reason to add it here.
for idx, name in enumerate(all_names): | ||
energy_dict[name] = np.array( | ||
[all_energies[frame][idx] for frame in range(len(times))]) | ||
return energy_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure that the "Time" key is in. I expect it to be, but I do not remember exactly how I treated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Time" is part of all_names
and all_energies
as returned by read_edr
. This is done specifically by
all_names = [u'Time'] + [nm.name for nm in edr_file.nms]
[...]
all_energies.append([frame.t] + [ener.e for ener in frame.ener])
Codecov Report
@@ Coverage Diff @@
## master #33 +/- ##
==========================================
+ Coverage 81.99% 82.84% +0.85%
==========================================
Files 2 2
Lines 261 274 +13
==========================================
+ Hits 214 227 +13
Misses 47 47
Continue to review full report at Codecov.
|
try: | ||
import pandas | ||
except ImportError: | ||
raise ImportError("""ERROR --- pandas was not found! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a test for that, which means a pipeline that does not install pandas and a test that runs the function and asserts the exception is raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BFedder you should be able to just use monkeypatch here to do this: https://github.com/MDAnalysis/mdanalysis/blob/3e249fe7173e68bb61d4ad7a2dfd316014760410/testsuite/MDAnalysisTests/utils/test_datafiles.py#L28-L38
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this test now
@@ -1,2 +1,2 @@ | |||
pandas | |||
numpy>=1.19.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IAlibay is bumping the version to 1.20.0 for MDAnalysis in MDAnalysis/mdanalysis#3737
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't impact things here so it wouldn't think it matters. If you want to raise numpy to 1.20 you'll have to drop py3.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we prioritise here? Staying in-step with MDAnalysis' dependencies or keeping Py3.6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't impact MDAnalysis if panedr supports a wider range of python & numpy versions, so IMHO it's fine to just leave it as-is.
I'm thinking if there are no further comments here, it might make sense to merge this PR as is so that #42 can be merged soon as well. I will then add docstrings/type hints in a separate PR. What do you think @jbarnoud @hmacdope @IAlibay? |
Works with me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the one thing on my end - I'm happy to move "further cleaning up" (docstring, type hints) to a future PR if this means we can keep things progressing.
@hmacdope can you check if everything you requested has been addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Addressing #25
I have changed the way data than panedr returns the energy data that it reads from EDR files.
Old behaviour
edr_to_df()
opens the file, parses its binary contents, populates lists with the energy data, and assembles and returns a Pandas DataFrame.New behaviour
The parsing of the EDR file is now done by a new
read_edr()
function.read_edr()
returns the lists that were previously generated byedr_to_df()
as an intermediate step.The data can now be returned as a dictionary of NumPy arrays with
edr_to_dict()
. Alternatively, returning the data as a Pandas DataFrame is possible viaedr_to_df()
. Both of these functions callread_edr()
.The old behaviour is maintained, users can still call
edr_to_df()
and obtain the results they expect.To Do