-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re-implement external
objects
#86
Comments
Id like to snapshot parquet/arrow data. You can get a meaningful human readable diff via polars/pandas assert_frame_equal and such. So I think it would be nice to get a load/dump function and/or a diff function hook so I can load/dump as parquet but compare using assert_frame_equal. For binary images we’ve also implemented a mask where a heatmap of the the diff is saved to a failure file ( |
I agree your current suggestion would be a nice improvement of the current implementation. What do you think of adding from pydantic import BaseModel
class CustomModel(BaseModel):
name: str
age: int
def test_model():
model = CustomModel(name="foo", age=42)
assert outsource(model, encode=pickle.dumps, decode=pickle.loads, suffix=".pkl") == snapshot(external("76a970z09f413-56d9-erfd-095aerdv60a1f.pkl")) I believe it could provide more explicit error traces for non-textual data. Additionally, would it be relevant to add a assert pd. assert_frame_equal(outsource(df, encode= DataFrame.to_csv, decode=pd.read_csv, suffix=".csv").value, snapshot(external("76a970z09f413-56d9-erfd-095aerdv60a1f.csv").value)) Thank you for your work. It tackles a pain point I had to implement several times in data science pipelines with bearly satisfacting solutions... |
Thank you @Galsor and @adriangb for your feedback. Support for DataFrames seems to be a very much wanted feature (see also #67), but there are some problems with DataFrames. >>> DataFrame()==DataFrame()
Empty DataFrame
Columns: []
Index: [] a DataFrame comparison returns a DataFrame which leads to problems in the current logic of inline-snapshot (it expects it to be a bool). The other problem is that @lucianosrp proposed the following in #67: def test_get_data():
df = get_data()
assert (df == snapshot()).all().all() This might work, but does not look very intuitive in for me, but I have not much experience with pandas. Maybe it is what someone who uses pandas wants to write. Please let me know. Another problem with this syntax is the following. # conftest.py
def pytest_assertrepr_compare(config, op, left, right):
if isinstance(left,DataFrame) and isinstance(right,DataFrame):
try:
assert_frame_equal(left,right)
except AssertionError as e:
return ["diff:"]+str(e).split("\n") d1=DataFrame({'col1': [True, True], 'col2': [True, False]})
d2=DataFrame({'col1': [True, True], 'col2': [True, True]})
def test_foo():
assert (d1==d2).all().all()
I don't know if this would be ok for you. I don't like the first I think there might be two ways how DataFrames could be integrated:
encode/decode load/dumpYes, Customization for specific types is also important. My current idea looks something like this @register_external_handler(DataFrame)
class DataFrameHandler:
suffix=".parquet"
def load(filename: Path) -> DataFrame:
...
def save(filename: Path, value: DataFrame):
... This handler would be used every time you some questions
@adriangb are you still talking about DataFrames? and which
@Galsor I don't know how this syntax could work. There are currently ways how you can use snapshots in custom functions (see #81 (comment)), but you have no way how you can access the value of the snapshot. I know that you want to access the |
Yes, I'm still talking about DataFrames. As you point out they are not trivially comparable. Hence why libraries like pandas ship testing helpers: https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_frame_equal.html. I would propose that you add a |
I think the compare logic can be viewed separate from load/store. The following example works with the current version of inline-snapshot and uses assert_frame_equal to show a diff. Here is a small wrapper for the DataFrame: from pandas import DataFrame
class Wrapper:
def __init__(self,*a,**ka):
if len(a)==1 and len(ka)==0 and isinstance(a[0],DataFrame):
self.df=a[0]
else:
self.df=DataFrame(*a,**ka)
def __repr__(self):
return f"Wrapper({self.df.to_dict()!r})"
def __eq__(self,other):
if not isinstance(other,Wrapper):
return NotImplemented
return (self.df == other.df).all().all() I extend pytest in conftest.py: from df_wrapper import Wrapper
from pandas.testing import assert_frame_equal
def pytest_assertrepr_compare(config, op, left, right):
if isinstance(left,Wrapper) and isinstance(right,Wrapper):
try:
assert_frame_equal(left.df,right.df)
except AssertionError as e:
return str(e).split("\n") the test: from pandas import DataFrame
from inline_snapshot import snapshot
from df_wrapper import Wrapper
d1 = DataFrame({"col1": [1, 2], "col2": [3, 4]})
def test_foo():
assert Wrapper(d1) == snapshot(
Wrapper({"col1": {0: 1, 1: 2}, "col2": {0: 3, 1: 5}})
# I changed the last 4 to 5 in col2 to make the test fail
) and this is how it looks when the test fails.
This does not uses Do you want something like this? |
Yes I agree it can be handled separately. And yes I do want something like that! I think it can even be done without from pandas import DataFrame
from pandas.testing import assert_frame_equal
def pytest_assertrepr_compare(config, op, left, right):
if isinstance(left,DataFrame) and isinstance(right,DataFrame) and op == "==":
try:
assert_frame_equal(left, right)
except AssertionError as e:
return str(e).split("\n") Thank you for teaching me about |
Yes, but you get an error message which looks like this (which is from my previous comment):
You need the >>> bool(DataFrame()==DataFrame())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/frank/.pyenv/versions/dirty_equal_3.12/lib/python3.12/site-packages/pandas/core/generic.py", line 1577, in __nonzero__
raise ValueError(
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). This means also that you can not use a Dataframe in a complex list/dict structure. >>> [DataFrame({"col0":[1]})]==[DataFrame({"col0":[2]})]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/frank/.pyenv/versions/dirty_equal_3.12/lib/python3.12/site-packages/pandas/core/generic.py", line 1577, in __nonzero__
raise ValueError(
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Which would mean that you are not able to snapshot such structures. Using a Wrapper which provides an Note I have found several other ways to compare DataFrames https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.equals.html |
Doesn't look intuitive to me either. Is there any constraint related to the def test_df(mock_input):
#Given
expected_df = load_test_df_expected()
#When
df = custom_process(mock_input)
#Then
assert_frame_equal(df, expected_df) And I would love to be able to write: def test_df(mock_input):
#When
df = custom_process(mock_input)
#Then
assert_frame_equal(outsource(df), snapshot()) But this involves Wrapper solutionThis implementation faces issues with the nasty import pandas as pd
from datetime import timedelta
data = {
"category": pd.Categorical(["A", "B", "C"]),
"string": ["foo", "bar", "baz"],
"timedelta": [timedelta(days=1), timedelta(hours=2), timedelta(minutes=30)],
}
df = pd.DataFrame(data) After --inline-snapshot=create it results in: def test_df():
assert Wrapper(df) == snapshot(
Wrapper(
{
"category": {0: "A", 1: "B", 2: "C"},
"string": {0: "foo", 1: "bar", 2: "baz"},
"timedelta": {
0: Timedelta("1 days 00:00:00"),
1: Timedelta("0 days 02:00:00"),
2: Timedelta("0 days 00:30:00"),
},
}
)
) Where type(df.to_dict()["timedelta"][0])
Out[8]: pandas._libs.tslibs.timedeltas.Timedelta It generates NameErrors in the generated code due to the lacking imports. |
I agree, currently using: df.to_dict("records") == snapshot() Which is more intuitive and produces a nice and readable dict. Suits perfectly for my case. |
One goal I have with inline-snapshot is to provide the ability to write custom functions which make use of the inline-snapshot logic. from df_wrapper import Wrapper
from pandas.testing import assert_frame_equal
from pandas import DataFrame
from inline_snapshot import outsource,snapshot
def assert_frame_equal(df,df_snapshot):
# just to show that you can some dynamic things with inline-snapshot
if df.size > 5:
# this case does not work currently, because outsource only works with str/bytes
assert outsource(Wrapper(df)) == df_snapshot
else:
assert Wrapper(df) == df_snapshot
def custom_process():
return DataFrame({"col0":[1,2]})
def test_df():
#When
df = custom_process()
#Then
assert_frame_equal(df, snapshot(Wrapper({"col0": {0: 1, 1: 2}}))) This example is still missing some parameters, but I think that it should be possible to write an implementation which provides the same signature. Changing your imports would then be enough if you want to use snapshots with assert_frame_equal: #from pandas.testing import assert_frame_equal
from inline_snapshot_pandas import assert_frame_equal This might be the easiest way to adopt inline-snapshot for existing pandas tests.
I plan to generate import-statements in a future release, but this is not my priority now. |
I created a You find the code here #87. I'm interested in your feedback and if something like this could work for you. outsourcing of the Dataframes does currently not work but i think that it could be implemented with a short example: from inline_snapshot._pandas import assert_frame_equal
from inline_snapshot import snapshot
from pandas import DataFrame
def test_df():
df = DataFrame({"col0": [1, 2]})
# the second argument can be a snapshot,
assert_frame_equal(df, snapshot())
# and the generated code looks like this.
assert_frame_equal(df, snapshot(DataFrame({"col0": {0: 1, 1: 2}})))
# it can also be used without a snapshot
assert_frame_equal(df, df) |
That looks really nice! Personally I don't particularly care about inline dataframes, I would only ever extern them. I'd also be interested in a more generic version of this because:
|
@adriangb could you provide me a code sample of what you are doing? I'm new to pandas/polars/... and want to understand how you are using these libraries and what are common testing patterns. Are these real (visual) images? |
I think I understand what you mean. You want to outsource images like png and show a visual diff (color different pixels red for example). Am I right? I also looked at polars and assert_frame_equal looks the same there. It should be possible to apply the same solution like for pandas. |
Yes, but I don't need that to be built into inline-snapshot. A callback where I print out the path to the file like I have now is fine. I do think a heatmap for comparing images is a good idea for your web UI. And yes polars can be implemented similar to pandas, but do you really want to have custom stuff in inline-snapshot for every library out there? |
No, I want to put it into a extra package. This merge-request (#87) is just for me to play around and for you to try it out and give feedback. My implementation for |
I guess my question is: will I be able to write my own |
Yes, extensibility is one of my goals. #85 will allow you to register custom repr implementations and overwrite the default repr() behavior of classes which you have not control about (because the class are part of some other package). This implementation is just a bit more tricky because DataFrame.eq does not return a bool. |
Greetings. First of all, thank you so much for the work here! It solves my main issue with snapshot tests in python, which was separating the expected outputs from the test code, even with small tests. The problem now is that current solution for outsourcing and external files does not really suit my needs* 😅 Right now, what I am doing is using *My setup involves expensive and long-to-run snapshot updates in a largeish codebase. To avoid running all of them, I am using |
My idea was to provide multiple format handler. The user should then have the control how the data should be written to the file (".png", ".txt", ".parquet", ...) an special handler for syrupy like formats could also be possible. The reason why this issue took a bit longer is that I refactored the internal core in 0.15.0, which should allow a nice new external implementation. You should be able to replace a part of the snapshot with The new external format will probably look like this
The protocol will be used to specify where the external object should be stored. This should work together with inline-snapshot-pandas assert_frame_equal(df, external("uuid:.parquet")) should then generate a new uuid and write the dataframe in parquet format. Your use case could maybe lock like this: assert your_value == external("uuid:.syrupy") But I will have to look into how syrupy actually generates the files.
sorry #147 should solve this issue. |
current implementation
The current implementation of external objects has some drawbacks.
There where reasons why it was implemented the way they are currently, but some changes in the implementation of inline-snapshot could allow a different implementation now.
new implementation
the interface will not change very much
outsource(data)
will still createexternal(name)
objects, but thename
of the external objects will now be a uuid.The uuid/name of the file will not change. inline-snapshot will only change the content of the file when you run
--inline-snapshot=fix
These external files will be stored beside you tests. The snapshot will be stored in
__inline_snapshot__/test_things/test_something/92f570b0-e411-46d3-bece-108aeaac6b1f.txt
if the example above is fromtest_things.py
.inline-snapshot will report an error if it detects that the same external object is referenced twice in the same file. It might technical not necessary, but it would make the behavior and implementation more complicated if it would be allowed.
I don't want to provide the option to name the external objects (at least for now) like
outsource("text"*100, name="long_test") == snapshot(external("long_test.txt"))
. The folder structure should provide enough information to find your data.an explicit name would have the following problems:
This is currently only an idea and feedback is very welcome.
The text was updated successfully, but these errors were encountered: