Adds a type change utility for append data keys in lib_tool #1932

IvoDD · 2024-10-18T07:17:49Z

Does this by adding a cpp layer functionality to overwrite append data
keys. And using a read + type change + overwrite in the python layer.

More specifically this change involves:

Using a LocalVersionEngine instead of an AsyncStore as library
tool state
Exposing some more python bindings to allow iteration over APPEND_DATA
keys with the library tool
Allow using normalization in library tool to allow overwriting append
data keys with a custom dataframe
Provide the type change functionality by reading a dataframe, changing
its type with pandas and overwriting it
Adds an elaborate test to verify iterating and reading append data
linked list is fine with various overwrites.

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

IvoDD · 2024-10-18T07:28:30Z

python/tests/integration/toolbox/test_library_tool.py

+    lib.write(sym, get_df(1, 4, str), incomplete=True)
+    lib.write(sym, get_df(1, 5, np.int64), incomplete=True)
+
+    def read_append_data_keys_from_ref(symbol):


These helper functions could be added to the library_tool api but I decided against it because it would require frequent tweaks (e.g. if you want to load only the last 20). So instead after this is merged I'll include some examples in the lib tool docs.

Does this by adding a cpp layer functionality to overwrite append data keys. And using a read + type change + overwrite in the python layer. More specifically this change involves: - Using a `LocalVersionEngine` instead of an `AsyncStore` as library tool state - Exposing some more python bindings to allow iteration over `APPEND_DATA` keys with the library tool - Allow using normalization in library tool to allow overwriting append data keys with a custom dataframe - Provide the type change functionality by reading a dataframe, changing its type with pandas and overwriting it - Adds an elaborate test to verify iterating and reading append data linked list is fine with various overwrites.

poodlewars · 2024-10-18T14:51:19Z

cpp/arcticdb/toolbox/library_tool.cpp

+        const py::object &norm,
+        const py::object & user_meta) {
+    if (!std::holds_alternative<AtomKey>(key) || std::get<AtomKey>(key).type() != KeyType::APPEND_DATA) {
+        throw_error<ErrorCode::E_INVALID_USER_ARGUMENT>(fmt::format("Can only override APPEND_DATA keys. Received: {}", key));


util::check?

poodlewars · 2024-10-18T15:23:17Z

python/tests/integration/toolbox/test_library_tool.py

+
+def test_overwrite_append_data(object_and_mem_and_lmdb_version_store):
+    lib = object_and_mem_and_lmdb_version_store
+    if lib._lib_cfg.lib_desc.version.encoding_version == 1:


Just checking this is definitely zero indexed and we aren't skipping on v1 encoding by mistake?

poodlewars · 2024-10-18T15:23:57Z

python/tests/integration/toolbox/test_library_tool.py

@@ -230,3 +232,92 @@ def iterate_through_version_chain(key):
    assert len(keys_by_key_type[KeyType.TABLE_DATA]) == (num_versions-1) % 3 + 1
    assert len(keys_by_key_type[KeyType.TOMBSTONE_ALL]) == num_versions // 3

+
+def test_overwrite_append_data(object_and_mem_and_lmdb_version_store):


Why all 3 stores? Why not use one of the _v1 fixtures to avoid the skip below?

poodlewars · 2024-10-18T15:26:10Z

python/tests/integration/toolbox/test_library_tool.py

+    lib.write(sym, get_df(1, 5, np.int64), incomplete=True)
+
+    def read_append_data_keys_from_ref(symbol):
+        nonlocal lib_tool


You don't need the nonlocal do you? Given that you aren't assigning to lib_tool

poodlewars · 2024-10-18T15:26:39Z

python/tests/integration/toolbox/test_library_tool.py

+    # We assert that types are as we wrote them and we can't read or compact because of type mismatch
+    append_keys = read_append_data_keys_from_ref(sym)
+    assert len(append_keys) == 3
+    # Different storages use either fixed or dynamic strings


Ah, this is a good reason for the different backends 👍

poodlewars · 2024-10-18T15:27:25Z

python/tests/integration/toolbox/test_library_tool.py

+    # And test that compaction now works with the new types
+    lib.compact_incomplete(sym, append=True, convert_int_to_float=False, via_iteration=False)
+    assert read_append_data_keys_from_ref(sym) == []
+    assert_frame_equal(lib.read(sym).data, get_df(15, 0, np.int64))


Test for appending more incompletes (with the right types) and compacting?

poodlewars · 2024-10-18T15:28:07Z

python/tests/integration/toolbox/test_library_tool.py

+
+    # Deliberately write mismatching incomplete types
+    lib.write(sym, get_df(3, 0, np.int64))
+    lib.write(sym, get_df(1, 3, np.int64), incomplete=True)


Test where all the incompletes are the wrong type?

poodlewars · 2024-10-18T15:28:57Z

cpp/arcticdb/toolbox/library_tool.cpp

+    auto old_segment_in_memory = decode_segment(std::move(old_segment));
+    const auto& tsd = old_segment_in_memory.index_descriptor();
+    std::optional<AtomKey> next_key = std::nullopt;
+    if (tsd.proto().has_next_key()){


How does the testing for the next key logic work? Given that ArcticDB doesn't write it? Or am I wrong and normal (non streaming) incompletes also have the linked list structure?

IvoDD commented Oct 18, 2024

View reviewed changes

IvoDD force-pushed the type-change-lib-tool-utility branch from 47cd9ef to 36a1032 Compare October 18, 2024 08:01

IvoDD force-pushed the type-change-lib-tool-utility branch from 36a1032 to 87731cb Compare October 18, 2024 08:26

IvoDD marked this pull request as ready for review October 18, 2024 08:29

IvoDD requested review from alexowens90, willdealtry and poodlewars as code owners October 18, 2024 08:29

poodlewars reviewed Oct 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a type change utility for append data keys in lib_tool #1932

Adds a type change utility for append data keys in lib_tool #1932

IvoDD commented Oct 18, 2024 •

edited

Loading

IvoDD Oct 18, 2024

poodlewars Oct 18, 2024

poodlewars Oct 18, 2024

poodlewars Oct 18, 2024

poodlewars Oct 18, 2024

poodlewars Oct 18, 2024

poodlewars Oct 18, 2024

poodlewars Oct 18, 2024

poodlewars Oct 18, 2024 •

edited

Loading

Adds a type change utility for append data keys in lib_tool #1932

Are you sure you want to change the base?

Adds a type change utility for append data keys in lib_tool #1932

Conversation

IvoDD commented Oct 18, 2024 • edited Loading

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

IvoDD commented Oct 18, 2024 •

edited

Loading

poodlewars Oct 18, 2024 •

edited

Loading