Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a type change utility for append data keys in lib_tool #1932

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

IvoDD
Copy link
Collaborator

@IvoDD IvoDD commented Oct 18, 2024

Does this by adding a cpp layer functionality to overwrite append data
keys. And using a read + type change + overwrite in the python layer.

More specifically this change involves:

  • Using a LocalVersionEngine instead of an AsyncStore as library
    tool state
  • Exposing some more python bindings to allow iteration over APPEND_DATA
    keys with the library tool
  • Allow using normalization in library tool to allow overwriting append
    data keys with a custom dataframe
  • Provide the type change functionality by reading a dataframe, changing
    its type with pandas and overwriting it
  • Adds an elaborate test to verify iterating and reading append data
    linked list is fine with various overwrites.

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

lib.write(sym, get_df(1, 4, str), incomplete=True)
lib.write(sym, get_df(1, 5, np.int64), incomplete=True)

def read_append_data_keys_from_ref(symbol):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helper functions could be added to the library_tool api but I decided against it because it would require frequent tweaks (e.g. if you want to load only the last 20). So instead after this is merged I'll include some examples in the lib tool docs.

@IvoDD IvoDD force-pushed the type-change-lib-tool-utility branch from 47cd9ef to 36a1032 Compare October 18, 2024 08:01
Does this by adding a cpp layer functionality to overwrite append data
keys. And using a read + type change + overwrite in the python layer.

More specifically this change involves:
- Using a `LocalVersionEngine` instead of an `AsyncStore` as library
  tool state
- Exposing some more python bindings to allow iteration over `APPEND_DATA`
  keys with the library tool
- Allow using normalization in library tool to allow overwriting append
  data keys with a custom dataframe
- Provide the type change functionality by reading a dataframe, changing
  its type with pandas and overwriting it
- Adds an elaborate test to verify iterating and reading append data
  linked list is fine with various overwrites.
@IvoDD IvoDD force-pushed the type-change-lib-tool-utility branch from 36a1032 to 87731cb Compare October 18, 2024 08:26
@IvoDD IvoDD marked this pull request as ready for review October 18, 2024 08:29
const py::object &norm,
const py::object & user_meta) {
if (!std::holds_alternative<AtomKey>(key) || std::get<AtomKey>(key).type() != KeyType::APPEND_DATA) {
throw_error<ErrorCode::E_INVALID_USER_ARGUMENT>(fmt::format("Can only override APPEND_DATA keys. Received: {}", key));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

util::check?


def test_overwrite_append_data(object_and_mem_and_lmdb_version_store):
lib = object_and_mem_and_lmdb_version_store
if lib._lib_cfg.lib_desc.version.encoding_version == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking this is definitely zero indexed and we aren't skipping on v1 encoding by mistake?

@@ -230,3 +232,92 @@ def iterate_through_version_chain(key):
assert len(keys_by_key_type[KeyType.TABLE_DATA]) == (num_versions-1) % 3 + 1
assert len(keys_by_key_type[KeyType.TOMBSTONE_ALL]) == num_versions // 3


def test_overwrite_append_data(object_and_mem_and_lmdb_version_store):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all 3 stores? Why not use one of the _v1 fixtures to avoid the skip below?

lib.write(sym, get_df(1, 5, np.int64), incomplete=True)

def read_append_data_keys_from_ref(symbol):
nonlocal lib_tool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the nonlocal do you? Given that you aren't assigning to lib_tool

# We assert that types are as we wrote them and we can't read or compact because of type mismatch
append_keys = read_append_data_keys_from_ref(sym)
assert len(append_keys) == 3
# Different storages use either fixed or dynamic strings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is a good reason for the different backends 👍

# And test that compaction now works with the new types
lib.compact_incomplete(sym, append=True, convert_int_to_float=False, via_iteration=False)
assert read_append_data_keys_from_ref(sym) == []
assert_frame_equal(lib.read(sym).data, get_df(15, 0, np.int64))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for appending more incompletes (with the right types) and compacting?


# Deliberately write mismatching incomplete types
lib.write(sym, get_df(3, 0, np.int64))
lib.write(sym, get_df(1, 3, np.int64), incomplete=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test where all the incompletes are the wrong type?

auto old_segment_in_memory = decode_segment(std::move(old_segment));
const auto& tsd = old_segment_in_memory.index_descriptor();
std::optional<AtomKey> next_key = std::nullopt;
if (tsd.proto().has_next_key()){
Copy link
Collaborator

@poodlewars poodlewars Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the testing for the next key logic work? Given that ArcticDB doesn't write it? Or am I wrong and normal (non streaming) incompletes also have the linked list structure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants