Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dvc][doc] Create MVP for DaVinciRecordTransformer #1087

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

kvargha
Copy link
Contributor

@kvargha kvargha commented Jul 30, 2024

Summary

  1. Added storeRecordsInDaVinci boolean to the constructor to detect if a user wants to store their transformed records in DaVinci or another storage of their choice.
  2. Created DaVinciRecordTransformerConfig to be used by DaVinciConfig, so we have access to Output Value Class and Schema without needing to create a dummy record transformer.
  3. Added a recovery method to bootstrap from VT or from local state on startup. If blob transfer is enabled, an error will be thrown since it's not currently supported with the record transformer.
  4. Created AbstractStorageIterator and a RocksDB implementation to be used by the recovery method.
  5. Added logic to detect if transformer logic has changed on startup, utilizing the hash of the class bytecode. If transform logic has changed, that means local state is stale and we need to bootstrap directly from the VT.
  6. Added logic to create a DVRT specific Avro deserializer.
  7. Split put into transform and processPut, so that when we're trying to bootstrap from local state we don't try to transform the data again as it already has been transformed. Also created a wrapper method transformAndProcessPut when both need to be called.
  8. Added delete logic in StoreIngestionTask.
  9. Updated record transformer doc with new logic.
  10. Made Venice repo a link in the workspace setup doc.

How was this PR tested?

Added more unit tests.

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

@FelixGV
Copy link
Contributor

FelixGV commented Jul 30, 2024

The thing about doing a hash of the bytecode sounds pretty interesting, but I wonder how reliable it is... I suspect that it probably has extremely low chances of false negative (the code changed but we fail to detect it) but that it might have a somewhat large probability of false positives (the code did not change but we think it did). The latter could be due to differences in JDK versions or flags used for compilation... Maybe it's not a big deal, but IDK...

Also, coming back to false negatives, those may still exist if the scope of the checksumming is not large enough, e.g. the class we checksum did not change but it calls into another class which did change.

Overall, I am not sure how I feel about this mechanism, and I think we may want to have a config to opt out of this check. Possibly even: this should be opt in to begin with (i.e. off by default) until we can gain a better understanding of its limitations.

@kvargha
Copy link
Contributor Author

kvargha commented Jul 30, 2024

The thing about doing a hash of the bytecode sounds pretty interesting, but I wonder how reliable it is... I suspect that it probably has extremely low chances of false negative (the code changed but we fail to detect it) but that it might have a somewhat large probability of false positives (the code did not change but we think it did). The latter could be due to differences in JDK versions or flags used for compilation... Maybe it's not a big deal, but IDK...

Also, coming back to false negatives, those may still exist if the scope of the checksumming is not large enough, e.g. the class we checksum did not change but it calls into another class which did change.

Overall, I am not sure how I feel about this mechanism, and I think we may want to have a config to opt out of this check. Possibly even: this should be opt in to begin with (i.e. off by default) until we can gain a better understanding of its limitations.

From my testing, the hash would change under these circumstances:

  1. Change in JDK version.
  2. Any changes made within the user's implementation (I found this to be very reliable).

Circumstances where it didn't change:

  1. Change in version numbers of imported packages that's used by the user's implementation.
  2. Code changes in functions imported.
  3. Changes in the abstract class.

I think having it off by default and having the user enable it makes sense.

We could have a global variable that's initially set to false, and add another constructor where a user can pass in a boolean of whether or not to enable it. We could call the variable detectChangesInLogic. If it's set to false, then always bootstrap from VT. Thoughts @ZacAttack?

@ZacAttack
Copy link
Contributor

I have finished my first two passes over this. Good progress, but there are some significant gotchas that have to be addressed. The first and most primary one being that I do not believe this implementation will work with version pushes.

newBuffer.put(transformedBytes);
newBuffer.flip();
return newBuffer;
public final O transformAndProcessPut(Lazy<K> key, Lazy<V> value) {
Copy link
Contributor Author

@kvargha kvargha Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading about the behavior of the key and value in SIT, I don't think value needs to be lazy since the operation of assembling has already occurred. Right now, it doesn't really do anything.

Another potential issue here is that key is being deserialized twice in transform and processPut, unless we expect most of our users to only utilize the key in processPut.

Thoughts @ZacAttack?

Copy link
Contributor

@ZacAttack ZacAttack Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's exactly right is it? The point of the lazy is to handle not having to do value deserialization or chunking unless the user needs it. The code today acts like a follower node should which is just storing key/values to bytes, and the point of the lazy is to give users the ability to read the value as a serialized/assembled record if they need to process it for their logic.

Deserializing twice is a problem.... but should be solved via the lazy right? The lazy function should deserialize once and then on successive calls to get it will just return a cached value.

Copy link
Contributor Author

@kvargha kvargha Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserializing twice is a problem.... but should be solved via the lazy right? The lazy function should deserialize once and then on successive calls to get it will just return a cached value.

Yes, you're right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a todo to make chunking lazy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to document the discussion made during the meeting. We need to think about what to do with the lazy value, since the parameter for transform is lazy, and processPut is also lazy. Should transform return a lazy value, or should we create a lazy/non-lazy API? TBD.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To document this a bit further, the 4 scenarios we need to make sure we have correctness for our:

  1. User deserialized the value and altered it
  2. User deserialized the value and didn't alter it
  3. User did not deserialize the value and just passed it long to the storage engine
  4. User did not deserialize the value and passed along something else entirely

…the return type of delete void. Proceed with deletion based off of storeRecordsInDaVinci
…iConfig to remove the need of instantiating a dummy record transformer, updated docs, and simplified function names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants