-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JAVA-3061 Remove CqlVector, represent CQL vector types as Lists #1656
Conversation
My original comment reviewing what became this change I mentioned that I didn't want to use both the newInstance() methods and the builder in the same impl. After reading comments from the other PR (and in other spots) I realized this wasn't really a sensible position... so I decided to leave both in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have one issue which I think is critical to address, which is the direct values accessor. Please see my comment for details.
* <p>Instances may be created iteratively using {@link CqlVector.Builder} or in a single method | ||
* call via {@link #newInstance(Object[])} or {@link #newInstance(Collection)}. | ||
*/ | ||
public class CqlVector<T> implements Iterable<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a critical need to have simple access to the underlying values. The iterative accessor is simply not sufficient for our application's needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the pros/cons, my understanding is:
- Iterable access provides a generic approach and could be used where lazy evaluation and large datasets are known.
- List commonly used interface, easy way to work with collections, random access, specific list operations. This also allows for getting an idea of the size.
Since our full set of vector search use cases are somewhat vague at this point, +1 for having both options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to address this by adding an export(Collection) method which dumps the vectors content into a collection type of your choosing. This is significantly more flexible than exporting a single collection type; whatever type we export may have properties (list vs. set vs. queue, immutable vs. immutable, linked lists vs. array lists and so on) which almost certainly will not apply in even a majority of situations. This approach allows API users to create the list in whatever way they see fit (and even perform some initial operations on said list if they choose to do so) and then hand it over to our vector for supplying the data.
It's my hope that this addresses your immediate concern @jshook while also maximizing our flexibility for the many Java driver users who will have use cases with different parameters. If I outright missed the boat here and didn't understand something please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an ergonomic improvement, and quite flexible. However, it doesn't improve the situation around the heap allocation acrobatics for large (assume 2K nominal) vectors. I'm working up some microbenches to quantify this better. It seems worth it at least for something that, once settled, will be how CqlVector works for a long time.
…in order to work with it in other contexts.
…SequenceType interface and make both ListType and VectorType extend from that. Also converted VectorType to a pure interface and moved impl to DefaultVectorType to better match driver conventions.
I spent some time looking at this last night, trying out different ideas like supporting custom codecs for mutable vector types. There just wasn't a clean way to do this which generalized nicely, but fortunately while digging into these ideas I hit upon some comments which inverted my original way of thinking. CqlDuration is the exception here, it isn't the model to be followed... which in turn means we have no particular need to have a custom type for vectors. With that in mind it was much easier to simply represent vectors in Java as a List with elements of some subtype, a change implemented by the most recent set of commits. The change was quite intrusive, changing the direction of this PR and making it quite a bit bigger than it was. But after this change you now do things like the following: ResultSet rs = session.execute("select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1");
for (Row row : rs) {
List<Float> v = row.getVector(0, Float.class);
System.out.println(v)
} And this: PreparedStatement preparedInsert = session.prepare("insert into test.foo (i, j) values (?,?)");
List<Float> vector = Lists.newArrayList(1.4f, 2.5f, 3.6f);
session.execute(preparedInsert.bind(3, vector));
session.execute(preparedInsert.bind(4).setVector(1, vector, Float.class));
session.execute(preparedInsert.bind(5).setVector("j", vector, Float.class));
session.execute(preparedInsert.bind(6).setVector(CqlIdentifier.fromInternal("j"), vector, Float.class)); This avoids all the Sturm und Drang brought on by the prior implementation while also behaving in a way that fits more naturally with the Java drivers re-use of existing Java types wherever possible. Finally, it also more closely mirrors what we do with the Python driver, which isn't necessary but is a sign that we're on a good path. A lot of this PR now covers driver internals, so really my interest for the reviewers who don't work on the Java driver on a regular basis is to make sure they're okay with the idea of doing away with CqlVector and using a List instead. We can handle the review of these changes from the perspective of the Java driver code base separately. |
@absurdfarce as you stated, the internal changes are less of a concern for the majority of folks on this PR interested in the general use of what was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm for the cqlvector usage; the driver specific bits, I'm leaving that detail to those more familiar with the internals.
…es what's already done with ListCodec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 very nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.We'll try it out as soon as we can in a release. It looks like you also added some APIs to help manage collection/container types more cleanly and efficently too, from what little I know of the driver internals. Thanks for taking this to the next level!
A quick note on this: @hhughes and I will be walking through these changes (including the changes to the driver innards) in detail later this afternoon. If that goes well this will get merged and the set of other fixes dependent on the outcome of this PR will get addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment about edge case checking when encoding vectors and a few inconsistencies in javadocs. LGTM
@@ -423,8 +423,9 @@ default SelfT setCqlDuration(int i, @Nullable CqlDuration v) { | |||
*/ | |||
@NonNull | |||
@CheckReturnValue | |||
default SelfT setCqlVector(int i, @Nullable CqlVector<?> v) { | |||
return set(i, v, CqlVector.class); | |||
default <ElementT> SelfT setVector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Sets the {@code i}th value to the provided duration.
Should this be vector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should indeed... good catch @hhughes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -570,10 +570,11 @@ default SelfT setCqlDuration(@NonNull String name, @Nullable CqlDuration v) { | |||
*/ | |||
@NonNull | |||
@CheckReturnValue | |||
default SelfT setCqlVector(@NonNull String name, @Nullable CqlVector<?> v) { | |||
default <ElementT> SelfT setVector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace duration
with vector
in the javadoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this one too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (value == null || cqlType.getDimensions() <= 0) { | ||
return null; | ||
} | ||
ByteBuffer[] valueBuffs = new ByteBuffer[cqlType.getDimensions()]; | ||
Iterator<SubtypeT> values = value.getValues().iterator(); | ||
Iterator<SubtypeT> values = value.iterator(); | ||
int allValueBuffsSize = 0; | ||
for (int i = 0; i < cqlType.getDimensions(); ++i) { | ||
ByteBuffer valueBuff = this.subtypeCodec.encode(values.next(), protocolVersion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ListCodec and MapCodec have some handling around null elements and classcastexceptions, do we want to have the same checks here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original thought was that this wasn't necessary because the codecs were pretty good about not meeting these error cases. Then I went back and looked at the code and realized that recollection was entirely wrong. Added that code in there (as well as check for a few other error conditions) and beefed up the codec test a bit to detect them.
Okay, we have buy-in from consumers on the API changes and some level of confidence for the driver internals (hat tip to @hhughes). I'm gonna call this good and merge so that I can start working on the downstream dependencies of this one. |
While also playing nice with CqlVectorCodec. Also set out to add some Javadoc in order to make usage a bit clearer.
Largely derived from previous work done by @jshook in PR 1641