Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added serde customization support in Fast-Avro #520

Merged
merged 3 commits into from
Oct 26, 2023

Conversation

gaojieliu
Copy link
Collaborator

We have the following requirements:
For serialization, we would like to validate whether all the map fields are using the desired map type.
For deserialization, we would like to deserialize the map type into a special map impelementation for later use.

These customized requirements are not supported in the past because of the following reasons:

  1. Fast classes generated are shared, so it is possible different users of the same schema may have different requirement.
  2. For the same process, for different schema, the requirements can be different too.
  3. No way to specify customized logic/data type when generating fast classes.

This PR adds a new functionality to specify customized logic and it is expandable and backward compatible.
DatumReaderCustomization : customization for read
DatumWriterCustomization : customization for write

Currently, these classes only support the requirements mentioned at the beginning.

How it works internally?

  1. Each Fast DatumReader/DatumWriter constructor will take a new param for customization.
  2. Each Fast DatumReader/DatumWriter will keep a local vanilla-Avro based implementation with customization support since the shared vanilla-Avro based implementation is still the default implementation.
  3. Each generated Fast class will have a new param for customization in serialize/deserialize APIs.
  4. Fast DatumReader/DatumWriter will call this new API with customization param of Fast classes.
  5. The read/write API in Fast DatumReader/DatumWriter doesn't change, so it is backward compatible.

Copy link
Collaborator

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't finish reviewing, but LGTM so far. I'll finish later.

// Test with an empty map
GenericRecord recordWithEmptyMap = new GenericData.Record(recordSchema);
recordWithEmptyMap.put("testInt", new Integer(200));
recordWithEmptyMap.put("testMap", Collections.emptyMap());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think of something... using the builder introduced in this PR, we could take advantage of datasets with many empty maps to make them take up less space on heap by leveraging the singleton empty map, e.g.:

    DatumReaderCustomization customization = new DatumReaderCustomization.Builder()
        .setNewMapOverrideFunc( (reuse, size) -> {
          if (reuse instanceof ConcurrentHashMap) {
            ((ConcurrentHashMap)reuse).clear();
            return reuse;
          } else if (size == 0) {
            return Collections.emptyMap();
          } else {
            return new ConcurrentHashMap<>(size);
          }
        })
        .build();

This could be done built-into fast-avro itself, but it's probably not a good default, since it means the returned map is immutable, which would be a breaking change in some cases. But it's a good use case for customization...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we can consider doing this in OSS Venice.

@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2fb570c) 45.77% compared to head (854aab3) 45.78%.

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #520      +/-   ##
============================================
+ Coverage     45.77%   45.78%   +0.01%     
- Complexity     4440     4444       +4     
============================================
  Files           403      403              
  Lines         28070    28070              
  Branches       4622     4622              
============================================
+ Hits          12850    12853       +3     
  Misses        13664    13664              
+ Partials       1556     1553       -3     

see 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@FelixGV
Copy link
Collaborator

FelixGV commented Oct 24, 2023

Thanks for adding back the generated classes @gaojieliu, but can we add them before the code change, so we can see the diff? i.e. these steps as separate commits:

  1. Tweak the ignore file to add back the generated code (perhaps in just one Avro version, or all version... either way may be fine...).
  2. Do the code change.
  3. Changes to the generated code.
  4. (Possibly optional) undo the tweak to the ignore file to re-ignore generated code...

On a separate note, @radai-rosenblatt, how strongly do you feel about not checking in the generated code? It makes it quite tedious, needing to do these git acrobatics in order to see the effect of the changes to the meta code...

@gaojieliu
Copy link
Collaborator Author

Thanks for adding back the generated classes @gaojieliu, but can we add them before the code change, so we can see the diff? i.e. these steps as separate commits:

  1. Tweak the ignore file to add back the generated code (perhaps in just one Avro version, or all version... either way may be fine...).
  2. Do the code change.
  3. Changes to the generated code.
  4. (Possibly optional) undo the tweak to the ignore file to re-ignore generated code...

On a separate note, @radai-rosenblatt, how strongly do you feel about not checking in the generated code? It makes it quite tedious, needing to do these git acrobatics in order to see the effect of the changes to the meta code...

It will be great to checkin one version of generated classes, otherwise the PR effort will be too high..

@gaojieliu gaojieliu force-pushed the fastavro_add_customization branch 2 times, most recently from ecb02fb to 0f1e01e Compare October 25, 2023 16:53
Copy link
Collaborator

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for restructuring the PR's commits. I left one more comment.

We have the following requirements:
For serialization, we would like to validate whether all the map fields are using the desired map type.
For deserialization, we would like to deserialize the map type into a special map impelementation for later use.

These customized requirements are not supported in the past because of the following reasons:

1. Fast classes generated are shared, so it is possible different users of the same schema may have different requirement.
2. For the same process, for different schema, the requirements can be different too.
3. No way to specify customized logic/data type when generating fast classes.
This PR adds a new functionality to specify customized logic and it is expandable and backward compatible.
DatumReaderCustomization : customization for read
DatumWriterCustomization : customization for write

Currently, these classes only support the requirements mentioned at the beginning.

How it works internally?

1. Each Fast DatumReader/DatumWriter constructor will take a new param for customization.
2. Each Fast DatumReader/DatumWriter will keep a local vanilla-Avro based implementation with customization support since the shared vanilla-Avro based implementation is still the default implementation.
3. Each generated Fast class will have a new param for customization in serialize/deserialize APIs.
4. Fast DatumReader/DatumWriter will call this new API with customization param of Fast classes.
5. The read/write API in Fast DatumReader/DatumWriter doesn't change, so it is backward compatible.
gaojieliu added a commit to gaojieliu/avro-util that referenced this pull request Oct 25, 2023
Copy link
Collaborator

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Gaojie. It LGTM overall. I have a few more minor comments.

Also, I think it would be good to make the float list customization part of this new framework (but it doesn't have to be part of this PR).

@gaojieliu gaojieliu merged commit cd17a70 into linkedin:master Oct 26, 2023
2 checks passed
@krisso-rtb
Copy link
Contributor

Hi @gaojieliu
After upgrading to 0.3.21 we started observing errors:

java.lang.NullPointerException: Cannot invoke "org.apache.avro.Schema.equals(Object)" because "writer" is null
	at org.apache.avro.Schema.applyAliases(Schema.java:1890)
	at org.apache.avro.generic.GenericDatumReader.getResolver(GenericDatumReader.java:131)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at com.linkedin.avro.fastserde.FastSerdeUtils$FastDeserializerWithAvroSpecificImpl.deserialize(FastSerdeUtils.java:68)
	at com.linkedin.avro.fastserde.FastGenericDatumReader.read(FastGenericDatumReader.java:126)
	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:263)
	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:248)

Presence of the new class (FastSerdeUtils) pointed us to this PR...
Has something changed?

We use

new FastSpecificDatumReader<>(null,  schema);

but it looks like null is not allowed anymore.

@FelixGV
Copy link
Collaborator

FelixGV commented Nov 2, 2023

Thanks for reporting... that is weird, since I don't see changes to the FastSpecificDatumReader constructors... but... what does it mean to have a null writer schema, though? Are you expecting the writer schema to default the reader schema in that case?

@krisso-rtb
Copy link
Contributor

OK, it's a bit more complex scenario :)

Schema schema = SomeAvroClass.getClassSchema();
FastSpecificDatumReader<SomeAvroClass> datumReader = new FastSpecificDatumReader<>(null, schema);

Path path = Paths.get("/Users/kris/dev/data-file-with-schema-in-header.avro");
try (InputStream inputStream = Files.newInputStream(path);
    DataFileStream<SomeAvroClass> reader = new DataFileStream<>(inputStream, datumReader)) {
    for (SomeAvroClass obj : reader) {
        System.out.println(obj);
    }
}

DataFileStream constructor calls void initialize(InputStream in, byte[] magic) method which extracts writerSchema from inputStream metadata. At the end the helper method invokes

reader.setSchema(header.schema); // replaces writerSchema if it's null --> that's our case

where reader is our datumReader.

Now, in version 0.3.21, new field was added to FastGenericDatumReader:

private final FastDeserializer<T> coldDeserializer;

which ignores invocation of public void setSchema(Schema schema) {

Basically setSchema(Schema schema) should be somehow cascaded to coldDeserializer.


I did an experiment and injecting

((FastSerdeUtils.FastDeserializerWithAvroSpecificImpl) this.coldDeserializer).customizedDatumReader.setSchema(schema);

to com.linkedin.avro.fastserde.FastGenericDatumReader.setSchema() fixed the issue.
Of course the real fix should should be done in a better way (i.e. no casting).

@FelixGV
Copy link
Collaborator

FelixGV commented Nov 2, 2023

Basically setSchema(Schema schema) should be somehow cascaded to coldDeserializer.

I did an experiment and injecting

((FastSerdeUtils.FastDeserializerWithAvroSpecificImpl) this.coldDeserializer).customizedDatumReader.setSchema(schema);

to com.linkedin.avro.fastserde.FastGenericDatumReader.setSchema() fixed the issue. Of course the real fix should should be done in a better way (i.e. no casting).

Ah, I see, that's very interesting... we should add a test for that (:

@krisso-rtb
Copy link
Contributor

@krisso-rtb
Copy link
Contributor

Hi,
It looks like the feature has a bug which should be fixed by:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants