Data compression as an option #58

sylvlecl · 2020-10-21T11:32:44Z

Do you want to request a feature or report a bug?

Feature.

What is the current behavior?

Binary data stored in AFS are compressed and uncompressed automatically by several components.

The Cassandra based implementation :

automatically gzips chunks of data on write
automatically gunzips chunks of data on read

The remote implementation :

on write, automatically gzips data on the client side
on write, automatically gunzips data on the server side
on read, automatically gzips data on the server side
on read, automatically gunzips data on the client side

In case we want to read or write already compressed data, those steps are unnecessary and can hurt performance (and possibly memory usage).

What is the expected behavior?

If we could set up those components to not perform compression, it could improve performance (to be measured).

What is the motivation / use case for changing the behavior?

Performance optimization in a typical setup with a client connected to an AFS server, which itself relies on a Cassandra implementation of AFS.

In this kind of setup, when writing/reading data blobs, it is unnecessarily compressed and uncompressed on the server side.

Some benchmarking with JMH show that compressing a large XIIDM case (100 Mb) takes around 2s on my laptop CPU.
With the reception of around 50 cases per hour, it means 1-2 minutes of CPU time consumed for this every hour.

sylvlecl · 2020-10-21T12:29:36Z

Note that for the case of Cassandra implementation, that evolution would require to change the current mechanisms, which compresses each chunk of data separately, instead of compressing the whole blob. With the current mechanism, getting a compressed blob requires to first uncompresse each chunk, reconstitute the blob, and then compressing it.

Changing this mechanism will need some caution, because we will need to be able to read data written in the "old way" too, to ensure smooth migration.

sylvlecl · 2020-11-26T15:07:32Z

An additional thought on that issue:
as already discussed with @yichen88 , the automatic compression of data does not seem a really relevant feature, because the interest of compressing the data will depend on its actual content.

In CassandraAppStorage and RemoteAppStorage, "binary" data is automatically gzipped, which can be counter-productive since we do not know whether this binary data is well suited for compression (indeed it's already "binary", not "text").

Therefore I would propose a simplification of those implementations with no automatic compression, and compression handled by the business objects: they are the ones who know whether it's relevant or not to compress their data.

However, implementing this would need a lot of care to ensure non regression / smooth migration of existing systems.

sylvlecl · 2020-11-26T15:10:59Z

Some additional benchmarking result :

setup is : a cassandra storage behind an AFS server (it means 2 automatic compressions in the current state, when uploading data)
Uploading a 70 Mo case (compressed or not) in current implementation takes around 5s
Uploading the same compressed case, after having removed compressions (dirty coding) takes 0.5s

So for that use case, the system without automatic compression would be able to scale 10x better.

sylvlecl · 2020-12-04T10:45:17Z

So, principle has been agreed to provide the possibility to particular business objects to define if their data should be compressed or not.

Steps to achieve this could be:

have an option in app storage implementations to enable/disable automatic compression
provide the possibility in app storage API to tell the storage if compression may be relevant or not (distinguish text/binary methods ? add an option to the existing methods ? see below)
migrate business objects to using that new possibility

Under work:

For example, ImportedCaseBuilder could have a new method:

public ImportedCaseBuilder withRawData(String format, byte[] data /* or something more streaming-friendly */);

The provided data would be passed "as is" to the app storage, and not compressed.

Note: this is to circumvent the data source API limitations, which is another issue.

In AppStorage we could have new methods to write data:

OutputStream writeBinaryData(String nodeId, String Name, boolean mayCompress);

/*
 * kind of shortcut for mayCompress = true ?
 */
OutputStream /* or Writer ? */ writeTextData(String nodeId, String Name);

On the read side, we need to:

be able to ask for raw or uncompressed data ?
be able to know what is returned

How could we achieve this ?
Note that there are similarities with content negociation in HTTP, could we get some inspiration from there?

sylvlecl mentioned this issue Oct 21, 2020

Cassandra: enforce chunks size on compressed data #59

Open

yichen88 mentioned this issue Oct 30, 2020

[WIP] A new interface method with an option to choose compress mode when wr… #61

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data compression as an option #58

Data compression as an option #58

sylvlecl commented Oct 21, 2020 •

edited

Loading

sylvlecl commented Oct 21, 2020 •

edited

Loading

sylvlecl commented Nov 26, 2020

sylvlecl commented Nov 26, 2020 •

edited

Loading

sylvlecl commented Dec 4, 2020

Data compression as an option #58

Data compression as an option #58

Comments

sylvlecl commented Oct 21, 2020 • edited Loading

sylvlecl commented Oct 21, 2020 • edited Loading

sylvlecl commented Nov 26, 2020

sylvlecl commented Nov 26, 2020 • edited Loading

sylvlecl commented Dec 4, 2020

sylvlecl commented Oct 21, 2020 •

edited

Loading

sylvlecl commented Oct 21, 2020 •

edited

Loading

sylvlecl commented Nov 26, 2020 •

edited

Loading