Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data compression as an option #58

Open
sylvlecl opened this issue Oct 21, 2020 · 4 comments
Open

Data compression as an option #58

sylvlecl opened this issue Oct 21, 2020 · 4 comments

Comments

@sylvlecl
Copy link
Contributor

sylvlecl commented Oct 21, 2020

  • Do you want to request a feature or report a bug?

Feature.

  • What is the current behavior?

Binary data stored in AFS are compressed and uncompressed automatically by several components.

The Cassandra based implementation :

  • automatically gzips chunks of data on write
  • automatically gunzips chunks of data on read

The remote implementation :

  • on write, automatically gzips data on the client side
  • on write, automatically gunzips data on the server side
  • on read, automatically gzips data on the server side
  • on read, automatically gunzips data on the client side

In case we want to read or write already compressed data, those steps are unnecessary and can hurt performance (and possibly memory usage).

  • What is the expected behavior?

If we could set up those components to not perform compression, it could improve performance (to be measured).

  • What is the motivation / use case for changing the behavior?

Performance optimization in a typical setup with a client connected to an AFS server, which itself relies on a Cassandra implementation of AFS.

In this kind of setup, when writing/reading data blobs, it is unnecessarily compressed and uncompressed on the server side.

Some benchmarking with JMH show that compressing a large XIIDM case (100 Mb) takes around 2s on my laptop CPU.
With the reception of around 50 cases per hour, it means 1-2 minutes of CPU time consumed for this every hour.

@sylvlecl
Copy link
Contributor Author

sylvlecl commented Oct 21, 2020

Note that for the case of Cassandra implementation, that evolution would require to change the current mechanisms, which compresses each chunk of data separately, instead of compressing the whole blob. With the current mechanism, getting a compressed blob requires to first uncompresse each chunk, reconstitute the blob, and then compressing it.

Changing this mechanism will need some caution, because we will need to be able to read data written in the "old way" too, to ensure smooth migration.

@sylvlecl
Copy link
Contributor Author

An additional thought on that issue:
as already discussed with @yichen88 , the automatic compression of data does not seem a really relevant feature, because the interest of compressing the data will depend on its actual content.

In CassandraAppStorage and RemoteAppStorage, "binary" data is automatically gzipped, which can be counter-productive since we do not know whether this binary data is well suited for compression (indeed it's already "binary", not "text").

Therefore I would propose a simplification of those implementations with no automatic compression, and compression handled by the business objects: they are the ones who know whether it's relevant or not to compress their data.

However, implementing this would need a lot of care to ensure non regression / smooth migration of existing systems.

@sylvlecl
Copy link
Contributor Author

sylvlecl commented Nov 26, 2020

Some additional benchmarking result :

  • setup is : a cassandra storage behind an AFS server (it means 2 automatic compressions in the current state, when uploading data)
  • Uploading a 70 Mo case (compressed or not) in current implementation takes around 5s
  • Uploading the same compressed case, after having removed compressions (dirty coding) takes 0.5s

So for that use case, the system without automatic compression would be able to scale 10x better.

@sylvlecl
Copy link
Contributor Author

sylvlecl commented Dec 4, 2020

So, principle has been agreed to provide the possibility to particular business objects to define if their data should be compressed or not.

Steps to achieve this could be:

  • have an option in app storage implementations to enable/disable automatic compression
  • provide the possibility in app storage API to tell the storage if compression may be relevant or not (distinguish text/binary methods ? add an option to the existing methods ? see below)
  • migrate business objects to using that new possibility

Under work:

For example, ImportedCaseBuilder could have a new method:

public ImportedCaseBuilder withRawData(String format, byte[] data /* or something more streaming-friendly */);

The provided data would be passed "as is" to the app storage, and not compressed.

Note: this is to circumvent the data source API limitations, which is another issue.

In AppStorage we could have new methods to write data:

OutputStream writeBinaryData(String nodeId, String Name, boolean mayCompress);

/*
 * kind of shortcut for mayCompress = true ?
 */
OutputStream /* or Writer ? */ writeTextData(String nodeId, String Name);

On the read side, we need to:

  • be able to ask for raw or uncompressed data ?
  • be able to know what is returned

How could we achieve this ?
Note that there are similarities with content negociation in HTTP, could we get some inspiration from there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant