Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large datasets #1

Open
tafia opened this issue Dec 7, 2017 · 3 comments
Open

Large datasets #1

tafia opened this issue Dec 7, 2017 · 3 comments
Assignees
Milestone

Comments

@tafia
Copy link

tafia commented Dec 7, 2017

First thanks for the library!

What is the recommended approach to write large datasets (e.g. 20+ GB csv files). Is there any way to stream reading / writing ?

I have a hard time finding documentation on how to use it. The only one I found uses data frames. I am not an expert on R but I think it is in memory only.

Also I would ideally like to use it in a rust program, which means I'll probably need to do a rust binding for the required parts. Happy to share it if you want!

@MarcusKlik
Copy link
Collaborator

Hi @tafia, first of all, congratulations on filing the first issue in the fstlib repository :-)

Up till now, the core C++ code of R's fst package was part of the R package itself. But now, I've published the library as a separate component to enable implementation in other languages than R.

As you noticed, I have yet to write documentation on the fstlib API and will do so in the coming months. In short, with the fstlib library you can and will be able to:

  • Write in-memory datasets to the file using the fst format
  • Have random access to that fst file, both row- and column wise
  • Use custom type-specific compression on each column in the fst file
  • Very fast multi-threaded compression of memory blocks
  • Very fast multi-threaded hashing of memory blocks
  • Add new datasets to existing fst files (row-binding) future expansion but format is ready
  • Add new columns to existing fst files (column binding) future expansion but format is ready
  • Retrieve data using on-the fly sub-setting (e.g. YEAR == 2016) without any memory overhead future expansion but format is ready
  • On-the-fly ('chunked') operations on data in a fst file, this is like applying map-reduce type algorithms on chunked data. This will be a fully multi-threaded feature. future expansion

The future expansion features will be developed in the coming period using the R package as a technology driver.

IO operations using the fstlib are designed to be as fast as possible, typically topping (due to compression) the maximum speed of a (NVME) SSD drives. At the same time, the library will be very small, so can easily be included in other packages or components.

Having a rust binding would be great!

@MarcusKlik MarcusKlik self-assigned this Dec 7, 2017
@MarcusKlik MarcusKlik added this to the v0.9.0 milestone Dec 7, 2017
@tafia
Copy link
Author

tafia commented Dec 7, 2017

first of all, congratulations on filing the first issue in the fstlib repository :-)

🥇

As you noticed, I have yet to write documentation on the fstlib API and will do so in the coming months.

You sure have lot of work to do! I certainly don't want to bother you too much. I'll split my input file for the moment in as many chunks as necessary.

For the moment, I am mainly interested in creating fst files (Write in-memory datasets and saving it to the disk). There are examples in tests drive, I guess if I manage to have rust bindings, it should be enough for me.

@MarcusKlik
Copy link
Collaborator

That's great, please let me know if you need anything. The Visual Studio 2017 solution contains 4 projects:

  • Project fstcpp: this is a very basic implementation of a fstlib wrapper in C++ (let's say the C++ variant of the R package.
  • Project fstlib: that's the fstlib library.
  • Project fstlibtest: a Google test project to test basic functionality. Currently I mostly use this to track and debug issues that arise from the R package users. Eventually, this will be the main test repository for fstlib.
  • Project googletests: the Google library for writing unit tests

image

Unfortunately, I have no experience with Rust but if you can make a wrapper for C++ code, then you should have no problems. It would be nice if you could have your work in a GitHub repository, so that we can learn from the process!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants