-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R crashes while reading an fst file #271
Comments
Can you please try to turn this into self-contained reproducible example with a script creating a file which subsequently crashes R? No sane person will read a random binary file off the internet. |
@eddelbuettel : thanks for looking at this. To prevent multicore I have also added the folowing two lines, as recommended in one of the github issue threads.
But I still regularly get the corruptions and the resultant crashes. |
Hi @sanjmeh, thanks for reporting. And I will definitely adhere to @eddelbuettel's warning to not try to load your binary file :-) In the The metadata determines how much memory is allocated for storing the result table. However, the actual column data is decompressed from data blocks in the file using To remedy, we could use safe versions of the Alternatively, This will have a smaller impact on performance and could be used for files read from internet or other suspicious sources (and would need to be done only once after downloading). |
Thank you @MarcusKlik and welcome back to your own repository. That was indeed a long break and I was afraid if you would be back soon.
I donot see the Meanwhile I will test the first alternative:
If you may please specify how to try the safe options, it will be helpful, as I cannot locate the arguments till now. By the way can I request you to have a look at the fst file I attached and not treat it as any random binary file from the internet. I am here to claim that it is originating from my system, and not from the internet :-) |
@sanjmeh As another open-source volunteer I am am a little surprised by your tone. We give you our labor for free. |
Oh my! my intention is not at all to offend you guys. You are doing a fantastic job in the open source community of R, and so would never want to turn you away. I hope I am making the |
Yes, unfortunately time is a scarce resource that can only be spent once (except for @eddelbuettel, my theory is that Dirk is somehow able to clone himself into identical copies that can do work in parallel, proof pending...) :-) About your file @sanjmeh, I will scan the metadata from a container and take a look where things go wrong. |
Hi Marcus, any progress on the bug? |
Hello, I'm suffering from this bug too. Never had an issue before it appears when multiple machines started to write files on the shared drive. |
I have previously encoutered the error as well and today again. I suspect the .fst file becomes corrupt during a 'forced' system reboot on a Windows machine (which is a secondary solution on premise, primary/production is running in the cloud on Ubuntu). I can read the metadata of the .fst file fine, but reading the whole file causes R to crash. I would be great if somehow this just results in an error instead of crashing R. I'm happy to provide the .fst file if needed for testing. Otherwise the fst package is great and so far I haven't encountered a better alternative (except for maybe parquet because of cross languate (i.e. Python) support). |
I switched from fst to qs. About the same perfomances, a bit faster. Only you need to read the whole file you cannot query rows or columns. But you can store any R object and store attributes. |
And what is its advantage over RDS files? |
@sanjmeh Start here: https://github.com/traversc/qs
|
Thank you for the tip. However, the ability to read only certain rows or columns is one of the main reasons I use the fst package. I have matrices with measurements for each minute for a certain number of sensors. As a result I have matrices that are 1.440 (number of minutes in a day) x 18.000 or 80.000 (depending on the sensor type). Using these daily matrices and their pivoted clones, I can very quickly read just one minute of a specific day (the date is the filename, minute the n.th column) or read the 24 hour series of a sensor (again the date is the filename and the column name the ID of the sensor). Reading such a column (or a set of hem) only takes a few milliseconds. Reading an entire year of a couple of sensor data (using their ID's) is done in a couple of seconds. It is very quick to create certain aggregates (over time) that way. The same is true for reading several minute data for all sensors. For example, you can very quickly calculate a typical (average) value for a tuesday 11:00 based on a set of previous tuesdays (also 11:00). The entire dataset is historically available from 2018 and is still updated every minute. It is about 500GB (compressed) and stored on SSD based storage (FSx for Lustre at AWS). Results are presented through a dashboard. For these purposes it is simply way too slow to read the whole matrix every time. With the solution above, I can read in the 'sensor' dimension and 'time' dimension very quickly no matter if it is about recent or older data (no caching needed). I have also tested databases, but they are either too slow or too costly. |
I have exactly the same application and we also started with fst package for exactly this reason. But I now have moved to mariadb due to this occassional corruption of the fst file. We use RDS for data upto 100 MB and move the data to RDBMS with primary index as time stamp so can quickly query a specific time range. |
A simple fst read can send R crashing down, if the file is corrupted !
How could a data file be so bad that it sends R crashing? Perhaps the fst read function has some aggressive memory management that interferes with the OS.
To replicate, just executing a simple
And you will get:
And then a series of error messages, followed by R crashing.
I have uploaded the offending file here.
https://drive.google.com/file/d/1hYJLAcqct_5JxTNNXN1c-qKH9bWFhgmO/view?usp=sharing
The text was updated successfully, but these errors were encountered: