-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
any plans for python/julia interfaces #184
Comments
Hi @jangorecki, thanks for asking, the answer is yes, absolutely! The underlying lib that powers To add a new client language, a wrapper for I don't have much experience using By the way, @xiaodaigh created a wrapper for the (the same holds true for |
hello there! as you may have noticed, I think there is a need for an efficient storage format that works with R and Python. Do you have timeline for the Thanks!!! |
Hi @randomgambit, thanks for the heads-up! Yes, there seems to be a void between Your offer to help with testing is much appreciated! |
I really wish you said before the end of the month instead!! :D |
Ha @randomgambit, yes, the same here! If only I had more time, I'll make sure to talk to my 'day-time-job' director on your behalf :-) |
Just curious if there is any update on the progress on the Python side? I use R, but a lot of people in my research field (astronomy) use Python. Feather seems the current best bet in this regard, but FST seems like it could be a decent step up given its subsetting and compression capabilities. |
Hi @asgr, thanks for your question. Yes, the Python bindings are long overdue and the fst format could be a faster and more dynamic bridge between The I'll try to get a repository up and running soon with an initial package version and we can work from there (user input much would be much appreciated :-)). |
pydt would be useful too, fyi @st-pasha |
that sounds like a great default return type for python's read_fst() 😺 |
@MarcusKlik Do you have any documentation for the fst file format? |
Hi @st-pasha, thanks for asking, are you interested in a specification of the format meta-data, data-block design, etc or the C++ API documentation? (both are not readily available at the moment, but just to know were to direct my efforts 😸) |
format meta-data, data-block design for me as I am writing a Julia serializer |
Ha @xiaodaigh, that's great to hear. I suspect that you won't need the exact details of the (that is, unless you mean you like to write your own format, in that case format specs are of interest off course) The Please let me know if I can help you there, implementing a (see also this issue in fstlib) |
Hi @MarcusKlik , sorry I should have given more context for my question. So, I'm a primary developer of the Python datatable library. This library provides a data frame object and facilities to manipulate this data frame. So, I guess it's pretty close to Some time in the near future (maybe around winter) we were planning to add integrations for other on-disk data formats, foremost arrow (feather) and parquet. And, as @jangorecki points out, the In other words, I'm not looking to using the So, if this all sounds agreeable to you, I would be looking for a document describing how to interpret data stored in a |
Hi @st-pasha, no problem! I'm very familiar with your work on (py)datatable (big fan ;-)) and just wanted to get clear how you would integrate In short, So it was explicitly not designed to manipulate in-memory datasets, like What the This is a difference with the goals of With The fst format is tightly bound to the For Currently, please let me know what you think, thanks! |
Hi Marcus, Based on your description it looks like the fst format is sufficiently complicated that it doesn't make sense to create an independent reader. In that case the simplest solution would be to have a separate Then in datatable we could have simple wrappers such as
We also have a feature proposal (h2oai/datatable#1950) for implementing For now, however, there are 2 main questions:
|
Hi @st-pasha, thanks, that sounds excellent. On your 2 main questions:
There are currently a few virtual columns in the fst format, but only for boundary cases like a factor column with just a single factor (which can be represented by a few numbers). Columns like sequences from n to m will also be encoded in dense format later on. Virtual columns would be a tremendous enhancement to that and I would very much like to see how we can support that. The challenge is to provide a cross-language way of encoding common expressions and constants. Virtual columns that depend on other virtual columns should also be possible. Does Interestingly, on the Your So, bottom line, the setup is very similar to the setup used for the Thanks! |
Hi Marcus, I presume you have much more experience with developing R libraries than Python extensions, so let me point out few peculiarities of Python that could be relevant to the design process.
Virtual columns are a new functionality in datatable, their implementation is largely complete, though there's still some refactoring to do to make sure the existing code uses the new functionality to full extent. And yes, in our design a "virtual column" is an object that knows how to calculate its i-th element. For example, a
|
Hi @st-pasha, thanks for the pointers! And yes, your assumption is very correct :-) About your second point, could we:
That way, wouldn't it be possible to use Or, perhaps simpler, when column A is being transformed, column B can be read into memory on background threads. When that's finished, column B can be transformed while column C is being read, etc. Obviously, we would have thanks! |
In my team some uses R and others use python, so we had to use hdf5 because fst is only for R. But I like fst better. |
Hi @ssh352, thanks, I'm happy to hear that Hope you and your team can wait for that :-) |
Just wanted to express that being able to read in Python would be extremely useful :) |
This issue has been quiet for a while. Has any progress been made with Python access to fst? (I'm very excited for this feature!) |
Hi @richierocks, thanks for checking in on the progress. Unfortunately, I haven't had much time to work on a I do think the |
This SO can be improved when fst in python will be ready https://stackoverflow.com/a/64880745/2490497 |
Hi @jangorecki, thanks for the heads up, when |
Has there been any progress on the project to create a python interface for |
Hey, would give you a heads up we in the genomics community are starting to experiment with this format, it is a very powerful substitute for older formats like bam files. I do believe if you do not have time to fix this yourself that funding could be acquired through grants etc, also supervised master students could do project courses to implement simpler/smaller parts. Many possibilities here, let me know if this could be of interest. |
Are there any plans to make interfaces from other languages to your binary format. Having python or julia interface we can easily move data between different platforms. Something that feather was meant to do, but it is slow and crashing in R even for 500 MB data csv input.
The text was updated successfully, but these errors were encountered: