Thoughts on VegaFusion 2.0 #433

jonmmease · 2023-12-13T16:00:46Z

jonmmease
Dec 13, 2023
Collaborator

Wanted to share some general plans for what I'm picturing for VegaFusion 2.0.

Background

To recap, VegaFusion 1.0 marked several important milestones for the project:

Relicensing from AGPL3 to BSD3.
Inclusion of an Altair mime renderer in addition to the Jupyter Widget based renderer.
Support for extracting transformed data from an Altair chart with the vf.transformed_data function.

Version 1.2 introduced a suite of save functions for exporting Altair charts to external file formats after performing VegaFusion's pre-transform process.

Since VegaFusion 1.2, I've been working on integrating these same features into Altair itself. Altair 5.1 includes the initial integration with VegaFusion for:

Accessing transformed data from an Altair chart
Enabling the "vegafusion" data transformer causes the existing Altair renderers to use VegaFusion to pre-transform chart specifications before sending the results to the browser. It also causes the existing Altair chart.save and chart.to_json to use VegaFusion.

Drop Altair features and `vegafusion-jupyter` package

As of Altair 5.1, the only Altair feature of VegaFusion that's not possible with Altair directly is the VegaFusion widget renderer. Yesterday, I opened a PR (vega/altair#3281) to update Altair's JupyterChart to support the functionality of VegaFusionWidget, where interactive chart transformations are performed in Python.

Once this is merged into Altair and released, there will no longer be any reason for a end-user to import vegafusion as vf, as all of the VegaFusion Altair functionality will be available directly in Altair. There will also be no need to install the vegafusion-jupyter Python package.

This is really exciting! And it makes VegaFusion useful to a much larger user base. For VegaFusion 2.0, I'd like to remove all of the Altair functionality from the vegafusion Python package and to remove the vegafusion-jupyter package from the VegaFusion repo.

Combine `vegafusion` and `vegafusion-python-embed` packages

vegafusion-python-embed is the native Python/Rust library and vegafusion is a pure Python package that is intended to be the public interface to VegaFusion's functionality. The reason these are separate package is that I was picturing supporting the scenario where the vegafusion package communicated with VegaFusion server over grpc. I didn't see this all the way through, and I haven't run into any demand for this feature. By combining these packages, we can remove the nascent code for communicating with VegaFusion server, and it would remove the burden of having to make sure the versions of vegafusion and vegafusion-python-embed match.

I also want to fully type the pure Python API so it's easer to use from Altair.

Documentation

The VegaFusion documentation would need a near total overhaul, as it's mostly focused on the Altair functionality. The new documentation should focus on VegaFusion's role as a collection of building blocks for scaling Vega systems.

jonmmease · 2024-09-27T23:13:52Z

jonmmease
Sep 27, 2024
Collaborator Author

I'm finally coming back around to thinking about this, and wanted to jot down some more detailed plans.

Rust

Simplify transform evaluation

In the core Rust implementation, I would like to simplify how Vega transforms are converted into representations for evaluation. Currently, Vega transforms are implemented against our custom DataFrame trait (defined in the vegafusion-dataframe crate). This trait has method that accept DataFusion expressions for things like selecting fields and filtering.

The main implementation of this trait is the SqlDataFrame struct (defined in the vegafusion-sql crate). This implementation builds up an SQL CTE chain, one CTE per DataFrame method. The CTEs are represented as sqlparser-rs ASTs, so the main task of the vegafusion-sql crate is to convert DataFusion expressions, and higher-level Vega transform constructs, into sqlparser-rs ASTs. The sqlparser-rs ASTs are then converted into a sql string, which is evaluated by a Connection implementation (also defined in vegafusion-sql). So when using the default DataFusion connection for evaluation, queries are round-tripped through sqlparser-rs, to a string, then parsed by DataFusion from the string.

I developed this architecture in order to support generating SQL for non-DataFusion dialects. The only current application of this functionality is using DuckDb as an alternative sql engine in Python. Testing the other dialects is a challenge, and currently the only tests are snapshot tests that I've manually validated using a Hex notebook with each data connection.

In the meantime, DataFusion has added support for unparsing LogicalPlans back to SQL in a couple of dialects (currently DataFusion, Postgres, MySQL, and sqlite). My assumption is that the DataFusion or Postgres dialects will be compatible with DuckDB for the subset of functionality the VegaFusion relies on, and so it should be possible to maintain DuckDB support while dramatically simplifying the architecture using this approach.

This simplest option here is probably to keep our current DataFrame abstraction, but turn it from a trait back to a struct. The method implementations would use a DataFusion LogicalPlanBuilder to build up a logical plan. Then update the Connection trait to have a collect method that accepts a DataFusion LogicalPlan. The default DataFusionConnection implementation would evaluate this logical plan using a DataFusion SessionContext.execute_logical_plan. Other implementations would unparse the LogicalPlan to SQL in the appropriate dialect and then evaluate it. This is how the Python DuckDB implementation would work.

With this approach, we can drop the vegafusion-sql crate and associated testing. This will make it much less burdensome to update to newer versions of sqlparser-rs and DataFusion.

Drop some DataFusion UDFs

DataFusion's feature set has increased dramatically across the board since I initially wrote VegaFusion. This means that it shouldn't be necessary to use as many custom UDFs. We'll still want an architecture that makes it easy to use UDFs/UDAFs, but we should be able to remove a bunch at this point. Looking at how DataFusion's unparse works, it looks like the name of custom UDFs that aren't otherwise intercepted are passed through to the generated sql. This means that if we were only targeting DuckDB to start, we could implement DuckDB functions as DataFusion UDFs/UDAFs and the generated SQL would work out.

Lift ChartState

I'd like to move the ChartState struct from vegafusion-runtime up to vegafusion-core (vegafusion-core has less dependencies, and is what vegafusion-wasm currently depends on). To do this, we'll need to add a trait abstraction for interacting with the VegaFusion runtime. This will make it possible for a ChartState to communicate with a VegaFusion server instance from Python or WASM.

JavaScript / WASM

If we update vegafusion-wasm to use ChartState then we can create an initial Vega spec that includes the initial transformed data. This will make it possible to use the regular VegaEmbed library to display charts, and so we can drop the vegafusion-embed JavaScript library, and have vegafusion-wasm use vega-embed directly. We can move the optional grpc-web support to vegafusion-wasm as well.

This doesn't need to be part of 2.0, but unlike past versions, DataFusion now supports evaluating queries when compiled to wasm, so it should be possible to also use VegaFusion entirely client side with DataFusion and/or DuckDB. Due to package size, it may make sense to publish separate packages for the workflow of connecting VegaFusion to a runtime on the server, and the workflow of running VegaFusion entirely in the browser. But we can see if the package size difference is enough to warrant doing it this way.

Python

Drop Altair Functionality

VegaFusion was initially designed to work with Altair entirely from the outside. And this is how it's still documented at vegafusion.io. Since then, we've integrated nearly all of VegaFusion's original Altair functionality into Altair itself, including chart.transformed_data(), the "vegafusion" data transformer, and integrating VegaFusion with JupyterChart.

I'd like to remove all of this functionality from the vegafusion Python package.

Merge vegafusion-python-embed

It has become more complicated than helpful to have vegafusion-python-embed and vegafusion as separate packages, and I no longer see a compelling use case for keeping them apart.

I'd like to merge these together into a single vegafusion package.

Drop vegafusion-jupyter, rework VegaFusionWidget

One use case that isn't handled by Altair's JupyterChart is the display of plain Vega (not Vega-Lite) charts with VegaFusionWidget. To continue supporting this, I'd like to add a new widget to the vegafusion package (not vegafusion-jupyter) that is based on AnyWidget. This will work with Vega (not Vega-Lite) specs, and will generally follow the same design as Altair's JupyterChart. (Vega-Lite specs can be converted to Altair charts and displayed with JupyterChart.). Offline support will be possible using vl-convert the same way it's done for JupyterChart.

With this change, we can drop the vegafusion-jupyter package.

Use Narwhals and PyCapsule API

To make VegaFusion a little lighter weight for use with Polars, I'd like to remove the hard pyarrow and pandas dependencies. Instead we can follow Altair and use Narwhals as the general interface to DataFrames (for things like introspecting the schema), and we can use the Arrow PyCapsule interface to zero-copy arrow data from Python VegaFusion's Rust layer.

This would make it possible to use Altair+VegaFusion with Polars without pulling in pandas or pyarrow.

Java

The Java API is pretty incomplete, is broken on CI, and not used as far as I know. So I'd like to drop it for now. If someone has a use case for it in the future, it will be pretty easy to pull back out from git.

Documentation

Content

The current VegaFusion docs focus primarily on the original Altair integration. This should all be removed and replaced by a few links to the Altair documentation. Instead, the docs should primarily focus on the topics outlined in https://vegafusion.io/low_level.html.

Some possible angles:

Provide enough info to do what we do in Python for Altair from another language.
Configure a chart state to point to an instance of VegaFusion server.
Host a Vega chart in a standalone web app and communicate with VegaFusion server over grpc-web.
Describe how the Python library uses Narwhals and the Arrow PyCapsule interface, so that if someone has a new DataFrame library, they know to where to work on things upstream.
Describe how sql generation works, and that additional dialects need to be added at the DataFusion level.

Location

I'd like to move the docs into the vegafusion repo and add a pixi task to sync them to the docs repo for github pages

I'd also like to move the integration demos to the main repo as well.

Next steps

I'm planning to create a v2 branch and then target that with PRs that implement the above.

After 2.0

After VegaFusion 2.0, I'm most interested in integrating Avenger into VegaFusion to support rendering select marks from a chart into images. My general idea is that VegaFusion should be able to replace a mark (like a symbol mark) with a Vega image mark containing the result of rendering the original mark using Avenger. My plan is to scale the image using the same scales as the original mark, so that things like pan and zoom still work. And when displayed in an interactive context like JupyterChart, the image would re-render asynchronously during pan and zoom operations. The feel should be similar to using pan and zoom in mapping software, where the map tiles fill in asynchronously.

This will provide a way to support scatter charts with millions of points, and it will be possible to create rect marks with millions of instances, which can be used to represent heatmaps and images.

0 replies

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on VegaFusion 2.0 #433

{{title}}

Replies: 2 comments

{{title}}

This comment was marked as off-topic.

Select a reply

Thoughts on VegaFusion 2.0 #433

jonmmease Dec 13, 2023 Collaborator

Background

Drop Altair features and vegafusion-jupyter package

Combine vegafusion and vegafusion-python-embed packages

Documentation

Replies: 2 comments

jonmmease Sep 27, 2024 Collaborator Author

Rust

Simplify transform evaluation

Drop some DataFusion UDFs

Lift ChartState

JavaScript / WASM

Python

Drop Altair Functionality

Merge vegafusion-python-embed

Drop vegafusion-jupyter, rework VegaFusionWidget

Use Narwhals and PyCapsule API

Java

Documentation

Content

Location

Next steps

After 2.0

This comment was marked as off-topic.

jonmmease
Dec 13, 2023
Collaborator

Drop Altair features and `vegafusion-jupyter` package

Combine `vegafusion` and `vegafusion-python-embed` packages

jonmmease
Sep 27, 2024
Collaborator Author