Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental view maintenance #3

Merged

Conversation

suremarc
Copy link
Collaborator

@suremarc suremarc commented Dec 26, 2024

The heart of this PR is a new UDTF, called file_dependencies. This UDTF takes in an identifier for a materialized view and spits out a "build plan" that lists "build targets" for the materialized view, which are Hive directories in object storage, and then lists the "dependencies" (existing files in object storage) for each target. For example:

CREATE EXTERNAL TABLE t1 (column0 TEXT, date DATE)
STORED AS PARQUET
PARTITIONED BY (date)
LOCATION 's3://t1/';

INSERT INTO t1 VALUES 
('a', '2021-01-01'), 
('b', '2022-02-02'), 
('c', '2022-02-03'),
('d', '2023-03-03');

-- Pretend we can create materialized views in SQL
CREATE MATERIALIZED VIEW m1 AS SELECT
    COUNT(*) AS count,
   date_part('YEAR', date) AS year
PARTITIONED BY (year)
LOCATION 's3://m1/';

SELECT * FROM file_dependencies('m1');

+--------------------+----------------------+---------------------+-------------------+--------------------------------------+----------------------+
| target             | source_table_catalog | source_table_schema | source_table_name | source_uri                           | source_last_modified |
+--------------------+----------------------+---------------------+-------------------+--------------------------------------+----------------------+
| s3://m1/year=2021/ | datafusion           | public              | t1                | s3://t1/date=2021-01-01/data.parquet | 2023-07-11T16:29:26  |
| s3://m1/year=2022/ | datafusion           | public              | t1                | s3://t1/date=2022-02-02/data.parquet | 2023-07-11T16:45:22  |
| s3://m1/year=2022/ | datafusion           | public              | t1                | s3://t1/date=2022-02-03/data.parquet | 2023-07-11T16:45:44  |
| s3://m1/year=2023/ | datafusion           | public              | t1                | s3://t1/date=2023-03-03/data.parquet | 2023-07-11T16:45:44  |
+--------------------+----------------------+---------------------+-------------------+--------------------------------------+----------------------+

The build plan is computed by analyzing the fragment of the materialized view's query containing the partition columns. This carries the restriction that the partition columns of the materialized view must only reference partition columns of other tables, directly or indirectly. (There are ways to lift this restriction in special cases, such as computing some of the partition columns at runtime, but this has not been implemented yet).

After this PR I plan to add a new UDTF, stale_files, that shows which files in a materialized view have been invalidated by newer dependencies. Then I plan to write some examples on how to use the IVM code + detailed documentation on how it works.

@@ -126,79 +123,6 @@ impl TableProvider for FileMetadata {
}
}

impl RowMetadataSource for FileMetadata {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation was moved to the row_metadata module and replaced with ObjectStoreRowMetadataSource, which allows mocking the file metadata if needed, which is useful in some test environments.

@suremarc suremarc marked this pull request as ready for review December 26, 2024 21:03
@suremarc
Copy link
Collaborator Author

Hey @alamb, I just thought I would request your review since I thought you might be curious about what's going on with the project, but don't feel obligated to leave a review, and please let me know if it's distracting you 😅

@alamb
Copy link

alamb commented Dec 27, 2024

Hey @alamb, I just thought I would request your review since I thought you might be curious about what's going on with the project, but don't feel obligated to leave a review, and please let me know if it's distracting you 😅

Not distracting at all! I will likely not have time to review in detail but I'll take a quick look

@alamb
Copy link

alamb commented Dec 27, 2024

One thing I suggest, while you still have all this context in your head, would be to take the descripton / example of this PR and adapt it to the main README in this repo -- so that people can quickly evaluate what code is here and if it works for them

https://github.com/datafusion-contrib/datafusion-materialized-views/blob/main/README.md

Copy link

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tagging me @suremarc -- I didn't read / understand all of this PR but given what I did read was well commented and tested I think this codebase would be relatively easy to pick up and figure out how to modify.

Really really cool to see this

struct TableTypeRegistry {
listing_table_accessors: DashMap<TypeId, (&'static str, Downcaster<dyn ListingTableLike>)>,
materialized_accessors: DashMap<TypeId, (&'static str, Downcaster<dyn Materialized>)>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting approach -- very cool. I wonder if there is anyhing we can do to the DataFusion APIs to make this easier to use

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yeah it's a bit unfortunate, but it's the only way I could think of to support "trait downcasting" in Rust code. In Go I know it is assisted by some runtime type information handled by the language itself. The only case I know of where someone has implemented their own such mechanism is the bevy_reflect crate, which seems to support trait downcasting here: https://docs.rs/bevy_reflect/latest/bevy_reflect/attr.reflect_trait.html


use super::{cast_to_materialized, row_metadata::RowMetadataRegistry, util, Materialized};

/// A table function that shows build targets and dependencies for a materialized view:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very cool

"+------+",
],
},
TestCase {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty cool setup (and an extensive body of tests)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not enough either! 😅

@suremarc
Copy link
Collaborator Author

Thanks for tagging me @suremarc -- I didn't read / understand all of this PR but given what I did read was well commented and tested I think this codebase would be relatively easy to pick up and figure out how to modify.

Really really cool to see this

I added a README including the example, as well as some background on the problems this repo currently aims to solve. Let me know what you think 👍

@suremarc suremarc merged commit 14a3518 into datafusion-contrib:main Dec 27, 2024
8 checks passed
@github-actions github-actions bot mentioned this pull request Dec 31, 2024
@github-actions github-actions bot mentioned this pull request Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants