Presenting the "Data Explorer" Feature for Onyxia: A Proposal for Our Community #611

fcomte · 2023-10-03T16:05:37Z

fcomte
Oct 3, 2023
Maintainer

Dear Onyxia community members,

At the heart of our ongoing commitment to evolve and enhance Onyxia is our belief in co-creation with the community that drives Onyxia's dynamism. Today, we are eager to introduce a proposal for a new feature: the "Data Explorer." But before diving into its intricacies, let's discuss the motivation behind its inception.

Why the Need for "Data Explorer"?

Our existing File Explorer has served the Onyxia community diligently, allowing users to organize and manage their files efficiently. However, its primary function is file management, which means it doesn't provide the necessary features to view and describe the actual content of the data. While the File Explorer is great for handling files, the community has expressed a desire for a more in-depth interaction with their data.

What is the "Data Explorer"?

The Data Explorer is a new, distinct part of the Onyxia application. While it will operate separately from the File Explorer, some light integration will be present to ensure users can conveniently transition between the two. The Data Explorer is designed to delve into the intricacies of data content, enabling users to view, describe, and interact with their data beyond mere file management. This distinction ensures that while the File Explorer handles the organization, the Data Explorer tackles comprehension, enhancing the overall user experience in the Onyxia ecosystem.

How Onyxia will handle the "Data Explorer" ?

When a user switches to the Data Explorer mode, they are not only changing the view but immersing themselves in a more analytical environment. This mode provides a deeper dive into the dataset, beyond what the File Explorer offers. Let's delve into its core features:

Dataset Description: Upon accessing a dataset, the Data Explorer will showcase a summary. This encapsulates metadata like the dataset's origin, size, last update timestamp, and other pivotal details offering clarity about the data's nature and lineage.
Attribute Exploration: For those datasets which translate to dataframes (akin to tables), users will be greeted with a list of attributes or columns. These attributes come enriched with data type information, potential value ranges, and even brief descriptions if annotations exist.
Data Preview: Beyond just metadata and structural insight, users can catch a glimpse of the actual data. The initial rows of the dataset surface, offering a snapshot of its content and the information treasures it might hold.
SQL editor: Beyond the aforementioned features, our roadmap for the Data Explorer includes an integrated SQL editor. This will harness the power of DuckDB in WebAssembly (WASM) to provide direct SQL editing capabilities within the browser, allowing users to query and manipulate their datasets seamlessly. This feature aims to bridge the gap between data viewing and data analysis, providing a comprehensive toolset right within Onyxia. This feature will be implemented in a second iteration
Metadata Registration in Data View: Once a user is in the Data Explorer mode and has an analytical view over a specific dataset, they have the added capability to register or update the associated metadata. This feature is beneficial for enriching the data catalog and ensuring datasets are accompanied by comprehensive and up-to-date information.

Metadata Storage & Integration in Onyxia

1. Metadata Storage

Once metadata is captured, it's stored in an S3 bucket with a specific path format:
bucket/.onyxia/data-catalog/default-source/tableX.json.
This JSON file serves as a reference, containing vital information such as:
- A pointer to the actual data.
- Information about the columns, including their data types.

2. Metadata Capture

Onyxia facilitates automatic extraction of metadata directly from the data files.
Users are assisted in refining and saving the extracted metadata directly within the Data Explorer.

3. Data Explorer Integration

Within the Data Explorer, users will have a holistic view of all their datasets. They can easily access the data, view its associated metadata, and interact with it for analytical purposes.
Metadata updates are reflected in real-time, ensuring that users always interact with the most recent dataset information.

4. Integration with Self-Services

When users initiate a self-service, like the hive-metastore or superset, Onyxia can automatically register the user's datasets within these platforms.
This seamless integration ensures users can leverage other open-source software platforms without the hassle of manual data migrations or registrations.

This mechanism ensures efficient metadata management, promoting a collaborative environment and ensuring seamless integration with various open-source tools.

Metadata Management in Onyxia: A Simple Yet Robust Approach

Rationale:

In the complex world of data management, there's no one-size-fits-all standard for metadata. The landscape is diverse, with varying requirements, tools, and standards. In navigating this complexity, Onyxia's strategy has been to prioritize simplicity and practicality.

1. Why S3 for Metadata Storage?

Scalability: S3 is inherently scalable, accommodating vast amounts of data without manual intervention.
Reliability: With S3, data durability and availability are guaranteed, ensuring metadata remains accessible and intact.
Integration: Since the data is already in S3, there's no additional system to integrate. This streamlines the architecture, reducing potential points of failure and ensuring faster access.

2. Potential Drawbacks

Vendor Lock-in: Onyxia will make his own system. This could limit flexibility in the future.

3. Onyxia's Vision

Despite the potential drawbacks, Onyxia's proposal to use S3 stems from a need for a simple, low-engineering solution. Instead of building a complex system from scratch or integrating several components, leveraging S3's capabilities presents an efficient way to manage metadata. The approach acknowledges the vendor lock-in risk but considers the trade-off acceptable given the benefits of reduced engineering complexity and rapid deployment.

fcomte · 2023-10-03T19:12:24Z

fcomte
Oct 3, 2023
Maintainer Author

technical trick : maybe the file name of metadata should be a hash of the source path in order to identify quickly when a path is a registered table.

0 replies

garronej · 2023-10-04T12:40:43Z

garronej
Oct 4, 2023
Maintainer

💯Let's get to work!

0 replies

vancauwe · 2023-10-06T07:23:02Z

vancauwe
Oct 6, 2023

Hi ! Exciting new feature: if I've understood correctly, the metadata would be in json format?
Have you considered having it in json-ld so that it's in a linked data format ?
(json-ld could be combined with this open ontology about Datasets: dcat)

3 replies

cmdoret Oct 6, 2023

Also in the spirit of reusing existing metadata standards, perhaps the dublin-core terms could be of interest: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

fcomte Oct 6, 2023
Maintainer Author

Yes, you've understood correctly. In our initial approach, we've chosen to use JSON format for metadata.

On the topic of JSON-LD: It's an interesting proposition. We're definitely aware of the benefits that JSON-LD and linked data formats bring, especially in terms of semantic consistency and interoperability.

Regarding the DCAT ontology: It's a good ideal. Integrating with established standards like DCAT could enhance the semantic richness of our metadata and potentially streamline integration with other platforms and tools.

Speaking of our users, it's evident that Onyxia's primary objective centers around data manipulation rather than the creation of semantically-rich metadata. However, selecting the right standards from the outset can pave the way for smoother integrations with other platforms, especially within an organizational framework.

It seems we'll initially roll out a version focused solely on the data explorer, and then in subsequent iterations, we'll integrate the metadata saving and editing features. By focusing initially on the data explorer component, we can ensure that users have a seamless experience in viewing and understanding their datasets. Once that foundation is robust and user-friendly, we can then introduce the metadata editing and saving component. This phased approach allows for iterative improvements based on user feedback and ensures that each feature is thoroughly tested and refined before moving on to the next.

vancauwe Oct 9, 2023

That sounds like a very good and reasonable approach. Please feel free to reach out to me/Cyril in the process. We look forward to seeing the first stage of the Data Explorer and rejoice in advance for the future possible enhancements ⭐

qgau · 2023-10-10T14:49:28Z

qgau
Oct 10, 2023

Really promising! Look forward to see that.

Actually, your approach is really in line with what we do in the oceanographic field since a few years now, especially:

hosting file on S3, using for example Analysis-Ready Cloud-Optimized standard format, such as Zarr
hosting metada on S3, to maintain a catalog of the data, using the JSON-based STAC standard
exploring (searching viewing) data, as a consequence, in a S3 native and optimized way. The only difference here is that some of our users are not fond of SQL, but that might change

Both Zarr and STAC are Open Geospatial Consortium standards.
It would be awesome, if, in the future, the Data Explorer could managed some sort of plugin so that it would work natively with our storage organisation.
Do not hesitate to contact me if you wish to go deeper in that direction.

Not fully related, but speaking of OGC standards and data, we are currently playing around with charts not launch services, but functions (container-based) to execute scientific (or not) computation with input and output.
We forseen the use of the web-ui to allow a user to launch it on demand (specifying the input at launch-time), while the output data and metadata would go directly in its file storage.
The web API will follow OGC API Processes, which is, despite being defined by OGC, really simple and generic to any kind of remote function specification.
We planned to contact you if this is something interesting to add natively to Onyxia (below "Services" for example).
The main use cases we have in mind is to let the community shared common/generic data pre and post-treatements, data interpolations and so on (for example, in our case, more complex stuff such as running ocean forecasts on-demand, etc.), and allow users to run these on-demand against their data.
Also, we want to use that internally as well to automatically generate/extract catalog metadata from data that contains inner-metadata, leading us to the data explorer again. Again, do not hesitate to contact us if you are interesting before we decided it is mature enough on our side.

Thanks to the team, again, great work!

2 replies

fcomte Oct 10, 2023
Maintainer Author

Function as a Service (FaaS) deeply intrigues me. While I'm enamored with the concept, the current maturity of the ecosystem gives me pause. The experience can vary greatly based on the framework chosen. My concern is investing significant time in a framework that might not stand the test of time, potentially leading our users down a path that might not be sustainable in the long run. On the container front, Kubernetes has proven its worth, and I believe we can train our data scientists to use it effectively. However, I might be mistaken, and I'd appreciate insights from the Onyxia community regarding building computations over FaaS. So let's share that.

On the metadata front, we aim to keep the door open to multiple standards.

Thanks a lot for your detailed response

qgau Oct 11, 2023

Well, to be honest, I didn't said it was "FaaS" on purpose, because FaaS is intented (up to me, at least) to ease cloud-adoption for developers (simplicity, scalabity, etc.), while OGC Processes where actually defined a long time ago, previously named WPS (version 1.0.0 published in 2007), with the intention of geo-scientific work integration and to build end-user applications on top of it. It doesn't mean the actual computation has to happen in the cloud, and, precisely, we aim to enable job dispatching on HPC for some of these "remote functions", because a lot of physical-ocean forcasting systems is already in place on HPC.

Moreover, there is an OGC API Processes STAC extension, enabling two features for us:

asset materialization -> you can define metadata before the asset(s) is actually produced, and trigger the materialization programmatically and/or on-demand for the end-users. Indirectly, it enables storage optimization.
data lineage -> you can track how data is produced (with which process and inputs)

In the end, the OGC API processes simply enable a web-api for remote functions, callable by anyone, regardless of the underneath complexity of the job dispatching (HPC accesses are often only enable with SSH or telnet, not a great UX for end-users). There are other kind of similar protocols/standards such as RPC/gRPC.

Another way to understand our motivation (hopefully) is: processes are to FaaS what onyxia self-services are to cloud-native services with auto-scaling. For example, instead of having Jupyterlab in the datalab, if all our users need jupyterlab, it might be more efficient (user-friendly, cost-effective) to deploy it once for all on kubernetes with autoscaling. Then, processes are just the same, it allows users to call them defined and call them, but, at some point, if a function is built to compute something in the cloud, something that can be optimized with the cloud (replication, memoization, etc.), then a migration to a FaaS system to call this "function" in production can be considered, as it is supposed to be optimized for that.

I hope it is clearer now ^^

On the metadata front, we aim to keep the door open to multiple standards.

Thank you for that ❤️

agarrone · 2023-10-11T08:51:39Z

agarrone
Oct 11, 2023

Nice!

It would be great to discuss this between Onyxia and data.gouv.fr.

In terms of product inspiration, here are some resources we used when building our data explorer:

You can also learn more on our own data explorer here.

2 replies

fcomte Nov 21, 2023
Maintainer Author

Hello @agarrone, I think we can discuss after a fast first iteration to materialize something on our side

agarrone Nov 21, 2023

Sounds good :)

fcomte · 2023-11-21T00:15:06Z

fcomte
Nov 21, 2023
Maintainer Author

We are ready to start development. Proposal for first step is here.

#664

0 replies

odysseu · 2023-11-21T08:09:15Z

odysseu
Nov 21, 2023
Maintainer

There will have to be some implementation for SSE-Client side encryption.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presenting the "Data Explorer" Feature for Onyxia: A Proposal for Our Community #611

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Presenting the "Data Explorer" Feature for Onyxia: A Proposal for Our Community #611

fcomte Oct 3, 2023 Maintainer

Why the Need for "Data Explorer"?

What is the "Data Explorer"?

How Onyxia will handle the "Data Explorer" ?

Metadata Storage & Integration in Onyxia

1. Metadata Storage

2. Metadata Capture

3. Data Explorer Integration

4. Integration with Self-Services

Metadata Management in Onyxia: A Simple Yet Robust Approach

Rationale:

1. Why S3 for Metadata Storage?

2. Potential Drawbacks

3. Onyxia's Vision

Replies: 7 comments · 7 replies

fcomte Oct 3, 2023 Maintainer Author

garronej Oct 4, 2023 Maintainer

vancauwe Oct 6, 2023

cmdoret Oct 6, 2023

fcomte Oct 6, 2023 Maintainer Author

vancauwe Oct 9, 2023

qgau Oct 10, 2023

fcomte Oct 10, 2023 Maintainer Author

qgau Oct 11, 2023

agarrone Oct 11, 2023

fcomte Nov 21, 2023 Maintainer Author

agarrone Nov 21, 2023

fcomte Nov 21, 2023 Maintainer Author

odysseu Nov 21, 2023 Maintainer

fcomte
Oct 3, 2023
Maintainer

Replies: 7 comments 7 replies

fcomte
Oct 3, 2023
Maintainer Author

garronej
Oct 4, 2023
Maintainer

vancauwe
Oct 6, 2023

fcomte Oct 6, 2023
Maintainer Author

qgau
Oct 10, 2023

fcomte Oct 10, 2023
Maintainer Author

agarrone
Oct 11, 2023

fcomte Nov 21, 2023
Maintainer Author

fcomte
Nov 21, 2023
Maintainer Author

odysseu
Nov 21, 2023
Maintainer