Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file level metadata extraction #244

Open
yarikoptic opened this issue Sep 14, 2023 · 5 comments
Open

Add file level metadata extraction #244

yarikoptic opened this issue Sep 14, 2023 · 5 comments

Comments

@yarikoptic
Copy link
Member

One of the possible use cases would be to establish/feed the datalad-catalog as outlined in #123 (comment) .
For that we should start with metalad_core file level extraction. But I think we might need some guidance or even functionality at metalad level to make it feasible in reasonable time. One of the relevant issues with a comment there is datalad/datalad-metalad#379 and also inquiring now from @christian-monch on Element.

@yarikoptic
Copy link
Member Author

Here is the issue which describes it datalad/datalad-metalad#395 . It would be nice to first run across a good sample datasets to assess how large those dumps would be: would postgresql scale up to store that much of BJSONs?

@candleindark
Copy link
Collaborator

I know that JSONB is the best type for storing JSON data in Postgres in many aspects.

As for how well it scales, below is an answer from ChatGPT.

The JSONB data type in PostgreSQL is highly flexible and efficient for storing JSON-formatted data. Below are some points that discuss how well it scales:

Storage

  • Binary Format: JSONB is stored in a binary format, which allows for more efficient storage and retrieval but uses slightly more storage space than plain text JSON.

Query Performance

  • Indexing: One of the main advantages of JSONB is its support for specialized JSON indexing, through GIN (Generalized Inverted Index) or GiST (Generalized Search Tree), which can dramatically improve query performance.

  • Nested Queries: If your application requires complex nested queries within JSON structures, JSONB can offer a performance advantage.

Write Performance

  • Overhead: JSONB incurs a small overhead during the insert or update operations as it parses and rewrites the JSON data into a binary format.

Space Utilization

  • Compression: JSONB data is more efficiently compressed than plain JSON or text data types, but the difference is usually marginal.

Search Operations

  • Advanced Operators: JSONB supports complex query expressions and operators that are not available in the JSON type, providing more flexibility and power.

Scalability

  • Partitioning: PostgreSQL supports table partitioning, and you can partition JSONB columns as well. This can help in scaling the database horizontally.

  • Read Replicas: For read-heavy workloads, you can use read replicas to distribute the read operations and improve performance.

Compatibility and Interoperability

  • Compatibility with Other Data Types: JSONB can be easily cast to other data types and manipulated using a range of built-in PostgreSQL functions.

Recommendations:

  1. Caching: If your JSONB fields are often queried but rarely updated, consider caching the results to avoid hitting the database repeatedly.

  2. Selective Indexing: Use indexes judiciously. Create them only for the keys that are frequently used in queries.

  3. Batch Processing: If possible, batch insert or update operations to reduce the per-operation overhead of JSONB conversion.

  4. Examine Query Plans: Use tools like EXPLAIN to understand how PostgreSQL is handling your JSONB queries and optimize accordingly.

Here are some references for deeper understanding:

@yarikoptic
Copy link
Member Author

Scalability
Partitioning: PostgreSQL supports table partitioning, and you can partition JSONB columns as well. This can help in scaling the database horizontally.

sounds relevant.

I guess we will see how well/bad it works after we fill up this one with some initial dumps of per-file. And then we might want to look into partitioning etc.

@candleindark
Copy link
Collaborator

I think JSONB is the best option for JSON data, for our use, from Postgres. If you want anything better, we will have to look into NoSQL databases, CouchDB, Elasticsearch, etc. Many applications have both relational and NoSQL databases.

@yarikoptic
Copy link
Member Author

ok, Let's plan to tackle this one soonish after we merge #257 . For that one we will enable dandi:files and also we would need to create a "pipeline" to extract e.g. metalad_core:files as described in datalad/datalad-metalad#395 -- start with just that single metalad_core extractor.

Later, after we see how badly this does not scale for anything (e.g. search, datalad catalog etc) -- we might want to establish a dedicated table with per-dataset/path metadata records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants