Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file cache: Moving files and updating primary keys versus dropping and revalidating files #2

Open
SevorisDoe opened this issue May 12, 2023 · 2 comments

Comments

@SevorisDoe
Copy link
Collaborator

Since Neorg stores files using their relative path from the workspace root, when files are moved on the file system, their primary keys need to be updated in the database cache. This has the advantage that primary key cascades will automatically ensure any linked ressources are moved along as well.

This however means, that any movement of files on the file system, must occur in a way that Norgberg can intercept and identify. Depending on how users use tools around norg files and the larger system, such garuantees do not exist.

The alternative is to treat any absence of files and movement/creation of files as a need for a re-parse. That means all rows related to any single file are dropped and rebuilt from scratch.

This behavior is probably not entirely avoidable. Norgberg has no oversight over how the files may have been changed when the database is not active. That means any files with probably suspicion of having been changed need to be re-validated. But this can be expensive, due to the need to re-run link resolution, registration of metadata, and it may lead to plugin data losses.

So we might consider some options for identifying files more quickly. In the case of complete Zettelkasten notes, we can make use of the ZK timestamp/ID identifier. We could also consider hashes of the files as a way to reinforce "probably the same file" suspicions if we find a file has been moved from its last known location and another file with the same name and hash exists elsewhere. We could also identify renamed but otherwise not/minmially changed files using file diffing. Since there is interest to have a full-text search index capability we might use the underlying full-text storage blob in the DB to drive such methods of re-validation.

All of this would allow us to avoind unnessecary update activity on the database state. While re-parsing with tree-sitter and extracting the data to update should be pretty cheap, skipping resolution behavior etc is of interest to keep database load and locks from inserts to a minimum, and avoid destroying annotations on accident if at all possible.

@BoManev
Copy link

BoManev commented May 15, 2023

We could have a table, per workspace, containing:

  • file path
  • file hash
  • last update timestamp

There are 4 situations we could encounter: deleted file, updated file, renamed file without updates and renamed file with updates. We could assume timestamps from the file system are not tampered with by the user. We only trigger rehashing for mismatched timestamps and for unknown files (present in a workspace, but not in the "files" table). We could detect updated files by mismatched timestamp and hash, and we trigger re-parse (hopefully some incremental sync with the database). We could detect deleted files by not matching on any hash of the unknown files, and we remove any associated tables/data. We could detect renamed files without updates by matching on hashes of the unknown files. The last case is tricky, renamed files with updates, since we can't use hashing to match files. We could examine the contents of the file and try to match it to any existing table (this could get hairy). The safest option is to remove any data associated with unmatched files and parse the rest of the unknown files (build new tables).

This part is not relevant to 0.1 release, but smt to keep in mid
We have to consider ownership here, Norgberg can't remove/update tables it doesn't own, since we don't known what business logic is needed for this operations. Additionally, we probably don't want other modules to directly invoke the parser; it's inefficient to parse the same file across different modules. We have to consider 2 types of changes:

  • internal, done from nvim instances
  • external, done by any other applications (adding a new image/document to a "resources" folder inside of a workspace)
    The nvim Norgberg plugin will notify Norgberg for any changes (either on file save or buffer changes), which Norgberg will then relay to other modules. Im not sure about the implications of external changes, but we might have to monitor the workspaces for any file system operations. For example if a module wants to provide cross workspace path completion when linking to files by using Norgberg, then we have to monitor for external changes. However this might be out of scope for Norgberg and we have to coordinate with Neorg's dirman module (which is currently under rewrite in rust)

Fast hashing with incremental updating and steaming

@SevorisDoe
Copy link
Collaborator Author

Generally I would recommend that any other modules storing information related to specific documents declare ON UPDATE CASCADE ON DELETE CASCADE - or otherwise understand that their entries may get orphaned if the primary key changes or is deleted. (Incidentally, we should restrict, ah, RESTRICT as being illegal to set for foreign keys related to the primary document table.)

Agreed on the chain of suspicion here:

  1. files in the same place in the file system are synced and thus also hashed when written to file (based on having a newer edit filestamp than the last write to database).
  2. if a file is moved using Neorg/nvim facilities, we can probably intercept the call and use that to know how the file was renamed.
  3. Else, we use "same file name, different path, same hash" to generate an UPDATE on the primary key, avoiding the need to re-parse
  4. If that fails, we delete the old entry and trigger a completely fresh ingestion for the file.

regarding database syncs, this is some type of synchronization module. Probably still rust-native, but this is basically what has to sort out what new files there are, task the multi-threaded parser to get a Tree-sitter CST, then cache those trees in memory while various modules go over it and consume. Probably some type of observable logic with subscription: modules that want to check the CST when a file update occurs subscribe and get called back when a sync to database occurs.

We may also want to post event notices for modules that parse against the database instead of any CSTs as a last step after an update transaction has rolled through, that the state has changed and they may want to check if they want to do anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants