-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file cache: Moving files and updating primary keys versus dropping and revalidating files #2
Comments
We could have a table, per workspace, containing:
There are 4 situations we could encounter: deleted file, updated file, renamed file without updates and renamed file with updates. We could assume timestamps from the file system are not tampered with by the user. We only trigger rehashing for mismatched timestamps and for unknown files (present in a workspace, but not in the "files" table). We could detect updated files by mismatched timestamp and hash, and we trigger re-parse (hopefully some incremental sync with the database). We could detect deleted files by not matching on any hash of the unknown files, and we remove any associated tables/data. We could detect renamed files without updates by matching on hashes of the unknown files. The last case is tricky, renamed files with updates, since we can't use hashing to match files. We could examine the contents of the file and try to match it to any existing table (this could get hairy). The safest option is to remove any data associated with unmatched files and parse the rest of the unknown files (build new tables). This part is not relevant to 0.1 release, but smt to keep in mid
|
Generally I would recommend that any other modules storing information related to specific documents declare ON UPDATE CASCADE ON DELETE CASCADE - or otherwise understand that their entries may get orphaned if the primary key changes or is deleted. (Incidentally, we should restrict, ah, RESTRICT as being illegal to set for foreign keys related to the primary document table.) Agreed on the chain of suspicion here:
regarding database syncs, this is some type of synchronization module. Probably still rust-native, but this is basically what has to sort out what new files there are, task the multi-threaded parser to get a Tree-sitter CST, then cache those trees in memory while various modules go over it and consume. Probably some type of observable logic with subscription: modules that want to check the CST when a file update occurs subscribe and get called back when a sync to database occurs. We may also want to post event notices for modules that parse against the database instead of any CSTs as a last step after an update transaction has rolled through, that the state has changed and they may want to check if they want to do anything. |
Since Neorg stores files using their relative path from the workspace root, when files are moved on the file system, their primary keys need to be updated in the database cache. This has the advantage that primary key cascades will automatically ensure any linked ressources are moved along as well.
This however means, that any movement of files on the file system, must occur in a way that Norgberg can intercept and identify. Depending on how users use tools around norg files and the larger system, such garuantees do not exist.
The alternative is to treat any absence of files and movement/creation of files as a need for a re-parse. That means all rows related to any single file are dropped and rebuilt from scratch.
This behavior is probably not entirely avoidable. Norgberg has no oversight over how the files may have been changed when the database is not active. That means any files with probably suspicion of having been changed need to be re-validated. But this can be expensive, due to the need to re-run link resolution, registration of metadata, and it may lead to plugin data losses.
So we might consider some options for identifying files more quickly. In the case of complete Zettelkasten notes, we can make use of the ZK timestamp/ID identifier. We could also consider hashes of the files as a way to reinforce "probably the same file" suspicions if we find a file has been moved from its last known location and another file with the same name and hash exists elsewhere. We could also identify renamed but otherwise not/minmially changed files using file diffing. Since there is interest to have a full-text search index capability we might use the underlying full-text storage blob in the DB to drive such methods of re-validation.
All of this would allow us to avoind unnessecary update activity on the database state. While re-parsing with tree-sitter and extracting the data to update should be pretty cheap, skipping resolution behavior etc is of interest to keep database load and locks from inserts to a minimum, and avoid destroying annotations on accident if at all possible.
The text was updated successfully, but these errors were encountered: