GitHub issues #495 #102
Replies: 6 comments 6 replies
-
Medley Repo Cleanup - Git BloatProblemEarly construction of the Interlisp Medley repository resulted in loading of binary files (Scanned PDFs) and other binary files that were deleted as the current structure of the repository evolved. This has led to the git repository's history becoming bloated. At present, the repository is just over 1 GB in size. The goal of this activity is to clean up the repository and removing unneeded files from the git history and, in turn, trim down the size of the repository to a more manageable size. RebaseA initial experiment is to rebase the complete repository, smashing together the commits that add files along with the commits that delete them. The goal being by bringing the commits together we end up removing the files from git. This plan does not work. Rebasing and moving commits and smashing them doesn't remove the files from the git's object store. They are still there taking up room. We need a different approach to excise the files from git's object library. Nuclear OptionThis option produces best possible outcome, removing all wasted space and creating a minimal sized repository. By deleting the .git directory and recreating the git repository we can create a benchmark - albeit a benchmark that lacks any history. The following steps from stackOverflow illustrates the process: git log > original.log
rm -rf .git
git init
git add .
git commit -F original.log At the conclusion of this exercise the repository is approximately 40 MBs. This tells us the best outcome would be to be close to 40 MBs. git filter repoA python tool, git filter repo, is available that automates rewriting and clean up of git repos. One function this tool provides is the ability to select and remove dead directories from the git object store. The tool, in turn, updates the One of the biggest consumers of space in the medley repository is the To fix this problem correctly, we need to (1) remove the files and (2) rewrite history so that if we reset our view back to August 29th we don't end up with a corrupted branch with missing files. Git has the ability to do this. Git Filter Repo simplifies this work helping ensure the repository remains consistent. Removing the directory is a large first step in cleaning up the repository. First, some analysis to better understand what the object library holds: medley(master) $ git_filter_repo.py --analyze
Processed 14640 blob sizes
Processed 317 commits
Writing reports to .git/filter-repo/analysis...done. The results show:
The deleted NOTE: The unpacked and packed size values can be slightly misleading and only provide a rough estimate of space used. The analyze option generates a readme file in the analysis directory that provides additional details Removing the medley(master) $ git-filter_repo.py --path docs/irm.pdf --invert-paths
Parsed 317 commits
New history written in 0.40 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at d08ce2e See PR #275 for discusssion
Enumerating objects: 14510, done.
Counting objects: 100% (14510/14510), done.
Delta compression using up to 8 threads
Compressing objects: 100% (8825/8825), done.
Writing objects: 100% (14510/14510), done.
Total 14510 (delta 5528), reused 14507 (delta 5525)
Completely finished after 1.19 seconds. This operation does two things. It removes the docs/irm.pdf directory and all it's subdirectories and their contents. It then parses the git history removing all references to the files that have been removed. This can be seen by reviewing the git log:
The git history has been rewritten resulting in the hashes for each commit changing. But more importantly the first commit (the newest commit) only exists in the before log. It has been removed in the after log. If we were to examine the commit where the documents were checked in to docs/irm.pdf we would find the commit edited and all those files removed. Since the commit contains other files that are still available, the commit remains. History has been rewritten omitting any mention of the docs/irm.pdf related files. Meaning we could checkout the check in docs commit (now 5bab28c) and have a stable view of the master branch - albeit missing the document that were in the docs/irm.pdf directory. This process can be continued removing other unused directories. In addition, large standalone files that are no longer needed can also be removed. The same process would apply, the files would be removed from the git repository and history rewritten to exclude them. Using this approach and just removing the |
Beta Was this translation helpful? Give feedback.
-
(Moving the discussion back to github I was assuming the use of git lfs migrate including Second, it will carry forward to sysouts as well as scanned image PDFs from older scanners. I’m not sure how serious it impacts development workflows, that there are older machines that can’t run git lfs. @nbriggs ? |
Beta Was this translation helpful? Give feedback.
-
@stumbo wrote earlier In the 7 June meeting, Larry mentioned I had proposed a second option for resolving space issues on GitHub. Since I had to drop off early for a work meeting, I thought I'd send an email and hopefully spur on the discussion. The initial conversation is documented at: #102 The short synopsis is that there is a lot of bloat in the repo - driven primarily by the addition of a collection of pdf documents that were later moved out of the repository. And, as Larry has also mentioned saving old tilde versions of files into git has also led to some clutter and wasted space. We've discussed two potential options. Option one is to 'surgically' remove unneeded content from git storage and remove any mention of it from git's history. We have tested this approach with the docs/irm.pdf directory. Using a third party python package I was able to remove all the files in this directory, its subdirectories and rewrite git history so there is no knowledge of them within the git Medley repository. The discussion mentioned above details the package and steps in greater detail. This action by itself significantly reduces the size of the Medley repository. There are a couple drawbacks to this approach. First, it rewrites git history, meaning that anyone who has forked our repository is going to suddenly find their repository no longer matches ours. Essentially after running the operation and updating a local git repository, we need to do a force push to reset the GitHub version of the repo to match the revised local one. Anyone uses the repository is going to need to update their local version and depending on how much work they have done locally may need to do more than stashing and popping from the stash to be resynched with the repo. Given its a potentially painful exercise, the next concern is minimizing the number of times we rewrite global history. It would be best to only force push a new version of the repo once. Which leads to the question, how much scrubbing should be done? When have we removed the right amount of cruft? And, if we are removing stuff and rewriting history, is there a point were we accidentally break something? What is our level of confidence that the toolset we use won't make a mistake - and what is our ability to identify an error before committing the changes? Option two is much simpler. Decide at point in time to start clean. Archive the existing repository. Delete the existing repository from GitHub and create a new git repository using the current source code base. This way, you're guaranteed the only files in the repository are what we deemed worthy of being under source control at time zero. There is no need to hunt down old files or worry about corrupting history. Of course, the downside, is there is no history. History, if needed, could be extracted from the archived repository and using it. This approach has the same issue that the first approach does - we're replacing the repository on GitHub and anyone that has forked a copy or pulled from our repo would suddenly find themselves out of sync with the existing repository. Again, there would likely be some effort required to resync with the repo. On the positive side, with this approach, the repository size is going to start at its smallest possible size. Another positive, option two is much simpler to implement. There is no risk of creating a version of master that somehow has a corrupted history. A couple questions to consider as we determine the path forward:
Thoughts? Questions?My thought is that using "option two" (just start clean) could be done using something else for anyone who wants ready access to the "archive" -- Dropbox, Google Drve, etc. |
Beta Was this translation helpful? Give feedback.
-
email from @stumbo 5/23 Instructions for removing unused directories from a git repository: Prerequisite: git-filter-repo installed. see:
Let me know if you have any questions. |
Beta Was this translation helpful? Give feedback.
-
Git Large File SystemInitial ResultsTesting was done using Ubuntu running in wsl2 on a Windows 10 system. Install git-lfs: > sudo apt-get install git-lfs Setup initialize git lfs: > git lfs install Run migrate info: > git lfs migrate info --include-ref=master
migrate: Sorting commits: ..., done.
migrate: Examining commits: 100% (313/313), done.
*.pdf 1.3 GB 242/242 files(s) 100%
*.sysout 558 MB 56/56 files(s) 100%
*.~2~ 26 MB 465/466 files(s) 100%
*.LCOM 21 MB 999/1002 files(s) 100%
*.venuesysout 20 MB 2/2 files(s) 100% The results state that pdf files consume 1.3 GBs of the repository and sysout files consume 558 MBs. Set lfs to manage pdf and sysout files: > git lfs migrate import --include-ref=master --include="*.pdf,*.sysout"
migrate: Sorting commits: ..., done.
migrate: Rewriting commits: 100% (313/313), done.
master dbe239de240004a746275af5ef0862a9adfc1cbf -> a68484e2650b65c6f7c6f73f425abc573028b40b
daily-210326 78d53039c540fc18c82165e899fbb4d9f68a319e -> df1b4956c65bfbdf85e58a76820dec563ac2a05c
nightly-210329 78d53039c540fc18c82165e899fbb4d9f68a319e -> df1b4956c65bfbdf85e58a76820dec563ac2a05c
nightly-210401 3e64317db5cbb39af8efd3eeac82bd08cb28f093 -> d2e94bd29c378e01642402a04505d57e84b8edee
nightly-210424 21c8759084459e4e47a9b859e077a48b2beac02a -> 98c8cc38d43b1e74f9adc3a0c2248de935f7e871
nightly-210428 f0ad3c5f6020598c5242a074c390657f76dd210e -> 2d0d375a2b633b0d6e78712fec31072f01fa898b
nightly-210502 0a5ff043937f9d0ecc5ac8cd5bd6786b67f582b9 -> f1e2a22ab19e9b242f56e8038732ac4bd92cabaf
nightly-210506 2cf33cebcfe6c99effd040aa22fe7f1ee0ac873d -> 8d32339cae35309524945bdac02820aeec2caf95
nitely-210330 78d53039c540fc18c82165e899fbb4d9f68a319e -> df1b4956c65bfbdf85e58a76820dec563ac2a05c
v0.14 3c33ba0b7e9f53135bb5848fd0470cb5db685cff -> ba2e6fac11fa27674cf90f0645b999fe0085c5e2
v3.5.1.13 99f28008dcb13462861b82e6df5957b890af8ae9 -> d381d663278117d3a09e44e13c8ecda4348dd6c6
v3.5.2-pre-alpha-test 511a73fd15fb80ab8b0cb18ad7866e9289c1a05d -> b88d0f5db6df0e61321f1396b4106192602c9ac7
migrate: Updating refs: ..., done.
migrate: checkout: ..., done. Re-run migrate info to see what's changed: > git lfs migrate info --include-ref=master
migrate: Sorting commits: ..., done.
migrate: Examining commits: 100% (313/313), done.
*.~2~ 26 MB 465/466 files(s) 100%
*.LCOM 21 MB 999/1002 files(s) 100%
*.venuesysout 20 MB 2/2 files(s) 100%
*.hash 17 MB 8/8 files(s) 100%
*.txt 15 MB 108/109 files(s) 99% After running To test storing the repository to Github I created a new remote repository, The results were a remote repository of 234 MBytes and 1.74 GBytes of lfs storage. All of that was pretty painless, the only issue I ran into was a 1 GB individual lfs limit in Github. Any usage above that starts incurring costs. Since I went significantly over (>150%) I was locked out of lfs. Given this is just a test, I didn't see a need to keep the repository around and pay for access. I'll put up a version without the pdf files so I can experiment with the sysouts. I want to see how the system performs with lfs installed and with out it. |
Beta Was this translation helpful? Give feedback.
-
Those are two precious sysouts that should never be overwritten, which is why they are not named with the same .sysout extension. They will never change, so I don't think it's a problem to have them directly in the git repo.
… On Jul 19, 2021, at 8:08 PM, Bill Stumbo ***@***.***> wrote:
Ok, since lfs migrate info didn't show anything matching uppercase, if there are any, they currently aren't using much space. But no harm in adding them to the .gitattributes files to cover ourselves going forward.
What about *.venuesysout?
-rw-r--r-- 1 wstumbo wstumbo 11501056 Jul 18 22:22 full.venuesysout
-rw-r--r-- 1 wstumbo wstumbo 8862208 Jul 18 22:22 lisp.venuesysout
Both of these files are in the loadups directory. A poorly named sysout? An artifact of value, an experiment, or something else?
We may also want to consider setting a minimum size for objects we store in lfs. The import command supports the --above option to set a minimum size.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#102 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB6DAWI75HWD6L6PJ3U4QWLTYTSCRANCNFSM4VFHH2XQ>.
|
Beta Was this translation helpful? Give feedback.
-
moved to issue #495
There are several problems with the way the project has and continues to misuse GitHub.
Fixing these requires more Git expertise than I have, and concerted effort. Let me assure you that each of these has a longer explanation than given here.
FOO
andFOO.~1~
in the repo. Doing this enables some workflows that are important, now, to be able to pluck out a previous definition with a simple `GETDEF(FOO FILE;3). There might be some way of doing the same with GiT but that doesn't matter unless we hook into the Git API.foo
andfoo.~3~
then foo with no version is really version 4. Lisp instead makes an explicit hard link betweenfoo.~4~
andfoo
which is fine. Except Git knows nothing about hard links and treats them as two separate files, doubling the space (and Git hashes file names with contents so they don't help). A simple fix would be to write a quick script that scans through looking forfoo
andfoo.~nn~
same size and content then replace one with a hard link with the other. (This also would remove the annoyance of having an extra version pop up.please use these numbers in your reply. There's a separate discussion category for 'Configuration and Startup' even though there's some overlap, And please use the 'view it on github' link from your notification email rahere than replying via email, at least until i get some response from github community forum.
Beta Was this translation helpful? Give feedback.
All reactions