GitHub issues #495 #102

masinter · 2020-12-22T07:06:56Z

masinter
Dec 22, 2020
Maintainer

moved to issue #495

There are several problems with the way the project has and continues to misuse GitHub.
Fixing these requires more Git expertise than I have, and concerted effort. Let me assure you that each of these has a longer explanation than given here.

Storing old versions as separate files in the repo. This refers to seeing FOO and FOO.~1~ in the repo. Doing this enables some workflows that are important, now, to be able to pluck out a previous definition with a simple `GETDEF(FOO FILE;3). There might be some way of doing the same with GiT but that doesn't matter unless we hook into the Git API.
This is a minor artifact of 1 but it has an easier fix. The problem is that the VM doesn't exactly follow EMacs' way of numbering versions. Emacs you see foo and foo.~3~ then foo with no version is really version 4. Lisp instead makes an explicit hard link between foo.~4~ and foo which is fine. Except Git knows nothing about hard links and treats them as two separate files, doubling the space (and Git hashes file names with contents so they don't help). A simple fix would be to write a quick script that scans through looking for foo and foo.~nn~ same size and content then replace one with a hard link with the other. (This also would remove the annoyance of having an extra version pop up.
We're storing derived files that should be rebuilt -- compiled files (John had code to batch compile everything), sysouts, whereis.hash, exports.all. There are good reasons for each of these but it should be a goal for someone with no experience could rebuild from source. It's the only way That's separate from releases (which we're not using now).
I didn't understand the git model (I still don't, but I've learned a few things): Move a bunch of files in. Add and commit. Then move them out. Git RM and then commit that. Then move them back in. etc. Multiple copies.
A diff that worked for lisp. And things like that. shellcommand git commands that work with lisp file names
LFS of sysouts might help a little, but 1 and 2 and 3 and 4

please use these numbers in your reply. There's a separate discussion category for 'Configuration and Startup' even though there's some overlap, And please use the 'view it on github' link from your notification email rahere than replying via email, at least until i get some response from github community forum.

stumbo · 2021-05-14T04:43:48Z

stumbo
May 14, 2021
Collaborator

Medley Repo Cleanup - Git Bloat

Problem

Early construction of the Interlisp Medley repository resulted in loading of binary files (Scanned PDFs) and other binary files that were deleted as the current structure of the repository evolved.

This has led to the git repository's history becoming bloated. At present, the repository is just over 1 GB in size.

The goal of this activity is to clean up the repository and removing unneeded files from the git history and, in turn, trim down the size of the repository to a more manageable size.

Rebase

A initial experiment is to rebase the complete repository, smashing together the commits that add files along with the commits that delete them. The goal being by bringing the commits together we end up removing the files from git.

This plan does not work. Rebasing and moving commits and smashing them doesn't remove the files from the git's object store. They are still there taking up room. We need a different approach to excise the files from git's object library.

Nuclear Option

This option produces best possible outcome, removing all wasted space and creating a minimal sized repository.

By deleting the .git directory and recreating the git repository we can create a benchmark - albeit a benchmark that lacks any history.

The following steps from stackOverflow illustrates the process:

git log > original.log
rm -rf .git
git init
git add .
git commit -F original.log

At the conclusion of this exercise the repository is approximately 40 MBs. This tells us the best outcome would be to be close to 40 MBs.

git filter repo

A python tool, git filter repo, is available that automates rewriting and clean up of git repos. One function this tool provides is the ability to select and remove dead directories from the git object store. The tool, in turn, updates the
git history accounting for the changes. The result is the general history remains but files that are not needed are used are wiped from it.

One of the biggest consumers of space in the medley repository is the docs/irm.pdf directory. The files in this directory were checked in
on 2020-08-29 as part of a larger check in of 154 files. On 2020-09-14 the files were removed. Git however is structured so that if we reset our view of the repository back to the August 29th check in we'll see the structure that resided at that time. No matter, that a short time later, we decided the check in was a mistake and the files are not needed. The upshot of this is these files are part of the history and remain there, consuming space but adding no value.

To fix this problem correctly, we need to (1) remove the files and (2) rewrite history so that if we reset our view back to August 29th we don't end up with a corrupted branch with missing files.

Git has the ability to do this. Git Filter Repo simplifies this work helping ensure the repository remains consistent.

Removing the directory is a large first step in cleaning up the repository.

First, some analysis to better understand what the object library holds:

medley(master) $ git_filter_repo.py --analyze

Processed 14640 blob sizes
Processed 317 commits
Writing reports to .git/filter-repo/analysis...done.

The results show:

medley(master) $ cat .git/filter-repo/analysis/directories-deleted-size.txt

=== Deleted directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
  1308953054  989315852 2020-09-13 docs/irm.pdf
   346514872  258132318 2020-09-13 docs/irm.pdf/Lisp Library
   338076414  254058427 2020-09-13 docs/irm.pdf/Release Notes
   203767197  156106554 2020-09-13 docs/irm.pdf/Interlisp Volume 2 Environment
   184924178  140801565 2020-09-13 docs/irm.pdf/Release Notes/Lyric Medley
   180824781  140011765 2020-09-13 docs/irm.pdf/Interlisp Volume 3 Input-Output
   105651331   83461300 2020-09-13 docs/irm.pdf/Interlisp Volume 1 Language
    90241594   61553633 2020-09-13 docs/irm.pdf/Lisp Users
    57627138   43418959 2020-09-13 docs/irm.pdf/Lisp Library/Tedit
    55111927   39862533 2020-09-13 docs/irm.pdf/Lisp Library/Sketch
   174407044   38353303 2020-11-15 lispcore
...

The deleted docs/irm.pdf directory is large, compressed it accounts for a large part of the space consumed by the repository.

NOTE: The unpacked and packed size values can be slightly misleading and only provide a rough estimate of space used. The analyze option generates a readme file in the analysis directory that provides additional details

Removing the docs/irm.pdf will have a significant effect on memory usage. Removing the directory can be accomplished by:

medley(master) $ git-filter_repo.py --path docs/irm.pdf --invert-paths

Parsed 317 commits
New history written in 0.40 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at d08ce2e See PR #275 for discusssion
Enumerating objects: 14510, done.
Counting objects: 100% (14510/14510), done.
Delta compression using up to 8 threads
Compressing objects: 100% (8825/8825), done.
Writing objects: 100% (14510/14510), done.
Total 14510 (delta 5528), reused 14507 (delta 5525)
Completely finished after 1.19 seconds.

This operation does two things. It removes the docs/irm.pdf directory and all it's subdirectories and their contents. It then parses the git history removing all references to the files that have been removed.

This can be seen by reviewing the git log:

Before	After
`b739c0b` move (huge) Interlisp manual files to history repo	---
`40bf2ea` Update run-medley	`16e888a` Update run-medley
`e4a5840` Update run-medley	`507f4dc` Update run-medley
`1357d69` Update README.md	`e2f013c` Update README.md
`b8234b1` Create README.md	`9fa719e` Create README.md
`80c1d59` Created READMEmd	`cca8440` Created READMEmd
`7807a81` Create README.md	`45bd052` Create README.md
`e3ae191` Update README.md	`8414109` Update README.md
`cfbf5df` Update README.md	`aa9336e` Update README.md
`ad0eaca` tweaks to match runtime	`2843f5c` tweaks to match runtime
`8519258` initial checkin for sources	`ddaffda` initial checkin for sources
`cb46b0b` initial checkin for library	`f7316f3` initial checkin for library
`d6580ff` initial checkin for library	`ae8b5c8` initial checkin for library
`b58c88b` initial checkin for lispusers	`586fc4d` initial checkin for lispusers
`feaf0a5` initial checkin for misc files	`f76d961` initial checkin for misc files
`758c289` intial checkin some useful sysouts	`78d7bef` intial checkin some useful sysouts
`32bd326` initial checkin docs	`5bab28c` initial checkin docs
`c9afda1` initial checkin fonts	`f781373` initial checkin fonts
`e820723` Update README.md	`0b20760` Update README.md
`511a73f` (tag: v3.5.2-pre-alpha-test) Update LICENSE	`4d3959b` (tag: v3.5.2-pre-alpha-test) Update LICENSE
`7f57df7` Create LICENSE	`522bc69` Create LICENSE
`44e05b2` Update README.md	`a297818` Initial commit

The git history has been rewritten resulting in the hashes for each commit changing. But more importantly the first commit (the newest commit) only exists in the before log. It has been removed in the after log. If we were to examine the commit where the documents were checked in to docs/irm.pdf we would find the commit edited and all those files removed. Since the commit contains other files that are still available, the commit remains.

History has been rewritten omitting any mention of the docs/irm.pdf related files. Meaning we could checkout the check in docs commit (now 5bab28c) and have a stable view of the master branch - albeit missing the document that were in the docs/irm.pdf directory.

This process can be continued removing other unused directories. In addition, large standalone files that are no longer needed can also be removed. The same process would apply, the files would be removed from the git repository and history rewritten to exclude them.

Using this approach and just removing the docs/irm.pdf directory reduces the footprint of the medley repo to apx. 100 MB. Removing additional unused directories and files will increase the savings. It's unclear how close we could come to 40 MB. The advantage of this approach is the repository history remains intact - but slightly edited.

2 replies

nbriggs May 14, 2021
Maintainer

I vote for filtering out the documentation files that we subsequently decided should be in a different repo. I'm not sure that we need to worry about getting from 100MB down to 40MB.

masinter May 14, 2021
Maintainer Author

I agree -- 100MB is wonderful improvement; the problem with our "keep old versions committed separately" is more about other factors than github size.

masinter · 2021-07-18T03:42:51Z

masinter
Jul 18, 2021
Maintainer Author

(Moving the discussion back to github
lispcore email

I was assuming the use of git lfs migrate including *.PDF* *.sysout. I know this also rewrites history. I’m a little less wary of doing that with a supported package like LFS.

Second, it will carry forward to sysouts as well as scanned image PDFs from older scanners.
(and motivating the questions about when we do loadups to make new sysouts, whether badly compressed PDFs go in the medley repo vs interlisp.org/docs)
One stone, three birds.

I’m not sure how serious it impacts development workflows, that there are older machines that can’t run git lfs. @nbriggs ?

2 replies

stumbo Jul 18, 2021
Collaborator

I'll go ahead and install git lfs and do some experimenting with it. It'll be interesting to see what happens to the repo size when I move all the PDF and sysouts to lfs. I'll report the results here.

Once we're satisfied, we still need to address the question of when to do the changeover. It is a breaking change, force pushing the updated repo will invalidate every cloned version of Medley. Given there are currently only 10 forked copies the impact is probably manageable.

A key milestone or accomplishment might present a good time to make the change.

nbriggs Jul 18, 2021
Maintainer

When you've made an "lfs" repo, have a look and see what happens when you look at it with git without the lfs extension installed -- that will be the situation I am in... since the lfs developers, for no apparent reason, decided to require Go 1.16 rather than 1.15, and 1.16 is not compatible with my non-upgradeable macOS.

masinter · 2021-07-18T03:55:07Z

masinter
Jul 18, 2021
Maintainer Author

@stumbo wrote earlier

In the 7 June meeting, Larry mentioned I had proposed a second option for resolving space issues on GitHub. Since I had to drop off early for a work meeting, I thought I'd send an email and hopefully spur on the discussion.

The initial conversation is documented at: #102

The short synopsis is that there is a lot of bloat in the repo - driven primarily by the addition of a collection of pdf documents that were later moved out of the repository. And, as Larry has also mentioned saving old tilde versions of files into git has also led to some clutter and wasted space.

We've discussed two potential options. Option one is to 'surgically' remove unneeded content from git storage and remove any mention of it from git's history. We have tested this approach with the docs/irm.pdf directory. Using a third party python package I was able to remove all the files in this directory, its subdirectories and rewrite git history so there is no knowledge of them within the git Medley repository. The discussion mentioned above details the package and steps in greater detail.

This action by itself significantly reduces the size of the Medley repository.

There are a couple drawbacks to this approach. First, it rewrites git history, meaning that anyone who has forked our repository is going to suddenly find their repository no longer matches ours. Essentially after running the operation and updating a local git repository, we need to do a force push to reset the GitHub version of the repo to match the revised local one. Anyone uses the repository is going to need to update their local version and depending on how much work they have done locally may need to do more than stashing and popping from the stash to be resynched with the repo.

Given its a potentially painful exercise, the next concern is minimizing the number of times we rewrite global history. It would be best to only force push a new version of the repo once. Which leads to the question, how much scrubbing should be done? When have we removed the right amount of cruft? And, if we are removing stuff and rewriting history, is there a point were we accidentally break something? What is our level of confidence that the toolset we use won't make a mistake - and what is our ability to identify an error before committing the changes?

Option two is much simpler. Decide at point in time to start clean. Archive the existing repository. Delete the existing repository from GitHub and create a new git repository using the current source code base. This way, you're guaranteed the only files in the repository are what we deemed worthy of being under source control at time zero. There is no need to hunt down old files or worry about corrupting history. Of course, the downside, is there is no history. History, if needed, could be extracted from the archived repository and using it.

This approach has the same issue that the first approach does - we're replacing the repository on GitHub and anyone that has forked a copy or pulled from our repo would suddenly find themselves out of sync with the existing repository. Again, there would likely be some effort required to resync with the repo.

On the positive side, with this approach, the repository size is going to start at its smallest possible size. Another positive, option two is much simpler to implement. There is no risk of creating a version of master that somehow has a corrupted history.

A couple questions to consider as we determine the path forward:

What is the value of maintaining an uninterrupted history?
How much of the early work, grappling with the available content and making decisions on where things should go is worth keeping?

Thoughts? Questions?

My thought is that using "option two" (just start clean) could be done using something else for anyone who wants ready access to the "archive" -- Dropbox, Google Drve, etc.

0 replies

masinter · 2021-07-18T17:45:06Z

masinter
Jul 18, 2021
Maintainer Author

email from @stumbo 5/23

Instructions for removing unused directories from a git repository:

Prerequisite: git-filter-repo installed. see:
https://github.com/newren/git-filter-repo/blob/main/INSTALL.md

Do a pull to update your local copy of the repository to match github
git pull
Optional: Analyze the current repository to identify unused directories and files:
git_filter_repo.py --anaylze

The results will be in .git/filter-repo/analysis. The directories-deleted-sizes.txt is the file I find most useful for identifying directories to remove.
Remove an unneeded directory:
git_filter_repo.py --path docs/irm.pdf --invert-paths

Removes the docs/irm.pdf directory and rewrites the git history to exclude any of the files in the initial commit.
One of the side effects of git_filter_repo when removing files or directories from a repo is to remove the remote repository information. This makes it a conscious activity to overwrite the existing repo.
git remote add origin git@github.com:Interlisp/medley.git
Update the repository with the updated copy of master
git push --set-upstream --force origin master

Let me know if you have any questions.

0 replies

stumbo · 2021-07-19T22:50:47Z

stumbo
Jul 19, 2021
Collaborator

Git Large File System

Initial Results

Testing was done using Ubuntu running in wsl2 on a Windows 10 system.

Install git-lfs:

> sudo apt-get install git-lfs

Setup initialize git lfs:

> git lfs install

Run migrate info:

> git lfs migrate info --include-ref=master
migrate: Sorting commits: ..., done.
migrate: Examining commits: 100% (313/313), done.
*.pdf           1.3 GB   242/242 files(s)       100%
*.sysout        558 MB     56/56 files(s)       100%
*.~2~           26 MB    465/466 files(s)       100%
*.LCOM          21 MB   999/1002 files(s)       100%
*.venuesysout   20 MB        2/2 files(s)       100%

The results state that pdf files consume 1.3 GBs of the repository and sysout files consume 558 MBs.

Set lfs to manage pdf and sysout files:

> git lfs migrate import --include-ref=master --include="*.pdf,*.sysout"
migrate: Sorting commits: ..., done.
migrate: Rewriting commits: 100% (313/313), done.
  master                dbe239de240004a746275af5ef0862a9adfc1cbf -> a68484e2650b65c6f7c6f73f425abc573028b40b
  daily-210326          78d53039c540fc18c82165e899fbb4d9f68a319e -> df1b4956c65bfbdf85e58a76820dec563ac2a05c
  nightly-210329        78d53039c540fc18c82165e899fbb4d9f68a319e -> df1b4956c65bfbdf85e58a76820dec563ac2a05c
  nightly-210401        3e64317db5cbb39af8efd3eeac82bd08cb28f093 -> d2e94bd29c378e01642402a04505d57e84b8edee
  nightly-210424        21c8759084459e4e47a9b859e077a48b2beac02a -> 98c8cc38d43b1e74f9adc3a0c2248de935f7e871
  nightly-210428        f0ad3c5f6020598c5242a074c390657f76dd210e -> 2d0d375a2b633b0d6e78712fec31072f01fa898b
  nightly-210502        0a5ff043937f9d0ecc5ac8cd5bd6786b67f582b9 -> f1e2a22ab19e9b242f56e8038732ac4bd92cabaf
  nightly-210506        2cf33cebcfe6c99effd040aa22fe7f1ee0ac873d -> 8d32339cae35309524945bdac02820aeec2caf95
  nitely-210330         78d53039c540fc18c82165e899fbb4d9f68a319e -> df1b4956c65bfbdf85e58a76820dec563ac2a05c
  v0.14                 3c33ba0b7e9f53135bb5848fd0470cb5db685cff -> ba2e6fac11fa27674cf90f0645b999fe0085c5e2
  v3.5.1.13             99f28008dcb13462861b82e6df5957b890af8ae9 -> d381d663278117d3a09e44e13c8ecda4348dd6c6
  v3.5.2-pre-alpha-test 511a73fd15fb80ab8b0cb18ad7866e9289c1a05d -> b88d0f5db6df0e61321f1396b4106192602c9ac7
migrate: Updating refs: ..., done.
migrate: checkout: ..., done.

Re-run migrate info to see what's changed:

> git lfs migrate info --include-ref=master
migrate: Sorting commits: ..., done.
migrate: Examining commits: 100% (313/313), done.
*.~2~           26 MB    465/466 files(s)       100%
*.LCOM          21 MB   999/1002 files(s)       100%
*.venuesysout   20 MB        2/2 files(s)       100%
*.hash          17 MB        8/8 files(s)       100%
*.txt           15 MB    108/109 files(s)        99%

After running git lfs migrate import for pdf and sysout files they no longer show in the migrate info report meaning they are no longer stored in the standard remote git repository.

To test storing the repository to Github I created a new remote repository, medleyTest and pushed the lfs version to it.

The results were a remote repository of 234 MBytes and 1.74 GBytes of lfs storage.

All of that was pretty painless, the only issue I ran into was a 1 GB individual lfs limit in Github. Any usage above that starts incurring costs. Since I went significantly over (>150%) I was locked out of lfs. Given this is just a test, I didn't see a need to keep the repository around and pay for access. I'll put up a version without the pdf files so I can experiment with the sysouts. I want to see how the system performs with lfs installed and with out it.

2 replies

masinter Jul 20, 2021
Maintainer Author

neither l'inux or git are case insensitive. Try *.PDF and *.SYSOUT too.

stumbo Jul 20, 2021
Collaborator

Ok, since lfs migrate info didn't show anything matching uppercase, if there are any, they currently aren't using much space. But no harm in adding them to the .gitattributes files to cover ourselves going forward.

What about *.venuesysout?

-rw-r--r--  1 wstumbo wstumbo 11501056 Jul 18 22:22 full.venuesysout
-rw-r--r--  1 wstumbo wstumbo  8862208 Jul 18 22:22 lisp.venuesysout

Both of these files are in the loadups directory. A poorly named sysout? An artifact of value, an experiment, or something else?

We may also want to consider setting a minimum size for objects we store in lfs. The import command supports the --above option to set a minimum size.

nbriggs · 2021-07-20T05:00:07Z

nbriggs
Jul 20, 2021
Maintainer

Those are two precious sysouts that should never be overwritten, which is why they are not named with the same .sysout extension. They will never change, so I don't think it's a problem to have them directly in the git repo.

…

On Jul 19, 2021, at 8:08 PM, Bill Stumbo ***@***.***> wrote: Ok, since lfs migrate info didn't show anything matching uppercase, if there are any, they currently aren't using much space. But no harm in adding them to the .gitattributes files to cover ourselves going forward. What about *.venuesysout? -rw-r--r-- 1 wstumbo wstumbo 11501056 Jul 18 22:22 full.venuesysout -rw-r--r-- 1 wstumbo wstumbo 8862208 Jul 18 22:22 lisp.venuesysout Both of these files are in the loadups directory. A poorly named sysout? An artifact of value, an experiment, or something else? We may also want to consider setting a minimum size for objects we store in lfs. The import command supports the --above option to set a minimum size. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#102 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB6DAWI75HWD6L6PJ3U4QWLTYTSCRANCNFSM4VFHH2XQ>.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interlisp.org

GitHub issues #495 #102

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Interlisp.org

GitHub issues #495 #102

masinter Dec 22, 2020 Maintainer

Replies: 6 comments · 6 replies

stumbo May 14, 2021 Collaborator

Medley Repo Cleanup - Git Bloat

Problem

Rebase

Nuclear Option

git filter repo

nbriggs May 14, 2021 Maintainer

masinter May 14, 2021 Maintainer Author

masinter Jul 18, 2021 Maintainer Author

stumbo Jul 18, 2021 Collaborator

nbriggs Jul 18, 2021 Maintainer

masinter Jul 18, 2021 Maintainer Author

Thoughts? Questions?

masinter Jul 18, 2021 Maintainer Author

stumbo Jul 19, 2021 Collaborator

Git Large File System

Initial Results

masinter Jul 20, 2021 Maintainer Author

stumbo Jul 20, 2021 Collaborator

nbriggs Jul 20, 2021 Maintainer

masinter
Dec 22, 2020
Maintainer

Replies: 6 comments 6 replies

stumbo
May 14, 2021
Collaborator

nbriggs May 14, 2021
Maintainer

masinter May 14, 2021
Maintainer Author

masinter
Jul 18, 2021
Maintainer Author

stumbo Jul 18, 2021
Collaborator

nbriggs Jul 18, 2021
Maintainer

masinter
Jul 18, 2021
Maintainer Author

masinter
Jul 18, 2021
Maintainer Author

stumbo
Jul 19, 2021
Collaborator

masinter Jul 20, 2021
Maintainer Author

stumbo Jul 20, 2021
Collaborator

nbriggs
Jul 20, 2021
Maintainer