A repack uncompressed & diff visualizer for ZIP based files stored in git repos.
Most repos hosting should use ReZipDoc.
git does not like binary files. They make the repo grow fast in size in MB (see delta compression), and when you try to see what changed in a commit, you only get this:
Binary files A and B differ!
... not very useful!
ReZipDoc solves both of these issues, though only for ZIP based files, which includes for example FreeCAD and LibreOffice files.
NOTE It does not work for all binary files!
HINT If you are unsure whether a file format is ZIP based, just try to look at it with a software that can peak into ZIP files.
On Linux or OSX:unzip -l someFile.xyz
So if you are storing ZIP based files in your git
repo,
you probably want to use ReZipDoc.
- Project state
- How to use
- Installation
- Filter repo history
- Culprits
- Motivation
- How it works
- Benefits
- Observations
- Based on
This repo contains a heavily revised, refined version of ReZip (and ZipDoc), plus unit tests and helper scripts, which were not available in the original.
If your git repo makes heavy use of ZIP based files, then you probably want to use ReZipDoc in one of these three ways:
-
install ZipDoc diff viewer - This allows you to see changes within you ZIP based files when looking at git history in a human-readable way. It does not change your past nor future git history.
To use this, install with
--diff
only. -
install ReZip filter - This will change your future git repos history, storing ZIP based files without compression.
To use this, install with
--commit --diff --renormalize
. -
install ReZip filter & filter repo - This changes both the past (<- Caution!) and future history of your repo.
To use this, create a copy of the repo with filtered history.
The filter and diff tool require Java 8 or newer.
The helper scripts - which are mostly used for installing the filter - require a POSIX (~= Unix) environment. This is the case on OSX, Linux, BSD, Unix and even Windows, if git is installed.
The recommended procedure is to install the helper scripts once, and then use them to comfortably install the filter into local git repos.
NOTE
This downloads and executes an online script onto your machine, which is a potential security risk. You may want to check-out the script before running it.
NOTE
This has to be done once per developer machine.
They get installed into ~/bin/
,
and if the directory did not exist before,
it will get added to PATH
.
To install:
curl --silent --location \
https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-scripts-tool.sh \
| sh -s install --path
To update (to latest development version):
curl --silent --location \
https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-scripts-tool.sh \
| sh -s update --dev
To remove:
curl --silent --location \
https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-scripts-tool.sh \
| sh -s remove
NOTE
This has to be done once per repo.
This installs the latest release of ReZipDoc into your local git repo.
Make sure you already have installed the helper scripts on your machine.
Switch to the local git repo you want to install this filter to, for example:
cd ~/src/myRepo/
As explained in How to use, you now want to use one of the following:
-
Install the diff viewer
rezipdoc-repo-tool.sh install --diff
-
Install the filter
rezipdoc-repo-tool.sh install --commit --renormalize
-
Filter the history & install the filter
If you filter the repo history, the freshly created, filtered repo will already have the filter installed as above.
To uninstall the diff viewer and/or filter, run:
rezipdoc-repo-tool.sh remove
Only use this if you can not use the above, for some reason.
-
Build the JAR
Run this in bash:
cd mkdir -p src cd src git clone git@github.com:hoijui/ReZipDoc.git cd ReZipDoc mvn package echo "Created ReZipDoc binary:" ls -1 $PWD/target/rezipdoc-*.jar
-
Install the JAR
Store rezipdoc-*.jar somewhere locally, either:
- (global) in your home directory, for example under ~/bin/
- (repo - tracked) in your repository, tracked, for example under /tools/
- (repo - local) recommended in your repository, locally only, under /.git/
-
Install the Filter(s)
execute these lines:
# Install the add/commit filter git config --replace-all filter.reZip.clean "java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ReZip --uncompressed" # (optionally) Install the checkout filter git config --replace-all filter.reZip.smudge "java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ReZip --compressed" # (optionally) Install the diff filter git config --replace-all diff.zipDoc.textconv "java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ZipDoc"
-
Enable the filters
In one of these files:
- (global) ${HOME}/.gitattributes
- (repo - tracked) /.gitattributes
- (repo - local) recommended /.git/info/attributes
Assign attributes to paths:
# This forces git to treat files as if they were text-based (for example in diffs) [attr]textual diff merge text # This makes git re-zip ZIP files uncompressed on commit # NOTE See the ReZipDoc README for how to install the required git filter [attr]reZip textual filter=reZip # This makes git visualize ZIP files as uncompressed text with some meta info # NOTE See the ReZipDoc README for how to install the required git filter [attr]zipDoc textual diff=zipDoc # This combines in-history decompression and uncompressed view of ZIP files [attr]reZipDoc reZip zipDoc # MS Office *.docx reZipDoc *.xlsx reZipDoc *.pptx reZipDoc # OpenOffice *.odt reZipDoc *.ods reZipDoc *.odp reZipDoc # Misc *.mcdx reZipDoc *.slx reZipDoc # Archives *.zip reZipDoc # Java archives *.jar reZipDoc # FreeCAD files *.fcstd reZipDoc
This always creates a new copy of the repository.
NOTE
This only filters a single branch.
Make sure you have the helper scripts installed and in your PATH
.
This filters the master
branch of the repo at ~/src/myRepo
into a new local repo ~/src/myRepo_filtered
,
using the original commit messages, authors and dates:
rezipdoc-history-filter.sh \
--source ~/src/myRepo \
--branch master \
--orig \
--target ~/src/myRepo_filtered
It also works with an online source:
rezipdoc-history-filter.sh \
--source "https://github.com/case06/ZACplus.git" \
--branch master \
--orig \
--target /tmp/ZACplus_filtered
After doing this, the new, filtered repo will already have the filter installed, so future commits will be filtered.
We are going to run a script that filters the Zinc-Oxide Open Hardware battery (ZAC+) project repo, which has a header comment explaining what it does in detail.
In short, it downloads ReZipDoc helper scripts to ~/bin
,
adds that dir to PATH
if it is not there yet,
creates temporary git repos in /tmp/
,
and generates some command-line output.
Run it like this:
curl --silent --location \
https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-sample-filter-session.sh \
| sh
As described in gitattributes, you may see unnecessary merge conflicts when you add attributes to a file that causes the repository format for that file to change. To prevent this, Git can be told to run a virtual check-out and check-in of all three stages of a file when resolving a three-way merge:
git config --add --bool merge.renormalize true
Many popular applications, such as Microsoft Office and Libre/Open Office, save their documents as XML in compressed zip containers. Small changes to these document's contents may result in big changes to their compressed binary container file. When compressed files are stored in a Git repository these big differences make delta compression inefficient or impossible and the repository size is roughly the sum of its revisions.
This small program acts as a Git clean filter driver. It reads a ZIP file from stdin and outputs the same ZIP content to stdout, but without compression.
- human readable/plain-text diffs of (ZIP based) archives, (if they contain plain-text files)
- smaller overall repository size if the archive contents change frequently
- slower
git add
/git commit
process - slower checkout process, if the smudge filter is used
When adding/committing a ZIP based file, ReZip unpacks it and repacks it without compression, before adding it to the index/commit. In an uncompressed ZIP file, the archived files appear as-is in its content (together with some binary meta-info before each file). If those archived files are plain-text files, this method will play nicely with git.
The main benefit of ReZip over Zippey, is that the actual file stored in the repository is still a ZIP file. Thus, in many cases, it will still work as-is with the respective application (for example Open Office), even if it is obtained without going through the re-packing-with-compression smudge filter, so for example when downloading the file through a web-interface, instead of checking it out with git.
The following are based on my experience in real-world cases. Use at your own risk. Your mileage may vary.
- One packed repository with ReZip was 54% of the size of the packed repository storing compressed ZIPs.
- Another repository with 280 *.slx files and over 3000 commits was originally 281 MB and was reduced to 156 MB using this technique (55% of baseline).
I found that the loose objects stored without this filter were about 5% smaller than the original file size (zLib on top of zip compression). When using the ReZip filter, the loose objects were about 10% smaller than the original files, since zLib could work more efficiently on uncompressed data. The packed repository with ReZip was only 10% smaller than the packed repository storing compressed zips. I think this unremarkable efficiency improvement is due to a large number of *.png files in the presentation which were already stored without compression in the original *.pptx.
- ReZip For more efficient Git packing of ZIP based files
- ZipDoc
A Git
textconv
program to show text-based diffs of ZIP files
- png-inflate Does the same uncompressed repack for PNG image files