Spare disk usage, but also shorten downloading time on a repository by removing large blobs of useless data.

On one of our projects, we intensively use the VCR gem to record and mock HTTP requests to external websites and APIs. These records are plain YAML files and act as a proxy-cache to replay HTTP requests within our test suite. We add them to our code versioning in order to be also used by our CI¹ system and speed up feedback loop.

But when external websites we request change, these records have to be updated, and new commits with new heavy data blobs enter our git repository (sometime a few MB). Due to the git database² structure, the .git folder tends to grow rapidly as we add more and more VCR records.

Plus, keeping obsolete HTTP requests has very limited interest, as we rarely have to check older commits in out workflow. So one possible way to shrink git database is to use git-filter-branch and cleanup all the git history from unwanted data. This command is particulary handy when a password or any other sensible value has been wrongly committed and pushed to a remote repository.

But git-filter-branch is not the fastest nor the more efficient to filter and remove data. BFG repo cleaner, on the other hand reveals itself quite better at this specific task. Ok, I know… we’re talking about a java tool here. But it does the job, pretty well and pretty fast!

The usage of BFG is quite simple, it works on a bare copy of the original repository. BFG only need the git database structure (the .git folder). In order to get this, you have to clone your target repository with the --mirror option.

$ git clone --mirror git://github.com/levups/some-big-repo.git

This will clone our remote repository to a local folder with a .git extension. This is a convention to distinguish normal and bare repositories. Then, you have to tell BFG which element you want to filter. In our case, we want to delete all the blobs where size is above 1MB in all the git history.

$ java -jar bfg.jar --strip-blobs-bigger-than 1M some-big-repo.git

After BFG have finished working, it tells all selected commits from the filter rule have been rewritten (and their children), and that the cleaning part is now our responsibility. Triggering git garbage collector can be done inside the bare folder with this command:

$ cd some-big-repo.git
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

The main drawback of this operation is the total rewriting of all modified commits. It’s not a problem at the git level, but for a remote repository, which acts as a reference for all the developers, changing history has to be prepared and discussed³ before pushing back these new commits. This operation changes all the references used in commit messages or pull-request discussions, so be aware of potential confusion.

If the change benefits outweighs the cost of loosing references information on the remote repository, then all you have to do is to push the bare repository to the remote.

$ git push

All developer repositories then have to be updated, to get new references. And the garbage collection process is also recommended to get rid of old orphan commits. A simpler way is to get rid of the local folder and git clone from the remote repository. You will be able to check if download time is better after cleaning things up.

Note that there’s no guarantee that GitHub will trigger its own garbage collector after pushing those new elements. But who really cares? After we tried on a sample repository, GitHub interface was still completely functional and old commits/pull-requests/issues references were still accessible, although non existent. Is this temporary, or will it last as repository evolves? Hard to tell, there’s not much documentation about this on the web and only a few tickets about this on BFG repo cleaner own repository.

We do not have enough feedback on theses manipulations to jump blindly in such an operation on a production-level repository yet. May be time will tell…

Continuous Integration. ↩
You can report to git porcelain commands for more details. ↩
Being well organized is primordial in this process. Inform contributors, freeze developments, merge all pending pull-requests can help, as when new elements are pushed, there is no way back other than force pushing a backup. ↩