tl;dr Objects can have multiple file names, and the "base script" answer doesn't find the secondary+ names. Start with git-filter-repo --analyze
and work from there.
I think the implication is that one must purge, then check, then purge, then check... until all the undesired files are gone. I am not certain because I haven't done the purge yet.
---
I inherited a code base that had a lot of issues; one of which was things checked in that shouldn't have been; another was that the size was approaching the github-alike size limit and would shortly no longer be usable.
In an effort to clean up the repo I used the Base Script from this answer https://stackoverflow.com/a/42544963/1352761 (elsewhere on this question) to generate a list of files-in-the-repo sorted by size.
I came across this idiom (don't have a source now) git log --diff-filter=D --summary
and as a double check ran it, and then compared the output to the above. I wanted to be sure that I didn't miss purging any files on the first go, because I am not looking to do this multiple times.
Lo and behold. A file of the shouldn't-have-been-checked-in variety was present in the "deleted files" summary but not present in the "base script" summary. How can that be? To verify the file, I checked out the commit that deleted it and verified the file's presence on disk. So it is still in the repo, but why doesn't the "base script" version find it?
Doing a lot of digging didn't turn up any solutions. Most or all of the "find big files" scripts are based on git rev-list --objects --all
which simply... did not report the mystery file. Several tools built on git verify-pack -v .git/objects/pack/pack-*.idx
also didn't return anything useful.
Finally I gave up, and moved on to having a look at filter-repo https://github.com/newren/git-filter-repo which is going to be the purging tool. One git-filter-repo --analyze
to get started and there it is:
blob-shas-and-paths.txt: 8d65390d2b76d34a8970c83ce02288a93280ba01 5315 1459 [build_dir/qtvars_x64_Debug.props, build_dir/temp/qtvars_x64_Debug.props]
To be fair, the git rev-list
documentation for the --object-names
option https://git-scm.com/docs/git-rev-list#Documentation/git-rev-list.txt---object-names does say "and if an object would appear multiple times with different names, only one name is shown." so there error here is mine, I guess, except --object-names
wasn't in use in the "base script". The documentation does not tell you how to find the other names which is frustrating.
It turns out that my repo has 178 objects with multiple names; one of them has 18 names. I am assuming that purging is done by pathname and that git-filter-repo
will not remove the blob until all pathnames referencing it are purged. That means I'm in for 18 cycles of purge, check repo health, purge... unless git-filter-repo
has some tricks up its sleeve.