Information about performing a git-commit-ectomy correctly. https://pages.charlesreid1.com/git-commit-ectomy/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

14 KiB

This page walks through a demonstration
git-commit-ectomy that you can perform
starting with an empty git repository.

It addresses the single-branch case.

Painting: The Agnew Clinic (Eakins 1889).




table of contents




requirements

This guide utilizes GNU xargs.
You should run it on Linux,
or use Homebrew’s gxargs if
on a Mac.

consult with your doctor

You should consult with your doctor to determine if a
git-commit-ectomy is right for you.

This one-liner lists the 40 largest files in the repo
(modify the tail line to change the number):

$ 
git rev-list --all --objects | \
     sed -n $(git rev-list --objects --all | \
     cut -f1 -d' ' | \
     git cat-file --batch-check | \
     grep blob | \
     sort -n -k 3 | \
     \
     tail -n40 | \
     \
     while read hash type size; do
          echo -n "-e s/$hash/$size/p ";
     done) | \
     sort -n -r -k1 

When you’re ready to perform the surgery, append a cut command to get the
relative path to the file only, without listing the size of the file, which
is what we will need when we carry out the git-commit-ectomy:

$ 
git rev-list --all --objects | \
     sed -n $(git rev-list --objects --all | \
     cut -f1 -d' ' | \
     git cat-file --batch-check | \
     grep blob | \
     sort -n -k 3 | \
     \
     tail -n40 | \
     \
     while read hash type size; do
          echo -n "-e s/$hash/$size/p ";
     done) | \
     sort -n -r -k1 | \
     cut -f 2 -d' ' 

an ascii art crash course in surgery

Suppose our patient has a particularly painful
and unnaturally large commit located in their
commit history:

            This is the commit
             to be "ectomied"
             \/
             __
    o---o---(__)---o---o--o

The git commit ectomy will rewrite the history
without the problematic commit:

             __
    o---o---(__)---o---o---o   old history
         \
          \---o----o---o---o   new history

              /\
            The problematic commit
             is modified to be smaller
              in the new history

If there is only one branch, no branches or tags
or labels will refer to the old history, so the
commits are not kept in the .git directory.

Finally, running a git push <remote-name> <branch-name> --force
will force the old history to be replaced with
the new history on the remote. Here is the final
commit history:

    o---o
         \
          \---o----o---o---o   new history

demo surgery: setup

Clone an example repo for performing surgery. You don’t need
a remote repository to do the demo surgery, but we will use one
in our walkthrough.

$ 
git clone https://github.com/charlesreid1/git-commit-ectomy-example

side note: how to make a fat file

We will use the dd command to create files with a specified number
of bits. For example, to create a 10 MB file, we can issue the command:

$ 
dd if=/dev/urandom of=my_big_fat_file bs=1048576 count=10

Important: You must use /dev/urandom with a non-zero block size.
If you use /dev/zeros then each file will be identical and git
will not store them separately. Then your surgery will go very badly.

Note: 1048576 = 2^20 bytes comes from
the fact that 1 KB = 2^10 bytes, and 1 MB = 2^10 KB,
for a total of 2^20 bytes per megabyte.

count=10 means we make 10 blocks, each of size 1 MB (1048576 bytes).

make some text files

We start by adding some small boring text files to the repository:

$ 
echo "hello foo" > foo.txt; echo "hello bar" > bar.txt

Now add them to the repo history:

$ 
for item in `/bin/ls -1 *.txt`; do
    git add ${item} && git commit ${item} -m "adding ${item}"
done

make some fat files

To demonstrate the importance of specifying the path to the
large files being removed from the repository, we add several
10 MB files inside of a subdirectory. Start with the directory
structure:

$ 
mkdir data1; mkdir data2

Now create some files in each of the two directories:

$ 
cd data1/
dd if=/dev/urandom of=bat bs=1048576 count=10
dd if=/dev/urandom of=cat bs=1048576 count=10
dd if=/dev/urandom of=dat bs=1048576 count=10
cd ../


cd data2/
dd if=/dev/urandom of=fat bs=1048576 count=10
dd if=/dev/urandom of=rat bs=1048576 count=10
cd ../

Now we have the following directory structure:

$
tree .
.
├── bar.txt
├── data1
│   ├── bat
│   ├── cat
│   └── dat
├── data2
│   ├── fat
│   └── rat
└── foo.txt
$ 
ls -lhg data1
-rw-r--r--   1 staff    10M Apr 10 18:30 bat
-rw-r--r--   1 staff    10M Apr 10 18:30 cat
-rw-r--r--   1 staff    10M Apr 10 18:30 dat

$ 
ls -lhg data2
-rw-r--r--   1 staff    10M Apr 10 18:30 fat
-rw-r--r--   1 staff    10M Apr 10 18:30 rat

Also make sure they are unique (hence /dev/random and not /dev/zero):

$ 
for i in `/bin/ls -1 data1/*at data2/*at`; do
    md5 ${i}
done

MD5 (bat) = 140c7d1e5a12c0eb2fefd5529250a280
MD5 (cat) = 9345ca4033c7e42fb009e3b8014570dc
MD5 (dat) = fadc3114fe9a70f688eba0db4e0dc7a9
MD5 (fat) = 39e98200043e438f9070369989b2d964
MD5 (rat) = 77b1c3077041078fd1f371b1bb7dd6e4

commit files

Add the files to the repo in separate commits:

$
for item in data1/bat data1/cat data1/dat data2/fat data2/rat; do
    git add ${item} && git commit ${item} -m "adding ${item}"
done

Now push all the commits to the remote (this will take a while):

$
git push origin master

Now you should see everything in the commit history on Github:

Repo commit history

You should also see it locally in the git log:

$ 
git log --oneline
902b0d8 adding rat
b3376bd adding fat
e2427de adding dat
25682b5 addding cat
495235a addding bat
2506d38 adding bar.txt
2eb8d13 adding foo.txt

demo surgery: procedure

prepare tools

Use git-forget-blob.sh
to forget the blob. Start by downloading it:

$ 
wget https://tinyurl.com/git-forget-blob-mac-sh -O git-forget-blob.sh
chmod +x git-forget-blob.sh

This script will detect if you are on a Mac,
and if so, will use the GNU gxargs instead of
the BSD xargs. This requires GNU tools to be
installed via Homebrew:

$
brew install gnu-xargs

When installing gnu-xargs, you can also add the --with-default-names flag
to brew to overwrite the default BSD version of xargs (which is not compatible
with the GNU version of xargs).

$
brew install gnu-xargs --with-default-names

To use the git-forget-blob.sh script:

$
./git-forget-blob.sh <relative-path-to-file>

(See below for more detail.)

the command that doesn’t work: git rm

Start by checking the size of the repo:

$ 
du -hs .git
 50M	.git

Now remove dat using git rm:

$ 
git rm dat
git commit dat -m 'Removing dat'
git push origin master

This, of course, does not change the size of the repo.
If we clone a fresh copy from Github, the size of the
repo is still the same:

$
du -hs .git
 50M    .git

Why? Because git is cursed with perfect memory, and will not
forget a large file that’s been added to the repo.

the command that does work: git forget blob

To force git to forget a large file that’s been added to the repo,
use the git-forget-blob.sh script to permanently remove it.
Here, we remove the dat file from the repo history by modifying
all commits that involve the dat file, and rewriting those commits
(and, by consequence, all commits that happened after that commit).

Here is how to permanently remove dat from the repo history and rewrite
all commits (we specify data1/bat and not bat):

$ 
./git-forget-blob.sh data1/bat
Enumerating objects: 26, done.
Counting objects: 100% (26/26), done.
Delta compression using up to 4 threads.
Compressing objects: 100% (20/20), done.
Writing objects: 100% (26/26), done.
Total 26 (delta 1), reused 26 (delta 1)
Rewrite 495235a86e70be03ee0749733645615a093b547a (3/7) (0 seconds passed, remaining 0 predicted)    rm 'data1/bat'
Rewrite 25682b53e6c8c88328346fc2e245b5946adec6cb (4/7) (0 seconds passed, remaining 0 predicted)    rm 'data1/bat'
Rewrite e2427def6f9de095928aaecfd9fef892880e6ce8 (5/7) (0 seconds passed, remaining 0 predicted)    rm 'data1/bat'
Rewrite b3376bdc847e26bdb323408afa06112dd4c2b36d (6/7) (0 seconds passed, remaining 0 predicted)    rm 'data1/bat'
Rewrite 902b0d8e46ec8d1487ae3db3b2989dfade5dacbe (7/7) (1 seconds passed, remaining 0 predicted)    rm 'data1/bat'

Ref 'refs/heads/master' was rewritten
Enumerating objects: 23, done.
Counting objects: 100% (23/23), done.
Delta compression using up to 4 threads.
Compressing objects: 100% (18/18), done.
Writing objects: 100% (23/23), done.
Total 23 (delta 1), reused 12 (delta 0)

Verify it worked by finding size of .git directory

$ 
du -hs .git
 40M	.git

Success!

Note that if you mistakenly specify the name of the file only,
without the relative path to the file, git will be looking for
the file at the top level of the repository, and the file will
not be found:

$ 
./git-forget-blob.sh bat
Enumerating objects: 26, done.
Counting objects: 100% (26/26), done.
Delta compression using up to 4 threads.
Compressing objects: 100% (21/21), done.
Writing objects: 100% (26/26), done.
Total 26 (delta 1), reused 0 (delta 0)
bat not found in repo history

how it worked

If we check the git log we can see what happened - all commits involving
bat were rewritten. It’s important to note that when git computes the
hash of each commit, it includes the hash of the prior commit - meaning,
if one commit in a repository’s history changes, every commit in a repository’s
history changes.

Thus, we will rewrite every single commit since the very first commit that introduced
the file we removed
.

Compare the old and new logs:

# NEW LOG                    # OLD LOG                                          
                                                                                
$ git log --oneline          $ git log --oneline                                
5bc57f6 adding rat           902b0d8 adding rat 
3153621 adding fat           b3376bd adding fat                                 
c456173 adding dat           e2427de adding dat                                 
078a5be addding cat          25682b5 addding cat                                
3cd75ce addding bat          495235a addding bat                                
2506d38 adding bar.txt       2506d38 adding bar.txt                             
2eb8d13 adding foo.txt       2eb8d13 adding foo.txt                             

Note that the first two commits, which did not involve
the bat file, remain identical, but every commit
after 495235a (which first introduced bat) is changed.

Each commit hash is computed using the prior commit hash,
so once commit 495235a changes, it cascades through the
entire history by changing all subsequent commit hashes.

stitch the patient back up

Of course, for surgeons, as for airline pilots,
if the last step is screwed up, nothing else counts.

We asked git to forget a file, which it did,
but that required modifying git’s entire commit
history. At this point we have two parallel
master branches - the old history and the new
history.

If we simply git push our branch to the Github remote,
we will have a huge headache: both histories will end up
on Github, our git history will contain duplicates of every
commit, and the old and new history will show up side by
side.

To do this correctly, we need to use the force
when we push, which tells the Github remote to rewrite
whatever commit history it currently has with the
commit history that we are pushing.

git push origin master --force

This will ensure that Github does not keep duplicate
copies of all commits.

Here is a screenshot of the repo on github before
we ran git-forget-blob:

Commit history before git-forget-blob

And a screenshot of the repo after:

Commit history after git-forget-blob

tips for surgery

Size up your patient before you start.

Use the one-liner in the “Consult with your Doctor”
section

to size up your patient before you start
(modify the tail line to change the number of files).

Get your patient some insurance.

Back up any files you want to remove but still want to keep.

Make sure you specify relative paths to file names.

The git-forget-blob.sh script requires you to specify the
path to the file you want to remove, relative to the top
level directory of the repository
.

Like this:

# CORRECT
./git-forget-blob.sh data/my_project/phase-1/proprietary/super_huge.file

Not like this:

# INCORRECT
./git-forget-blob.sh super_huge.file

The long one-liner in the “Consult with your Doctor” section
will list the largest files in the repository, with the relative
path to that file (relative to the root of the repository).

If you pass it a filename without a path to the file,
the script will most likely complain that the file could
not be found. But it may attempt to remove the file and
rewrite history anyway without removing any files.

If you are running git-forget-blob.sh and the size of the
.git folder is not going down, it may be because you are
specifying an incorrect path to the files you are trying to
remove.