Rewriting history with git filter-branch

Welcome to my second post in the series of lesser-known git commands. If you have missed last week’s post about git worktree you can find it here. This week we will look at git filter-branch.

Hopefully, you will not need it that often but if you do git filter-branch is extremely powerful in rewriting your git history. If your spidy senses aren’t tingling by now let me be as clear as possible: Rewriting your git history is a very destructive process and after doing so you have to force push your changes and everybody working with the repository has to clone the repository again so please be very careful. Essentially git filter-branch can create a completely new repository for you.

But why do it at all if it is as destructive as advertised? Because sometimes we have too. Here are a couple of use cases where git filter-branch is the way to go.

Completely remove files from the repository

Let’s say you committed a file to the repository that should not be there. It could be a huge log file, a secret or even a private key or something. Deleting the file with a new commit would not suffice at all because the file would still be available in the previous versions in your git history. In the case of a large file, the repository size will not go down and everybody cloning would still have to get the history with the large file in it. In case of some sensitive data, an attacker could just check out the older version and get your secret data from there.

This is where git filter-branch comes into play. With the following command, you can remove a file from all branches in a repository.

$ git filter-branch --force --index-filter \
  "git rm --cached --ignore-unmatch FILE_TO_DELETE" \
  --prune-empty --all

Let’s break this down:

–force
As you would image this just ignores problems.

–index-filter
This is one of the possible filters we can apply. In an index-filter git will not checkout any files into a working directory it will only work on the index itself. Most of the time this makes the execution much faster but it makes complex changes more difficult because you have to work with update-index and such. But in this example, the change we want to perform is simple enough so we can go for this faster variant.

“git rm –cached –ignore-unmatch FILE_TO_DELETE”
The way index-filter works is it lets you execute a command on every state of your repository. So if you have a large repository with a long history bring drinks & snacks. Whenever you can use rebase instead of filter-branch you should choose to do so. But back to the topic at hand.
This is the command that will be executed on every state of your repository.
We use git rm to delete the FILE_TO_DELETE. Since we are only operating on the index we have to use --cached. The whole filter-branch command will fail if a filter command fails so we don’t want to get any errors if the file we want to delete isn’t present. To archive this we append --ignore-unmatched.

–prune-empty
Since we are removing a file there may be a commit that had the sole purpose of introducing this file to the repository. After removing the file, this commit would be empty. With this option, we get rid of distracting empty commits and making them disappear from the repository completely.

–all
This references the branch you want to operate on. You could choose to do it on master only or HEAD for the current branch. In our case, we want to do it on all branches. So the file will be gone for real.

Extract a part of your repository into a repo by its own

Maybe to this point in time, you managed your source code and your documentation in the same repository but for any reason, you now want to keep the documentation in a separate repository.

With the following command, you can do just this.

$ git filter-branch --force --prune-empty --subdirectory-filter docs @

–force –prune-empty
These are just the same as in our previous example.

–subdirectory-filter docs
This time we are using a different type of filter the subdirectory filter. This filter will move all contents of docs into the repository root and throw everything else away. You will end up with a „new“ repository containing only the history of your docs directory.

@
This is just gits way of messing with you. The @ is just a placeholder for HEAD or the current branch you are working on.

Remove a secret from a configuration file

For example, somebody accidentally committed an AWS token instead of the placeholder AWS_TOKEN and you want to make sure every reference to the real token is gone.

Short side note: If something like this happens with any secret data you should treat the data as compromised and create new secrets ASAP!

$ git filter-branch --tree-filter \
  "sed -i 's/12hgjg324g23g413gjh1234/AWS_TOKEN/g' config.yml" \
  --all

–tree-filter
This is maybe the most powerful filter. It will check out every history state into a temp directory, execute the given command and commit every change present in the temp directory. This means with this filter you can remove, add or even move or change files.

“sed -i ‘s/12hgjg324g23g413gjh1234/AWS_TOKEN/g’ config.yml;”
Just as index-filter the tree-filter executes a command on every state of the repository. In our case, we run a sed command to search for the AWS token in the configuration file and replace it with the placeholder. As mentioned you can change nearly anything here. Git will execute your command and commit all the changes present in the working directory.

–all
Since we don’t want any reference to the token in any branch to remain we run the sed command for all branches.

You can run multiple consecutive filter-branch commands. For example, if the stuff you want to extract in a repository on its own isn’t located in a single directory. You can move everything you need into a single directory using a tree-filter and afterward using the subdirectory-filter on the newly created subdirectory. Most of the time I have used git filter-branch it was to get rid of some useless big files that made cloning the repo a not very pleasant experience. But as you have seen you can do wicked stuff with git filter-branch. Try to remember, since after using it everybody has to re-clone the repository, so only use it if really necessary.

Some more words of warning

Even the git documentation states that git filter-branch has “some” pitfalls. Please make sure to have a look at the safety instructions before copy-pasting any of the shown commands into your terminal. Especially if you have spaces or non ASCI characters in your file names.

Secondly, remember that all attached commit signatures will be gone after applying git filter-branch. Understanding how git signatures work this is to be expected so don’t be surprised by any missing signatures.

I hope you enjoyed the second part of my lesser-know git commands series.

See you next time.

2 thoughts on “Rewriting history with git filter-branch

  • Hi,

    I try the below command to remove a directory of files from my commits, and it did works! But I have no intention to remove that directory of files in my working directory. But strangely enough, this directory of files in my working directory disappear after i execute this command. Any idea what could be the cause…

    git filter-branch –index-filter “git rm –cached –ignore-unmatch -r ‘Frontend/Sample CSV files'” HEAD

    • Hi Han,

      that behavior is to be expected, it is the same thing if you execute ‘git rm’ it will remove the files as well.
      If you want to keep the files you should clone another copy to the side, and copy the files back over to the cleaned-up repository, but don’t forget to put them on the ignore list then otherwise you would have to start all over again 😉

Leave a Reply to Han Cancel reply

Your email address will not be published. Required fields are marked *