Hamburger Icon
  • Labs icon Lab
  • Core Tech
Labs

Guided: Rewriting Git History

It’s common for developers to make the mistake of committing code to git repositories that contain sensitive data. In this Guided Code Lab, you will learn how to easily remove this information from previous Git commits!

Labs

Path Info

Level
Clock icon Advanced
Duration
Clock icon 35m
Published
Clock icon Mar 27, 2024

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Table of Contents

  1. Challenge

    Introduction to Git

    In this lab, you will be working with Git, a tool for distributed version control of changes made to files. Git allows you to create a local repository of files that you can work with and provides tracking of those files. Git is commonly used to track changes to source code among software development teams.

    In the following steps, you will learn:

    1. How to identify sensitive data in a Git repository
    2. How to perform automated scanning of Git repositories for sensitive information
    3. How data can be leaked via commit history
    4. How to remove all traces of a file from a repository and it's commit history
    5. How to extract a subdirectory from an existing repository and establish it as a standalone repository
    6. The risks and benefits of rewriting Git history
  2. Challenge

    Identifying Sensitive Information in Git

    In this step, you will learn how to view the contents of a git repository and locate potential sensitive information that has been committed to the Sensitive_Information repository.

    The git show command is used to display various types of objects in a Git repository, such as commits, tags, and trees. When used with a specific file path, git show can display the content and metadata of that file at a particular commit or version.

    First, you will need to acquire the hash of the latest git commit in the Sensitive_Information repository.

    Next, you will generate a list of all the files that are inside of the commit.

    There are two different methods you could use:

    • You could use the Terminal to run the git show command, which will immediately give you the hash.

    info> To exit the command, simply press the q key.

    • You can find this information going to Sensitive_information/.git/refs/heads directory and examining the master file.

    Using the command line is usually a lot more efficient. All of the information extracted by using the git show command can be hard to read. The --name-only flag used with the git show command instructs Git to display only the names of the files that were affected by the commit being shown, without showing the actual content changes. This is useful when you're only interested in knowing which files were modified, added, or deleted in a particular commit, rather than seeing the detailed content changes within each file.

    git show --name-only <commit_hash>
    ``` Now that you have a list of files that are committed to the repo, you can inspect their contents for sensitive data. 
    
    Sensitive data can take many forms, but typically anything that should remain private that is committed to a repo is considered a data leak. Some common examples include IP addresses, password information, and personal user data.
    
    Similar to using the `git show` command with the hash, you can also use it with a filename.
  3. Challenge

    Automated Scanning for Sensitive Data in Git

    In a real world scenario, manually sifting through each file in search of sensitive data can be time consuming and inefficient. It's much better to use an automated approach. For this, you will use the command git log with the -S flag.

    Using the-S flag in git log allows you to quickly identify commits containing instances of a specified keyword in file content. This streamlined process enables efficient investigation of changes in the repository, empowering you to swiftly search files for potential keywords. It allows you to focus investigative efforts solely on confirmed matches, ensuring maximum efficiency and minimizing unnecessary manual labor.

    git log -S <keyword>
    ``` With this command, you'll receive the hash of the commit where the keyword `IP` was located, along with pertinent details regarding the user who made the commit and the date/time of the commit. Leveraging this method allows for a swifter and more precise scan of Git repositories for sensitive information compared to utilizing the `git show` command.
  4. Challenge

    Leaking Data through Git Commit History

    It's important to understand that removing sensitive data from a Git repository is not as simple as just deleting it. Even if you delete the file and commit the change, the sensitive data can still be leaked via the commit history, which records all commits in a particular repository. To demonstrate this, you will go through the process of deleting a file and verifying that you are still able to access its contents via the commit history. For the following tasks, make sure to navigate to the Insecure_Deletion repository in the Terminal. If you run the git log command, you will see that there is one commit that contains two files. The file containing sensitive data has been deleted and the change committed to the repository. However, if you run the git log command, you will see that there are two hashes within the commit history. Using this information it's possible to retrieve the contents of the deleted file. In this next task, you you will see how easily data can be leaked via the commit history. This illustrates the importance of not only removing files from a Git repository, but also rewriting the Git history. By doing so, previous commits cannot be accessed and used to view sensitive data that should have already been removed.

    In the upcoming step, you will learn how to remove a file and rewrite the repository's commit history to ensure all data is properly removed.

  5. Challenge

    Removing a Commit Using the `git filter-branch` Command

    Now for this step, you are going to remove all traces of a file using the git filter-branch command.

    git filter-branch is a powerful tool in Git used for rewriting history by applying custom filters to the commits in a repository. It allows you to modify the repository's commit history in various ways, such as removing sensitive data, splitting or merging directories, or altering commit messages.

    With git filter-branch, you can apply different filters to each commit, including modifying the commit message, changing file contents, or entirely removing commits from the history.

    However, it's essential to use git filter-branch with caution, especially in shared repositories, as it rewrites history and can potentially disrupt collaboration if not used carefully. The git filter-branch command with the --tree-filter flag applies a specified command to each commit in the repository.

    The rm -rf command is used to forcibly remove directories and their contents. The -r flag stands for recursive, meaning it removes directories and their contents recursively, and the -f flag stands for force, which ignores any warnings or prompts. Therefore, rm -rf removes files and directories without asking for confirmation and regardless of their permissions.

    git filter-branch -f --tree-filter 'rm -rf <filename>'
    

    For the tasks in this step, navigate to the Filtered_Repo repository. Now, you will practice the secure way of removing a file from a Git repository. To verify that the change was made correctly, you can use one of two methods.

    1. You can run the ls command to check the directory itself.
    2. You can use the git show method to obtain the new hash and then use the git show --name-only <hash value> command to list all of the files in the current repository.
  6. Challenge

    Splitting a Subdirectory into a New Repo

    For this final step, you will split a subdirectory out into its own repository. This is a common use case when a repository begins to grow so large that it isn't practical to have it all in once place. In these situations, it's helpful to know how to automate the process of splitting a subdirectory into it's own repository.

    For this task, navigate to the Split_Repo repository. You will utilize an open-source script called git-filter-repo , which has been added to the folder for easy access.

    git-filter-repo is a tool that provides advanced filtering capabilities for Git repositories. It allows users to rewrite history, remove sensitive data, split repositories, and perform various other repository manipulations with ease.

    You can use this script, to turn the subdirectory, Logs, into it's own repository.

    To do so you will utilize the following command:

    python3 git-filter-repo --subdirectory-filter <subdirectory> --force
    

    The --subdirectory-filter option is a feature provided by the git-filter-repo tool. When used with git-filter-repo, --subdirectory-filter allows you to filter the repository history to include only the commits and files related to a specific subdirectory. This means that after applying this filter, only the history and files within the specified subdirectory will remain in the repository.

    The--force flag ensures that the operation is performed forcefully, overwriting existing configurations if necessary. To see the outcome, run the ls command or use the git show method you used in the previous step. You will be able to see that the repository now only contains the contents of the Logs folder, which is the access.log file.

  7. Challenge

    Risk and Benefits of Rewriting Git History

    Congratulations on completing this lab! You have covered how to rewrite git history to remove sensitive information from a git repository. Before you use your skills in the real world, it's important for you to weigh the risks and benefits. Here are some of the common risks of of rewriting Git History:

    • Loss of Data: If changes are made to Git History without proper backups or without proper authorization, important files and data can be lost forever.

    • Loss of Accountability: When making changes to the commit history, you can lose track of who has made changes in the past and this can lead to a loss of accountability among the development team,

    • Wasted Time: If the rewritten history is not communicated to other developers and shared promptly, it could result in developers working on outdated repositories and files, leading to wasted time and effort. Here are some of the common benefits:

    • Removing sensitive information: It's common for developers to accidentally commit files to a repository that contains sensitive information. In this case, from a security point of view, it's important that developers understand how to remove these files.

    • Better Organization - You can rewrite history to clean up and improve the quality of your commit history. You can also remove unnecessary files and consolidate all of your files under one central repository.

    • Remove Mistakes - If a developer added a file by accident, you can easily remove it from your git history.

Shimon Brathwaite is a seven-year cybersecurity professional with extensive experience in Incident Response, Vulnerability Management, Identity and Access Management and Consulting.

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.