The ability of making tools and using them is one of the many things that makes us special and the skills to use tools properly is what makes some of us elites.
As software engineers, interacting with Git is an important part of our daily life. These days Git is the de facto standard of version control systems and almost everyone uses it. Git is of one those special tools that every engineer has to be familiar with, since it's widely spread in the tech world. It will be a big surprise if you find a new project or company that is not using Git.
As a free software contributor, I spend all my professional career in FOSS communities and projects and proper use of Git seems so natural to me. But to my surprise, every now and the I witness how some "commercial engineers" (air quote) uses git and it makes me sad that in a commercial space which you get paid to build technology people do so poorly. After a lot of these type of incidents I've decided to put together a document to help improving my team's Git workflows. While there are plenty of reading materials up on the internet dedicated to Git best practices, I thought it might be useful to publish that document publicly to help others as well. For the lack of a better word I've chose the title "Git Etiquette". Following Git etiquette help teams to get more out of their git workflows and avoid frustration.
I'll try to keep it short and refer to essays from others who explained it much better that me. I borrowed some of the words from the others and I included most of them in the resources section to best of my ability, but since I wrote the original document so long ago and that suppose to be a private doc for few people some of the resources might have been lost.
Also, I'll add more items to the least overtime.
Commits are the building blocks of version controlling via Git. It's obvious that improving the commit quality will result in improvement in the overall quality of the repository.
Single purpose commits
Oftentimes engineers working on something get sidetracked into doing too many things when working on one particular thing like when you are trying to fix one particular bug and you spot another one, and you can’t resist the urge to fix that as well. And another one. Soon, it snowballs and you end up with so many changes all going together in one commit.
This is problematic, and it is better to keep commits as small and focused as possible for many reasons, including:
- It makes it easier for other people in the team looking at your change, making code reviews more efficient.
- If the commit has to be rolled back completely, it’s far easier to do so.
- It's straightforward to track these changes with your ticketing system.
- It helps you mentally parse changes you’ve made using git log.
A commit should be a wrapper for related changes. For example, fixing two different bugs should produce two separate commits. Small commits make it easier for other team members to understand the changes and roll them back if something went wrong. With tools like the staging area and the ability to stage only parts of a file, Git makes it easy to create very granular commits.
On many occasions we need to inspect the Git history to find something. A commit, specific changes, find clues about errors or even to find the engineer who made a certain change. I have bittersweet experience when it comes to dealing with commit messages in the Git history of projects. Let me demonstrate with real examples.
I saw it many times in commercial teams that engineers don't bother with writing a proper and useful Git commit message. For some reason that is beyond my understanding, they think having "I hate my life!" as commit message for a commit with ~1200 lines of change in a repository with more than ~300k commits (at the time) that is used by about 200 engineers is a cool thing to do. I came across this commit message long ago when I was trying to figure out why a service malfunctions. This commit message wasn't helpful at all and I had to read through the diff to figure out whether or not that commit is the root of the issue. I can tell so many stories like this one but for the sake of this essay one would be enough.
But let's have a look at real Git history of a repository that I don't like at all (using
2683332a333a Update tests 3315442a4983e Remove icon from manage header aa234e8aa83f8 test fix 29c35ba3adcee Class migration fbde3a265ab3f Migrate header styles 01eaac4b4cc13 tests 8d004a970eef7 fix tests d2890dfdc360 add tests 91c2aa31720f2 add test for notice variable 135a2df25e86a fix tests 3aa4101546a93 refactor 0eaae58006f51 add test for global variable 3ae7ee7297104 remove unnecessary check
These commits are taken from a repository with more than 400k commits and many active contributors in a commercial space (Don't worry, the SHAs are not the original SHAs).
In the other hand, few weeks ago I pulled from the LLVM repository and built in again (I do this weekly) and tried to build the Serene compiler (a programming language that I'm working on) against that. But the compilation failed with an error like "Identifier is unknown". I grepped the Git logs of LLVM repository and saw a commit and all of a sudden smiled and praised the author in my mind. Here is the commit message (I removed the commit details):
Date: Wed Jan 12 11:20:18 2022 -0800 [mlir] Finish removing Identifier from the C++ API There have been a few API pieces remaining to allow for a smooth transition for downstream users, but these have been up for a few months now. After this only the C API will have reference to "Identifier", but those will be reworked in a followup. The main updates are: * Identifier -> StringAttr * StringAttr::get requires the context as the first parameter - i.e. `Identifier::get("...", ctx)` -> `StringAttr::get(ctx, "...")`
It was so obvious how to fix my issue by looking at this fantastic commit message.
Which one would you rather read? Which one helps you understand what happened in any specific commit ?
According to Chris Beams, A well-crafted Git commit message is the best way to communicate the context about a change to other engineers (and our future selves). A diff will tell you what changed, but only the commit message can properly tell you why.
Peter Hutterer makes this point well:
Re-establishing the context of a piece of code is wasteful. We can’t avoid it completely, so our efforts should go to reducing it [as much] as possible. Commit messages can do exactly that and as a result, a commit message shows whether a developer is a good collaborator.
If you ever used
git log or any other Git sub command that requires interactions with commits
(which many of them do), you'll understand what a valuable asset, a well written commit message
The Git history is just bunch of commits in a certain order. It's up to the engineers to make the most of it. With the growth of any project, maintenance becomes an issue and the messier your history is the harder it is to maintain the project. Also it would be painful for other to be involved in the project too.
There are seven easy rules that you can follow to rock your commit messages:
- Separate subject from body with a blank line
- Limit the subject line to 50 characters
- Capitalize the subject line
- Do not end the subject line with a period
- Use the imperative mood in the subject line
- Wrap the body at 72 characters
- Use the body to explain what and why vs. how
I highly recommend to read the How to Write a Git Commit Message post from Chris Beams that explain these rules in depth.
Commit early, commit often
Git works best, and works in your favor, when you commit your work often. Instead of waiting to make the commit perfect, it is better to work in small chunks and keep committing your work. Personally, I have found it much easier to have smaller commits that group together related changes. This way you can easily revert commits that you don't like and cherry pick those that you want and avoid dealing with un-necessary changes that comes in a commit.
If you are working on a feature branch that could take some time to finish, it helps you keep your code updated with the latest changes so that you avoid conflicts.
Also, Git only takes full responsibility for your data when you commit. It helps you from losing work,
reverting changes, and helping trace what you did when using
Don’t commit generated files
This one is fairly obvious, but many times I had to look at the history to figure out who has committed an auto generated file or a massive file into the repository.
Generally, only those files should be committed that have taken manual effort to create, and cannot
be re-generated. Files can be re-generated at will, can be generated any time, and normally don’t
work with line-based diff tracking as well. It is useful to add a
.gitignore file in your
repository’s root to automatically tell Git which files or paths you don’t want to track.
Don’t alter published history
Once a commit has been merged to an upstream default branch (and is visible to others), it is strongly
advised not to alter history. Git and other VCS tools to rewrite branch history, but doing so is
problematic for everyone who has access to the repository. While
git-rebase is a useful feature,
it should only be used on branches that only you are working with (Private branches).
One of the key aspects of Git is its distributed nature. Meaning that everyone can have their own repositories and push their commits to their own fork and send pull requests to others to pull from their repositories. This process is centralized these days via Git hosting services (While the provide the forking functionality, that is not a common thing to do in a commercial and closed source project) specially in the commercial space that causes engineers to share feature branches. It happens to me many time in different roles that some one force pushed to a public (within the org) branch and screwed everyone's workflow. For your the sake of your peace of mind and others sanity, DO NOT CHANGE THE PUBLIC HISTORY.
It's kind of a joke, but if you are a public force pusher, I'll end my friendship with you.
Having said that, there would inevitably be occasions where there’s a need for a history rewrite on a published branch. Extreme care must be practiced while doing so.
Merge VS Rebase
The golden rule is to never rebase on public branches and always merge to public branches. When it comes to merge vs rebase, there are two simple rules.
Note: It's better to use squash and merge instead of normal merge because in projects with many contributors, it is easier to maintain a Git history on the main branch that contains one commit per feature.
Don’t change other people’s history
You must never ever destroy other peoples history. You must not rebase commits other people did. Basically, if it is not your branch you can't rebase it. Notice that this really is about other people's history, not about other people's code. If you want to pull down some changes from other developers into your branch, it’s fine to rebase, because it’s their code but it’s your history. So you can go wild on the rebase thing on it, even though you didn't write the code, as long as the commit itself is your private one.
Minor clarification: once you've published your history in a public branch, other people may be using it, and so now it's clearly not your private history anymore. So the minor clarification really is that it's not just about your commit, it's also about it being private to your tree, and you haven't pushed it out and announced it yet.
Don’t expose your unfinished work to public
Keep your own history readable. Some people do this by just working things out in their head first,
and not making mistakes. but that's very rare, and for the rest of us, we use
git rebase etc
while we work on our problems. So
git rebase is not wrong. But it's right only if it's
YOUR VERY OWN PRIVATE git tree.
If you're still in the
git rebase phase, you don't push it out. If it's not ready, you don't
tell the public at large about it. Don’t push your changes to a shared feature branch or the main
Don’t merge upstream changes at random points. If you’re working on a shared feature branch, don’t pull down the changes when they are not verified and finalized. It will put your history in an inconsistent state because your history will contain some changes which might get removed upstream and later on when you push your changes you’re going to put back those removed changes again.
This essay was just a superficial try to explain some of the etiquette of Git that we need to follow when we're collaborating on a project with others. At the end of the day we are looking to make it easier for ourselves to develop software and following certain rules will help us to get there faster and makes the process more pleasant.
References and Resources
- https://www.kernel.org/doc/html/v4.10/process/submitting-patches.html The kernel community is one of the biggest communities of paid and volunteer contributors that are using Git intensively with a really high traffic. In order to manage the development process and keep the productivity that has really strict guidelines which some of them can be useful for us.
- https://chris.beams.io/posts/git-commit/ Chris Beams made a research about the best practices around the commit messages By reviewing many projects, his article is one the most referenced articles in this field.
- https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html Another short but widely referenced article on best practices around Git commit messages
- https://yarchive.net/comp/linux/commit_messages.html Who can be better to follow on Git best practices rather than Linus Torvalds himself?
- https://lwn.net/Articles/328438/ A famous email from Linus Torvalds describing how to maintain a git tree from merge vs rebase perspective
- https://www.atlassian.com/git/tutorials/merging-vs-rebasing Atlasians guidelines on merge vs rebase