Sunday, January 25, 2015

How does Git internally represent a commit?

Have you ever wondered how git represents all those commits under the hood? The short answer to that question would be :- Git stores each and every version of a file internally. Now let us dig deep and discover what Git actually does.

What is Git by the way?
I will quote git-scm.com here- "Git is a content-addressable filesystem." Wow! What does that mean? It simply means that git is a simple file system that takes an input and returns a hash and this hash can be used to address the content later on. To illustrate this I will use the below example.


So what happened above?

  1. In the first command I simply initialized a git repository.
  2. Then I did an `echo 'Hello World'` and piped it to git hash-object.
  3. And finally I did a git cat-file.
The second and third steps require explanations. The command git hash-object takes the input and stores it in the "Git file system". The hash-object command returns a hash, in this case the hash is 980ad5f..6b60e3. This hash is SHA1 hash of 40 in length and is based on the contents given to git hash-object. And later on, to access the content, this hash can be used. This is exactly what I am doing at step 3. In step 3, I am using the git cat-file to display the content.

Now if you see .git/objects folder you will see a folder named 98/ and inside there you will see a file 0ad5f..6b60e3. As you guessed, .git/objects is where git stores all the objects. Git extracts the first two characters (out of 40) of the hash and makes a folder and stores the object inside the folder with file name using the remaining 38 characters of the hash. 

Now if you see a commit hash 40 characters in length, you know where to look! Unfortunately, whatever is stored in objects file is not human readable, that is, a simple cat only returns garbled text. That is why I used git cat-file. I gave input as hash with a switch -p (-p stands for print) and it returns the contents that is referred by that hash.


Now let us see how commit works!

To represent a commit, git basically uses 3 types of object. They are


  1. Commit object, to store details of a commit.
  2. Tree object, to represent folders.
  3. Blob object, to store contents of a file.
Please have a look at the following illustration.



In the step 1, 2 and 3 above I am creating  files file1 and folder1/file2 and putting contents in it. After that in step 4 and 5 I add the files and commit it respectively. Now I get the commit hash as fe40ed6 and I am using this to inspect the object using git cat-file in the step 6. That shows the details the git have stored in the commit object. You can see inside the commit object there is a reference to a tree object with hash cd886..6e943. This tree object represents the root folder of the repo.

When you inspect the aforementioned tree object with git cat-file, you see reference to a blob which represents file1 and a tree which is folder1. If you check the blob you will see the content of file1. Now, what to expect in tree is in anybody's guess or else try that yourself!

Now consider that you changed just the contents of file1 and committed it. How do you expect things to change? I am going to illustrate it below.

In the step 1 and 2 above, I change the content of file1 and commits it. In step 3 I inspect the object of resulting commit. In that as expected we find a reference to tree. But what you have not seen in the previous example is the parent. The parent is nothing but a reference to the previous commit. The previous commit did not require it because, it was the first commit in the repo. 

In 4th step I inspect the tree object, which refernces to root directory of repo, with hash c78af..dfd7f. In that as you expect there is a reference to the file1's blob and folder1's tree. If you look closely, you may notice that folder1's reference hash is same as that  was inside the previous commit. So what the git has done is to track and store things which have changed and reuse things that have not changed.

To sum up, the git stores all versions of each file and link it up with tree objects and commit objects.

I hope I have done a good job in explaining how git commit is being tracked at a basic level. If you have understood these things properly, then to figure out what happens during the checkouts and cherry-picks is a cake walk.

No comments:

Post a Comment