Sunday, January 25, 2015

How does Git internally represent a commit?

Have you ever wondered how git represents all those commits under the hood? The short answer to that question would be :- Git stores each and every version of a file internally. Now let us dig deep and discover what Git actually does.

What is Git by the way?
I will quote here- "Git is a content-addressable filesystem." Wow! What does that mean? It simply means that git is a simple file system that takes an input and returns a hash and this hash can be used to address the content later on. To illustrate this I will use the below example.

So what happened above?

  1. In the first command I simply initialized a git repository.
  2. Then I did an `echo 'Hello World'` and piped it to git hash-object.
  3. And finally I did a git cat-file.
The second and third steps require explanations. The command git hash-object takes the input and stores it in the "Git file system". The hash-object command returns a hash, in this case the hash is 980ad5f..6b60e3. This hash is SHA1 hash of 40 in length and is based on the contents given to git hash-object. And later on, to access the content, this hash can be used. This is exactly what I am doing at step 3. In step 3, I am using the git cat-file to display the content.

Now if you see .git/objects folder you will see a folder named 98/ and inside there you will see a file 0ad5f..6b60e3. As you guessed, .git/objects is where git stores all the objects. Git extracts the first two characters (out of 40) of the hash and makes a folder and stores the object inside the folder with file name using the remaining 38 characters of the hash. 

Now if you see a commit hash 40 characters in length, you know where to look! Unfortunately, whatever is stored in objects file is not human readable, that is, a simple cat only returns garbled text. That is why I used git cat-file. I gave input as hash with a switch -p (-p stands for print) and it returns the contents that is referred by that hash.

Now let us see how commit works!

To represent a commit, git basically uses 3 types of object. They are

  1. Commit object, to store details of a commit.
  2. Tree object, to represent folders.
  3. Blob object, to store contents of a file.
Please have a look at the following illustration.

In the step 1, 2 and 3 above I am creating  files file1 and folder1/file2 and putting contents in it. After that in step 4 and 5 I add the files and commit it respectively. Now I get the commit hash as fe40ed6 and I am using this to inspect the object using git cat-file in the step 6. That shows the details the git have stored in the commit object. You can see inside the commit object there is a reference to a tree object with hash cd886..6e943. This tree object represents the root folder of the repo.

When you inspect the aforementioned tree object with git cat-file, you see reference to a blob which represents file1 and a tree which is folder1. If you check the blob you will see the content of file1. Now, what to expect in tree is in anybody's guess or else try that yourself!

Now consider that you changed just the contents of file1 and committed it. How do you expect things to change? I am going to illustrate it below.

In the step 1 and 2 above, I change the content of file1 and commits it. In step 3 I inspect the object of resulting commit. In that as expected we find a reference to tree. But what you have not seen in the previous example is the parent. The parent is nothing but a reference to the previous commit. The previous commit did not require it because, it was the first commit in the repo. 

In 4th step I inspect the tree object, which refernces to root directory of repo, with hash c78af..dfd7f. In that as you expect there is a reference to the file1's blob and folder1's tree. If you look closely, you may notice that folder1's reference hash is same as that  was inside the previous commit. So what the git has done is to track and store things which have changed and reuse things that have not changed.

To sum up, the git stores all versions of each file and link it up with tree objects and commit objects.

I hope I have done a good job in explaining how git commit is being tracked at a basic level. If you have understood these things properly, then to figure out what happens during the checkouts and cherry-picks is a cake walk.

Tuesday, June 25, 2013

IPC device using Linux Kernel "completions"

The Linux kernel provides a service named "completions" for  the purpose of intimating another process that a particular task has been completed. Here is the source code for a device driver for a simple Inter Process Communication device that can be used to send a string of 4 characters between two processes.


Here is some details about the "completions":

  • Completions is a technique that can be used to convey that particular task has been completed to the other process
  • <linux/completion.h> must be included to use this.
  • A completion can be declared by DECLARE_COMPLETION(my_completion)
  • If it has to be created and initialized dynamically then we may use the following method.           
struct completion my_completion;
  • Waiting can be done by wait_for_completion(struct completion *c)
  • It performs an uninterruptible wait and if nobody ever issues a complete() , then that results in an un-killable process.
  • Completion can be notified by
    • void complete(struct completion *c)
    • void complete_all(struct completion *c)
    • Both of the above functions behaves differently if more than one process is waiting
    • In that case complete() will only notify only one process. But the complete_all() will notify all the processes waiting.
    • The structure can be reused without any problem unless complete_all is used. If comlpete_all is used then re-initialization has to be done in the follwing way :- INIT_COMPLETION(struct completion c)

The Device

The structure of device is simple. Here is the write operation
The write operation simply copies the data from user space to an array in the kernel space.Then it issues a complete(). This is anything but a declaration that a particular task that is associated with sig has been completed and a process waiting on sig can proceed.
The read operaion is as follows.
The read operation on the other hand waits on the sig using wait_for_completion(). So when another process issues a complete() on sig, then the reader can proceed.
The device has the following behavior
  • Read and write can only be done with data size as 4 characters.
  • The data written to the device persists unless a read operation is done on the the device. The write operations is non-blocking.
  • The read operation on the device has to wait unless something is written to the device. Hence read operation is blocking.
Reference : LDD 3e Chapter 5

Saturday, May 11, 2013

HIGMEM and Memory Zones from newbie perspective


Earlier Linux kernels(about 10 years ago) where unable to support more than 1GB of  physical memory in 32 bit x86 systems. This was attributed to the way in which the VM was arranged. In earlier Linux 32bit systems with 3:1 split, the first 3GB of the VM was the user address space and the remaining 1GB was kernel address space. The 1GB allocated to the kernel can be mapped to  any part of physical memory. If the Kernel has to access any pages in physical memory then that has to be mapped to the its address space. Earlier kernel used static mapping and hence with 1GB address space, the kernel can only map 1GB of physical memory. Thus any physical memory beyond the size of 1GB can never be accessed by the kernel. Thus the earlier Linux kernels where restricted to using 1GB of physical memory in 32bit x86 systems.

The solution that was devised to solve this problem was high memory or simply HIGHMEM. In this strategy we divide the physical memory into high memory and low memory. The low memory still uses the permanent mapping. That is certain pages in the physical memory is always mapped to the kernel address space and that is called the low memory. The remaining part of kernel address space makes use of temporary mappings and that is called the high memory. That is the pages are mapped as when required. This enables the kernel to access any pages within 4GB range. The kernel data structures like linked lists must live in the low memory itself. This helps in maintaining pointer consistency. 


In Linux kernel the page frames are grouped logically in to three.They are
There are some reasons in the Linux kernel to divide the physical memory as explained next.In some architectures there is a constraint on which a range of memory can be used. An example can be given in x86 architecture. In x86 architecture , the ISA devices are only capable of addressing the first 16MB of RAM. It implies that the for DMA operations, the ISA devices can only use the first 16MB of RAM. Some architectures does not have this constraint(Example :- PPC ). Besides this we have to accommodate the HIGHMEM solutions. With the division of memory, managing the memory becomes easy with each zone having a struct to store its details.
Above I mentioned that ISA devices in x86 has a constraint in memory addressing. So in x86 architecture ZONE_DMA is the group of pages that belongs to first 16MB of RAM. Since PPC does not have this constraint and hence in PPC ZONE_DMA is empty.

In x86 architecture , the first 896MB of page frames are permanently mapped to the Kernel Address space. Which can be other wise stated as , the first 896MB of Kernel address space is mapped to the first 896MB of physical memory.So that leaves 128MB of unmapped addresses in the 1GB kernel address space. This unmapped addresses can be used to make the temporary mappings from the high memory. The union of ZONE_DMA pages and ZONE_NORMAL pages gives us the low memory pages. If ZONE_NORMAL pages are not available for allocation, as a last resort, kernel can use ZONE_DMA but not vice versa.

ZONE_HIGHMEM is the collection of high memory pages. So in 32bit x86 that will be any memory above 896MB. They are not permanently mapped to the kernel address space as explained in HIGHMEM section.

Sources :
[1] Linux Device Drivers,3rd ed.,Corbet et al
[2]Linux Kernel Development by Robert Love 

Wednesday, July 18, 2012

Need for Shell Built ins

Usually a command is executed by the shell by spawning a new process. Though this strategy is effective in dealing usual commands it may turn ineffective in dealing certain others, especially one that changes the behavior of the shell itself.Example for such a command is the cd command. As you know, cd is a command used to change the current working directory in a shell.

Consider that we decided to adopt the conventional method for implementing the cd command,i.e we will spawn a new process whenever the cd command is encountered. If you where to explore Linux System Call API, you will find that the only sys call available for changing the directory is the chdir(). The problem with the chdir() is that it can only change the current working directory of the current process. Thus if we are implementing cd as a separate process, it will have no effect on our main process-shell.That's the stage at which the shell built ins kicks in. Built ins are nothing but commands whose implementation is the task of the shell itself. So whenever a built in command is encountered , it is implemented by the bash itself rather than as a separate process. 

The below program is the source code for a minimal shell that implements cd as a built in
Note: The minimal shell given above accepts maximum of only one argument for any given command including switches. Also the commands supported are the ones only in /bin directory along with exit and cd