Wednesday, December 18, 2013

How to translate C source code into assembly instruction for a function

Converting the function source code into the assembly is not a big deal , it is just a matter of 3 simple steps:

1.) Identifying the non-executable statements within a function.
2.)Create a function symbol table and resolve all the non-executable instructions.
3.)Translate C source into assembly source.

These steps sounds to be tough but if you have basic knowledge of c language these are very easy . let us take the steps one by one considering the follwing example of small code:
  1. int main()  
  2. {  
  3. int a = 100;  
  4. int b = 100;  
  5. int c;  
  6. c = a+b;  
  7. }  

Step1: Identifying the non-executable statements within a function

To understand this step we have to know what are the non-executable statements ??
Whenever you write a program you always write them at the very beginning of you function body.yes you guessed it right the variable declarations are the ones.
we can define the non-executable statements as:

Non-executable statements are those which are not converted to the instructions to be executed by the processor. It means that they are just the memory allocation instruction which are processed by the compilers.

So, in our example we have three non-executable statements shown in the figure below.

Step2: Creating the function symbol and resolving all non-executable statements.

Function Symbol table consists of 4 values:
1.)Symbol Name
3.)Composition (size of data type of symbol)
4.)Offset Address

You can easily fill the symbol table as shown below for our example.

As you can see in the above table , we have taken the variable names as symbols and added their type and size in the respective columns.

After filling the table we have calculate the total size which comes out to be 12 . Basically this 12 is the size of the stackfor the function main which will be allocated when the main function will be called.

Now the last column is the offset address . Offest Addresses are allocated by the compiler. let us see how we will fill that column.

Let us take the base pointer to the stack as ebp and assume that stack grows downwards (which happens most of the tym ).

As the size of the stack is 12 and a is the first variable declared in the function so,

Variable 'a' will get the memory at -12(%ebp)  that is -12 from ebp (stack base pointer)
Variable 'b' will get the memory at -8(%ebp) because 'a' is of 4 bytes.
Variable 'c' will get the memory ar -4(%ebp) because 'b' is also of 4 bytes.

So, The final function symbol will be:

Step3: Translating C code to assembly language.

In this process whenever we have to access the variable we will replace that by the offset address given in the functionsymbol table.

To convert the function C source code into assembly we need to know one thing that is GNU compilers follows a standard which says that:

"Every function consist of 4 parts which are:
1.) Function name
2.)Pre-amble (the opening brace of function)
4.)Post-amble (the closing brace of function)"

And the same is written as the following:

Function name :

Here pre-amble defines the code for the opening curly brace and
       post-amble defines the code for the closing curly brace.

Let us take a look at the both the pre-amble ,function body and post-amble small piece of code:

Pre-amble code:
  1. Pushl %ebp  
  2. Movl %esp,%ebp  
Function-body code:
  1. Subl $12, %esp  
  2. Movl $100, -12(%ebp)  
  3. Movl $100, -8(%ebp)  
  4. Movl -8(%ebp), %eax  
  5. Addl -12(%ebp) , %eax  
  6. Movl %eax, -4(%ebp)  
Post-amble code:
  1. movl %ebp, %esp  
  2. popl %ebp  
  3. popl %eip  

pre-amble and post-amble are the most important part to understand how the stack flow works in  function in c language,
what is the ebp ,esp and eip ?
ebp is the pointer to base address of the stack and
esp is the pointer to top address of the stack.
eip is the pointer to the current instruction to be executed .

Now let us take the instruction one by one and see what it means:

pushl %ebp :--> This instruction means push the content/value of the ebp to the top of the stack.
movl %esp,%ebp :--> This instruction means copy the content/value of the esp to the ebp that is now both the pointers points to the same location.
subl $12, %esp :-->This the first instruction of the function body and it does the job of allocating the space for the function on the stack by subtracting the total amount of space required for the function from the esp so that esp points to the top of the stack as it is defined to be.
you can also note that now there will be a difference of 12 bytes in the ebp and esp as per our example which is nothing but the space required by the function to allocate its local variables.

 movl $100 , -12(%ebp) :--> This instruction means that move the value 100 at the location 12 less than the location of ebp. that is if ebp is pointing to 1012 address location put the value 100 at 1000(1012-12).
popl %ebp --> This instruction means that pop out the contents at the top of the stack that is value at the address pointed by the esp pointer.

Rest the instruction meanings are self explanatory in corresponding to these instructions.

Full code:
  1. Main:  
  2. Pushl %ebp  
  3. Movl %esp,%ebp  
  4. Subl $12, %esp  
  5. Movl $100, -12(%ebp)  
  6. Movl $100, -8(%ebp)  
  7. Movl -8(%ebp), %eax  
  8. Addl -12(%ebp) , %eax  
  9. Movl %eax, -4(%ebp)  
  10. movl %ebp, %esp  
  11. popl %ebp  
  12. popl %eip  

This is how we can convert the C code of a function to the assembly source code. The main thing to understand is the pre-amble and the post-amble .