Assembler and Machine Language
Understanding machine language helps you becoming a better programmer because you understand how a computer really works. I have a German YouTube video where I show how to change a program after it has already been compiled:
If you don't understand German or prefer a blog form that you can use for copy and pasting, stay here. I want to explain how to write a "hello world" C-program, translate it to assembler, machine language, find out what syscalls it does, and then debug it. We will use Ubuntu Linux for it.
The output is glibberish. The reason is that cat uses all available characters, including line feed, carriage return, the "bing" sound indicating on a typewriter that you are approaching end of line, end of file and the backspace. Better use a hex editor like okteta:
Here you see on the left the ASCII character numbers in hexadecimal. Every character is two digits. On the right you see the characters if they are displayable, otherwise, you see a dot to show there is a non-displayable character. The file is in the ELF format, so the hello world program needs:
If you don't understand German or prefer a blog form that you can use for copy and pasting, stay here. I want to explain how to write a "hello world" C-program, translate it to assembler, machine language, find out what syscalls it does, and then debug it. We will use Ubuntu Linux for it.
Let's start with the C program:
#include <stdio.h>
int main()
{
printf("hello world");
}
Compiled it:
gcc -S hello.c
This translated the C-code into assembler code, stored in hello.s:
thorsten@ubuntu:~$ cat hello.s
.file "hello.c"
.text
.section .rodata
.LC0:
.string "hello world"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
endbr64
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
leaq .LC0(%rip), %rdi
movl $0, %eax
call printf@PLT
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
.section .note.GNU-stack,"",@progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:
thorsten@ubuntu:~$
As you can see, it dumps the string "hello world" right into the file. And the assembler code gets translated into machine code using the command
gcc -o hello hello.c
This command will create an ELF file that is loadable. Don't use cat to show its content:
The output is glibberish. The reason is that cat uses all available characters, including line feed, carriage return, the "bing" sound indicating on a typewriter that you are approaching end of line, end of file and the backspace. Better use a hex editor like okteta:
Here you see on the left the ASCII character numbers in hexadecimal. Every character is two digits. On the right you see the characters if they are displayable, otherwise, you see a dot to show there is a non-displayable character. The file is in the ELF format, so the hello world program needs:
- in the C source code (hello.c): 60 bytes
- in the Assembler source code (hello.s): 670 bytes
- in the executable ELF format (a.out): 16K bytes
Now let's execute the file and watch what calls it does to the system using the command strace:
mprotect(0x7f6900439000, 4096, PROT_READ) = 0
munmap(0x7f69003f7000, 85571) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
brk(NULL) = 0x55619dfa7000
brk(0x55619dfc8000) = 0x55619dfc8000
write(1, "hello world", 11hello world) = 11
exit_group(0) = ?
+++ exited with 0 +++
You don't need to understand all of it. What it shows, is, that the operating system's loader gets active, loads a lot of libraries into the memory, maps them, and mprotects them. Then, there is one syscall, write, that corresponds to printf.
Now, how do we debug this? We use the GNU Debugger gdb:
thorsten@ubuntu:~$ gdb hello
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from hello...
(No debugging symbols found in hello)
OK, first we set a breakpoint at the beginning of the main procedure:
(gdb) break main
Breakpoint 1 at 0x1149
now we start the program:
(gdb) run
Starting program: /home/thorsten/hello
Breakpoint 1, 0x0000555555555149 in main ()
and, as you can see above, it did stop at the breakpoint. Now let's look around what we have there:
(gdb) disassemble
Dump of assembler code for function main:
=> 0x0000555555555149 <+0>: endbr64
0x000055555555514d <+4>: push %rbp
0x000055555555514e <+5>: mov %rsp,%rbp
0x0000555555555151 <+8>: lea 0xeac(%rip),%rdi # 0x555555556004
0x0000555555555158 <+15>: mov $0x0,%eax
0x000055555555515d <+20>: callq 0x555555555050 <printf@plt>
0x0000555555555162 <+25>: mov $0x0,%eax
0x0000555555555167 <+30>: pop %rbp
0x0000555555555168 <+31>: retq
End of assembler dump.
And you see this is exactly the sequence of assembler commands that we saw in the .s file already. We see that it moves the value 0x555555556004 into the processor register di (%rdi). This seems to be the memory address where "hello world" has been put. Then it calls printf from the memory address 0x555555555050.
Let's now look in the debugger if you can really find "hello world" at this 6004 memory address. To eXamine a RAM address, we use the debugger command x:
(gdb) x 0x555555556004
0x555555556004: 0x6c6c6568
x displays the value at the end (so, it is 68) and pre-pones the postponed bytes. So we are looking at
68 65 6c 6c
Now guess what, 68 (hexadecimal) is the ASCII sign for h, 65 for e, 6c for l. So we look at the four bytes
hell
So, let's look at the adjacent bytes in memory:
(gdb) x 0x555555556005
0x555555556005: 0x6f6c6c65
(gdb) x 0x555555556009
0x555555556009: 0x726f7720
(gdb)
It is indeed hello world. So, let's tell gdb to continue running our program:
(gdb) continue
Continuing.
hello world[Inferior 1 (process 2887) exited normally]
(gdb)
So, indeed, it outputs "hello world". Congratulations to ourselves, we have written a "hello world" C-program, translated it to assembler using the command gcc -S, translated it to machine language using the command gcc, looked at it using okteta, traced it using strace, and debugged it using gdb. Understanding how a computer works is fun.
Comments
Post a Comment