by Carl Burch, Hendrix College, October 2012
To make large programs more manageable, programmers break programs into smaller pieces. In C and Python, these pieces are called functions; in Java, they are called methods; and in assembly language, they are called subroutines. We'll now turn to seeing how to write subroutines for ARM's ISA. Through learning about subroutines, we'll also learn more about how methods/functions in higher-level languages are actually executed.
We've seen already that each processor has a register called its
program counter to track the address of the instruction it is
about to execute. The ARM processor uses R15 for this purpose. In fact,
assembly programs refer to
as
R15
instead, though the two are synonymous.PC
In invoking a subroutine, a program must store where the processor
should return after completing the subroutine. The ARM processor uses
R14 for this purpose; it is called the link register,
and is usually referenced as
in programs.LR
Below is an example illustrating how this works, using our
strcpy
example from Section 3.2.
By the end of this current section, we'll see that one
uses the BL
opcode rather than the two-instruction sequence
used below, but first we'll look at how calling a subroutine works without
using that special-purpose instruction.
; Note: This way of calling subroutines is stylistically poor.
ADD LR, PC, #after ; place into LR the address to return to
B strcpy
after ; main code continues its work here after subroutine completes
strcpy LDRB R2, [R1], #1
STRB R2, [R0], #1
TST R2, R2 ; repeat if R2 is nonzero
BNE strcpy
MOV PC, LR ; return back into the code calling strcpy
This fragment first loads into
LR
the address of the instruction
following the call to strcpy
, labeled after
in the
above program. In the next instruction, it branches into
strcpy
, and the subroutines starts its work. Once complete,
the subroutine copies LR
back into PC
.
Since PC
is used to determine which instruction should
be loaded and executed next, changing its value leads the processor
to execute its next instruction at the after
label.
The process of calling subroutines is so common that the ARM
designers felt that the two-instruction process in the above
illustration was too cumbersome. So they created a new opcode
explicitly for this purposes,
called BL
for Branch and Link.
Thus, rather than the first two instructions above,
we'd write simply
.
This instruction loads into BL strcpy
LR
the address of the
instruction following the BL
instruction, and it sets
PC
so that the next instruction executed is the one at
strcpy
.
Often our subroutines will themselves call subroutines. But that brings up a problem: How should we avoid the situation where one subroutine needs the values of some registers, but another subroutine that it calls will change those same registers?
The natural solution is to save registers' values into memory before the called subroutine does its work, and then to restore the values from memory once that subroutine has completed. In fact, we'll do this with something called the program stack. When a subroutine starts, it will allocate a new block of memory on the top of the stack; and when it returns, the subroutine will release the block from the top, restoring the stack to its older state.
We implement the stack using a a register called the
stack pointer, which holds the address of where the top
value in the stack can be found.
In the ARM processor, R12
is conventionally
used for the stack pointer, which we can reference this
as
.SP
The stack pointer will start at a large address, but as we
add things into the stack, the stack pointer will decrease;
thus, SP
points to the top of the stack,
while SP
+ 4 is the address of the next-to-top value,
while SP
+ 8 points to the third-from-top value,
and so on.
For the ARM, we'll be able to push several things onto the
stack using the STMDB
instruction; for example, the
instruction
decreases STMDB SP!, {R4,R5}
SP
by 8, effectively growing the stack by 8
bytes, and R4
and R5
are saved into those 8 bytes.
When we want to remove values from the top of the stack, we can
use the LDMIA
instruction, as with
.STMIA SP!, {R4,R5}
An example of such a subroutine is below. This subroutine, an adaptation of our earlier fragment for adding the numbers of an array, needs two additional registers beyond the registers containing the parameters. Since any code that calls this subroutine may not be able to afford those registers, we instead opt to write the subroutine so that it saves both registers onto the stack as soon as it is called; and just before returning, it restores the registers to their previous values.
; sumArray: Places sum of entries in array into R0. On entry, R0
; should be address of first array element, R1 should be array length.
sumArray STMDB SP!, { R4, R5 } ; push R4 and R5 onto stack
MOV R4, #0
sumLoop LDR R5, [R0], #1
ADD R4, R4, R5
SUBS R1, R1, #1
BNE sumLoop
MOV R0, R4
LDMIA SP!, { R4, R5 } ; restore R4 and R5 from stack
MOV PC, LR ; return back to after sumArray call
One of the most common registers a subroutine will want to save is
the link register, since the subroutine will often want to modify the
link register itself as it calls other subroutines. There is a handy
trick involving this: When we restore the registers at the subroutine's
end, we can easily restore the link register LR
's saved value
into the program counter PC
instead. As a result, we won't need
the
instruction.MOV PC, LR
subName STMDB SP!, { R4-R5,LR }
; code within subroutine goes here, with perhaps some calls
; to other subroutines (thus changing LR)
LDMIA SP!, { R4-R5,PC } ; loading into PC returns out of subroutine
To write large assembly programs, we need a standard system for passing parameters, returning values, and allocating registers between subroutines. After all, if each subroutine created its own system of using registers, things would quickly get very confusing as we try to remember each subroutine's system. A standard system is called a calling convention, and it's important enough to have this convention that ISA designers typically define one even though it is not technically something that is built into a manufactured processor.
For the ARM processor, we'll follow the standard calling convention
that parameters are passed by placing the parameter values into
registers R0 through R3 before calling the subroutine, and a
subroutine
returns a value by placing it into R0 before returning.
In the rare situation that a subroutine wants more than four parameters,
we'd place any additional parameters onto the stack before entering the
subroutine (with the earlier parameters pushed last onto the stack, so
that the fifth parameter is on the stack's top (referenced by
SP
).
Each subroutine is allowed to alter R0 through R3 as it wishes; but if it uses R4 through R12, it must restore them to their previous values. It must also restore the stack pointer R13, effectively removing everything from the stack. It may change the link pointer R14.
The registers are thus divided into caller-save registers and callee-save registers. Caller-save registers are those that the subroutine may change, such as R0 through R3 in the ARM convention described above: They are caller-save because since a caller of a subroutine must save the registers' values if it wants the values after the subroutine completes. Callee-save registers are those that a subroutine must leave unchanged, like R4 through R12 is the convention described above: Upon being called, the subroutine (the callee) must save the registers' values if it wishes to use them.
It's beneficial for a calling convention to designate both caller-save registers and callee-save registers. If the convention designated all registers as callee-save, then subroutines would not be able to use any registers at all without saving them onto the stack first — which would be a waste, since some of the saved registers would be transient values that the calling subroutine did not care about long-term. And if the convention designated all registers as caller-save, then programmers would be forced to save many registers before every call to a subroutine and to restore them afterwards, lengthening the amount of time to call a subroutine.
Let's get a bit more practice with ARM assembly programming
by looking at some more examples. In particular, we'll look at
examples for processing a linked list of integers. In C, we'd
define this as a struct
with two fields, an int
field named value
and a struct node*
field named
next
. In assembly language, if R0 contains an address of a node
of the linked list, then the node's integer value
(node->value
)
would be at address R0 and the address of the following node
(node->next
)
would be at address R0 + 4.
We'll begin by writing a subroutine for adding up all the values in a linked list. In C, we'd write a function to do this as follows.
int sumList(struct node *head) {
struct node *cur = head;
int sum = 0;
while (cur != NULL) {
sum += cur->value;
cur = cur->next;
}
return sum;
}
Following is a translation of this into ARM assembly language.
sumList MOV R3, R0 ; R3 holds "cur"
MOV R0, #0 ; R0 holds "sum"
sumList_loop TST R3, R3
MOVEQ PC, LR ; if cur is NULL, return: R0 already holds sum
LDR R2, [R3] ; load cur->value into R2, add it to R0
ADD R0, R0, R2
LDR R3, [R3, #4] ; update cur to be cur->next
B sumList_loop
In this translation, we were somewhat clever in deciding
conveniently to use R0 for holding the accumulated sum: By the
time we're ready to return from the subroutine, the return value
is already in R0, so we simply need to return using the
MOVEQ
instruction.
Now let's look at a subroutine that adds a new node at the front of a linked list. The C function is below.
struct node* addFront(struct node *head, int val) {
struct node *n = (struct node*) malloc(sizeof(struct node));
n->value = val;
n->next = head;
return n;
}
This will be harder to translate into assembly language
since the addFront
subroutine will need to call another
subroutine (malloc
). What's more, the value of the R0,
R1, and LR registers will need to be saved before we enter
malloc
, since these are all caller-save registers
and addFront
will need their values after
malloc
returns. (We need to save LR
since the
instruction
modifies it.)BL malloc
In the below translation,
we end up pushing these registers onto the stack and later
popping these values into different registers;
thus, the head
argument originally in R0 ends up in R2,
and the val
argument originally in R1 ends up in R3.
We do this because malloc
places its return value into
R0, and we wish to keep it in R0 for the return value of
addFront
.
addFront STMDB SP!, {R0, R1, LR} ; push R0, R1, LR onto stack
MOV R0, #8 ; call malloc(8)
BL malloc
LDMIA SP!, {R2, R3, LR} ; pop stack into R2, R3, LR
STR R3, [R0] ; n->value = val (pushed R1 popped into R3)
STR R2, [R0, #4] ; n->next = head (pushed R0 popped into R2)
MOV PC, LR ; return from subroutine
Finally, an example that's complex enough to involve some real manipulation of the stack: Removing a number from the linked list.
struct node *remove(struct node *head, int val) {
struct node *cur = head;
struct node *prev = NULL;
while (cur != NULL) {
if (cur->value == val) {
if (cur == head) {
head = cur->next;
} else {
prev->next = cur->next;
}
free(cur);
return head;
}
prev = cur;
cur = cur->next;
}
return head;
}
Our implementation ends up needing more than just the
registers R0 through R3, so we'll need to use a callee-save
register. In our implementation below, you can see that the
callee-save register R4 is
saved onto the stack initially and restored when we return from
the subroutine. It also saves R0 onto the stack, since the value
of the head
parameter is needed following the free
subroutine. However, in the innermost if
statement, we
must modify head
; we do this by storing the intended value into
the stack so that the new value is restored after free
completes.
remove STMDB SP!, {R0, R4, LR} ; Save "head", R4, LR onto stack.
MOV R2, R0 ; Use R2 for "cur".
MOV R3, R0 ; Use R3 for "prev".
remove_loop TST R2, R2 ; If cur == NULL, return with R0 unchanged.
LDMEQIA SP!, {R0, R4, PC}
LDR R4, [R2] ; Load cur->value,
CMP R4, R1 ; compare to "val" argument,
BEQ remove_find ; and go to remove_find if they match.
MOV R3, R2 ; Otherwise prev = cur,
LDR R2, [R2, #4] ; cur = cur->next,
B remove_loop ; and repeat.
remove_find LDR R4, [R2, #4] ; Load cur->next into R4.
CMP R2, R0 ; Check whether cur == head:
STREQ R4, [SP] ; if so, replace "head" on stack with R4;
STRNE R4, [R3, #4] ; if not, save R4 into prev->next.
MOV R0, R2 ; Now free(cur).
BL free
LDMIA SP!, {R0, R4, PC} ; Return "head" saved on stack.
The ARM processor has many more details than we have been able to discuss here. Notably missing from our discussion is any interaction with input or output devices. But one commonly writes subroutines for this (usually included as part of an operating system) without needing to think about this interaction again. But we've now seen all the pieces we need for writing computational code in the ARM assembly language.