Saturday, February 21, 2026
Modern CPUs don't love reading data from arbitrary addresses. They prefer data to sit at addresses that are multiples of the data's size. This is alignment.
int x; // int is 4 bytes, wants to live at address divisible by 4
double d; // double is 8 bytes, wants address divisible by 8
So address 0x1000 is fine for a double. Address 0x1003 is not.
Depends on our CPU:
So on our machine it might seem fine, but our code would break on mobile/embedded. And "might be slower" is already bad for an allocator.
Lets say we have a very dumb but just works allocator
void *bump_alloc(size_t size) {
void *ptr = sbrk(size); // we are asking OS for 'size' bytes
if (ptr == (void *) -1) return NULL; // sbrk here return -1 on error
return ptr;
}
what
sbrkdoes here?The OS maintains a pointer, the program break, which marks the end of the heap segment and the beginning of unallocated memory. When program needs more memory, sbrk() is called with a +ve integer, kernel adjusts the location of program break upward by the requested number of bytes. The newly allocated space is contiguous (ofc in virtual memory).
Thus sbrk() returns the previous value of the program break. This value is a pointer to the start of the newly allocated, usable memory block.
say now we call
bump_alloc(1);
Let's say OS gave us 0x5000, break will move to 0x5001 and next call we want to store some double data type value and call
bump_alloc(8); // next call returns 0x5001 ← misaligned for double
0x5001 isnt divisible by 8, the next pointer is off by 1, which breaks alignment for anything that needs 4 or 8 bytes alignment.
This is largely because of how the CPU fetches memory internally, it reads it aligned to its bus width, for e.g. its 8 bytes at a time, always starting at multiples of 8.
If our int starts at 0x1000, the CPU fetches 0x1000-0x1007 in one go. Say it is at 0x1006 then it will span two 8-byte chunks (i.e. 0x1000-0x1007 and 0x1008-0x100F), therefore depending on CPU it might
fetch 0x1000-0x1007 → get bytes 1-2 of your int
fetch 0x1008-0x100F → get bytes 3-4 of your int
stitch them together → extra work
Two fetches instead of one. On strict architectures it just refuses and crashes.
#define ALIGN 16
size_t align_up(size_t size) {
return (size + ALIGN - 1) & ~(ALIGN - 1);
}
This will align whatever size we gave to nearest multiple of 16 (thus 1 -> 16, 17 -> 32 and so on)
and
void *bump_alloc(size_t size) {
size = align_up(size);
void *ptr = sbrk(size);
if (ptr == (void *)-1) return NULL;
return ptr;
}
int main() {
char *a = bump_alloc(1); // wastes 15 bytes, but pointer is safe
double *b = bump_alloc(8); // guaranteed aligned
*b = 3.14; printf("double: %f\n", *b);
printf("addresses: %p %p\n", a, b); // %p returns addr in hex -> should differ by 16
return 0;
}
Now this "wasted" space will be part of internal fragmentation. Many allocators like jemalloc, instead of one pool, maintain separate pools for common sizes, for e.g. :
8 byte pool → for allocations 1-8 bytes
16 byte pool → for allocations 9-16 bytes
32 byte pool → for allocations 17-32 bytes
64 byte pool → ...
of course this just touches the surface (or barely touches) of something as big and complex and amazing as allocators, for e.g. handling free() or handling external fragmentation after multiple malloc/free cycles and you can't fit large allocations even though total free space is enough.
You can checkout the code used here at the repo