Why raw pointers can betray you

Modern CPUs don't love reading data from arbitrary addresses. They prefer data to sit at addresses that are multiples of the data's size. This is alignment.

int x; // int is 4 bytes, wants to live at address divisible by 4 
double d; // double is 8 bytes, wants address divisible by 8

So address 0x1000 is fine for a double. Address 0x1003 is not.

What happens if we're misaligned?

Depends on our CPU:

x86: silently does extra work, just slower
ARM, MIPS, older chips: straight up crashes with a bus error

So on our machine it might seem fine, but our code would break on mobile/embedded. And "might be slower" is already bad for an allocator.

Lets say we have a very dumb but just works allocator

void *bump_alloc(size_t size) {
	void *ptr = sbrk(size); // we are asking OS for 'size' bytes
	if (ptr == (void *) -1) return NULL; // sbrk here return -1 on error
	return ptr;
}

what sbrk does here?

The OS maintains a pointer, the program break, which marks the end of the heap segment and the beginning of unallocated memory. When program needs more memory, sbrk() is called with a +ve integer, kernel adjusts the location of program break upward by the requested number of bytes. The newly allocated space is contiguous (ofc in virtual memory).

Thus sbrk() returns the previous value of the program break. This value is a pointer to the start of the newly allocated, usable memory block.

say now we call

bump_alloc(1);

Let's say OS gave us 0x5000, break will move to 0x5001 and next call we want to store some double data type value and call

bump_alloc(8); // next call returns 0x5001 ← misaligned for double

0x5001 isnt divisible by 8, the next pointer is off by 1, which breaks alignment for anything that needs 4 or 8 bytes alignment.

Why does the CPU care about the start address but?

This is largely because of how the CPU fetches memory internally, it reads it aligned to its bus width, for e.g. its 8 bytes at a time, always starting at multiples of 8.

If our int starts at 0x1000, the CPU fetches 0x1000-0x1007 in one go. Say it is at 0x1006 then it will span two 8-byte chunks (i.e. 0x1000-0x1007 and 0x1008-0x100F), therefore depending on CPU it might

fetch 0x1000-0x1007  → get bytes 1-2 of your int
fetch 0x1008-0x100F  → get bytes 3-4 of your int
stitch them together → extra work

Two fetches instead of one. On strict architectures it just refuses and crashes.

Fix in our code? Align upto fixed bytes say 16

#define ALIGN 16

size_t align_up(size_t size) {
	return (size + ALIGN - 1) & ~(ALIGN - 1);
}

This will align whatever size we gave to nearest multiple of 16 (thus 1 -> 16, 17 -> 32 and so on)

and

void *bump_alloc(size_t size) { 
	size = align_up(size);
	void *ptr = sbrk(size); 
	if (ptr == (void *)-1) return NULL; 
	return ptr; 
}

int main() { 
	char *a = bump_alloc(1); // wastes 15 bytes, but pointer is safe
	double *b = bump_alloc(8); // guaranteed aligned 
	*b = 3.14; printf("double: %f\n", *b); 
	printf("addresses: %p %p\n", a, b); // %p returns addr in hex -> should differ by 16 
	return 0; 
}

Now this "wasted" space will be part of internal fragmentation. Many allocators like jemalloc, instead of one pool, maintain separate pools for common sizes, for e.g. :

8 byte pool   → for allocations 1-8 bytes
16 byte pool  → for allocations 9-16 bytes
32 byte pool  → for allocations 17-32 bytes
64 byte pool  → ...

of course this just touches the surface (or barely touches) of something as big and complex and amazing as allocators, for e.g. handling free() or handling external fragmentation after multiple malloc/free cycles and you can't fit large allocations even though total free space is enough.

You can checkout the code used here at the repo