How Python's Memory Management Works Under the Hood

May 16, 2026 · 10 min read

Introduction

Every time you create a Python object — a string, a list, an integer, a class instance — Python allocates memory for it. When you are done with it, that memory is freed. Most of the time, this happens invisibly.

But Python’s memory management is not just “call malloc, call free.” CPython has a sophisticated multi-layer memory architecture designed to make small object allocation fast and to minimize the overhead of talking to the operating system.

This article covers:

How CPython requests memory from the OS
The three-tier allocator architecture
The small object allocator (pymalloc)
Memory pools and arenas
Object-specific allocators
How memory grows and shrinks
How to measure and profile memory usage

This is one of the deepest topics in Python internals. Understanding it explains why Python uses more memory than C programs, why memory does not always return to the OS immediately, and how to diagnose memory growth in long-running applications.

All examples are tested on CPython 3.12.

The Three-Layer Architecture

CPython’s memory management has three distinct layers, each building on the one below:

┌────────────────────────────────────┐
│  Object-Specific Allocators        │  Layer 3
│  (int cache, string interning...)  │
├────────────────────────────────────┤
│  pymalloc (Small Object Allocator) │  Layer 2
│  (objects ≤ 512 bytes)             │
├────────────────────────────────────┤
│  General Allocator (malloc/free)   │  Layer 1
│  (OS memory via C runtime)         │
└────────────────────────────────────┘

Most Python objects are small. The small object allocator (pymalloc) handles them efficiently without touching the OS on every allocation. Only large objects and the allocator’s own bookkeeping go through the general allocator.

Layer 1: The General Allocator

At the bottom, CPython uses the C runtime’s malloc and free to request and release raw memory from the operating system. This is the same mechanism used by C programs.

malloc is general-purpose but has overhead — it tracks allocated blocks, handles fragmentation, and involves system calls. For a language that creates thousands of small objects per second (like Python), calling malloc and free directly for every object would be prohibitively slow.

CPython wraps the general allocator with PyMem_RawMalloc, PyMem_RawFree, and related functions. These can be overridden with a custom allocator, which is how memory profiling tools like tracemalloc work.

Layer 2: pymalloc — The Small Object Allocator

For objects 512 bytes or smaller (the vast majority of Python objects), CPython uses its own allocator called pymalloc. The design is built around three hierarchical structures: arenas, pools, and blocks.

Blocks

A block is the smallest unit of allocation. Each block is a fixed size — 8, 16, 24, … up to 512 bytes in multiples of 8. When Python needs to allocate 30 bytes, it rounds up to 32 bytes and gives you a 32-byte block.

Pools

A pool is a 4KB region of memory that holds blocks of a single size. If a pool holds 32-byte blocks, it contains exactly 4096 / 32 = 128 blocks.

Pools are the key to fast allocation. To allocate a 32-byte object, Python just needs to find a pool of 32-byte blocks with a free slot. This is much faster than calling malloc with its general-purpose overhead.

Each pool has three states:

used — partially allocated (some blocks free, some in use)
full — all blocks are in use
empty — no blocks are in use (returned to its arena)

import sys

# Small objects — handled by pymalloc
x = 42
y = "hello"
z = [1, 2, 3]

print(sys.getsizeof(x))   # int size
print(sys.getsizeof(y))   # str size
print(sys.getsizeof(z))   # list size (not including elements)

Expected output:

28
54
88

Arenas

An arena is a 256KB region of memory requested from the OS via malloc. Each arena is divided into 64 pools of 4KB each. CPython requests new arenas as needed and maintains a list of arenas sorted by how full they are, preferring to allocate from the most-full arenas to maximize the chance of returning empty arenas to the OS.

This three-tier structure (arenas → pools → blocks) is the core of why Python’s small object allocation is fast:

Finding a free block in a used pool is O(1)
Pools avoid per-object malloc calls
Arenas batch OS memory requests into 256KB chunks

Layer 3: Object-Specific Allocators

On top of pymalloc, CPython has specialized allocators for frequently created types.

Integer Caching

CPython pre-allocates integers from -5 to 256 and reuses them. Creating x = 42 does not allocate new memory — it returns a pointer to the pre-existing integer 42 object:

a = 100
b = 100
print(a is b)  # True — same object

a = 1000
b = 1000
print(a is b)  # False — different objects (outside cache range)

Expected output:

True
False

This optimization saves significant memory in programs that use small integers extensively — which is nearly every program.

String Interning

Python automatically interns (reuses) string objects that look like identifiers:

a = "hello"
b = "hello"
print(a is b)  # True — interned

a = "hello world"
b = "hello world"
print(a is b)  # Might be False — not automatically interned (contains a space)

You can explicitly intern strings using sys.intern():

import sys

a = sys.intern("hello world")
b = sys.intern("hello world")
print(a is b)  # True

Interning is useful when the same string is compared many times — identity comparison (is) is faster than equality comparison (==).

The `None`, `True`, `False` Singletons

None, True, and False each exist as exactly one object in the entire interpreter. Every use of None refers to the same object:

a = None
b = None
print(a is b)  # Always True
print(id(a) == id(b))  # Always True

Expected output:

True
True

Memory Growth and Fragmentation

Why Python Uses More Memory Than Expected

Python objects have overhead beyond the data they store. Every Python object contains:

A reference count (8 bytes on 64-bit systems)
A pointer to the type object (8 bytes)
The actual data

A Python integer takes 28 bytes, even though the value itself fits in 8 bytes or less. A Python list of 1000 integers uses much more memory than 1000 integers would in C.

import sys

# Memory overhead of different objects
print(sys.getsizeof(0))           # int
print(sys.getsizeof([]))          # empty list
print(sys.getsizeof({}))          # empty dict
print(sys.getsizeof(""))          # empty string
print(sys.getsizeof(()))          # empty tuple

Expected output:

Why Memory Does Not Always Return to the OS

When a Python object is freed, its memory goes back to pymalloc’s pool, not directly to the OS. A pool is only returned to its arena when all blocks in it are free. An arena is only returned to the OS when all its pools are empty.

This means Python’s memory footprint tends to grow over time, especially if you allocate many objects and then free most of them — the remaining objects prevent pools from becoming empty, which prevents arenas from being returned to the OS.

import tracemalloc
import sys

tracemalloc.start()

# Create a large list
data = list(range(1_000_000))

snapshot1 = tracemalloc.take_snapshot()

# Delete it
del data

snapshot2 = tracemalloc.take_snapshot()

stats1 = snapshot1.statistics("lineno")
stats2 = snapshot2.statistics("lineno")

print(f"Peak memory with list: {stats1[0].size / 1024 / 1024:.1f} MB")

List Over-Allocation

Python lists over-allocate to make append() amortized O(1):

import sys

lst = []
for i in range(10):
    lst.append(i)
    allocated = sys.getsizeof(lst)
    print(f"len={len(lst)}, size={allocated} bytes")

Expected output:

len=1, size=88 bytes
len=2, size=88 bytes
len=3, size=88 bytes
len=4, size=88 bytes
len=5, size=120 bytes
len=6, size=120 bytes
...

The list does not resize on every append — it allocates extra capacity in chunks (roughly 1.125x growth factor). This is why sys.getsizeof(list) can be larger than the data inside it.

Measuring Memory Usage

`sys.getsizeof`

Returns the size of a single object in bytes, not counting the objects it references:

import sys

data = [1, 2, 3, 4, 5]

print(sys.getsizeof(data))           # size of the list object itself
print(sys.getsizeof(data) +
      sum(sys.getsizeof(x) for x in data))  # list + elements

Expected output:

104
244

For containers, sys.getsizeof only measures the container, not its contents. To measure the total size of a nested structure, you need a recursive approach.

`tracemalloc` — Built-in Memory Profiler

import tracemalloc

tracemalloc.start()

# Code you want to profile
result = [x ** 2 for x in range(100_000)]

current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024:.1f} KB")
print(f"Peak:    {peak / 1024:.1f} KB")

tracemalloc.stop()

Expected output:

Current: 3516.2 KB
Peak:    3516.2 KB

Finding the Biggest Memory Users

import tracemalloc

tracemalloc.start()

# Your code here
data = {str(i): list(range(100)) for i in range(1000)}

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")

print("Top 5 memory consumers:")
for stat in top_stats[:5]:
    print(stat)

tracemalloc is the right tool for finding memory leaks and understanding where memory is being allocated in your program.

Process-Level Memory

To see the total memory used by the Python process:

import os
import psutil  # pip install psutil

process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Process memory: {memory_mb:.1f} MB")

Practical Implications

Use Generators Instead of Lists for Large Data

# Loads all data into memory at once
data = [process(x) for x in large_dataset]

# Generates one item at a time — much less memory
data = (process(x) for x in large_dataset)

Generators, covered in the list comprehensions guide, are the primary tool for memory-efficient iteration over large datasets.

Use `slots` for Memory-Efficient Classes

By default, Python stores instance attributes in a dictionary (__dict__), which has overhead:

class PointWithDict:
    def __init__(self, x, y):
        self.x = x
        self.y = y

class PointWithSlots:
    __slots__ = ["x", "y"]
    def __init__(self, x, y):
        self.x = x
        self.y = y

import sys

p1 = PointWithDict(1, 2)
p2 = PointWithSlots(1, 2)

print(f"With __dict__:  {sys.getsizeof(p1) + sys.getsizeof(p1.__dict__)} bytes")
print(f"With __slots__: {sys.getsizeof(p2)} bytes")

Expected output:

With __dict__:  232 bytes
With __slots__: 56 bytes

__slots__ replaces the instance dictionary with a fixed-size array of slots, saving roughly 4x memory per instance. This matters when you create millions of instances of the same class.

Use `array` or `numpy` for Numeric Data

A Python list of 1 million integers uses roughly 8MB. A numpy array of the same data uses about 8x less:

import sys
import array

# Python list — each element is a full Python int object
python_list = list(range(1_000_000))
print(f"Python list: {sys.getsizeof(python_list) / 1024 / 1024:.1f} MB")

# array module — C-level typed array
typed_array = array.array("i", range(1_000_000))
print(f"array.array: {sys.getsizeof(typed_array) / 1024 / 1024:.1f} MB")

Expected output:

Python list: 8.0 MB
array.array: 3.8 MB

For numerical computation, numpy arrays are even more efficient and unlock vectorized operations.

Quick Reference

Concept	Description
pymalloc	CPython’s small object allocator (≤512 bytes)
Block	Smallest allocation unit, fixed size (8 to 512 bytes)
Pool	4KB region holding blocks of one size
Arena	256KB region divided into 64 pools
Integer cache	Ints -5 to 256 are pre-allocated and reused
String interning	Identifier-like strings are reused automatically
`sys.getsizeof(obj)`	Size of one object (not its contents)
`tracemalloc`	Built-in memory profiler
`__slots__`	Replace instance `__dict__` with fixed-size slots
Generator expressions	Iterate without loading everything into memory

Wrap-Up

Python’s memory management is a carefully designed system that trades raw memory efficiency for programmer productivity and allocation speed. The three-layer architecture — OS allocator, pymalloc, and object-specific allocators — makes small object creation fast. Integer caching, string interning, and singletons reduce redundant allocations. The garbage collector handles the edge cases that reference counting cannot.

Understanding this system helps you make better decisions: when to use generators instead of lists, when __slots__ is worth the complexity, and when to reach for numpy or array instead of plain Python lists.

For a full picture of how these pieces fit into Python’s execution model, see the deep dive on what happens when you run python script.py. For how Python decides when to free objects, see the garbage collector guide. For memory-efficient iteration patterns, see the list comprehensions guide. For questions or future tutorial ideas, get in touch via the Contact page.

How Python's Memory Management Works Under the Hood

Introduction

The Three-Layer Architecture

Layer 1: The General Allocator

Layer 2: pymalloc — The Small Object Allocator

Blocks

Pools

Arenas

Layer 3: Object-Specific Allocators

Integer Caching

String Interning

The None, True, False Singletons

Memory Growth and Fragmentation

Why Python Uses More Memory Than Expected

Why Memory Does Not Always Return to the OS

List Over-Allocation

Measuring Memory Usage

sys.getsizeof

tracemalloc — Built-in Memory Profiler

Finding the Biggest Memory Users

Process-Level Memory

Practical Implications

Use Generators Instead of Lists for Large Data

Use __slots__ for Memory-Efficient Classes

Use array or numpy for Numeric Data

Quick Reference

Wrap-Up

Related articles

The `None`, `True`, `False` Singletons

`sys.getsizeof`

`tracemalloc` — Built-in Memory Profiler

Use `slots` for Memory-Efficient Classes

Use `array` or `numpy` for Numeric Data