How Python's Memory Management Works Under the Hood
Introduction
Every time you create a Python object — a string, a list, an integer, a class instance — Python allocates memory for it. When you are done with it, that memory is freed. Most of the time, this happens invisibly.
But Python’s memory management is not just “call malloc, call free.” CPython has a sophisticated multi-layer memory architecture designed to make small object allocation fast and to minimize the overhead of talking to the operating system.
This article covers:
- How CPython requests memory from the OS
- The three-tier allocator architecture
- The small object allocator (pymalloc)
- Memory pools and arenas
- Object-specific allocators
- How memory grows and shrinks
- How to measure and profile memory usage
This is one of the deepest topics in Python internals. Understanding it explains why Python uses more memory than C programs, why memory does not always return to the OS immediately, and how to diagnose memory growth in long-running applications.
All examples are tested on CPython 3.12.
The Three-Layer Architecture
CPython’s memory management has three distinct layers, each building on the one below:
┌────────────────────────────────────┐
│ Object-Specific Allocators │ Layer 3
│ (int cache, string interning...) │
├────────────────────────────────────┤
│ pymalloc (Small Object Allocator) │ Layer 2
│ (objects ≤ 512 bytes) │
├────────────────────────────────────┤
│ General Allocator (malloc/free) │ Layer 1
│ (OS memory via C runtime) │
└────────────────────────────────────┘
Most Python objects are small. The small object allocator (pymalloc) handles them efficiently without touching the OS on every allocation. Only large objects and the allocator’s own bookkeeping go through the general allocator.
Layer 1: The General Allocator
At the bottom, CPython uses the C runtime’s malloc and free to request and release raw memory from the operating system. This is the same mechanism used by C programs.
malloc is general-purpose but has overhead — it tracks allocated blocks, handles fragmentation, and involves system calls. For a language that creates thousands of small objects per second (like Python), calling malloc and free directly for every object would be prohibitively slow.
CPython wraps the general allocator with PyMem_RawMalloc, PyMem_RawFree, and related functions. These can be overridden with a custom allocator, which is how memory profiling tools like tracemalloc work.
Layer 2: pymalloc — The Small Object Allocator
For objects 512 bytes or smaller (the vast majority of Python objects), CPython uses its own allocator called pymalloc. The design is built around three hierarchical structures: arenas, pools, and blocks.
Blocks
A block is the smallest unit of allocation. Each block is a fixed size — 8, 16, 24, … up to 512 bytes in multiples of 8. When Python needs to allocate 30 bytes, it rounds up to 32 bytes and gives you a 32-byte block.
Pools
A pool is a 4KB region of memory that holds blocks of a single size. If a pool holds 32-byte blocks, it contains exactly 4096 / 32 = 128 blocks.
Pools are the key to fast allocation. To allocate a 32-byte object, Python just needs to find a pool of 32-byte blocks with a free slot. This is much faster than calling malloc with its general-purpose overhead.
Each pool has three states:
- used — partially allocated (some blocks free, some in use)
- full — all blocks are in use
- empty — no blocks are in use (returned to its arena)
import sys
# Small objects — handled by pymalloc
x = 42
y = "hello"
z = [1, 2, 3]
print(sys.getsizeof(x)) # int size
print(sys.getsizeof(y)) # str size
print(sys.getsizeof(z)) # list size (not including elements)
Expected output:
28
54
88
Arenas
An arena is a 256KB region of memory requested from the OS via malloc. Each arena is divided into 64 pools of 4KB each. CPython requests new arenas as needed and maintains a list of arenas sorted by how full they are, preferring to allocate from the most-full arenas to maximize the chance of returning empty arenas to the OS.
This three-tier structure (arenas → pools → blocks) is the core of why Python’s small object allocation is fast:
- Finding a free block in a used pool is O(1)
- Pools avoid per-object
malloccalls - Arenas batch OS memory requests into 256KB chunks
Layer 3: Object-Specific Allocators
On top of pymalloc, CPython has specialized allocators for frequently created types.
Integer Caching
CPython pre-allocates integers from -5 to 256 and reuses them. Creating x = 42 does not allocate new memory — it returns a pointer to the pre-existing integer 42 object:
a = 100
b = 100
print(a is b) # True — same object
a = 1000
b = 1000
print(a is b) # False — different objects (outside cache range)
Expected output:
True
False
This optimization saves significant memory in programs that use small integers extensively — which is nearly every program.
String Interning
Python automatically interns (reuses) string objects that look like identifiers:
a = "hello"
b = "hello"
print(a is b) # True — interned
a = "hello world"
b = "hello world"
print(a is b) # Might be False — not automatically interned (contains a space)
You can explicitly intern strings using sys.intern():
import sys
a = sys.intern("hello world")
b = sys.intern("hello world")
print(a is b) # True
Interning is useful when the same string is compared many times — identity comparison (is) is faster than equality comparison (==).
The None, True, False Singletons
None, True, and False each exist as exactly one object in the entire interpreter. Every use of None refers to the same object:
a = None
b = None
print(a is b) # Always True
print(id(a) == id(b)) # Always True
Expected output:
True
True
Memory Growth and Fragmentation
Why Python Uses More Memory Than Expected
Python objects have overhead beyond the data they store. Every Python object contains:
- A reference count (8 bytes on 64-bit systems)
- A pointer to the type object (8 bytes)
- The actual data
A Python integer takes 28 bytes, even though the value itself fits in 8 bytes or less. A Python list of 1000 integers uses much more memory than 1000 integers would in C.
import sys
# Memory overhead of different objects
print(sys.getsizeof(0)) # int
print(sys.getsizeof([])) # empty list
print(sys.getsizeof({})) # empty dict
print(sys.getsizeof("")) # empty string
print(sys.getsizeof(())) # empty tuple
Expected output:
28
56
64
49
40
Why Memory Does Not Always Return to the OS
When a Python object is freed, its memory goes back to pymalloc’s pool, not directly to the OS. A pool is only returned to its arena when all blocks in it are free. An arena is only returned to the OS when all its pools are empty.
This means Python’s memory footprint tends to grow over time, especially if you allocate many objects and then free most of them — the remaining objects prevent pools from becoming empty, which prevents arenas from being returned to the OS.
import tracemalloc
import sys
tracemalloc.start()
# Create a large list
data = list(range(1_000_000))
snapshot1 = tracemalloc.take_snapshot()
# Delete it
del data
snapshot2 = tracemalloc.take_snapshot()
stats1 = snapshot1.statistics("lineno")
stats2 = snapshot2.statistics("lineno")
print(f"Peak memory with list: {stats1[0].size / 1024 / 1024:.1f} MB")
List Over-Allocation
Python lists over-allocate to make append() amortized O(1):
import sys
lst = []
for i in range(10):
lst.append(i)
allocated = sys.getsizeof(lst)
print(f"len={len(lst)}, size={allocated} bytes")
Expected output:
len=1, size=88 bytes
len=2, size=88 bytes
len=3, size=88 bytes
len=4, size=88 bytes
len=5, size=120 bytes
len=6, size=120 bytes
...
The list does not resize on every append — it allocates extra capacity in chunks (roughly 1.125x growth factor). This is why sys.getsizeof(list) can be larger than the data inside it.
Measuring Memory Usage
sys.getsizeof
Returns the size of a single object in bytes, not counting the objects it references:
import sys
data = [1, 2, 3, 4, 5]
print(sys.getsizeof(data)) # size of the list object itself
print(sys.getsizeof(data) +
sum(sys.getsizeof(x) for x in data)) # list + elements
Expected output:
104
244
For containers, sys.getsizeof only measures the container, not its contents. To measure the total size of a nested structure, you need a recursive approach.
tracemalloc — Built-in Memory Profiler
import tracemalloc
tracemalloc.start()
# Code you want to profile
result = [x ** 2 for x in range(100_000)]
current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024:.1f} KB")
print(f"Peak: {peak / 1024:.1f} KB")
tracemalloc.stop()
Expected output:
Current: 3516.2 KB
Peak: 3516.2 KB
Finding the Biggest Memory Users
import tracemalloc
tracemalloc.start()
# Your code here
data = {str(i): list(range(100)) for i in range(1000)}
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
print("Top 5 memory consumers:")
for stat in top_stats[:5]:
print(stat)
tracemalloc is the right tool for finding memory leaks and understanding where memory is being allocated in your program.
Process-Level Memory
To see the total memory used by the Python process:
import os
import psutil # pip install psutil
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Process memory: {memory_mb:.1f} MB")
Practical Implications
Use Generators Instead of Lists for Large Data
# Loads all data into memory at once
data = [process(x) for x in large_dataset]
# Generates one item at a time — much less memory
data = (process(x) for x in large_dataset)
Generators, covered in the list comprehensions guide, are the primary tool for memory-efficient iteration over large datasets.
Use __slots__ for Memory-Efficient Classes
By default, Python stores instance attributes in a dictionary (__dict__), which has overhead:
class PointWithDict:
def __init__(self, x, y):
self.x = x
self.y = y
class PointWithSlots:
__slots__ = ["x", "y"]
def __init__(self, x, y):
self.x = x
self.y = y
import sys
p1 = PointWithDict(1, 2)
p2 = PointWithSlots(1, 2)
print(f"With __dict__: {sys.getsizeof(p1) + sys.getsizeof(p1.__dict__)} bytes")
print(f"With __slots__: {sys.getsizeof(p2)} bytes")
Expected output:
With __dict__: 232 bytes
With __slots__: 56 bytes
__slots__ replaces the instance dictionary with a fixed-size array of slots, saving roughly 4x memory per instance. This matters when you create millions of instances of the same class.
Use array or numpy for Numeric Data
A Python list of 1 million integers uses roughly 8MB. A numpy array of the same data uses about 8x less:
import sys
import array
# Python list — each element is a full Python int object
python_list = list(range(1_000_000))
print(f"Python list: {sys.getsizeof(python_list) / 1024 / 1024:.1f} MB")
# array module — C-level typed array
typed_array = array.array("i", range(1_000_000))
print(f"array.array: {sys.getsizeof(typed_array) / 1024 / 1024:.1f} MB")
Expected output:
Python list: 8.0 MB
array.array: 3.8 MB
For numerical computation, numpy arrays are even more efficient and unlock vectorized operations.
Quick Reference
| Concept | Description |
|---|---|
| pymalloc | CPython’s small object allocator (≤512 bytes) |
| Block | Smallest allocation unit, fixed size (8 to 512 bytes) |
| Pool | 4KB region holding blocks of one size |
| Arena | 256KB region divided into 64 pools |
| Integer cache | Ints -5 to 256 are pre-allocated and reused |
| String interning | Identifier-like strings are reused automatically |
sys.getsizeof(obj) | Size of one object (not its contents) |
tracemalloc | Built-in memory profiler |
__slots__ | Replace instance __dict__ with fixed-size slots |
| Generator expressions | Iterate without loading everything into memory |
Wrap-Up
Python’s memory management is a carefully designed system that trades raw memory efficiency for programmer productivity and allocation speed. The three-layer architecture — OS allocator, pymalloc, and object-specific allocators — makes small object creation fast. Integer caching, string interning, and singletons reduce redundant allocations. The garbage collector handles the edge cases that reference counting cannot.
Understanding this system helps you make better decisions: when to use generators instead of lists, when __slots__ is worth the complexity, and when to reach for numpy or array instead of plain Python lists.
For a full picture of how these pieces fit into Python’s execution model, see the deep dive on what happens when you run python script.py. For how Python decides when to free objects, see the garbage collector guide. For memory-efficient iteration patterns, see the list comprehensions guide. For questions or future tutorial ideas, get in touch via the Contact page.