PostgreSQL Internals - Module 3: MVCC & Transaction Isolation Levels
Third Post in the Seris PostgreSQL Internals - MVCC & Transaction Isolation Levels
Series: PostgreSQL Internals
- 1.PostgreSQL Internals - Module 1: Process & Memory Architecture
- 2.PostgreSQL Internals - Module 2: Page Layout & Storage Internals
- 3.PostgreSQL Internals - Module 3: MVCC & Transaction Isolation Levels
In the previous blogs, we explored the concepts related to PostgreSQL Architecture and Storage layer.
Now that we know what is xmin, xmax, ctid and cid looks like on the disk, we use them to understand how the concurrency works in PostgreSQL.
This module is about the single most important question PostgreSQL asks millions of times per second:
"Is this tuple visible to this transaction?"
Part 1: What is a Transaction ID (XID)?
Every transaction in PostgreSQL gets a transaction ID, a 32-bit unsigned integer which increases monotonically.
-- Helps to see your current transaction's XID
SELECT txid_current();
-- Returns 9093
-- XID is only assigned when you actually write something
-- Read-only transactions may get a "virtual XID" instead
SELECT txid_current_if_assigned();
-- NULL - If read-only transactionThe XID Space
0 = Invalid TransactionID
1 = Bootstrap TransactionID (system init)
2 = Frozen TransactionID (always visible)
3...231-1 = Normal Transaction IDs, assigned sequentially
231 - 1 = 4,294,967,295 (~4 Billion transactions)
Note: When the transaction ID reaches 231-1 it might cause the Transaction WrapAround, which we will discuss shortly in this blog
Part 2: The Snapshot - PostgreSQL's view of NOW
When a transaction starts, PostgreSQL takes a snapshot of which transactions are currently active. And this snapshot will be used to make all the tuple visibility decisions for this current transaction.
What a Snapshot contains
typedef struct SnapshotData {
TransactionId xmin; /* oldest XID still active */
TransactionId xmax; /* first XID not yet assigned */
TransactionId *xip; /* in-progress XIDs[] between xmin & xmax */
uint32 xcnt; /* number of in-progress XIDs */
} SnapshotData;In simple terms:
xmin = any XID below this is either committed or aborted.
xmax = any XID above this is yet to start in POV of this transaction. So invisible.
xip[] = XIDs between xmin and xmax that are still in progress at the time of taking this snapshot. So these are also invisible.
-- See the current snapshot
SELECT pg_current_snapshot();
-- Returns something like: 1040:1045:1041,1043
-- xmin = 1040 (oldest active)
-- xmax = 1045 (next to be assigned)
-- xip = {1041, 1043} (currently running)Snapshot Construction - Step by step
PostgreSQL builds this snapshot by scanning the ProcArray, the shared memory array of all active backends we covered in Module 1.
Active backends at snapshot time:
XID 1040 — committed before snapshot
XID 1041 — IN PROGRESS ← goes into xip[]
XID 1042 — YOUR transaction (taking snapshot)
XID 1043 — IN PROGRESS ← goes into xip[]
XID 1044 — not started yet
Snapshot result:
xmin = 1040
xmax = 1044
xip = [1041, 1043]Part 3: Visibility Rules - Exact Algorithm
Given a tuple with xmin, xmax, and a snapshot, here is the complete visibility check PostgreSQL runs. This will be done in the HeapTupleSatisfiesMVCC() in src/backend/access/heap/heapam_visibility.c.
Algorithm
FUNCTION is_visible(tuple, snapshot):
── Check xmin (who inserted this tuple) ──
IF xmin == InvalidXID:
RETURN false ← never valid
IF xmin_committed hint bit SET:
xmin_ok = true ← fast path, no CLOG lookup
ELSE IF xmin_aborted hint bit SET:
RETURN false ← inserter aborted, tuple never existed
ELSE:
status = lookup_clog(xmin)
IF status == ABORTED:
set_hint_bit(XMIN_ABORTED)
RETURN false
IF status == COMMITTED:
set_hint_bit(XMIN_COMMITTED)
xmin_ok = true
ELSE: (IN_PROGRESS)
IF xmin == my_own_xid:
xmin_ok = true ← I can see my own inserts
ELSE:
RETURN false ← another txn's uncommitted insert
── xmin passed ───
── Now check if it's visible per (or to this) snapshot ───
IF xmin >= snapshot.xmax:
RETURN false ← started after my snapshot
IF xmin in snapshot.xip[]:
RETURN false ← was in-progress when I started
── xmin is visible ──────
── xmin is visible. Now check xmax (who deleted it) ──────
IF xmax == 0:
RETURN true ← not deleted, live tuple
IF xmax_committed hint bit SET:
xmax_visible = true
ELSE IF xmax_aborted hint bit SET:
RETURN true ← deleter aborted, tuple still live
ELSE:
status = lookup_clog(xmax)
IF status == ABORTED:
set_hint_bit(XMAX_ABORTED)
RETURN true
IF status == COMMITTED:
set_hint_bit(XMAX_COMMITTED)
xmax_visible = true
ELSE: (IN_PROGRESS)
IF xmax == my_own_xid:
RETURN false ← I deleted it, I can't see it
ELSE:
RETURN true ← another txn is deleting, not committed
── xmax committed ──
── Check if that delete is in my snapshot ──
IF xmax >= snapshot.xmax:
RETURN true ← deleter started after my snapshot
IF xmax in snapshot.xip[]:
RETURN true ← deleter was in-progress when I started
RETURN false ← deleter committed before my snapshotThis runs for every tuple a query touches. The hint bits are what make it fast — once set, CLOG is never consulted again for that tuple.
Part 4: Isolation Levels
PostgreSQL Isolation levels are implemented by controlling when the snapshot is taken in transactions.
Different Isolation Levels:
Read Committed (Default)
Repeatable Read
Serializable
READ COMMITTED
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;A new snapshot will be taken at the start of each statement. That means, even in a single transaction, different statement see different snapshot that means different data.
Example:
T1: BEGIN; (XID = 1054)
T1: SELECT balance FROM accounts WHERE id=1; --Sees $100(Snapshot A)
T2: BEGIN; (XID = 1055)
T2: UPDATE accounts SET balance=200 WHERE id=1;
T2: COMMIT; (Committed XID 1055)
T1: SELECT balance FROM accounts WHERE id=1; --Sees $200(Snapshot B)
T2: COMMIT;T1 got two different answers for the same query within one/same transaction. This is a non-repeatable read allowed under READ COMMITTED.
REPEATABLE READ
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;The snapshot is taken once at the first statement and held for the entire transaction.
Example:
T1: BEGIN; (XID = 1056)
T1: SELECT balance FROM accounts WHERE id=1;--Snapshot taken,sees $100
T2: BEGIN; (XID = 1057)
T2: UPDATE accounts SET balance=200 WHERE id=1;
T2: COMMIT;
T1: SELECT balance FROM accounts WHERE id=1; -- Still sees $100
T2: COMMIT;T1 is completely isolated from T2's commit. This prevents non-repeatable reads AND phantom reads, PostgreSQL's REPEATABLE READ is actually stronger than the SQL standard requires.
SERIALIZABLE
Uses Serializable Snapshot Isolation (SSI) tracks read/write dependencies between transactions to detect serialization anomalies and abort the "dangerous" transaction. Much deeper than just snapshot timing.
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- May raise: ERROR: could not serialize access due to concurrent updateAnamolies
Dirty Read - You read uncommitted data from the another transaction
Not possible in any transaction isolation. That is because tuple & visibility rules are designed in such a way that uncommitted tuples never visible to other transactions.
Non-Repeatable Read - Same row returns different values due to updates in another transactions
Possible in READ COMMITTED isolation
Phantom Reads - Same query returns different rows due to insert/delete in another transaction
Possible in READ COMMITTED isolation
Part 6: XID WrapAround Problem
XID is 32-bit. It wraps around after ~2 billion transactions. When it wraps, old XIDs appear to be "in the future" and PostgreSQL would think every tuple is invisible.
This would corrupt your entire database.
How PostgreSQL solves it: Freezing
VACUUM replaces old xmin with FrozenTransactionID (2).Frozen tuples are always visible to everyone and no snapshot check needed.
Normal tuple: xmin=1000000 → visible only if XID 1M is in snapshot
Frozen tuple: xmin=2 → ALWAYS visible, no snapshot checkWhen does a tuple get frozen?
A tuple is eligible for freezing when its xmin is older than vacuum_freeze_min_age transactions (default: 50 million). It must be frozen before it's vacuum_freeze_table_page old (default: 150 million XIDs).
-- Monitor freeze health
SELECT datname,
age(datfrozenxid) AS xid_age,
2^31 - age(datfrozenxid) AS xids_until_wraparound
FROM pg_database
ORDER BY xid_age DESC;
-- Per-table freeze age
SELECT relname,
age(relfrozenxid) AS table_xid_age,
pg_size_pretty(pg_total_relation_size(oid)) AS size
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 10;When age(datfrozenxid) approaches 2 billion, PostgreSQL forces an autovacuum to freeze everything.
If it hits 2^31 - 1, PostgreSQL will refuse to accept new transactions and shut down and this we can call as the wraparound safety circuit.
The all_frozen VM bit
When VACUUM freezes every tuple on a page, it sets the all_frozen bit in the visibility map. Future VACUUMs skip that page entirely since they know there's nothing left to freeze. This is why keeping tables well-vacuumed doesn't just prevent bloat, it keeps freeze overhead manageable.
Part 7: HOT Updates - Avoiding Index Churn
Normal UPDATE writes a new tuple version and requires updating every index that covers the table. HOT (Heap Only Tuple) avoids this when:
The updated columns are not part of any index
There is free space on the same page as the old tuple
CREATE TABLE orders (id int PRIMARY KEY, status text, notes text);
-- index only on id
INSERT INTO orders VALUES (1, 'pending', 'rush');
UPDATE orders SET notes = 'very rush'; -- HOT: notes not indexed, same pageWhat happens physically:
Page 0:
ItemId[1]: lp_flags=REDIRECT → points to slot 2 ← index still points here!
ItemId[2]: xmin=101, xmax=0, t_infomask2=HEAP_ONLY_TUPLE ← new tuple, no index entryThe index entry still points to ItemId[1]. When the index scan reaches it, it follows the redirect chain to ItemId[2], the live version. No index update needed.
Part 8: Hands-On Lab
-- Setup
CREATE TABLE mvcc_demo (id int, val text);
INSERT INTO mvcc_demo VALUES (1, 'original');
-- Note the XID
SELECT xmin, xmax, ctid, * FROM mvcc_demo;
-- xmin | xmax | ctid | id | val
-- 9581 | 0 | (0,1) | 1 | original
-- ── Terminal 1 ──
BEGIN;
SELECT txid_current(); -- XID = 9582
UPDATE mvcc_demo SET val='updated' WHERE id=1;
-- Before committing, open Terminal 2
-- ── Terminal 2 (while Terminal 1 is open) ───
-- Under READ COMMITTED (default):
SELECT xmin, xmax, ctid, * FROM mvcc_demo;
-- xmin | xmax | ctid | id | val
-- 9581 | 9582 | (0,1) | 1 | original
-- xmax=9582 is in progress → old tuple still visible
-- ── Terminal 1 ──
COMMIT;
-- ── Terminal 2 ──
SELECT xmin, xmax, ctid, * FROM mvcc_demo;
-- NOW sees: xmin=9582, xmax=0, val='updated'
-- xmin | xmax | ctid | id | val
-- 9582 | 0 | (0,2) | 1 | updated
-- New snapshot taken (READ COMMITTED), sees committed data
-- ── Inspect both tuple versions with pageinspect ────
SELECT lp, t_xmin, t_xmax, t_ctid,
t_infomask,
t_data
FROM heap_page_items(get_raw_page('mvcc_demo', 0));
-- Both tuples visible here: the dead one (xmax set) and the live one
-- lp | t_xmin | t_xmax | t_ctid | t_infomask | t_data
-- 1 | 9581 | 9582 | (0,2) | 1282 | \x0100000013
-- 2 | 9582 | 0 | (0,2) | 10498 | \x0100000024
-- ── Watch XID age ───
SELECT age(relfrozenxid), relname
FROM pg_class WHERE relname = 'mvcc_demo';
-- age | relname
-- 3 | mvcc_demo
-- ── Trigger a manual freeze ──
VACUUM FREEZE mvcc_demo;
SELECT lp, t_xmin, t_xmax, t_infomask
FROM heap_page_items(get_raw_page('mvcc_demo', 0));
-- lp | t_xmin | t_xmax | t_ctid | t_infomask
-- 1 | | | |
-- 2 | 9582 | 0 | (0,2) | 11010This is the key insight, on PostgreSQL 14+, the freeze mechanism changed. Instead of rewriting t_xmin to 2 (FrozenTransactionId), PostgreSQL now sets the HEAP_XMIN_FREEZE infomask bit on the tuple and leaves t_xmin intact. This is why you still see t_xmin=9582 instead of 2. In older vesion i.e <= 13, you would see t_xmin set to 2.
Part 9: Conclusion
In this module, we explored how MVCC, Transaction isolation levels works in PostgreSQL in detail, starting from:
What is Transaction XID
What are the isolation levels and uses of them
How snapshot plays a role in confirming the visibility of a tuple
and many more
Finally, we make our hands dirty by trying to understand what we learned till now with practical examples.
In the next blog, we are going to discuss about WAL, checkpoints, crash recovery.
What's Next → Module 4: WAL, Checkpoints and Crash Recovery
You now understand exactly how PostgreSQL decides what you can see. Module 4 goes into WAL (Write-Ahead Logging), the mechanism that makes every write durable, enables crash recovery, and powers replication.
We'll cover the WAL record format, LSNs, checkpoints, and what happens byte-by-byte when you COMMIT.
Enjoyed this post?
2 reactions
Related Posts
Comments
Leave a comment
Stay Updated
Get notified when I publish new articles. No spam, unsubscribe anytime.
By subscribing, you agree to our Privacy Policy.