In the previous blogs, we explored the concepts related to PostgreSQL Architecture and Storage layer.

Now that we know what is xmin, xmax, ctid and cid looks like on the disk, we use them to understand how the concurrency works in PostgreSQL.

This module is about the single most important question PostgreSQL asks millions of times per second:

"Is this tuple visible to this transaction?"

Part 1: What is a Transaction ID (XID)?

Every transaction in PostgreSQL gets a transaction ID, a 32-bit unsigned integer which increases monotonically.

-- Helps to see your current transaction's XID
SELECT txid_current();
-- Returns 9093 

-- XID is only assigned when you actually write something
-- Read-only transactions may get a "virtual XID" instead
SELECT txid_current_if_assigned();
-- NULL - If read-only transaction

The XID Space

0 = Invalid TransactionID
1 = Bootstrap TransactionID (system init)
2 = Frozen TransactionID (always visible)
3...2³²-1 = Normal Transaction IDs, assigned sequentially
2³²- 1 = 4,294,967,295 (~4 Billion transactions)

Note: When the transaction ID reaches 2³²-1 it might cause the Transaction WrapAround, which we will discuss shortly in this blog

Part 2: The Snapshot - PostgreSQL's view of NOW

When a transaction starts, PostgreSQL takes a snapshot of which transactions are currently active. And this snapshot will be used to make all the tuple visibility decisions for this current transaction.

What a Snapshot contains

typedef struct SnapshotData {
    TransactionId xmin;   /* oldest XID still active */
    TransactionId xmax;   /* first XID not yet assigned */
    TransactionId *xip;   /* in-progress XIDs[] between xmin & xmax */
    uint32        xcnt;   /* number of in-progress XIDs */
} SnapshotData;

In simple terms:

xmin = any XID below this is either committed or aborted.
xmax = any XID above this is yet to start in POV of this transaction. So invisible.
xip[] = XIDs between xmin and xmax that are still in progress at the time of taking this snapshot. So these are also invisible.

-- See the current snapshot
SELECT pg_current_snapshot();
-- Returns something like: 1040:1045:1041,1043
--   xmin = 1040  (oldest active)
--   xmax = 1045  (next to be assigned)
--   xip  = {1041, 1043}  (currently running)

Snapshot Construction - Step by step

PostgreSQL builds this snapshot by scanning the ProcArray, the shared memory array of all active backends we covered in Module 1.

Active backends at snapshot time:
  XID 1040 — committed before snapshot
  XID 1041 — IN PROGRESS  ← goes into xip[]
  XID 1042 — YOUR transaction (taking snapshot)
  XID 1043 — IN PROGRESS  ← goes into xip[]
  XID 1044 — not started yet

Snapshot result:
  xmin = 1040
  xmax = 1044
  xip  = [1041, 1043]

Part 3: Visibility Rules - Exact Algorithm

Given a tuple with xmin, xmax, and a snapshot, here is the complete visibility check PostgreSQL runs. This will be done in the HeapTupleSatisfiesMVCC() in src/backend/access/heap/heapam_visibility.c.

Algorithm

FUNCTION is_visible(tuple, snapshot):
  ── Check xmin (who inserted this tuple) ──
  IF xmin == InvalidXID:
      RETURN false  ← never valid

  IF xmin_committed hint bit SET:
      xmin_ok = true ← fast path, no CLOG lookup
  ELSE IF xmin_aborted hint bit SET:
      RETURN false  ← inserter aborted, tuple never existed
  ELSE:
      status = lookup_clog(xmin)
      IF status == ABORTED:
          set_hint_bit(XMIN_INVALID)
          RETURN false
      IF status == COMMITTED:
          set_hint_bit(XMIN_COMMITTED)
          xmin_ok = true
      ELSE: (IN_PROGRESS)
          IF xmin == my_own_xid:
              xmin_ok = true  ← I can see my own inserts
          ELSE:
              RETURN false  ← another txn's uncommitted insert
  ── xmin passed ───

  ── Now check if it's visible per (or to this) snapshot ───
  IF xmin >= snapshot.xmax:
      RETURN false  ← started after my snapshot
  IF xmin in snapshot.xip[]:
      RETURN false  ← was in-progress when I started
  ── xmin is visible ──────

  ── xmin is visible. Now check xmax (who deleted it) ──────
  IF xmax == 0:
      RETURN true ← not deleted, live tuple

  IF xmax_committed hint bit SET:
      xmax_visible = true
  ELSE IF xmax_aborted hint bit SET:
      RETURN true ← deleter aborted, tuple still live
  ELSE:
      status = lookup_clog(xmax)
      IF status == ABORTED:
          set_hint_bit(XMAX_INVALID)
          RETURN true
      IF status == COMMITTED:
          set_hint_bit(XMAX_COMMITTED)
          xmax_visible = true
      ELSE: (IN_PROGRESS)
          IF xmax == my_own_xid:
              RETURN false ← I deleted it, I can't see it
          ELSE:
              RETURN true ← another txn is deleting, not committed
  ── xmax committed ──

  ── Check if that delete is in my snapshot ──
  IF xmax >= snapshot.xmax:
      RETURN true  ← deleter started after my snapshot
  IF xmax in snapshot.xip[]:
      RETURN true  ← deleter was in-progress when I started
  RETURN false  ← deleter committed before my snapshot

This runs for every tuple a query touches. The hint bits are what make it fast — once set, CLOG is never consulted again for that tuple.

Part 4: Isolation Levels

PostgreSQL Isolation levels are implemented by controlling when the snapshot is taken in transactions.

Different Isolation Levels:

Read Committed (Default)
Repeatable Read
Serializable

READ COMMITTED

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

A new snapshot will be taken at the start of each statement. That means, even in a single transaction, different statement see different snapshot that means different data.

Example:

T1: BEGIN; (XID = 1054)
T1: SELECT balance FROM accounts WHERE id=1; --Sees $100(Snapshot A)

T2: BEGIN; (XID = 1055)
T2: UPDATE accounts SET balance=200 WHERE id=1;
T2: COMMIT; (Committed XID 1055)

T1: SELECT balance FROM accounts WHERE id=1; --Sees $200(Snapshot B)
T2: COMMIT;

T1 got two different answers for the same query within one/same transaction. This is a non-repeatable read allowed under READ COMMITTED.

REPEATABLE READ

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

The snapshot is taken once at the first statement and held for the entire transaction.

Example:

T1: BEGIN; (XID = 1056)
T1: SELECT balance FROM accounts WHERE id=1;--Snapshot taken,sees $100

T2: BEGIN; (XID = 1057)
T2: UPDATE accounts SET balance=200 WHERE id=1;
T2: COMMIT;

T1: SELECT balance FROM accounts WHERE id=1; -- Still sees $100
T2: COMMIT;

T1 is completely isolated from T2's commit. This prevents non-repeatable reads AND phantom reads, PostgreSQL's REPEATABLE READ is actually stronger than the SQL standard requires.

SERIALIZABLE

Uses Serializable Snapshot Isolation (SSI) tracks read/write dependencies between transactions to detect serialization anomalies and abort the "dangerous" transaction. Much deeper than just snapshot timing.

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- May raise: ERROR: could not serialize access due to concurrent update

Anamolies

Dirty Read - You read uncommitted data from the another transaction
1. Not possible in any transaction isolation. That is because tuple & visibility rules are designed in such a way that uncommitted tuples never visible to other transactions.
Non-Repeatable Read - Same row returns different values due to updates in another transactions
1. Possible in READ COMMITTED isolation
Phantom Reads - Same query returns different rows due to insert/delete in another transaction
1. Possible in READ COMMITTED isolation

Part 6: XID WrapAround Problem

XID is 32-bit. It wraps around after ~4 billion transactions. When it wraps, old XIDs appear to be "in the future" and PostgreSQL would think every tuple is invisible.

This would corrupt your entire database.

How PostgreSQL solves it: Freezing

VACUUM replaces old xmin with FrozenTransactionID (2).Frozen tuples are always visible to everyone and no snapshot check needed.

Normal tuple:  xmin=1000000  → visible only if XID 1M is in snapshot
Frozen tuple:  xmin=2        → ALWAYS visible, no snapshot check

When does a tuple get frozen?

A tuple is eligible for freezing when its xmin is older than vacuum_freeze_min_age transactions (default: 50 million). It must be frozen before it's vacuum_freeze_table_page old (default: 150 million XIDs).

-- Monitor freeze health
SELECT datname,
       age(datfrozenxid)          AS xid_age,
       2^32 - age(datfrozenxid)   AS xids_until_wraparound
FROM pg_database
ORDER BY xid_age DESC;

-- Per-table freeze age
SELECT relname,
       age(relfrozenxid)          AS table_xid_age,
       pg_size_pretty(pg_total_relation_size(oid)) AS size
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 10;

When age(datfrozenxid) approaches 4 billion, PostgreSQL forces an autovacuum to freeze everything.

If it hits 2^32 - 1, PostgreSQL will refuse to accept new transactions and shut down and this we can call as the wraparound safety circuit.

The all_frozen VM bit

When VACUUM freezes every tuple on a page, it sets the all_frozen bit in the visibility map. Future VACUUMs skip that page entirely since they know there's nothing left to freeze. This is why keeping tables well-vacuumed doesn't just prevent bloat, it keeps freeze overhead manageable.

Part 7: HOT Updates - Avoiding Index Churn

Normal UPDATE writes a new tuple version and requires updating every index that covers the table. HOT (Heap Only Tuple) avoids this when:

The updated columns are not part of any index
There is free space on the same page as the old tuple

CREATE TABLE orders (id int PRIMARY KEY, status text, notes text);
-- index only on id

INSERT INTO orders VALUES (1, 'pending', 'rush');
UPDATE orders SET notes = 'very rush'; -- HOT: notes not indexed, same page

What happens physically:

Page 0:
  ItemId[1]: lp_flags=REDIRECT → points to slot 2 ← index still points here!
  ItemId[2]: xmin=101, xmax=0, t_infomask2=HEAP_ONLY_TUPLE ← new tuple, no index entry

The index entry still points to ItemId[1]. When the index scan reaches it, it follows the redirect chain to ItemId[2], the live version. No index update needed.

Part 8: Hands-On Lab

-- Setup
CREATE TABLE mvcc_demo (id int, val text);
INSERT INTO mvcc_demo VALUES (1, 'original');
-- Note the XID
SELECT xmin, xmax, ctid, * FROM mvcc_demo;
-- xmin | xmax | ctid  | id |   val    
-- 9581 |    0 | (0,1) |  1 | original

-- ── Terminal 1 ──
BEGIN;
SELECT txid_current(); -- XID = 9582

UPDATE mvcc_demo SET val='updated' WHERE id=1;

-- Before committing, open Terminal 2


-- ── Terminal 2 (while Terminal 1 is open) ───
-- Under READ COMMITTED (default):
SELECT xmin, xmax, ctid, * FROM mvcc_demo;
-- xmin | xmax | ctid  | id |   val    
-- 9581 | 9582 | (0,1) |  1 | original
-- xmax=9582 is in progress → old tuple still visible

-- ── Terminal 1 ──
COMMIT;

-- ── Terminal 2 ──
SELECT xmin, xmax, ctid, * FROM mvcc_demo;
-- NOW sees: xmin=9582, xmax=0, val='updated'
-- xmin | xmax | ctid  | id |   val   
-- 9582 |    0 | (0,2) |  1 | updated
-- New snapshot taken (READ COMMITTED), sees committed data

-- ── Inspect both tuple versions with pageinspect ────
SELECT lp, t_xmin, t_xmax, t_ctid,
       t_infomask,
       t_data
FROM heap_page_items(get_raw_page('mvcc_demo', 0));
-- Both tuples visible here: the dead one (xmax set) and the live one
-- lp | t_xmin | t_xmax | t_ctid | t_infomask | t_data            
--  1 |   9581 |   9582 | (0,2)  |       1282 | \x0100000013
--  2 |   9582 |      0 | (0,2)  |      10498 | \x0100000024


-- ── Watch XID age ───
SELECT age(relfrozenxid), relname
FROM pg_class WHERE relname = 'mvcc_demo';
-- age |  relname  
--   3 | mvcc_demo

-- ── Trigger a manual freeze ──
VACUUM FREEZE mvcc_demo;

SELECT lp, t_xmin, t_xmax, t_infomask
FROM heap_page_items(get_raw_page('mvcc_demo', 0));
-- lp | t_xmin | t_xmax | t_ctid | t_infomask 
--  1 |        |        |        |           
--  2 |   9582 |      0 | (0,2)  |      11010

This is the key insight, on PostgreSQL 14+, the freeze mechanism changed. Instead of rewriting t_xmin to 2 (FrozenTransactionId), PostgreSQL now sets the HEAP_XMIN_FREEZE infomask bit on the tuple and leaves t_xmin intact. This is why you still see t_xmin=9582 instead of 2. In older vesion i.e <= 13, you would see t_xmin set to 2.

Part 9: Conclusion

In this module, we explored how MVCC, Transaction isolation levels works in PostgreSQL in detail, starting from:

What is Transaction XID
What are the isolation levels and uses of them
How snapshot plays a role in confirming the visibility of a tuple
and many more

Finally, we make our hands dirty by trying to understand what we learned till now with practical examples.

In the next blog, we are going to discuss about WAL, checkpoints, crash recovery.

What's Next → Module 4: WAL, Checkpoints and Crash Recovery

You now understand exactly how PostgreSQL decides what you can see. Module 4 goes into WAL (Write-Ahead Logging), the mechanism that makes every write durable, enables crash recovery, and powers replication.

We'll cover the WAL record format, LSNs, checkpoints, and what happens byte-by-byte when you COMMIT.