Inside PostgreSQL: A Bird’s-Eye View

PostgreSQL, or "Postgres," is a beast known for being a powerful and reliable relational database. But have you ever wondered what actually happens when you run a simple SELECT query? How does it find your data so fast? How does it keep everything safe if the power goes out? This blog is for the curious. We're going to peek under the hood and connect the dots between our DBMS lectures and how a real world database actually works. Let's dive in! We'll explore everything from how Postgres guarantees the sacred ACID properties to the indexing that makes queries fly, and much more.

Handling Connections

When a client (be it a Flask, Node.js or Go backend) connects to the Postgres database, it does not talk to a monolithic program. A supervisor-like process called the Postmaster process which is listening on port 5432 (postgres’ default) forks off a new process for the connection. This means that each connection gets its own dedicated process on the database server (they communicate via TCP), which provides a great amount of isolation from other clients. However, it also means that the opening and closing of connections is a relatively expensive operation. Recall from operating systems that creation of a new process is only possible through the fork syscall which obviously causes the OS to trap into kernel mode.

For applications with frequent queries, constantly creating and closing connections is extremely inefficient. The standard solution here is “connection-pooling”, essentially: we maintain a set of active connections, and everytime we have a query, we borrow one of the connections and return when not in use, this removes the overhead of process creation (and even the TCP handshake).

ACID - But how?

While most relational databases promise ACID guarantees, the implementation details are what differs. Let's examine how PostgreSQL elegantly achieves these properties.

Atomicity and Durability: The Write Ahead Log (WAL)

Postgres stores the actual data of the table in heap files on the disk which are just an unordered collection of the rows of the table.

But before making any actual changes to these files on disk, postgres first writes a description of those changes to the WAL. These log records must be written and flushed to disk. This is known as the WAL Rule. Finally, when a transaction says COMMIT, a commit record is also written to the WAL. Let’s say the server crashes in the middle of a transaction before the COMMIT, then during recovery, it will read the WAL sequentially from the last checkpoint and if it does not find a COMMIT record for a particular transaction, it will roll back those changes, ensuring that the transaction never happened. Thus, the transaction is treated as a single, indivisible unit; it either fully completes (commits) or has no effect. Atomicity!

Whenever our application receives a “success” confirmation for a COMMIT, postgres ensures that the WAL commit record is safely flushed to the disk, even if the server crashes a second later, the changes made by the committed transaction will persist due to the previously mentioned recovery mechanism. Durability!

Consistency

Consistency guarantees that any transaction will only bring the database from one valid state to another valid state. In simpler words, all the data integrity rules like NOT NULL, UNIQUE and FOREIGN KEY constraints are never violated. Postgres enforces this by checking constraints at the end of each statement. If a statement's result violates a constraint, the statement is immediately rolled back. Consistency!

Isolation

Arguably the most important of the ACID properties is Isolation. Isolation ensures that concurrent transactions do not interfere with each other. Postgres achieves this through MVCC (Multi Version Concurrency Control). Instead of locking the row when it is being read or written to, postgres creates a new version of the row for each UPDATE. Thus a single logical row in a table can have multiple physical tuples on disk. In other words, a “tuple” is one of the many possible “versions” of a row. Consequently readers do not block writers and writers do not block readers. But then does a single transaction see multiple different versions of the same row? No. That would be quite silly. Let’s dive deeper into the tuple visibility logic.

When you begin a new transaction, it only sees a “snapshot” of the database. Basically there are a set of possible rules you can choose from, which govern what version of a row you see. These are known as Isolation Levels. We will see 2 of the most common of them here.

Read Committed (Default) - A new snapshot is taken at the start of every statement. This means that your transaction sees the changes made by newer transactions if they commit before your transaction does. Consequently, if you run the same SELECT query twice, you may see different results.

This will get clear with this example:

Imagine T1 starts. It runs SELECT balance FROM accounts WHERE id=1; and sees 1000. Then, T2 starts, runs UPDATE accounts SET balance = 900 WHERE id=1;, and COMMITs. Now, T1 runs the exact same SELECT query again. Because it takes a new snapshot, it now sees the balance as 900.
Repeatable Read - A single snapshot is taken only at the start of the transaction. Any changes (committed or not) made by other transactions are not visible to you, providing a stable view of the database. For long running queries, this can be crucial. Let us take the same example to see the difference:

T1 starts and takes its one-and-only snapshot. It sees the balance as 1000. T2 makes and commits its update. When T1 runs its SELECT query again, it uses its original snapshot. It still sees the balance as 1000, as if T2's change never happened.

Storage and Indexing

Postgres stores table data in what's called a heap file, which is an unordered collection of rows. Each row is assigned a Tuple ID (TID) and a direct physical pointer (block_number, offset) to its location on disk. When you create an index (a B-Tree), the index stores the indexed value and the TID of the corresponding row. When we want to search a particular row by its index,

Search Key → TID → Row Data

This is very efficient for lookups. However, because of MVCC, an UPDATE creates a new row version with a new TID. This means every single index on the table must be updated to point to this new location. This adds significant overhead.

Conclusion

Hopefully, this blog has demystified some of what goes on under the hood of postgres. The next time you write a SELECT statement, you'll have a much better appreciation for the incredible journey your query takes. Until next time!

Inside PostgreSQL: A Bird’s-Eye View

Handling Connections

ACID - But how?

Atomicity and Durability: The Write Ahead Log (WAL)

Consistency

Isolation

Storage and Indexing

Conclusion

Comments

More from this blog

GSoC: The Complete Guide - By Example

The Binary Indexed Tree (BIT) data structure and how it is used in Redis

Command Palette

Handling Connections

ACID - But how?

Atomicity and Durability: The Write Ahead Log (WAL)

Consistency

Isolation

Storage and Indexing

Conclusion

Comments

More from this blog