Inside Redis: How a Fenwick Tree Optimizes Performance

In this blog, we are going to

Learn about the Binary Indexed Tree (also called as Fenwick Tree) data structure in depth
Learn a little bit about Redis and what it does
Learn how Redis uses the BIT and see it in action inside the source code of Redis

What does a data structure from competitive programming have in common with one of the world's most popular databases? As it turns out, quite a lot. Let's dive in.

Why use a BIT?

Let us say in typical DSA fashion we are given an array A of size N, and we want to handle prefix sum queries, as well as point updates to the array. More formally, our ADT would look something like:

query(idx) → should return the sum of A from indices 0 to idx

update(idx, delta) → should increase the value of A[idx] by delta

Using the array A or a prefix sum array of A are naive approaches, as one can easily realize. However the BIT provides a balanced trade-off here, making both of the operations blazingly fast.

Approach	Complexity of query operation	Complexity of update operation
Use array `A` itself	`O(N)`	`O(1)`
Use prefix sum array of `A`	`O(1)`	`O(N)`
BIT	`O(log N)`	`O(log N)`

Now that we have learnt the abstraction that the BIT provides, let us learn its internal implementation. The BIT is simply an array tree of size N + 1, where

tree[i] stores the sum of the subarray from [i - i & (-i) + 1, i] inclusive

In other words, each tree[i] covers an interval of size i & (-i), ending exactly at i.

At first glance, the expression i & (-i) may look esoteric, but it is actually a simple bit trick

-i is the two’s complement of i
taking i & (-i) isolates the rightmost set bit of i

Example: i = 6 = 0b0110, then -i = -6 = 0b1010, and finally i & (-i) = 0b0010 = 2, thus the 6th index will be responsible for 2 values.

This works because the two's complement operation (inverting all bits and adding one) preserves the rightmost '1' bit while flipping all bits to its left. The subsequent & operation then zeroes out everything else, perfectly isolating it.

This way of storing partial information in each index will beautifully come together when we take a look at the implementation of the operations themselves.

query(idx):

Since the BIT is 1 indexed, our i will be equal to idx + 1. Now we want sum till i, what we can do is add tree[i] to our result, then move our i pointer to i - i & (-i), and then continue to repeat the same procedure again, till i is not out of bounds. This effectively corresponds to breaking our sum query into multiple subarrays, using the defined rule that we only store sum of i & (-i) elements at any given index. Since at each step we cover exactly the last uncovered range, the pieces glue together to form the entire prefix sum.

def query(idx, tree):
    """Return sum A[0..idx] using BIT."""
    i = idx + 1   # BIT is 1-indexed
    res = 0
    while i > 0:
        res += tree[i]
        i -= i & -i  # move to "parent"
    return res

It is easy to see that a call to query takes O(log N) time, notice that at each step, we unset the rightmost set bit of i. Since i has at most log₂(N) bits, the loop runs at most O(log N) times.

update(idx, delta):

Since the BIT is 1-indexed, our internal pointer will start at i = idx + 1. The thing to notice is: if tree[i] stores the sum of the interval [i - i & (-i) + 1, i], then any ancestor of i whose interval contains idx must also be updated. We can find these ancestor nodes by incrementing i such that we go to the “next” index which needs to be updated, we do this by incrementing i by i & (-i).

def update(idx, delta, tree):
    """Increase A[idx] by delta in BIT of size n."""
    n = len(tree) - 1
    i = idx + 1   # BIT is 1-indexed
    while i <= n:
        tree[i] += delta
        i += i & -i  # move to "next responsible node"

The logic for the time complexity is similar, adding i & (-i) unsets the rightmost set bit and sets the next higher bit on. Eventually, we propagate the change all the way up until we cross n. This takes O(log N) steps.

Here is an example you can play around with:

Let A = [1, 2, 3, 4, 5] (n = 5). The internal tree[1...5] becomes:

tree[1] = A[0] = 1
tree[2] = A[0] + A[1] = 3
tree[3] = A[2] = 3
tree[4] = A[0] + A[1] + A[2] + A[3] = 10
tree[5] = A[4] = 5

So tree = [_, 1, 3, 3, 10, 5]

That’s the sauce of the BIT!

What is Redis?

Redis is an open source, in memory database store which you can use as a cache. Unlike a traditional RDBMS which stores all the data on disk, Redis primarily stores it in memory for extremely low latency, and also gives us a rich set of data types and algorithms. The common use case of Redis is as a key-value cache, sitting in front of a traditional database. The backend application would first check the Redis instance if it has what its looking for, and only if it is a miss will it check the database, and then write the result back to Redis with some expiration time.

At it’s core, Redis is a key-value store, typically powered by a hash table. But what happens when a hash table grows to a million keys? It inevitably needs to resize in order to avoid performance degradation. This resizing can lead to the database becoming unresponsive for a while, which is unacceptable for a software like Redis where predictable low latency is a key feature. To solve this problem, Redis has a kvstore abstraction which keeps an array of hash tables. Most of the time the source code refers to the hash tables as dictionaries.

The comment at the top of redis/src/kvstore.c makes the purpose of the file clear, this will be the main source file that we will explore today.

/*
 * Index-based KV store implementation
 * This file implements a KV store comprised of an array of dicts (see dict.c)
 * The purpose of this KV store is to have easy access to all keys that belong
 * in the same dict (i.e. are in the same dict-index)
 *
 * For example, when Redis is running in cluster mode, we use kvstore to save
 * all keys that map to the same hash-slot in a separate dict within the kvstore
 * struct.
 * This enables us to easily access all keys that map to a specific hash-slot.
... 
 */

So when we insert a new key-value pair, we pick one of these dicts and store the pair there. Having multiple dicts lets us

perform per-slot maintenance (like rehash a single dict at a time)
keep keys that belong to the same logical slot together (e.g., cluster hash-slots)
iterate and scan through a slot efficiently

The kvstore struct looks as follows, with the fields not relevant to our discussion hidden.

struct _kvstore {
    ... 
    dict **dicts;
    long long num_dicts;
    ...
    int non_empty_dicts;                   /* The number of non-empty dicts. */
    unsigned long long key_count;          /* Total number of keys in this kvstore. */
    ...
    unsigned long long *dict_size_index;   /* Binary indexed tree (BIT) that describes cumulative key frequencies up until given dict-index. */
    ... 
};

Here, **dicts is the pointer to the array of the actual dictionaries (of type dict defined in redis/src/dict.c), num_dicts is the size of this array and the rest of the fields are as explained in the comments, with the dict_size_index being the actual BIT. This BIT stores the cumulative key counts of the array of dictionaries, i.e it is a BIT on top of the number of keys in a particular dictionary.

Redis allocates this BIT only when num_dicts is greater than 1, as seen in the kvstoreCreate function

kvs->dict_size_index = kvs->num_dicts > 1 ? 
    zcalloc(sizeof(unsigned long long) * (kvs->num_dicts + 1)) : NULL;

Note the +1, this clearly shows us that even here the BIT is 1 indexed.

When Redis inserts or deletes a key, the appropriate functions of the kvstore API are called,. For example when setting a key value pair, the kvstoreDictSetAtLink function is called, if the key is new (not an update to the old value), then the BIT needs to be updated by an increment of 1. The actual BIT update logic is done by calling the cumulativeKeyCountAdd function, passing in 1 as the “delta”. Note that for the delete operation, the delta is -1.

/* Set a key (or key-value) in the specified kvstore. 
 *
 * This function inserts a new key or updates an existing one, depending on 
 * the `newItem` flag.
 *
 * Parameters:
 * link:      - When `newItem` is set, `link` points to the bucket of the key.
 *            - When `newItem` is not set, `link` points to the link of the key.
 *            - If link is NULL, dictFindLink() will be called to locate the link.
 *          
 * newItem: - If set, add a new key with a new dictEntry.
 *          - If not set, update the key of an existing dictEntry.
 */
void kvstoreDictSetAtLink(kvstore *kvs, int didx, void *kv, dictEntryLink *link, int newItem) {
    dict *d;
    if (newItem) {
        d = createDictIfNeeded(kvs, didx);
        dictSetKeyAtLink(d, kv, link, newItem);
        cumulativeKeyCountAdd(kvs, didx, 1); /* must be called only after updating dict */
    } else {
        d = kvstoreGetDict(kvs, didx);
        dictSetKeyAtLink(d, kv, link, newItem);
    }
}

The cumulativeKeyCountAdd function is the clear cousin of the update method in our ADT. It takes in 3 parameters:

kvs - pointer to the kvstore struct
didx - the index of the dictionary of which the key count is to be updated, this is 0 indexed
delta - the key count will be increased by delta

Let us see the meat of the logic in this function.

/* Updates binary index tree (also known as Fenwick tree), increasing key count for a given dict.
 * You can read more about this data structure here https://en.wikipedia.org/wiki/Fenwick_tree
 * Time complexity is O(log(kvs->num_dicts)). Take care to call it only after 
 * adding or removing keys from the kvstore.
 */
static void cumulativeKeyCountAdd(kvstore *kvs, int didx, long delta) {
    kvs->key_count += delta;

    dict *d = kvstoreGetDict(kvs, didx);
    size_t dsize = dictSize(d);
    /* Increment if dsize is 1 and delta is positive (first element inserted, dict becomes non-empty).
     * Decrement if dsize is 0 (dict becomes empty). */
    int non_empty_dicts_delta = (dsize == 1 && delta > 0) ? 1 : (dsize == 0) ? -1 : 0;
    kvs->non_empty_dicts += non_empty_dicts_delta;

    /* BIT does not need to be calculated when there's only one dict. */
    if (kvs->num_dicts == 1)
        return;

    /* Update the BIT */
    int idx = didx + 1; /* Unlike dict indices, BIT is 1-based, so we need to add 1. */
    while (idx <= kvs->num_dicts) {
        if (delta < 0) {
            assert(kvs->dict_size_index[idx] >= (unsigned long long)labs(delta));
        }
        kvs->dict_size_index[idx] += delta;
        idx += (idx & -idx);
    }
}

The last while loop is verbatim the BIT logic we saw in python, with an added check so that the values do not drop below zero, as the count of the keys cannot be negative. The code above the while loop is doing work to maintain the fields of the struct like the total key_count, getting the number of keys in the actual dict and the number of non_empty_dicts.

Now, to get the number of keys up to and including a certain dict, Redis uses cumulativeKeyCountRead, a function that is a direct implementation of the query logic we saw earlier.

/* Returns total (cumulative) number of keys up until given dict-index (inclusive).
 * Time complexity is O(log(kvs->num_dicts)). */
static unsigned long long cumulativeKeyCountRead(kvstore *kvs, int didx) {
    if (kvs->num_dicts == 1) {
        assert(didx == 0);
        return kvstoreSize(kvs);
    }
    int idx = didx + 1;
    unsigned long long sum = 0;
    while (idx > 0) {
        sum += kvs->dict_size_index[idx];
        idx -= (idx & -idx);
    }
    return sum;
}

It is really interesting to see how closely they have followed the standard implementation, and still have covered their unique needs by carefully tying it together with the kvstore struct and the family of functions which operate on it, providing a nice abstraction to the high level set and delete operations.

There is another powerful use case for the BIT that Redis leverages: finding a random dictionary, but weighted fairly according to the number of keys it holds. This is crucial for operations where Redis needs to pick a random key from the entire keyspace, for example a background job that checks for TTL expired keys and removes them. A naive random choice would pick any dictionary with equal probability, which is unfair if one dictionary holds thousands of keys and another holds only a few. A fair choice means a dictionary with more keys should have a proportionally higher chance of being selected. This is exactly what the kvstoreGetFairRandomDictIndex function is for. Let's see how it works.

/* Returns fair random dict index, probability of each dict being returned is proportional to the number of elements that dictionary holds.
 * This function guarantees that it returns a dict-index of a non-empty dict, unless the entire kvstore is empty.
 * Time complexity of this function is O(log(kvs->num_dicts)). */
int kvstoreGetFairRandomDictIndex(kvstore *kvs) {
    unsigned long target = kvstoreSize(kvs) ? (randomULong() % kvstoreSize(kvs)) + 1 : 0;
    return kvstoreFindDictIndexByKeyIndex(kvs, target);
}

The function first picks a random number target between 1 and this total size. This target represents the k-th key in our giant conceptual array. The problem now is to find which dictionary this k-th key belongs to. This is where the BIT shines again, but in a different way. Basically now we want to find the smallest didx such that the cumulative number of keys uptil that point is greater than or equal to our target. If you have a strong grasp over DSA, it will immediately strike you that the problem laid out in front of us looks like it could be solved by some version of binary search.

This is handled by the kvstoreFindDictIndexByKeyIndex function, which performs a clever binary search directly on the structure of the BIT to find the answer in O(log N) time.

/* Finds a dict containing target element in a key space ordered by dict index.
 * Consider this example. Dictionaries are represented by brackets and keys by dots:
 *  #0   #1   #2     #3    #4
 * [..][....][...][.......][.]
 *                    ^
 *                 target
 *
 * In this case dict #3 contains key that we are trying to find.
 *
 * The return value is 0 based dict-index, and the range of the target is [1..kvstoreSize], kvstoreSize inclusive.
 *
 * To find the dict, we start with the root node of the binary index tree and search through its children
 * from the highest index (2^num_dicts_bits in our case) to the lowest index. At each node, we check if the target
 * value is greater than the node's value. If it is, we remove the node's value from the target and recursively
 * search for the new target using the current node as the parent.
 * Time complexity of this function is O(log(kvs->num_dicts))
 */
int kvstoreFindDictIndexByKeyIndex(kvstore *kvs, unsigned long target) {
    if (kvs->num_dicts == 1 || kvstoreSize(kvs) == 0)
        return 0;
    assert(target <= kvstoreSize(kvs));

    int result = 0, bit_mask = 1 << kvs->num_dicts_bits;
    for (int i = bit_mask; i != 0; i >>= 1) {
        int current = result + i;
        /* When the target index is greater than 'current' node value the we will update
         * the target and search in the 'current' node tree. */
        if (target > kvs->dict_size_index[current]) {
            target -= kvs->dict_size_index[current];
            result = current;
        }
    }
    /* Adjust the result to get the correct dict:
     * 1. result += 1;
     *    After the calculations, the index of target in dict_size_index should be the next one,
     *    so we should add 1.
     * 2. result -= 1;
     *    Unlike BIT(dict_size_index is 1-based), dict indices are 0-based, so we need to subtract 1.
     * As the addition and subtraction cancel each other out, we can simply return the result. */
    return result;
}

The core of the logic is the if condition: target > kvs->dict_size_index[current]. This checks if our target key lies beyond the block of dictionaries represented by the BIT node at kvs->dict_size_index[current].

If it does, it means our key is in a later dictionary. So, we subtract the number of keys in this block from our target and set our result (our new base position) to current, effectively skipping over this entire block of dictionaries.
If it doesn't, we do nothing and try a smaller jump in the next iteration (i >>= 1).

We’ve seen how the Binary Indexed Tree is expertly used in Redis to solve two crucial problems: computing cumulative key counts and enabling fair, weighted random selection across multiple dicts. It stands as a perfect example of how algorithmic thinking and knowledge can be quite fundamental to building real world software, showing that the elegant solutions of computer science aren’t just theoretical. They are actively powering the tools we use every day.

References

Redis Source Code on GitHub - Explore the full source code of the project. All snippets in this article are from this repository.
Fenwick Tree (Binary Indexed Tree) on CP-Algorithms - An excellent resource for a deeper dive into the data structure.

The Binary Indexed Tree (BIT) data structure and how it is used in Redis

Why use a BIT?

What is Redis?

References

Comments

More from this blog

GSoC: The Complete Guide - By Example

Inside PostgreSQL: A Bird’s-Eye View

Command Palette

Why use a BIT?

What is Redis?

References

Comments

More from this blog