Our team at مینون uses Redis as one of our message queues. The Redis server is deployed in a master/replica setup. Quite recently, we have been noticing a spike in the CPU usage of the Redis replicas every time we deploy our service, even when the replicas are not in use and when there’s no read traffic to it. However, the issue is resolved once we reboot the replica.
Because a reboot of the replica fixes the issue every time, we thought that it might be due to some Elasticache replication issues and didn’t pursue further. However, a recent Redis failover brought this to our attention again. After the failover, the problematic replica becomes the new master and its CPU immediately goes to 100% with the read traffic, which essentially means the cluster is not functional after the failover. And this time we investigated the issue with new vigour. What we found in our investigation led us to deep dive into the details of Redis replication and its implementation of Hash.
Did you know that Redis master/replica can become inconsistent in certain scenarios?
Did you know the encoding of Hash objects on the master and the replica are different even if the writing operations are exactly the same and in the same order? Read on to find out why.
The Problem
The following graph shows the CPU utilisation of the master vs. the replica immediately after our service is deployed.
From the graph, you can see the following CPU usage trends. Replica’s CPU usage:
- Increases immediately after our service is deployed.
- Spikes higher than the master after a certain time.
- Get’s back to normal after a reboot.
Cursory Investigation
Because the spike occurs only when we deploy our service, we scrutinised all the scripts that were triggered immediately after the deployment. Lua monitor script was identified as a possible suspect. The script redistributes inactive service instances’ messages in the queue to active service instances so that messages can be processed by other healthy instances.
We ran a few experiments related to the Lua monitor script using the Redis monitor command to compare the script’s behaviour on master and the replica. A side note, because this command causes performance degradation, use it with discretion. Coming back to the script, we were surprised to note that the monitor script behaves differently between the master and the replica:
- Redis executes the script separately on the master and the replica. We expected the script to execute only on master and the resulting changes to be replicated to the secondary.
- The Redis command
HGETALL
used in the script returns the hash keys in a different order on master compared to the replica.
Due to the above reasons, the script causes data inconsistencies between the master and its replica. From that point on, the data between the master and the replica keeps diverging till they become completely distinct. Due to the inconsistency, the data on the secondary does not get deleted correctly thereby growing into an extremely large dataset. Any further operations on the large dataset requires a higher CPU usage, which explains why the replica’s CPU usage is higher than the master.
During replica reboots, the data gets synced and consistent again, which is why the CPU usage gets to normal values after rebooting.
Diving Deeper on HGETALL
We knew that the keys of a hash are not ordered and we should not rely on the order. But it still puzzled us that the order is different even when the writing sequence is the same between the master and the replica. Plus the fact that the orders are always the same in our local environment with a similar setup made us even more curious.
So to better understand the underlying magic of Redis and to avoid similar bugs in the future, we decided to hammer on and read the Redis source code to get more details.
HGETALL Command Handling Code
The HGETALL
command is handled by the function genericHgetallCommand
and it further calls hashTypeNext
to iterate through the Hash object. A snippet of the code is shown as follows:
/* Move to the next entry in the hash. Return C_OK when the next entry
* could be found and C_ERR when the iterator reaches the end. */
int hashTypeNext(hashTypeIterator *hi) {
if (hi->encoding == OBJ_ENCODING_ZIPLIST) {
// call zipListNext
} else if (hi->encoding == OBJ_ENCODING_HT) {
// call dictNext
} else {
serverPanic("Unknown hash encoding");
}
return C_OK;
}
From the code snippet, you can see that the Redis Hash object actually has two underlying representations:
- ZIPLIST
- HASHTABLE (dict)
A bit of research online helped us understand that, to save memory, Redis chooses between the two hash representations based on the following limits:
Eureka Moment
Based on this understanding, we checked the encoding of the problematic hash in staging.
stg-bookings-qu-002.pcxebj.0001.apse1.cache.amazonaws.com:6379> object encoding queue_stats
"hashtable"
stg-bookings-qu-001.pcxebj.0001.apse1.cache.amazonaws.com:6379> object encoding queue_stats
"ziplist"
To our surprise, the encodings of the Hash object on the master and its replica were different. Which means if we add or delete elements in the hash, the sequence of the keys won’t be the same due to hashtable operation vs. list operation!
Now that we have identified the root cause, we were still curious about the difference in encoding between the master and the replica.
How Could the Underlying Representations be Different?
We reasoned, “If the master and its replica’s writing operations are exactly the same and in the same order, why are the underlying representations still different?”
To answer this, we further looked through the Redis source to find all the possible places that a Hash object’s representation could be changed and soon found the following code snippet:
/* Load a Redis object of the specified type from the specified file.
* On success a newly allocated object is returned,