Skip to main content

The silent memory leak: debugging intermittent 500s in WordPress Redis caching

·1026 words·5 mins
Tech WordPress Redis Performance Debugging

A WordPress site started throwing intermittent 500 errors roughly every ten days after we enabled Redis object caching. Refresh the page and it loads fine. Flush the cache and the errors disappear — for another ten days. That pattern is the kind of thing that makes you doubt yourself, because it “works” every time you look at it.

This is what I found, what I changed, and where I’m still not entirely certain the fix is complete.

Finding the actual offender
#

The first useful move was running redis-cli directly against the database the site was using:

redis-cli -n 1 --bigkeys

The output was the most concrete anchor in the whole investigation:

Sampled 77726 keys in the keyspace!
Total key length in bytes is 8083517 (avg len 104.00)

Biggest string found '…:options:notoptions' has 3971911 bytes
Biggest zset  found '…:redis-cache:metrics'  has 2471 members

A single key — options:notoptions — sitting at just under 4MB. Everything else was noise by comparison: site-transient:feed_* keys around 0.5MB, various post_meta:* values in the 100–220KB range. The structure was almost entirely flat strings, which at least ruled out a data type mismatch problem.

The notoptions key is a WordPress Options API artifact. WordPress caches the names of options that don’t exist, so it can skip a database lookup the next time something calls get_option('missing_key'). In principle, that’s reasonable. In practice, if your codebase — or a plugin — is making a lot of get_option() calls for keys that never exist, the array grows without bound. Without a TTL, that key lives in Redis indefinitely and gets pulled into PHP memory on every cache warm read.

The intermittent 500s were almost certainly PHP memory spikes on requests where that 3.97MB blob was deserialized in full. The flush “fixed” it because the key was deleted. Ten days later, it had grown back to the same size.

Here is the part I want to be careful about, because the reviewer of an earlier draft caught this: setting a global TTL cap does not automatically fix notoptions in all configurations. In some WordPress and Redis Object Cache plugin versions, system option groups are excluded from TTL caps or treated as persistent. In our case, the WP_REDIS_MAXTTL of 86400 seconds did eventually evict the stale entries that had accumulated — the key expired and rebuilt from a smaller, cleaner baseline. But if the TTL had been ignored for that group, the right move would have been to exclude the options group entirely or patch the code generating the misses. I cannot tell you with certainty which mechanism mattered more here until we see whether the pattern repeats.

What we changed
#

Three configuration changes, added to wp-config.php:

// Exclude cache groups that cause more harm than good externalized
define('WP_REDIS_IGNORED_GROUPS', [
    'site-transient',   // remote feed blobs (~0.5MB) not worth serializing to Redis
    'users',
    'userlogins',
    'usermeta',
    'user_meta',        // per-request user data; cost of cache miss is low
]);

// Enforce consistent serialization to avoid igbinary/php mismatch crashes
define('WP_REDIS_SERIALIZER', 'php');

// Cap all TTLs at 24 hours so stale entries can't accumulate indefinitely
define('WP_REDIS_MAXTTL', 86400);

We also raised the PHP memory limit to 512MB. I want to be direct about that decision: raising the limit didn’t solve the leak, it just widened the margin for error. The site stopped crashing under the immediate load while the TTL had time to cycle out the bad key. That is not an architecture fix — it is a stabilization measure while the real cause is still being watched.

The rationale behind WP_REDIS_IGNORED_GROUPS deserves a sentence. Not every cache group benefits from being externalized to Redis. User meta and login state are per-request, short-lived, and the penalty for a cache miss is a fast database read. Pushing them through Redis adds network round-trip overhead without meaningful benefit. The site-transient group was generating large feed blobs that were eating memory without obvious read performance gains. Excluding them reduced the Redis footprint without measurably slowing the site.

The serializer flag is a defensive move. If igbinary is installed for one PHP handler but not another (common in mixed FPM + CLI environments), you can get intermittent deserialization failures that look like random application errors. Pinning to PHP’s native serializer removes one variable.

The Elementor angle
#

The --bigkeys output also showed a pattern of large compiled CSS entries from Elementor. For context, the compiled CSS for the homepage alone was sitting at roughly 450KB — held in Redis memory and read on every request. That adds up fast on a site with many Elementor-built pages.

The fix here was running Elementor → Regenerate CSS & Data, which forced a clean rebuild of those compiled files. The more durable change is switching Elementor’s CSS print method to External File mode, which writes CSS to disk instead of serializing it into the database and then into Redis. The Element Cache feature also helps if you set a controlled TTL rather than letting it inherit the default.

I should have caught the Elementor contribution earlier. The --bigkeys scan was the first time I looked at Redis at key-level granularity rather than just watching aggregate memory usage in a dashboard.

What this actually changed
#

The 500s have not returned in the observation period since the changes. Whether that is the TTL expiring the bad notoptions key, the serializer fix preventing a specific deserialization failure path, or the memory headroom from the 512MB cap, I cannot attribute with precision. Probably all three contributed, which is an unsatisfying answer.

The real blind spot here was assuming that Redis object caching was net-positive without ever instrumenting it for size, only for hit rate. Hit rate was fine. The cache was technically “working.” The 4MB key never appeared in any dashboard because nothing was watching for key-level anomalies — only aggregate cache performance.

I’ve since added a weekly script that scans for keys exceeding 1MB and flags them in a log. That turns this from intermittent guesswork into a measurable, repeatable check. The threshold is arbitrary for now, but having any threshold is better than none. The next time something grows out of control, I want to see it before the 500s start.

Related

That 4 MB `options:notoptions` key is why your WordPress site throws a 500 every ten days
·788 words·4 mins
Tech WordPress Redis Performance Caching
An intermittent WordPress 500 that cleared on refresh turned out to be a single 4 MB Redis key growing without a TTL. Here is what the big-keys scan showed, why the mechanism is easy to miss, and the three config changes that stopped it.
WooCommerce slows down under concurrency, not under load
·1092 words·6 mins
Tech WooCommerce WordPress Performance Scaling
The WooCommerce performance failures that actually hurt at scale don’t show up in standard audits. They live in plugins doing unbounded per-request work that looks harmless at five requests per second and falls apart at twenty.
Debugging a persistent WordPress backdoor
·1009 words·5 mins
Tech WordPress Security CMS
A WordPress site was reinfecting itself after every cleanup. The culprit was a self-healing backdoor in mu-plugins that reconstructed itself from an encoded payload stored in the database. Here is how I found it, killed it, and what I missed the first time.