Skip to main content

WordPress staging that actually mirrors production: what it took to get there

·1336 words·7 mins
Tech WordPress DevOps Infrastructure

The premise sounds obvious: test before you update. The problem is that most WordPress staging setups test the wrong thing. A local Docker container will tell you, in most configurations, whether a plugin activates without a fatal error — but it won’t tell you whether your LiteSpeed cache behaves correctly, whether Plesk’s PHP-FPM config survives the change, or whether Cloudflare is quietly routing traffic somewhere unexpected. I learned that last one the hard way, through a debugging session where the apparent WordPress problem turned out to have nothing to do with WordPress.

What started as a sysops hygiene project — a repeatable weekly update workflow for a handful of production WordPress sites — turned into a deeper look at where staging can actively mislead you.

Why local Docker isn’t enough for this class of work
#

Illustration for Why local Docker isn’t enough for this class of work Local Docker is a fine development environment, but for sysops workflow validation it’s the wrong tool.

When I’m testing whether a WordPress core update, a WooCommerce bump, or a plugin combination will survive in production, I need the test environment to share the same constraints as production: the same web server (LiteSpeed, in this case), the same PHP-FPM configuration managed by Plesk, the same caching layer, and ideally the same volume of Action Scheduler jobs, webhook endpoints, and third-party integrations running in the background. A Docker container gives you WordPress and a database — not any of that surrounding context, and the context is where failures tend to hide.

The approach I settled on was a throwaway subdomain on the actual Plesk server: a real vhost, provisioned via Plesk WP Toolkit, populated by syncing production down to it. The local machine stays the control plane — it runs orchestration scripts over SSH and runs Playwright smoke tests externally against the staging URL. The staging environment itself lives on the same infrastructure as production, which keeps staging genuinely close to real conditions without requiring a separate server.

The neutralization problem after a production sync
#

Illustration for The neutralization problem after a production sync Once you copy production to staging, you have a live replica of your production site sitting on a throwaway subdomain — useful for testing and a liability if you don’t treat it carefully. I know this because the first time I skipped the cleanup step, the CRM received a test order event and created a duplicate contact record. Low-stakes version of a higher-stakes failure, but it made the problem concrete.

A production WordPress site for an e-commerce or marketing operation typically has a lot of outbound activity: CRM webhooks, Action Scheduler queues processing email sends and order events, pixel and analytics scripts firing on every page load, shipping integrations resolving rates, SEO plugins pinging for index status. A naive copy of that environment will do all of those things from a staging URL, and some of them will cause real damage.

I ended up with what I think of as a post-copy neutralization pass: a set of operations that run automatically after each production-to-staging sync, before any smoke testing begins. The specific items depend on your plugin stack, but the pattern splits into two categories.

Things that cause active damage if left on:

  • Outbound webhook endpoints — blank or reroute them
  • Action Scheduler queues — clear or pause so background jobs don’t run
  • Transactional email — redirect through a trap (I use a staging-specific SMTP route that discards or captures rather than delivers)
  • Pixel and tracking scripts — disable, or replace real account IDs with dead placeholder values

Things that clutter but don’t break anything:

  • wp_options values like site URL, admin email, and debug settings — re-apply staging-safe versions
  • Read-only integrations (an SEO plugin reading its own database, for instance) — I leave these active where I can confirm they’re harmless

The goal isn’t to disable everything; it’s to know exactly what’s live and make a deliberate call on each integration rather than leaving it to chance. This pass is not exciting work, but the duplicate contact record was a good enough argument for doing it every time.

The weekly gate: only promote if staging passes
#

Illustration for The weekly gate: only promote if staging passes With a neutralized staging environment in place, the weekly update workflow becomes straightforward:

  1. Sync production database and files to staging
  2. Create a staging restore point via Plesk WP Toolkit
  3. Run the neutralization pass
  4. Apply WordPress core and plugin updates on staging
  5. Run Playwright smoke tests against the staging URL from the local machine
  6. If tests pass: apply the same updates to production
  7. If tests fail: restore staging from the restore point, investigate, notify

The Playwright tests aren’t exhaustive — they cover the homepage, a product or content page, checkout initiation, and login. Enough to catch a white screen of death, a broken layout, or a missing critical element. The goal is a fast gate, not a full regression suite.

Having the restore point before applying updates matters more than it sounds. When something breaks on staging you want to reset cleanly and try again without re-syncing production, because re-syncing is the slow step.

The Cloudflare detour that looked like a WordPress problem
#

Illustration for The Cloudflare detour that looked like a WordPress problem While validating this workflow on one of the sites, I ran into a breakage that cost me more time than it should have.

The symptom: visiting the domain returned only the site header, no content, and wp-login.php returned a 404. The Plesk error log showed:

File not found: /var/www/vhosts/default/htdocs/wp-login.php

That path — /var/www/vhosts/default/htdocs/ — is Plesk’s fallback docroot, not the actual vhost docroot. Something was resolving the request against the wrong directory.

I ran plesk repair web on the domain:

Checking web server configuration
Repair web server configuration for the domain? [Y/n] y
Repairing web server configuration for the domain .......... [OK]
Checking php-fpm configuration
Repair php-fpm configuration for the domain? [Y/n] y
Repairing php-fpm configuration for the domain .......... [OK]
Error messages: 0; Warnings: 0; Errors resolved: 0

Clean. No errors resolved because there was nothing to resolve — Plesk’s own configuration was fine. I had reread that output twice before I accepted what it was telling me and tried something different.

I tested directly against the server’s internal IP, bypassing DNS entirely:

curl -sk --resolve example.com:443:<internal-ip> https://example.com/ | sed -n '1,60p'

Full WordPress homepage. wp-login.php returned 200. The vhost, LiteSpeed, PHP-FPM, and the WordPress files were all working correctly.

So the public URL was serving a static index.html from /var/www/vhosts/default/htdocs/ — which matched exactly what I was seeing in the browser. Cloudflare was routing traffic somewhere else, and whatever it was routing to was serving that fallback file.

Switching the Cloudflare DNS records for the apex and www from Proxied to DNS only fixed it immediately. Full site came back, login worked, no further changes needed.

I still don’t know what changed in Cloudflare to cause this — nothing in the configuration had been touched recently. My best guess is a Cloudflare-side routing anomaly; I can’t say whether it would have resolved on its own, because switching the DNS record to DNS-only fixed it before I had time to find out. What I can say is that Plesk, LiteSpeed, the vhost config, and WordPress were never the problem.

The lesson I’ll carry from this: when a WordPress problem only reproduces through a public URL and not through a direct-to-origin request, stop debugging WordPress — a broken proxy can look exactly like a broken CMS, and without the habit of testing directly against origin, I would have spent hours in the wrong layer.

What changed after all of this
#

The staging workflow now runs weekly and has surfaced two plugin conflicts before the production gate ran — one between a shipping rate plugin and a WooCommerce update, one involving a cache plugin that silently broke the checkout session after a WordPress core bump. Neither would have been obvious from a local Docker environment.

I still have gaps. The neutralization pass is manual in places where I’d prefer automation, and the Playwright tests are thinner than I’d like on mobile flows. But the gate is real now, and that’s the thing that was missing before.

Related

Plugin drift comes from duplication, not carelessness — here's how we fixed it
·938 words·5 mins
Tech WordPress Git DevOps
Three sites, three slightly different copies of the same plugin, no clear canonical version. This is how we untangled it using Git subtree and wp-config.php constants — and what the process revealed about how plugin drift actually happens.
Building a one-way CMS sync workflow that doesn't bite you later
·1292 words·7 mins
Tech CMS DevOps WordPress Staging
Most CMS staging failures are workflow failures, not technical ones. This covers the design decisions, the bash automation, and the wp_options audit that turned out to be the real work.
Dissecting a WordPress compromise: from obfuscated code to hardened infrastructure
·1166 words·6 mins
Tech WordPress Security Infrastructure
A WordPress site on our infrastructure was compromised via an abandoned theme. This is a walkthrough of how I found the malicious code, what it was actually doing, and the harder question I still can’t fully answer: whether any real damage was done before we found it.