Understanding Content Caching

with WordPress: what it is and how it works

Mark Montague

mark@catseye.org

Many slides have notes with extra material:

View this presentation at
http://www-personal.umich.edu/~markmont/wpcc/

Source files for this presentation can be downloaded from
http://www-personal.umich.edu/~markmont/wpcc.zip

Disclaimer

The description of this talk on the WP Ann Arbor web site said that this presentation would not cover how to set up caching for your WordPress site. If you want that, please refer to the presentation done at WordCamp Ann Arbor earlier this month by Topher DeRosia:

Site Caching, From Nothing to Everything

or see the documentation for the W3 Total Cache or WP Super Cache plugins.

Instead, this presentation focuses on an in-depth examination of what caching is and how it works behind the scenes.

What is caching?

Picture of an open cardboard box, symbolizing a cache

Content caching is saving a copy of content that a visitor asks for in a special, high-speed location called a cache, so that it can be sent again very quickly to the next visitor who asks for it. This avoids the need to generate the content again for each visitor.

Flowchart showing how a cache is used

Types of caching

Picture of an open cardboard box, symbolizing a cache

While content caching is saving a copy of the content that a visitor asks for so that it can be sent again to other visitors, there are also other types of caching that a WordPress site can do:

  • Database query caching
  • Object caching / transient caching
  • Opcode caching (PHP script caching)

All of these save expensive-to-generate intermediate results so that they can be reused in order to speed up generating future requests for articles.

This presentation is only concerned with content caching, that is, caching copies of HTML pages, CSS files, JavaScript files, image files, font files, and so on. By caching and re-using content, we don't care as much about speeding up generating the content in the first place (although, it's best to speed things up there, too).

Why do we care about caching?

Caching speedup example

Graph showing content caching speedup

This is a fairly simple site with a small number of plugins and several other performance optimizations enabled (such as gzip compression).

The speedup will be larger with a more out-of-the-box WordPress configuration.

Something to keep in mind

The W3 Total Cache and WP Super Cache plugins do

CDN
Support
Content
Compression
(gzip)
HTML
Minification
CSS
Minification
Content
Caching
JavaScript
Minification
CSS / JavaScript
Inlining
Database
object caching
PHP Opcode
Caching

...so it is more accurate to call them "performance optimization plugins" rather than caching plugins.

There are also other speedups that you can and should do that are not part of these plugins: minimizing number of plugins, tuning the database, and much more.

In this presentation we will look ONLY at web content caching, even though the other speedups are important, too.

When NOT to cache

Telling caches to throw away the things they have cached is called "invalidating" the cache or "purging" it. This can be done for only a single thing in the cache, or for everything in the cache.

Caches can only be purged or invalidated when they are under control of WordPress. In particular, WordPress is not able to purge visitor web browser caches or ISP/corporate caches.

Note that turning off casing for all logged in users slows things down considerably for them and puts extra load on the web server, but is a necessary trade off.

Where does content caching occur?


Diagram showing that in the default configuration, caching only occurs in the user's web browser.

By default, caching occurs only in the user's web browser.

The WordPress web server usually needs additional configuration in order for the web browser cache to be fully effective.

Where does content caching occur?


Diagram showing the WordPress web server configured with a WordPress plugin for caching.

The combination of WordPress caching and web browser caching is a very common configuration.

Where does content caching occur?


Diagram showing a dedicated caching server sitting in between the WordPress web server and the Internet.

A specialized, dedicated caching server between WordPress and the Internet can greatly speed up WordPress and handle heavy visitor loads. This usually requires special control by the WordPress server and so is usually used in conjunctiva with a WordPress caching plugin.

Caching servers include Varnish, Squid, and mod_cache.

Where does content caching occur?


Diagram showing Content Distribution Network (CDN) caches inside the Internet 'cloud'.

A Content Distribution Network (CDN) such as CloudFlare or MaxCDN is usually used in conjunction with a WordPress caching plugin but without a caching server. (A caching server doesn't add any benefit when behind a CDN).

The CDN will have caches (nodes) in different geographical regions, and each user will use the cache that is closest to them. This both reduces the number of network hops for traffic as well as spreads traffic out across multiple caches for performance and fault tolerance.

Where does content caching occur?


Diagram showing the position of an ISP or corporate cache (or proxy) between the visitor and the Internet.

If a user is accessing a WordPress site from work, their employer may put a cache or proxy in between their computer and the Internet. Or, a consumer's ISP may put a cache in between them and the Internet.

This is often done to save on bandwidth costs, or to deal with high-latency networks, such as when a visitor is using a satellite uplink to access the Internet. However, it can also be done to control, monitor, or alter the content the user is accessing, either with or without caching the content.

All of the other forms of caching we have looked at are affect all visitors to the site, but this form will only affect some visitors to the site, unusually a very small minority, and so problems caused by this form of caching can be difficult to diagnose, especially if the ISP or corporate cache chooses not to adhere to Internet standards (that is, when it doesn't "play nicely").

Where does content caching occur?


Diagram showing that caching can occur in the user's web browser, at their ISP, on the Internet (for example, in a Content Distributions Network), on a special caching server (such as Varnish) in front of the WordPress web server, or on the WordPress web server / within WordPress itself.

Here are all of the places, together, that caching can occur. While it is possible for all types of cache to be present simultaneous, it would be unusual.

KEY POINT: Each of these open boxes could contain a different saved copy of each of your articles, images, and other content.

Browser caching

Diagram showing where the web browser cache is on the network

Browser caching

Locations of browser caches under MacOS X:

Clear the browser cache and then see what happens in these directories when you access various URLs:

There are similar locations under Microsoft Windows, Linux, and on smartphones. And Microsoft Internet Explorer and Opera also have web browsers caches. Do a web search if you want to find where these are.

Note that both Chrome and Firefox can have multiple user profiles, each with their own cache. Chrome names the default profile "Default", but Firefox will choose a random name for each profile.

Browsers may group their cache as a part of their private data, history, or private data. When you clear one of these things, you may be asked what time period — make sure you specify "from the beginning of time" or "everything", or the cache will only be partially cleared. I also recommend checking the checkboxes to clear all types of data.

You may want to use a different browser from the one you use day-to-day when doing this so that you don't lose important data or configuration.

Caching - general

Caching is actually fairly tricky. If something is cached for too long, visitors will get out of date content. On the other hand, if something is not cached at all, or not cached as long as it could be, then visitors won't get the performance and scalability that are the reasons we are doing caching in the first place.

Modern web browsers and modern web servers currently exchange information via version 1.1 of the Hypertext Transfer Protocol (HTTP), which is documented in a series of six RFCs (RFC 7230 - 7235).

RFC 7234 describes how caching works. The "meat" of this document is only 32 pages, so consider looking at it.

Rather than explaining the standard, let's take a look at a real-world example.

RFC stands for "Request for Comments" — some are just proposals, others are informative memoranda, and yet others (including the HTTP/1.1 RFCs) are designated by the Internet Engineering Task force and the Internet Society as official standards.

HTTP/2, which is the next version of HTTP, will hopefully be finalized as a standard soon (December 2014 - February 2015).

Caching - general

Clear the browser cache and then look at the HTTP request for a JPEG file (https://dev.catseye.org/wpannarbor.jpg):

GET /wpannarbor.jpg HTTP/1.1
Host: dev.catseye.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:32.0) Gecko/20100101 Firefox/32.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive

This is a standard HTTP request, there's nothing special about it.

Caching - general

The HTTP response for the image with cache-related lines highlighted:

HTTP/1.1 200 OK
Date: Mon, 27 Oct 2014 19:43:55 GMT
Server: Apache/2.4
Last-Modified: Mon, 27 Oct 2014 16:58:26 GMT
Etag: "2472-5066a706efac9"
Accept-Ranges: bytes
Content-Length: 9330
Cache-Control: max-age=604800
content-security-policy: default-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; img-src 'self' data: ; font-src 'self' data: ; report-uri /csp-report.php
Age: 300
X-Cache: HIT from dev.catseye.org
X-Cache-Detail: "cache hit" from dev.catseye.org
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: image/jpeg

The X-Cache and X-Cache-Detail lines explain that the JPEG came from the web server's cache rather than being retrieved by the web server itself. You shouldn't encounter these lines on publicly accessible sites.

This response is from a WordPress web server that has its own cache (Apache HTTP Server mod_disk_cache). Since HTTPS is being used, we know that there are no other caches between the WordPress web server cache and the browser cache.

Note that nearly half of the lines affect caching!

The two X-Cache lines are actually debugging information that I've had the web server add to make it more obvious what is going on; they are not a part of the HTTP/1.1 standard.

Caching - general

Removing the debugging information and reordering things gives us:

Date: Mon, 27 Oct 2014 19:43:55 GMT
Age: 300

Last-Modified: Mon, 27 Oct 2014 16:58:26 GMT
Etag: "2472-5066a706efac9"

Cache-Control: max-age=604800

If both Age and Date are present, Age will be used. In this case, both are present because the JPEG file came from the web server cache rather than from the web server itself; if it had come from the web server itself, only the Date header would be present.

Caching - general


Date: Mon, 27 Oct 2014 19:43:55 GMT
Age: 300

Last-Modified: Mon, 27 Oct 2014 16:58:26 GMT
Etag: "2472-5066a706efac9"

Cache-Control: max-age=604800

The web browser will use these things later on when asking the web server cache and web server if they have a newer version of the image.

Caching - general


Date: Mon, 27 Oct 2014 19:43:55 GMT
Age: 300

Last-Modified: Mon, 27 Oct 2014 16:58:26 GMT
Etag: "2472-5066a706efac9"

Cache-Control: max-age=604800

Cache-Control contains instructions from WordPress, the web server, and any other caches about whether the web browser should store the response in its own cache. Multiple directives can be present after the colon, separated by commas.

max-age=604800 says that cached copies of this JPEG file should only be considered "fresh" for a maximum of 604,800 seconds (1 week). After this, the JPEG file will be considered "stale" and if the web browser needs it again past this point, it will try to get a new copy rather than using the copy in its cache.

Note that if the web browser tries and fails to get a fresh copy of the JPEG file after it becomes stale, it will still use the stale copy that it has in its cache.

Caching - general


Date: Mon, 27 Oct 2014 19:43:55 GMT
Age: 300

Last-Modified: Mon, 27 Oct 2014 16:58:26 GMT
Etag: "2472-5066a706efac9"

Cache-Control: max-age=604800

What will the browser do? It will store the response in its cache IF

Otherwise, the response will not be stored in the cache, and the web browser will instead request it again the next time it is needed.

The Expires header is an older version of Cache-Control: max-age. Instead of giving the number of seconds the resource remains valid, it gives the date at which the resource becomes invalid.

The Pragma: no-cache header is an older version of Cache-Control: no-cache.

WordPress can prevent web browser — or other — caches from storing things by specifying Cache-Control: no-store. (Note that private will only prevent something from being cached if the cache is shared by multiple clients.)

The Authorization header is used by HTTP Basic Authentication and HTTP Digest Authentication, neither of which are used by WordPress. It indicates that the visitor is logged in and hence that the response probably should not be cached. (Note that since WordPress does not use these forms of authentication, it will set Cache-Control: private for its logged in users, instead.)

Caching - general

Now let's go back to the web browser tab in which we loaded the JPEG image, and, still capturing HTTP requests and responses, request the image again by clicking in the web browser location bar and pressing Return.

We don't see any HTTP traffic, because the web browser retrieved the JPEG image from its cache.

Next, click the web browser Reload button. This will — at least in Firefox — ask the web server to check to see if a new version of the image is available on the web server, and, if so, to get it.

(Note that if you hold the Shift key while clicking Reload, instead of checking if a new version is available, Firefox will just go get the resource again, just like it did the first time it loaded it, completely disregarding anything in its cache.

Caching - general

HTTP request sent when we click the Reload button:

GET /wpannarbor.jpg HTTP/1.1
Host: dev.catseye.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:32.0) Gecko/20100101 Firefox/32.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
If-Modified-Since: Mon, 27 Oct 2014 16:58:26 GMT
If-None-Match: "2472-5066a706efac9"
Cache-Control: max-age=0

The Cache-Control: max-age=0 header, when used in a request, tells any caches that may see this request to only respond if what they have is at most 0 seconds old — that is, never, thus letting this request all the way through to the WordPress web server.

Caching - general


GET /wpannarbor.jpg HTTP/1.1
If-Modified-Since: Mon, 27 Oct 2014 16:58:26 GMT
If-None-Match: "2472-5066a706efac9"

This tells the web server to only respond with a new copy of the JPEG file if the copy the web server has...

Note that doing a GET with If- headers is much more efficient than doing a HEAD, interpreting the response, and doing a regular GET if a new version is available.

A HEAD should only be used if the client needs to check to see if a new version of a resource is available but does not need to retrieve it.

Caching - general

And here is the web server's response to the reload request:

HTTP/1.1 304 Not Modified
Date: Mon, 27 Oct 2014 20:12:34 GMT
Server: Apache/2.4
Connection: Keep-Alive
Keep-Alive: timeout=5, max=100
Etag: "2472-5066a706efac9"
Cache-Control: max-age=604800

Caching - shared

Diagram showing the location of all caches on the network

A shared cache is one that is used by multiple visitors. Most caches are shared. The exceptions to this are web browser caches and private proxying caches, which are private to a single user.

A WordPress cache, a web server cache, a CDN, and an ISP or corporate cache are all examples of shared caches.

WordPress can use the Cache-Control: s-maxage=XXXXXX header to control how long something is stored by shared caches. s-maxage works exactly the same as maxage except that private caches will ignore s-maxage.

In the diagram, all of the caches are shared caches except for the web browser cache.

Caching - shared

Diagram showing the location of the web server cache on the network

In the following example, the resource (page) will be cached by the web server cache for 4 hours because s-maxage=14400 is set, but it will not be cached by the web browser cache because max-age=0 is set.

This configuration can be used if you know you only have a single shared cache between you and the web browser (for example, because you are using HTTPS) which is under control of WordPress. WordPress can then purge the page from the shared cache if it changes; otherwise, the cache will avoid unnecessarily regenerating the page.

Cache-Control: max-age=0, s-maxage=14400
In addition to s-maxage=0, the Cache-Control: private directive can be used to prevent something from being stored in a shared cache.

The Vary header

WordPress or the web server can set the Vary header in HTTP responses to indicate that the response that was generated is specific to one or more HTTP request headers.

For example, the response header

Vary: Accept-Encoding

means that the response is either compressed or not compressed, depending on what the client said that it could accept. Compressed content must not be sent to browsers that are not able to deal with it, and if uncompressed content gets sent to a browser that can deal with compressed content, then performance will be slower than it needs to be.

Another common example would be for WordPress to set Vary: User-Agent in responses to indicate that it is serving different versions of the site to traditional versus mobile web browsers.

Note that the Vary header only affects shared caches, not private caches.

The Vary header

If the Vary header is set, caches will store a different version of the content for each different value of the headers named by the Vary header. A cache will only serve a piece of content to a client if the clients request headers exactly match the request headers specified by the Vary header that applied when the content was cached.

For example, if four web browsers all request a sites main page:

Then the main page will be generated and stored in the cache three separate times (there will be three copies), once for each of clients 1, 2, and 3. Client 4 will then get the cached copy that was generated for client 2.

Note that this only applies if WordPress or the web server set the Vary: Accept-Encoding header in its responses. If it doesn't, then the Accept-Encoding header in requests is ignored, and the main page will only be generated and cached once, for client 1. Clients 2, 3, and 4 will then get the copy that was cached when client 1 requested it.

Static vs dynamic assets

There are two types of things that can be cached:

For these reasons, the easiest and safest thing to do is to only cache static assets, and always generate dynamic assets for every request. But most static assets are quick to serve up, and the big performance improvements are all from correctly caching dynamic assets.

Also for these reasons, many CDNs have a low-end or "starter" mode where they only cache static assets.

Static assets

Static assets

For example, if there is a file named

my-buttons.css

and it is accessed via the URI path

/wp-content/plugins/my-plugin/my-buttons.css

then make sure all of your PHP, CSS, JavaScript, and HTML files always access it as something like

/wp-content/plugins/my-plugin/my-buttons.css?ver=1.2

The ?ver=1.2 will be ignored by the web server but treated as a part of the file name by caches.

Static assets

When it's time to upgrade the file, change all of the places that reference it from

/wp-content/plugins/my-plugin/my-buttons.css?ver=1.2

to

/wp-content/plugins/my-plugin/my-buttons.css?ver=1.3

Then, the next time a file referencing my-buttons.css is served by the web server, it will request version 1.3. Since only the copy with ?ver=1.2 is in the caches, the request for my-buttons.css will fall through to the web server. Version 1.3 will get cached, version 1.2 won't be served from the cache again and will eventually reach its maximum lifetime and be dropped from the caches.

Note that this depends on files referencing the static asset being served freshly by the web server rather than from a cache. Ultimately, the file that references all other files will be a PHP script, a dynamic asset. We'll take a look at how dynamic assets are handled on the next few slides.

Dynamic assets

For dynamically generated pages that should never be cached (such as the WordPress dashboard and other pages generated for logged-in users), WordPress generates the following HTTP header:

Cache-Control: no-cache, must-revalidate, max-age=0

However, we want as many pages as possible to be cached so that they do not have to be generated by WordPress each time any visitor requests them, while still ensuring that no cached page is ever served if it would be regenerated differently.

To do this, we ensure that these dynamic assets only get put into caches that can be directly controlled by WordPress. Then, when WordPress changes something, it can reach into the cache via an API and tell the cache to purge (invalidate) affected assets.

In particular, for dynamic assets, we must make sure that they are cached by some cache that is controlled directly by WordPress (such as the web server cache) while at the same time making sure that they are not cached by the web browser cache. This can be tricky, but the most common solution involves having the web server cache modify the caching headers before passing them further along.

Up until this point, everything has happened solely via the HTTP/1.1 protocol. The problem with this is that HTTP only allows things to happen in response to requests for specific resources for a single client. To provide the level of control we need to have our cake and eat it too in regards to dynamic assets, WordPress and the cache need some additional way to communicate.

Dynamic assets

WordPress is able to directly control the following types of caches, as WordPress and the cache know about each other and the cache trusts the WordPress site:

However, WordPress is NOT able to control the following caches via means other than HTTP, since they are set up by and run by people who don't necessarily trust the WordPress site:

Cache invalidation

How WordPress tells a cache to invalidate (purge, remove) something that is has cached depends on the specific caching software that is being used:

Some local caching WordPress plugins have the ability to use memcached, which is a memory-based cache rather than a disk-based cache. In this case, the plugin sends a delete command to memcached.

Cache invalidation

As an example, if a visitor adds a new comment to an article with WordPress post ID 123, the following URI paths need to be invalidated so that each of them will be regenerated the next time it is requested:

/(permalink to post 123)
/(permalink to post 123)/(page number)    multi-page posts
/(permalink to post 123)/comments-page-(page number)
/                                         site main page (may show comment count)
/page/(page number)                       additional main-page content
/feed/...                                 RDF, RSS, ATOM feeds
/comments/feed
/tag/(each tag of post 123)
/category/(each category of post 123)
/author/(author of post 123)
/(year of post 123)
/(year of post 123)/(month of post 123)
/(year of post 123)/(month of post 123)/(day of post 123)
/author/(author's username)

In addition, most WordPress caching plugins will provide a dashboard button that will purge the entire cache; this is useful during upgrades and other site maintenance.

Non-HTTP caching

Up until now we've talked mostly about caching at the HTTP protocol level. HTTP caching completely avoids running WordPress and its plugins unless a WordPress page needs to be regenerated. Also, the HTTP request is intercepted by the cache as early as possible, so that most of the work that the web server does for each HTTP request is avoided, too. For these reasons, HTTP caching is the fastest caching option available.

The downside to HTTP caching is that you need to set up some software to do it. This adds considerable complexity to a WordPress installation, and is not always possible. (We're ignoring web browser caching here, since we can't safely cache dynamic assets such as articles in the web browser cache, which loses a lot of performance.)

Non-HTTP caching

A WordPress caching plugin can save a copy of every page that is generated, however, and when new requests come in to WordPress itself, it can check to see if there is a saved copy that can be used instead of generating a new copy.

This avoids all of the complexity of HTTP caching, but is substantially slower:

  1. The web server has to do the full work of handling both the HTTP request and response.
  2. WordPress actually starts up to handle the request and does a fair portion of its initialization (although the caching plugin will avoid as much of the initialization as possible unless it turns out to be necessary to regenerate the page).
  3. The checks to see if a previously saved copy can be served are done in PHP, which is relatively slow.

However, #2 and #3 can be completely avoided if you are able to set up mod_rewrite rules in the WordPress .htaccess file to check for and serve cached content.

Troubleshooting caching problems

Caching problems can be very non-obvious and difficult to troubleshoot since so much is hidden. The most common problems are:

If the problem is identified to be a particular plugin or theme, options include replacing it with a similar plugin or theme that does not have the problem, or modifying (customizing) the plugin or theme to fix the problem.

Troubleshooting caching problems

References:

Questions?


Diagram showing most of the places caching can occur.