Design Doc: HTML Caching in PageSpeed

HTML Caching in PageSpeed

Jeff Kaufman, 2013-03-14

What would it take to make the HTML we generate be cacheable?

Some sites use a cache like Varnish on their HTML, either all the time or only when they get a spike in traffic. This can dramatically increase the QPS dynamically generated sites can handle. mod_pagespeed and ngx_pagespeed break this optimization because they're built on the assumption that the HTML they generate will not be cached. Where do we currently rely on this?

Visitor-Specific Optimizations

PageSpeed can serve different HTML to different browsers:

webp on Chrome but not most other browsers
mobile vs desktop
disabling rewriters (inline_images) that don’t work on IE6 or IE7
lazyload doesn’t work on blackberry 5

Partial Rewriting

PageSpeed almost never fully rewrites a page on the first view. If the slightly-rewritten page were cached we wouldn’t get an opportunity to fully rewrite the page for a while.

Resource Cache Lifetimes

When we rewrite resources we extend their cache lifetime to a year, which moves the burden of resource lifetime management to the HTML. To be completely correct, rewritten HTML should not be cacheable longer than any cache-extended resource (including cache extended resources referenced by cache extended resources...). In the case of LoadFromFile we don’t know what the intended cache lifetime was. And unless we keep statistics about resources on pages, at the time we need to decide what caching headers to use on the HTML we don’t yet know about resources’ cache lifetimes.

Beaconing

Some features (critical images/css) depend on a beacon. We don’t want to request beacons every time, just when we don’t have enough data. Beacons require sending (currently) 1.9K of javascript to clients, a lot of processing client side, sending a response, and then a cache update server-side.

[Are there other issues I’m not thinking of?]

Potential Solutions

The simplest solution would be a mode where we don’t do any browser-specific optimizations and just emit html cacheable for N minutes. Either it’s the site owner’s job to make sure this doesn’t cause any problems with resource cache lifetimes or we refuse to rewrite anything with a (remaining) cache lifetime less than N. To fix partial rewriting you figure that you’ll get it right when you fall out of cache and get re-requested.

If N is large enough, however, we’ll always return under-optimized html because origin resources will have expired from our cache during those N minutes. A simple partial fix would be to extend this a little by keeping per-url statistics on how long ago the last view was and only mark the HTML as cacheable if there was already a recent pageview to seed caches.

(A trailing header would be the elegant solution here. We don’t know how fully we rewrote a page until after we’re done processing it. We can remember this for next time and use that in the header of the next response, or we can use a heuristic like the one above, but really we’d like to send a header at the end of this response saying "this page is done, keep it" or “don’t save it, still need to do more work”. HTTP 1.1 supports putting headers at the end in a trailer, when combined with chunked encoding, but as far as I can tell no one supports this HTTP 1.1 feature. See mod_pagespeed issue 418.)

If we want to keep browser-specific optimizations without supporting client hints we could do the HTML caching ourselves. Cache multiple versions of the HTML keyed on our interpretation of the User Agent string. I don’t think any of our core filters are browser-specific, however, so this might not be worth it.

For beacons there are several options:

use a short lifetime and lots of beacons until you have the information you need, and then a long lifetime and fewer beacons to keep it up to date
in the client javascript randomly decide if you’re going to send a beacon and if so fetch the complex javascript via ajax
send back beacons every time
just not support features that need beacons

Provide feedback

Saved searches

Use saved searches to filter your results more quickly