Design Doc: Critical Image Beaconing

Critical Image Beaconing

Jud Porter, 2012-12-26

Introduction

Critical images are the images displayed "above the fold" on the initial load of the page. By knowing which images on a page are critical, we can make our image rewriters smarter. Currently, critical image information is used by several rewriting filters:

inline_images and inline_preview_images will only inline critical images.
lazyload_images will not lazily load the critical images.
add_render_events will tag critical images with "_pagespeed_atf_img" attribute and add JS events when all crit images have been loaded. Specifically, it adds timing instrumentation to track how long it takes for all of the critical images to load.

Critical images are currently detected internally rendering pages with a headless browser instance. This process however suffers from a number of shortcomings. The resolution in the renderer is fixed at 1376x768, which may not be representative for all clients, especially for mobile devices. But more fundamentally, the headless solution is not deployable in mod_pagespeed.

To mitigate these issues, this document describes the design and implementation of a critical image beacon meant to provide an alternative method for collecting critical image information. The main idea is to inject client-side JS into rewritten pages that will send a beacon URL request from the user’s browser back to the server identifying the critical images. This is similar in design to the current add_instrumentation filter, which injects a beacon to measure page load time from the client.

Implementation components

Detecting critical images: critical_images_beacon filter and JS

The critical_images_beacon filter is responsible for rewriting pages to include the beacon JS code. Rather than have this filter be explicitly enabled by MPS users (for example, by adding it to ModPagespeedEnableFilters), it is implicitly enabled when a filter that uses critical images is also enabled. A "CriticalImagesBeaconEnabled" flag is exposed to allow users to disable beacon insertion.

An option will also be provided to allow only a sampling of pages to be instrumented. Initially, this will allow a configurable percentage of pages (say 1 of every 100 requests) to have beacon code injected. Later, this could be modified to specify a desired QPS of beacon responses to better allow low traffic sites to collect enough beacon information.

The critical image beacon solution needs to be robust to the various transformations that we perform on image URLs so that the server can match the image identifiers sent back by the beacon with the URLs encountered during rewriting. This is difficult because of the URL rewritings we perform, including our normal URL scheme for rewritten resources, domain changes using ModPageespeedMapRewriteDomain, and image inlining with a data URL. We also want to make the identifiers as short as possible due to the ~2000 character limit of request URLs (NOTE: http://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url). The proposed solution to these issues is to have the BeaconCriticaImagesFinder insert into image tags a data-beaconurl attribute with a hash of the original image URL. This way the beacon does not have to know anything about the URL transformations performed by mod_pagespeed, nor does any hashing of URLs to shorten them need to be performed on the client side. The downside of this approach is that each image URL will have this hash included, increasing the size of the served HTML.

Other solutions considered:

Send hashes of the (potentially rewritten) image URLs to the server. Check for critical images using rewritten URLs. This is not feasible because we will not have the rewritten image URL early enough (during image rewriting) to be able to check if we have a critical image or not.
Disable MPS (or at least everything that could modify image URLs) when performing critical image beaconing. This could work as long as we are sampling/measuring critical images at a low enough frequency. A solution that is robust to resource rewriting and does not require disabling all of our optimizations is prefered.
Have the beacon JS parse out the original image URL from the rewritten URL, and then send hashes of the original URL in the beacon. This wouldn’t work for inlined images, and also would not possible if ModPagespeedMapRewriteDomain is used unless the beacon was provided with information to undo this transformation.

The injected JS adds an onload event handler which scans each tag and checks the location of the image relative to the window size to determine if it is above the fold or not.

The beacon is then sent as a URL request to the server with the following format.

/mod_pagespeed_beacon?url=<pageurl>&ci=<crit_img_hash1>,<crit_img_hash2>...

The beacons are sent as POST requests, with the critical image information in the body of the POST. GET requests were originally used, however they have the limitation of only supporting ~2000 characters in the URL. While this is likely sufficient in most cases for critical images, other filters such as critical CSS will require more, so it’s simpler to just a single method of sending beacons.

Storing critical images: ServerContext::HandleBeacon

HandleBeacon is responsible for parsing the beacons sent by clients and storing the critical image hashes in the property cache. Different clients will return different results for the critical image set, and it’s likely that mobile, and tablet clients will report a different set of critical images compared to other users, so HandleBeacon must store more than just the most recent set of critical images returned. We also want the critical image set to be responsive to changes on the page, so that new critical images are quickly added to the critical image set.

The initial approach will be to store the last N reported critical image sets, and then report that an image is critical if an image is present in the last M out of N CI sets. Since the critical images are being stored in the property cache, we will take advantage of Kishore’s implementation of segregating property cache entries by user agent type, thereby storing separate results for each device type. We will have to use a simple regex based parsing of user agent strings for identifying the device category.

Using critical images: BeaconCriticalImagesFinder

The BeaconCriticalImagesFinder is responsible for providing a queryable interface for the critical images through the IsCriticalImage function. This function is responsible for aggregating the multiple critical image sets for the various user agent types, and generating a single list to apply to the image URL being queried about. Since the critical images are stored as hashes in the property cache, care needs to be taken to ensure that only non-rewritten and untransformed image URLs are checked.

Security Considerations

A malicious user could flood the server with fake beacon requests that, say, indicate that all or none of the images are critical. The overall impact of this would be to decrease the effectiveness of our image rewriters that use critical image information. There isn’t a risk of a large number of beacon requests filling our cache since we only keep a predetermined number of the last N beaconed critical image responses. The overhead of processing beacon requests should also be small compared to our normal HTML rewriting flow, so this shouldn’t open a new vector to DoS attacks.

As a mitigation for fake beacons, once rate limiting of the beacons injected by QPS is implemented, we will have an expectation of how many beacon responses the server should be receiving, and could discard beacon responses that exceed the expected rate.

Alternative Approaches Considered

Instead of using beaconing, we could implement an approach comparable to the current internal-only headless browser using PhantomJS. This would have the advantage of not requiring any modification of the pages sent to client’s to detect the features of the page we are interested in. We’ve ruled this out though due to the complexity it would add to the installation process, and due to the increased resource utilization it would add to servers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly