Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

Design Doc: Cache invalidation of URL patterns

Jeff Kaufman edited this page Jan 5, 2017 · 1 revision

Cache invalidation of URL patterns

Srihari Sukumaran

last updated - July 19, 2012

Objective

Allow users (webmasters) to cache invalidate individual URLs or URL patterns.

Motivation

The current cache invalidation feature allows a webmaster invalidate all URLs of his site. This invalidates all the items cached prior to a timestamp (corresponding to the time of the click). This does not give the webmaster the flexibility of invalidating only specific URLs, e.g., when the webmaster updates a resource he will want to only invalidate cache entries corresponding to that resource. With the current invalidation feature, the webmaster is forced to invalidate all URLs (thus increasing latency for loading unaffected resources) or accept stale content.

Design overview : Invalidate at the time of cache lookup using domain config / rewrite options

We extend the idea behind the current implementation of "all cache items for a domain" invalidation -- invalidation is done at the time of cache lookup. Each URL to be invalidated and the timestamp when it was requested are written to config (storage config and then rewrite options). The current approach, for HTTP cache, of checking against these when the actual cache lookup is performed (specifically, in the callback), can be applied here also. But the approach of including the cache invalidation timestamp in the rewrite options signature (and hence metadata and property cache keys) to invalidate metadata and property caches is not necessarily what we want for URL cache invalidation. Hence we plan to support two URL cache invalidation modes:

  1. 'Strict' URL cache invalidation: Here we will not include the url patterns and timestamps in rewrite options signature. Thus metadata cache will not be invalidated at all. For the property cache we will explicitly use the URL patterns and timestamps, either like for HTTPCache or by including timestamps (of all matching patterns) in pcache key. This 'strict' invalidation is suitable for html, and not for resources.

  2. 'Reference' URL cache invalidation: Here we will include the url patterns and timestamps in RO signature, thus invalidating all metadata and pcache values for the domain. This is suitable for resources, where it makes sense to also invalidate all potential 'references' to the resource being invalidated (e.g., all rewriting metadata, html cached in blink since it could contain references to rewritten resource)

Pros:

  • No major changes in the core rewriter code -- basically adding logic to existing functions.
  • Allows URL patterns for invalidation.

Cons:

  • Slightly more code complexity in webmaster interface.

Detailed Design

The StorageConfig proto will have the following related to cache invalidation:

message StorageConfig {
  ...
  // Existing field for invalidating ALL urls:
  optional int64 cache_invalidation_timestamp;
  // This list will be in increasing order of
  // URLCacheInvalidationItem::timestamp. This should happen naturally
  repeated URLCacheInvalidationItem url_cache_invalidation_items;
  ...
}

message URLCacheInvalidationItem {
  int64 timestamp;
  string url_pattern;
  bool strict;  // default true
}

frontend

We plan to minimally modify the current console UI to support url cache invalidation.

The ‘Flush Domain’ button should be replaced by two buttons: ‘Flush Html’ and ‘Flush all related resources’

The user can enter a URL pattern and click either of the buttons. This will be used to invalidate caches for all the domains in the project.

Invalidating the caches

RewriteOptions should be modified to contain a field corresponding to url_cache_invalidation_items, which will be populated from storage config.

For invalidating individual URLs, it is more important that the latency in propagating the storage config to the different rewrite_proxy instances is minimal. The pubsub channel is should take care of this.

HTTPCache invalidation

The current HTTP cache invalidation is based on the (pure virtual) function HTTPCache::Callback::IsCacheValid(). In HTTPCache whenever an item is retrieved from cache, its validity is checked (among other things) by invoking this function provided by the client’s callback. All cache gets requires clients to create and pass in a HTTPCache::Callback object. These objects belong to subclasses of HTTPCache::Callback, with access to RewriteOptions, hence in IsCacheValid, cache_invalidation_timestamp can be compared to the date field of the response header of the item retrieved from the cache to decide validity.

In addition to comparing against response headers date field, IsCacheValid will need to perform an equivalent of the following:

valid = true;
for each item in url_cache_invalidation_items traversed backwards:
  if (item’s timestamp < date field of response header of
       cache entry)
    break;   // all items henceforth have a greater timestamp
  if (url of cache entry matches item’s url)
    valid = false; break;
end

The above is done in the line of request.

Metadata cache invalidation

Non-strict (reference) invalidation items will be included in rewrite options signature, thus invalidating the metadata cache. Strict invalidation items have no effect on metadata cache.

Property cache invalidation

Non-strict (reference) invalidation items will be included in rewrite options signature, thus invalidating the property cache.

String invalidation items will be handled as follows:

Pass RewriteOptions::url_cache_invalidation_items into PropertyPage constructor (in ProxyInterface::InitiatePropertyCacheLookup).

Then in PropertyCache::CacheInterfaceCallback::Done() we can compare pcache_value->write_timestamp_ms() against page->url_cache_invalidation_items entries to determine if pcache_vaue is invalidated. If invalidated do not add it to page (i.e., do not call page_->AddValueFromProtobuf for it). Either all or no values in a cohort should get invalidated (since values in a cohort are all written together), and so in case of invalidation of a cohort we can pass false to collector’s Done method.

Alternate Design Considered

Explicitly delete from http cache and metadata cache

When the webmaster enters an URL to invalidate, it is propagated to all server instances.

URL entered for invalidation need not be persisted, in fact it is awkward to do this.

We need a URL invalidator class that is a consumer of invalidation URL publications. On receiving an update (an URL) it should invoke Delete on all caches -- http_cache() and metadata_cache().

Pros:

  • Simple to implement.
  • Simple UI.
  • No invalidation checks in line of request.

Cons:

  • Implementing the URL invalidator class is very tricky
    • How to figure out which metadata keys to call Delete on?
  • Will this lead to redundant Delete RPCs to remote cache servers?
  • Supporting URL patterns is hard (if not impossible)

Config service

A service, e.g., InvalidateCacheUrl, with the request proto containing the URL to be invalidated and the domain. The devconsole backend server, on receipt of a InvalidCacheUrl request, will publish the URL and domain on the pubsub channel for url invalidation.

Invalidating the caches

There should a URLInvalidator class that subscribes to the channnel. This class requires RewriteOptions, http_cache, metadata_cache and PssUrlNamer. It has to run as a "background task" -- wakes up when an update on the pubsub channel is received. When it receives an update it has to explicitly invoke Delete on all the caches. This will involve two Deletes on the http_cache, one for the URL and the other for its rewritten version, and multiple deletes on the metadata cache. But it is not clear how it will synthesize all the cache keys to be invalidated.

It is not even clear if this alternate design is feasible at all.

Clone this wiki locally