O'Reilly Network   Published on the O'Reilly Network (http://www.oreillynet.com/)
  http://linux.oreillynet.com/pub/a/linux/2002/02/28/cachefriendly.html
  See this if you're having trouble printing code examples


Cache-Friendly Web Pages

by Jennifer Vesperman
03/07/2002

There are a lot of HTTP caches out there. How long are they holding your pages? How long should they hold your pages? RFC 2616 (HTTP/1.1) specifies that caches must obey Expires and Cache-Control headers--but do your pages have them? How do you add them? What happens to your pages if you don't?

"The goal of caching in HTTP/1.1 is to eliminate the need to send requests in many cases, and to eliminate the need to send full responses in many other cases." RFC 2616

Advantages of cache-friendly pages

"HTTP caching works best when caches can entirely avoid making requests to the origin server. The primary mechanism for avoiding requests is for an origin server to provide an explicit expiration time in the future, indicating that a response MAY be used to satisfy subsequent requests. In other words, a cache can return a fresh response without first contacting the server." RFC 2616

The RFC was written with the expectation that Web pages would include expiration headers. If the expiration times in the headers are chosen carefully, caches can serve stored pages without losing any meaning.

When origin servers don't provide expiration headers, caches use heuristics based on headers like "Last-Modified" to guess at a suitable expiration. Heuristic methods are inefficient compared to expiry dates set by humans who know a page's content and frequency of changes.

"Since heuristic expiration times might compromise semantic transparency, they ought to be used cautiously, and we encourage origin servers to provide explicit expiration times as much as possible." RFC 2616

Notes about caching

The HTTP/1.1 specification (section 13) has strict requirements for caches: It requires them to provide semantic transparency--returning a cached response should provide the same data as would be retrieved from the origin server; and it calls for them to read freshness requirements provided by origin servers and clients.

Caches must pass on warnings provided by upstream caches or the origin server, and they must add warnings if providing a stale response. A cache may provide a stale response in limited circumstances, mostly if the cache cannot connect to the origin server and the client has stated that it will accept a stale response.

If a cache receives a request for a stale page, it sends a validation request to ask the origin server if the page has changed. The most common validation tool is the last modification time. If there are two changes stored within one second, Last-Modified will be incorrect. Because of this, HTTP/1.1 offers strict validation using the Entity Tag header.

The simplest way to assist caching is to keep accurate time on your HTTP server and always send the Date and Last-Modified headers with your responses.

To be a really cache-friendly webmaster, though, include the cache headers in your pages.

Available cache headers

The Expires header is the quick and easy solution. This header states the time at which the page is considered no longer cacheable, after which any cache storing a copy of the page should contact the origin server rather than send the stored copy. The syntax is:

Expires: <date in RFC 1123 format>

For example:

Expires: Sun, 10 Feb 2002 16:00:00 GMT

To mark a response as "already expired" or "not cacheable", the header should be set to the time the response is sent. To mark a response as "never expires", the header should be set to one year in the future.

The other header is Cache-Control. Cache-Control includes elements to designate the maximum age of a page element, whether it can be cached, whether it can be transformed into a different storage type, and whether it can be stored on nonvolatile media.

This article uses Apache as an example for setting up headers and discusses the Cache-Control header in more detail during the example.

Setting up cache headers in Apache

Main method: Expires header

To use the Expires header, you will need to be running Apache 1.2 or later and have mod_expires enabled. Uncomment the expires_module line in the "Dynamic Shared Object Support" section of the httpd.conf file, then recompile Apache.

#
# Dynamic Shared Object (DSO) Support
#

# LoadModule cern_meta_module /usr/lib/apache/1.3/mod_cern_meta.so
LoadModule expires_module /usr/lib/apache/1.3/mod_expires.so

(If you are running Apache 1.3 or later and it is configured to load modules at runtime, you can edit httpd.conf and then restart Apache instead of recompiling.)

mod_expires calculates the Expires header based on three directives. The directives apply to document realms and are usable in any of the following realms: "server config", "virtual host", "directory", or ".htaccess".

The Expires directives have two syntaxes. One is fairly unreadable; it expects you to calculate how many seconds until expiry. Fortunately, the module will also read a much more human syntax. This article describes the readable syntax.

The directives are:

ExpiresActive on|off
ExpiresDefault "<base> [plus] {<num>  <type>}*"
ExpiresByType type/encoding "<base> [plus] {<num> <type>}*"

base is one of:

num is an integer value that applies to the type. type is one of:

If you're using the Expires directives for a server, virtual host, or directory, edit the httpd.conf file and add the directives inside those realms.

<Directory /whichever/directory/here>
    # Everything else you want to add to this section
    ExpiresActive on
    ExpiresByType image/gif "access plus 1 year"
    ExpiresByType text/html "modification plus 2 days"
    # ExpiresDefault "now plus 0 seconds"
    ExpiresDefault "now plus 1 month"
</Directory>

If you're using the Expires header in the .htaccess file, you will need to edit httpd.conf to set the AllowOverride header for the relevant directory. Apache will only read .htaccess in directories which have the "Indexes" override set.

# Allow the Indexes override for the directories using .htaccess.
<Directory /whichever/directory/here>
    # Everything else you want to add to this section
    AllowOverride Indexes 
</Directory>

Add the Expires directives to the .htaccess file in the relevant directory. The webmaster can edit the .htaccess file without needing access to httpd.conf.

The main problem with the ".htaccess" method is that the Indexes override and the .htaccess file give the webmaster more configuration options than just the Expires header. This may not be what the system administrator intends.

Alternative method: Cache-Control header

mod_cern_meta allows file-level control, and it also allows the use of Cache-Control headers (or any other header). The headers are put in a subdirectory of the origin directory, with a name based on the origin file's name.

Uncomment the cern_meta_module line and recompile, as for expires_module in the last section.

In the httpd.conf file, set MetaFiles on, MetaDirectory to the subdirectory name, and MetaSuffix to a suffix for the header files.

MetaFiles on
MetaDirectory .web
MetaSuffix .meta

Using these values, the file /var/www/www.example.org/index.html would have a meta file at /var/www/www.example.org/.web/index.html.meta.

Any valid HTTP headers can be put in these files. This provides another way to apply the Expires header, and it's a way to add the Cache-Control headers. The relevant Cache-Control headers are:

Cache-Control : max-age = [delta-seconds]
Modifies the expiration mechanism, overriding the Expires header. Max-age implies Cache-Control : public.
Cache-Control : public
Indicates that the object may be stored in a cache. This is the default.
Cache-Control : private
Cache-Control : private = [field-name]
Indicates that the object (or specified field) must not be stored in a shared cache and is intended for a single user. It may be stored in a private cache.
Cache-Control : no-cache
Cache-Control : no-cache = [field-name]
Indicates that the object (or specified field) may be cached, but may not be served to a client unless revalidated with the origin server.
Cache-Control : no-store
Indicates that the item must not be stored in nonvolatile storage, and should be removed as soon as possible from volatile storage.
Cache-Control : no-transform
Proxies may convert data from one storage system to another. This directive indicates that (most of) the response must not be transformed. (The RFC allows for transformation of some fields, even with this header present.)
Cache-Control : must-revalidate
Cache-Control : proxy-revalidate
Forces the proxy to revalidate the page even if the client will accept a stale response. Read the RFC before using these headers, there are restrictions on their use.

Caveats and gotchas

Final words

Caching is a reality of the Internet and enables efficient usage of bandwidth. Your clients probably view your pages through a cache, and sometimes multiple caches. Applying cache headers to your pages protects the page content and allows your clients to save their bandwidth.

Further reading

Jennifer Vesperman is the author of Essential CVS. She writes for the O'Reilly Network, the Linux Documentation Project, and occasionally Linux.Com.

Return to Related Articles from the O'Reilly Network .


Library Navigation Links

oreillynet.com Copyright © 2003 O'Reilly & Associates, Inc.