On URL accuracy, versioning and the immutable-redirect pattern

So you have this requirements:

  • serve content (might be a static file or a dynamic page) under a URL
  • make it cacheable by defining appropriate HTTP expiration headers
  • yet, the content might change – serve it under a new URL
  • the old URLs should still be valid but show the new content

In my current project I solved this with what I’m calling the “immutable redirect” pattern: URLs contain a unique ID and version identifier. If the browser requests the correct ID and version it is served with the page immediately, otherwise it is redirected to the current version. This has the benefit that URLs become immutable and be served with long expiration dates.

The URLs look like this: http://example.com/%5Bid%5D/%5Bversion%5D/%5Btitle%5D

  • URLs consist of a unique identifier [id] that references the page, regardless of whether it has been modified or not
  • URLs also contain a [version] identifier which changes with every revision of the content (or new deployment of the application because the rendering might have changed)
  • URLs end with a human readable string, i.e. the title of the page

 When a browser requests the page, it sends an URL which contains the ID of the page it wants to see. This allows the application to easily and unambiguously fetch the content from a database or generate it in runtime. The URL also contains a version ID which, when matched with the version of the page just leads to the page being rendered. If the versions do not match, then the browser is redirected to the current revision.

Human readable titles and cacheability

As we have said, the determining component in a URL is the ID, the rest is just “sugar”. So what do you do when the server receives an URL with the correct ID but everything else in mismatch?

We’ve discussed already the case of a version mismatch: a simple redirect to the new URL will suffice. But what do you do with a mismatched title? This case is actually much more common than one might think; titles that contain special characters are often broken apart in social media websites and buggy mobile phone browsers. On a pet project that recently got some attention about 25% of the requested URLs lead to 404s because exactly of that problem: special characters in titles are not requested correctly.

One solution might be to just serve the content anyway; since we wanted human readable titles in the URL only as an aid to the visitor we could serve the content regardless of which title the URL request was submitted under. However this will lead soon to cache pollution as the website’s reverse proxy will have to store multiple copies of the same content under different URLs.

Another, obvious, solution would be to redirect the browser to the correct URL with the correct title. This ends badly because some mobile phone browsers have bugs in their url encoding routines. Every time we’d redirect them to the correct URL they would just decode it in a wrong way and request the old URL again. This leads to an endless chain of redirects and eventually to an error message.

The solution we finally opted for was “lenient” URL matching: we observed all mismatched URLs and found a maximum Levenshtein distance of 5 between the requested title and the “correct” title. We implemented that into the URL matching a few days ago and ever since URL mismatches have dropped to below 1%.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s