@madpilot makes

Be a good netizen: Use the correct HTTP response code

Remember the good ol’ days back before dymanic websites where pages had .html extensions and when you tried to access a page that didn’t exist you got an ugly, yet reassuring 404 Not found page? The significance of this page is actually pretty important – not only does it tell the user that the page is not found but it returns a special HTTP status that tells web spiders the same thing. As web developers, sometimes we forget that humans aren’t the only ones accessing our pages, and as a result don’t use the correct HTTP response codes to denote what is going on.

What the hell is a HTTP response code?

When your web browser makes a request to a web server, the web server will return a status code as well as the web page, which tells your browser what has happened. This response is usually made up of two parts: a number (which is for any spiders or bots that might be accessing the web site) and a string (which is for humans) and back when everything was static the web server took care of everything.

Unfortunately for web developers, in this environment of database driven web sites, we often don’t have the luxury of letting the server take care of everything, so this article aims to show you that it isn’t that difficult to do HTTP responses correctly. As I will show later, this can adversely affect your ranking in your favourite search engine.

Let’s try it out

Firstly, lets see what happens when you actually make a request – you can see what is going on using another old school application: Telnet.

Open up command line or Terminal.app or terminal depending on you flavour and type the following:

telnet madpilot.com.au 80

You will get a prompt and type the following (Windows users might not see anything as the telnet client won’t echo what you type):

HEAD / HTTP/1.1

Host: madpilot.com.au

…and hit enter twice – you should see something like:

HTTP/1.1 200 OK

Server: Mongrel 1.0.1

Status: 200 OK

Cache-Control: no-cache

Content-Type: text/html; charset=utf-8

Content-Length: 4123

The first line of the response is the important bit – it tells the web browser that the response conforms to HTTP version 1.1 and more importantly the response code is 200 and the response type is OK! The 200 type is the most common response you will come across, it means that the page was found and served up correctly. Generally your web server WILL take care of this one for you. Let’s look at how you can change that status code.

I’ll use PHP as an example, because it is still the most common dynamic language – but you can do this in any language, leave a comment if you would like an example of how to do it in another dialect. It is all very simple – BEFORE you output any HTML, call the header() function as such:

header("HTTP/1.1 404 Page Not Found");

As you have probably guessed, this will tell the browser that the page it requested was not found. Why would you want to do that? Obviously if the script is being run, it has been found? Well, yes – that is true, however, when the HTTP specification was written, CMSs and dynamic product catalogues weren’t even thought about – so we need to think a little bit differently.

Let’s look at an example: Your customer has requested to see the details for product #10 by browsing to

http://www.yourcomany.com/products/view/10

Product #10 exists, so we serve up the details, but what if the customer decides to see what product #20 is? If product 20 doesn’t exist, then what should we do? One option is to print out “Product not found” which is fine for the user, but what happens if your favourite search engine tries to hit product #20? If you maintain the default action the search engine will receive a 200 OK status which makes it think that the product exists and it will index it! This just pollutes the search engines’ index and hurts your ranking, which is bad. So what we can do is serve up the 404 header from above and this let’s both the user AND the search engine know that what they have requested doesn’t exist.

So what other response codes can we use? There is a complete list here, but I’ll run through a couple of the common ones:

301 Moved Permanently: This response code means the resource that has been requested USED to live here but has now moved somewhere else and will never return. Returning this status code is extremely important if you are changing the structure of your website, as you can tell the browser where it needs to go to get the resource. More importantly, it also tells your search engine to update it’s index with the new URL. You need to supply the new URL as part of the request, so it looks something like this:

header("HTTP:1/1 301 http://www.yoursite.com/new-url");

302 Found: The 302 is actually generally used incorrectly. The most common use is to redirect a user to another page TEMPORARILY which is actually what the 303 code is for. unfortunately, not all browsers support 303 and actually expect a 302 in this case. So who are we to argue? If you are a PHP developer, you have probably used

header("Location http://www.yoursite.com/somewhere");

before – this is exactly what this does.

403 Forbidden: If you wanted to really play HTTP right, you would return this code every time someone tried to access a private URL when they weren’t logged in. It means the server knows what you are trying to do, but isn’t going to let you do it.

404 Page not found: This has been covered – basically if the resource the agent wants doesn’t exist, you serve this up.

410 Gone: This page you are looking for used to exist, but it doesn’t anymore. In fact there isn’t even a new URL, so if you are a search engine, just forget about it. Whether search engines listen to this, I’m not sure, but it can’t hurt.

500 Server Error: Something went wrong with the server. I would throw this up if there is an error that is stopping the page from loading, such as a missing database or a broken web service or similar.

Don’t forget that you can also server up content to the browser (in fact, if you don’t humans will just get a blank page), so it is recommended that you serve up a nice friendly message to your visitors explaining what happened.

So there you go – now there is no excuse for serving up errors to your users and forgetting about our automated friends. So when you are writing your next kick-arse web app, spare a thought for the visitors that aren’t so good at parsing human talk.

WordPress Hack: Changing your permalink structure without upsetting Google

WordPress has the ability to generate permalinks, which is great for Search Engine Optimisation. But what can you do if you need to change between them? Changing them in WordPress isn’t a problem – you go to the “Options” tab, click permalinks, and select a new one. However! If others bloggers have linked to your posts, or a search engine has already indexed your blog, their links will break.

With a little bit of .htaccess trickery you easily* change between the different options without breaking your old links!

Why the star next to the “easily”? If you have a lot of posts, it could take a while, but read on….

Out of the box you have three options (There is a fourth, but if you use that option, you probably don’t need this guide!):

The WordPress permalink options

In terms of SEO, The best option is Date and name based – having the title of the post should give a higher ranking. Next best is Numeric, but only because it gets rid of the &p=123 part from the URL. Luckily moving from Default and Numeric to Date and Time is easy!

If you are moving from Default to one of the other options, then there is nothing else to do – WordPress automatically responds to this style of this URL regardless of what option is selected.

There are two ways of moving from the Numeric Option to one of the other options – the quick way and the right way!

The Right way

The right way uses a permanent redirect (using Apache’s mod_alias module) for each blog entry – this will tell search engines that the page doesn’t exist any more and that they should index the new page instead. Unfortunately, this doesn’t update links of other peoples pages, so you will need to leave the hack for the life of the blog.

Open up the .htaccess file and add the following BEFORE the # BEGIN WordPress line:

  1. <ifmodule mod_alias.c>
  2. RedirectPermanent /blog/2007/02/05/this-is-my-blog-post /blog/?p=123
  3. RedirectPermanent /blog/archive/123 /blog/?p=123
  4. <IfModule>

You will need to dig into the database to find what ID number corresponds to each post. This is tedious even for a small number of posts, so I prefer the quick way. The IfModule line check to see if the mod_alias module is installed. The first RedirectPermanent link shows an example of changing from the Date and Title option. The second line shows how to change from the Numeric option.

The Quick way

The quick way uses Apache’s mod_rewrite module to rewrite the URL – as such it will only work when converting FROM numeric mode. If you need to convert from Date and Name to numeric, you have to user the Right way. Drop the following code BEFORE the # BEGIN WordPress line:

  1. <ifmodule mod_rewrite.c>
  2. RewriteEngine On
  3. RewriteBase /blog/
  4. RewriteRule ^archives/([0-9]+) /blog/index.php?p=$1
  5. <IfModule>

The IfModule line makes sure mod_rewrite is enabled, the next link tells Apache to turn mod_rewrite on. The RewriteBase line tells apache to automatically prepend /blog/ to all rewrite tests. The second last line is the meat and potatoes: it tells apache to rewrite any url that looks like /blog/archives/[one or more numbers] to /blog/index.php?p=[the numbers].

I would highly recommend using the Date and Name option – thankfully converting to that option is the easiest to do!