Saturday, December 08, 2007

I'm back! New and improved! After nearly a year out of the picture, I've finally completed my move to a Linux based dedicated server. This has been a multi-month project with many wrong turns and bad design decisions. In this post I want to record some of my conclusions.

The Goals of the Project

Even though this site takes very little traffic compared to the power of the machine that hosts it, I like my site to be minimalist. It should use the minimum amount of resources that are required to host it. On of the goals of this project was to design a web-site that could take a very large load and still operate cleanly. By making your site light on resources you can achieve this fairly easily.

The thinking behind this requirement is that if my site uses the lowest amount of resources per hit, one can imagine that over time it will become cheaper and cheaper for me to host the site. Having a fast site allows me to target underpowered hardware with all the obvious cost benefits that provides.

The other key goal of the project was that I wanted a site that wouldn't need to be fundamentally changed for twenty or so years. I imagine most people at this point are sort of scratching their heads: twenty years? Are you serious?

I'm completely serious.

As I start to get older, I realise that my own time is a precious thing. Whenever you're doing something in your spare time there is an opportunity cost associated with that task, in the sense that you could be doing something else.

There are some people who like to recode their web-site in the latest flavour of the month technology as a way of learning new tools. I completely understand the motivation behind that sort of development but I'd wager that eventually you're going to get bored with that treadmill. This is where I am today and I want a site that sits there and hosts my posts.

In short, I want it so that it is platform independent, fast and maintainable.

Why Move at All?

If you already have a site that works on a Microsoft platform, why move? It's true that Windows Server 2003 is a very stable web-hosting environment. As much as I like to complain about the machines at work, they are on the whole very stable.

There are few reasons to move. Firstly, ASP is dying. It's by no means dead yet but it will probably not exist in ten years time, let alone twenty. To continue to use Microsoft technologies would require rewriting the site in ASP.NET. Even so, it is unlikely I'll be able to take C# source written today and run it on a machine twenty years from now. Historically, Microsoft hasn't offered that sort of source-code longevity.

Secondly, you're locked in to one vendor. While in theory Chillisoft may be sufficiently feature complete to run a ASP web-site from, no-one can deny that you'd prefer to run that application on a Windows machine. At any rate, it is unlikely that ASP will be usable on any platform in even a few years time.

Thirdly, Windows dedicated servers are much more expensive than other servers. I've pretty much decided that I want a dedicated server for my site. This has less to do with the web-site and more to do with what else you can do with that box. I want this box to be as low cost as possible.

All of these things point to a move away from the current code-base.

Choosing a platform

You've probably already guessed that this was the easiest decision to make. Linux runs everywhere, costs nothing and it's a Unix style operating system. Unix has existed since the sixties. Even if the entire open source community folded tomorrow, it's a good bet that something like Unix will exist in twenty-years time. There are literally hundreds of vendors of Unix style computing platforms.

The only question then was what distribution of Linux to use. I selected Debian. Debian cares about user freedoms than other more commercial distributions. It also puts raw stability ahead of anything else. Stability is the key feature you want from a web-server.

Initial thoughts on moving to Linux

The original site was coded up using ASP, using Microsoft SQL Server as the back-end. That code-base was three years old and not something I particularly wanted to maintain once I moved over to Linux. You could theoretically use Chillisoft's ASP engine on Linux but then you're tying yourself in to a single vendor. A vendor that will probably not exist in twenty-years time. Even if you solved the ASP problem, you still have the problem of migrating from SQL Server.

The most straight-forward way of completing the switch was to simply recode the site in PHP and move the database over to MySQL. This is what most people would do. However, if you do this you've swapped being tied to one vendor for being tied to two vendors.

It would be wrong to say that this is no improvement over being tied solely to Microsoft. PHP and MySql are open source so in theory I could take on the maintenance task once the current versions are no longer supported. However, that clearly conflicts with my goals. I want to do the least work possible to maintain the site. I don't have the inclination to try and get a twenty-year old database and scripting language to work on the hardware of 2027. This would probably be a bigger task than rewriting the existing code in the languages of that day.

The database problem could be resolved by designing queries according to the relevant SQL standards. Then you could just move your schema to the latest database engine every couple of years. I do admit that's a solid migration path and probably one you'd take for a larger system.

I haven't done this with my site. In fact, I've actually decided to scrap the database entirely. This is the next subject I tackle.

Do we actually need a database?

The old blog used the common pattern for a web application. Each page load would go to the database and ask it for a copy of the raw HTML to stick in the middle of the page. That's a lot of CPU cycle to spend on getting a bunch of text to put on a page. One solution to the problem is to cache the page in memory. The theory behind this is that you'll spend a bunch of CPU cycles on the first hit of the page but subsequent hits will be serviced straight out memory.

However, my pages change on a glacial time scale. This semi-literate garbage page was written four years ago. The last time I changed it before today was four years ago. My seemingly dynamic site is actually just a bunch of static text.

This raises the question: "Why bother with having active scripting at all?"

I felt that's quite a compelling question. When you look out across the Web, how many sites are actually just a collection of static pages. If you look at an article older than just a couple of weeks on Slashdot or Digg those articles are simply not going to change. I could view them in two years time and the HTML would largely be the same. This principle seems to apply very broadly indeed. With the exception of genuine web applications almost all sites are fundamentally static.

I think there's something wrong with the web in that respect. A lot of the performance problems that we run in to are a direct result of too much dynamism. We then put caching in to remove some of that dynamism in order to get the required performance. This seems backwards to me. It just makes the whole system much more complicated. Perhaps the correct way to design a high traffic web-site is to make the site fundamentally static then use background processes to handle state changes to that static content.

This allows you to scale reads of pages independently of writes to resources. There are very few web-sites that are write heavy. Even Wikipedia is predominately configured for reading.

I took this approach with this site. Every page you see on the site is actually a normal XHTML file. I mutate the various files with custom processes that I run on the server itself. The choice of XHTML is handy because reading and writing XML files is trivial in any language.

I want to return to this principle in a future Blog post in the future and expand upon it. I think there is great millage to be had in making the majority of sites out there fundamentally static.

Have I met my goals?

My solution of having a bunch of XHTML files is actually a very nice solution. A HTTP server will always be able to dish files out from a location on disk. This is the core functionality of a web-server. Something would have to change radically in web-servers over the next twenty years for that functionality to disappear. Anything that radical would clearly need a resdesign of the site.

Using flat files allows me to remove dependencies left, right and centre. I no longer have to worry about a certain scripting language's performance, longevity or security. There is also no database to configure or secure either.

There is no need to worry about scripting vulnerabilities. There's no scripts running so there aren't any. I only have to worry about a hole in Apache itself.

There are other advantages to using flat files. They can be version controlled. This is a really nice feature of this layout. It means I can check the history of any file on the site and make changes safe in the knowledge there's a decent backup.

I didn't use server side includes either. What's the point? I can write a program to modify the contents of all the XML files in one go. Maintaining the files is not a problem. Having a full copy of the page per blog entry is actually useful because you can validate the XHTML locally before putting it up on the server.

In conclusion, going for the simplest of all the possible solutions has worked out for the best. If the site is still here in twenty years, I'll make a note to go back to this entry and compare my decisions to what actually happened.

10:13:23 GMT | #Website | Permalink
XML View Previous Posts