The problems with Dynamic Web-Sites

Introduction

In the beginning there were static web-pages. Each page on your site had a file associated with it and when you wanted to change it, you did so with a text editor. For the first web-pages this was all we had. There was no way of taking input from a user and working with it.

Then came the Common Gateway Interface (CGI). This allowed a web-request to trigger a program that executed a console based application on the server. The web-server communicated with the process using environment variables and the standard-in, standard-out communications channels. These programs would reply with the HTML of the page to standard-out and the HTTP server would relay this to the viewer’s browser.

However, using CGI means that every request that is made to that program requires a whole new process to be started. Starting a process is typically very expensive which means that performance was severely reduced when the site had many visitors.

There are actually a number of ways to solve this problem but the solution that industry took was to stop creating a new process for each request. Instead, there is a single process that runs all the time. This process was a script interpreter that would execute a script written in a given scripting language. This script would output HTML just like the CGI programs before it. However, it could handle many more concurrent requests because now there was no process creation over-head for each request.

However, interpreted code is much slower than compiled code. As web-traffic started to increase the performance problems that come out of this difference started to show their hand. Eventually, platforms like ASP.NET developed where the code to run is “just in time compiled” on the first run. The first hit to the program would be slow, but subsequent hits would run as possible. This solved the problem.

This entire process took about eight years. Somewhere in the middle of this one of the key assumptions of the web-changed. It used to be that when you were accessing a resource on the web, everything was a file. The images you have are files, the HTML document is a file, the CSS document is a file etc. However, somewhere a long the line the assumption changed. When you visited a web-page, you were no longer reading a file that physically existed on the server, you were interacting with a program executing on that server.

When I go to Amazon, EBay, Slashdot or wherever, conceptually what I am getting is not a file but the output of a program. When I go to these domains a program is run which generates my HTML. In most cases, for each page I go to the program is run for every single request.

I think that changing that basic assumption was wanted in to accidental and probably a big mistake. We pay for this mistake with poor performance and additional solution complexity.

Most pages are not dynamic

The vast majority of web pages are not dynamic. They have very few, if any dynamic elements. This page for example, contains no dynamic information. Once I have finished typing this post, it will remain the same until the end of time. If you take a look at a product page on Amazon or the old threads on a web forum, which don’t forget are the vast majority of pages on a forum, these things will not change.

For the most part, the update frequency on web-pages is very low. On a computer with a clock cycle frequency of two billion operations per second, even an update frequency of one minute for a web-page is not really dynamic. From the computer’s point of view, it’s static for the vast majority of the computer’s operation.

To some degree, this is already recognised as a problem. Most web-application frameworks have some way of temporarily holding information in a cache for use on subsequent requests. Problem solved? Not quite.

For a start, is inefficient. That cache will have to be recomputed periodically. Caching also leads to poor application design. If you take a look at Slashdot, you can see this for yourself. If you go to the home page and look at the tally of the number of comments for a story, then actual visit the thread, the numbers will be out of sync for minutes at a time. This is probably because the home page is cached independently of the thread viewing page. It’s a bug, most technical people will try and make excuses for it, but it is certainly a bug. My grandma would not understand why there is an observable difference.

Caching for the most part adds additional solution complexity that simply wouldn’t be necessary if everything were a file as it used to be.

The problem of scale

But this is only the start of the problem with dynamic sites. The problem is that once you have dynamism in a web-site it becomes hard to remove. The problem is that you’re inside a program and a program can do anything you want it to do. You want to access the database fifty times for a single page load? Sure, go right ahead. You want to use twenty of those accesses to write data and the remainder to read? Sure go right ahead!

In small, low traffic applications it’s quite okay to have this flexibility. As you move to larger and larger scale, suddenly you realise that your system’s performance is much lower than the theoretical maximum. While our computers are very fast, load is always difficult to deal with.

The key to dealing with scale is to separate reading data from writing it. On the whole, the vast majority of web applications are read intensive. You will generally read far more records than you will write to. This is good because it is much easier to scale reading. Take this page for example, if I wanted to scale this entire web-site, I can just buy another server and copy the HTML files from one computer to another. I can host one machine in one data-centre, one in another and in theory I should have twice the capacity. It’s easy to see that this will scale all the way up to very high load.

Writes on the other hand do not scale particularly well. Most writes are done to a database of some description and it is difficult to improve a database’s throughput by just adding more machines. The database usually sits on a single machine and if you want more performance you have to upgrade that machine.

As such, writes are much more expensive than reads. By making every web-request a program, you have to rely on self-discipline to exploit the asymmetry of cost between a read and a write. In fact, I would argue that people don’t really understand there is a marked difference in cost until they’re having performance problems. By then the architecture is largely set in stone and scaling has suddenly become very expensive.

How do we solve the problem?

With the invention of AJAX we now have some cool tools to separate the read operations within a site from the write operations.

Take for example Digg’s comment moderation logic. On Digg, every logged in user has the ability to moderate a comment up or down. The actual comment view would be a single HTML page that everyone sees. When you click the up arrow, an AJAX request is made to a CGI process that actually writes a new copy of the HTML file. Rather than generating the raw HTML each time, Javascript is used to enforce the user’s preferences.

This is one particular example, but with careful deployment, AJAX can be used as the sole mechanism for performing writes in your application. This allows you to scale you read capacity independently from your write capacity. To me, that is what AJAX is really meant for. It’s not so much a a way to retrieve information without a page refresh, although that is nice feature. It’s a way for you to separate the act of reading from a resource from writing to a resource.

When all the web 2.0 stuff dies down, this is what I think the lasting impact of AJAX will be: simplifying the problem of scale.

  1. 2008-01-03 21:41:32 GMT
  2. #Web
  3. Permalink
  4. XML