Sustainable URLs [was "Daily news URLs; how long do they live?"]
In between the round-the-clock news briefs and the Web-only, in-depth data analysis presentations are the daily dumps, for lack of a better term, that go up on our sites. (I know, I know, this shovelware is Web content at its worst, etc., etc., but readers demand/expect it.)
revised question is limited to this class of item and is this:
At what point do you start to get
concerned about the number of rows in
your Story database? And when you get
to that point, what do you do about
How old is the oldest URL on your site? (And how difficult is it for you to determine this?) What’s your real-world strategy for dealing with this daily-increasling pile that theoretically extends into infinity?
Most curious to hear from the mid-size and smaller operations using Django.
If you're concerned about your story database getting big, just split it into multiple story databases. The easiest way to do that is by date. Google around for "sharding" or "partitioning." The way it works is, you put all stories from the year 2008 in one table, all stories from 2007 in another table, etc. Then you set up some views/triggers that make it transparent, such that your client code only deals with one master "table."
Here's a link to the PostgreSQL documentation on partitioning, which gives an example of partitioning by date.
BTW, at the Lawrence Journal-World, we did indeed have older stories in the database dating back to the mid 20th Century, but that was just a handful (see http://www2.ljworld.com/news/1958/, for example). I think the consistent/complete story archive only went back to the early 1990s -- which is still pretty great when it comes to newspaper archives, of course! You can hack the URLs to view the archives: http://www2.ljworld.com/news/1993/
The 'simple' answer is whenever database performance becomes a bottleneck.
The longer answer is that you shouldn't be seeing particular database size problems until your primary tables are in the hundreds of thousands of rows minimum.
If you're noticing performance problems, make sure you're doing regular table optimizations, that your common queries are staying in the query cache, and that the RAM usage and query cache size settings are sensible. MySQL installs with a very conservative set of usage limits, so make sure you've configured it to match your environment.
If you're still having issues, consider hardware upgrades. Swapping your database hard drives to solid state drives is a cheap way to buy a huge performance gain. If you can afford it, the performance of a hardware RAID array of good server class SSDs (Intel X series in particular) is absolutely amazing, and can be had for < $1000 in some cases.
If you're using mysql or postgres you can then go to a sharding approach, or consider breaking your individual pieces up and storing them in a non-relational database like cassandra, although I'm not sure if any of those integrate well with Django, or if the maturity is suited for your environment.
I remember once seeing a screenshot for the admin of the CMS at the Lawrence Journal-World, where Django was born. At that time (a few years ago) they occasionally would add older stories to the CMS when they wanted to republish them on the Web. The screenshot I saw had an story from, I believe, 1953.
Not sure you can boil the largest financial crisis in human history down to a couple of guys watchdogging the wrong putzes.
Full disclosure: I've never used Django for pure CMS work. But I've done a lot with it to create data-driven interactives, which often involves storing hundreds of thousands to millions of hefty records, so I assume that experience probably corresponds (for now) to most newspapers' story loads (particularly ones that haven't gone back and digitized decades-old stories).
Django's standard databases -- Postgres and MySQL -- can handle a whole lot of data. So database size, though tricky at those upper bounds, is a relatively manageable problem for a pretty long while. Of course, there is a conceptual point in the sky where you'll have to start splitting things up between multiple databases -- and when you get there, the upcoming Django 1.2 release will make it a whole lot easier.
But I'd suggest that the answer to your question isn't a specific number of stories; rather, it's "whenever it starts to hurt your site's performance." When it does, you could probably do a lot of good just by being very careful about how and what you query (it's amazing how easy it is to accidentally run count(), which isn't a problem until you've got hundreds of thousands of records and you're doing it ten times per page). And after that, you start looking at more complex technical solutions to the problem: sharding, partitioning, a master-slave set-up, better caching, moving to a bigger database server, setting up pgpool, examining your database indexes, looking closely at your server's settings, et cetera.
After all, the point could be different for different places, based on things like how fast your site needs to be, what technology you're running on, how much traffic you get and how well-designed your underlying CMS is. And the solution would depend very much on the exact problem you want to solve: whether querying has gotten sluggish, or a handful of very-common operations are taking too long, or you feel the need for more redundancy, or whatever else.
Keep in mind also that most newspaper stories are normally only fetched on a handful of keys, so assuming that you've got the right indexes set up and something else (like a Haystack backend) handling full-text search, that retrieval should be pretty fast. And since the increase in stories over time should be roughly linear, when it does start getting slow, you should have plenty of notice that you need to start exploring different solutions.
It's not uncommon for MySQL or Postgres databases to contain tens of millions of records - they're certainly built to handle those kinds of numbers. Whether you need to break your database up (sharding) is a question of the amount of traffic and queries you're handling, not the size of the database. And, like others have said, you'll want to put effort into site optimizations, caching, Varnish, etc. before you go down the sharding road.
Please login to post questions.