This is a tale of cron jobs, data mining and algorithms.
Our story begins with a data mining script I wrote to do research for my Master's Thesis (investigating social media buzz - still in early stages, will post more details as it progresses and/or drives me progressively more insane).
This script was based off an earlier script I had written for some of my other projects and wasn't a big deal. What was new in this one was the scale of it. In one month this script mined more data than my old script had mined in a year. It was mining more data and at a much faster pace and I was scaling it up more and more.
You don't notice little waves and a few clouds
So what happened? When I first started the new script it ran flawlessly, never noticed any problems. It's always nice when you're playing with a tiny dataset and you can do what you want in any hack-together way without noticing performance. Of course anyone who has learned about algorithms (sorting comes to mind first) you know about Big O and how complexity matters as you play with bigger datasets. What I had done was essentially a selection sort on data that was in a random order(Sorting Algorithm visualization) - pretty much the worst possible way to do it ever (infinite loop/dividing by zero aside).
My Problem - Dupe Checking Algorithm
The problem was in my duplicate checking algorithm - it was searching for duplicates in the entire database for every new piece of information. You don't notice any slow down on 50,000 pieces of data, but somewhere between 50,000 and half a million the camel's back broke. It was taking longer to run this check than the time before the next set batch of data was being collected. So the jobs were stacking up further and further overloading my database and my server.
Quick Math behind the crash
For every new piece of data (I was grabbing them in sets of around 15-20 every couple minutes) it had to check them again 500,000 entries. So 20 new inserts compared to 500,000 = 10 million comparison operations per run (Now I was running many instances of this, so multiply this times the number of different sets... it gets UGLY FAST). The worst part of all this is: the more times it runs, the bigger the dataset gets, the longer it takes (linear increase in time, thankfully, but linear still kicks your ass eventually!).
The sad truth is I didn't need to check for dupes across the entire database, I only had to compare it for the latest X (they all had timestamps, I just needed to make sure the data pulled wasn't the same as the data pulled before it). It was a poor design from the start that wasn't thought through fully because I (thought) I hadn't run into scaling issues before.
The biggest irony of it all is - I had FIXED this very problem in an earlier version but NOT for scalability purposes, but because the way it was handling the data for another purpose. So I had the very solution commented out right above the broken line that crashed my server.
Writing good (efficient) code matters, a LOT. Those little hacks are great to proof of concept something but if you try and scale them without really thinking through what you're doing and actually analyzing performance you will may see some huge problems down the road.
I am thankful my problem was so small that it required a whopping 7 characters to fix ('LIMIT XX' on a sql query) but that might not always be the case. I hope all this struggling with my code and my server has taught me another lesson in becoming a better programmer; and I look forward to the next one, hopefully without crashing my server completely!