Recently we experienced an issue with a high level of IO waits on the primary RescueTime.com database server, which resulted in our MySQL DB being the major bottle neck to good performance.
RescueTime essentially serves as a large data warehouse for thousands of users who send us application and website usage information 24 hours a day. Most of our database activity is made up of insert and updates coming in from our API web service mongrels. At the moment, we are running 7 dedicated mongrels that do nothing but handle incoming data streams from our users.
One of the major functions of the API mongrel is to build real-time summaries of the incoming second by second attention data. We summarize this information by hour and by day, and this is currently happening in real-time as the API mongrels receive the data streams. This summary data is used by the www.RescueTime.com dashboard to display the usage information to our users, primarily through real-time graphs and analytics.
As we monitored the average round trip for a data stream update, we noticed a significant trend up as we added users. It became quickly obvious that we did not have a scalable solution as our user numbers grew 7% week over week. During our peak load times (6am – 3pm PST), we started seeing timeouts from the user client applications which resulted in a less than optimal user experience.
By using the excellent SeattleRB’s Production Log Analyzer we were able to see that we were averaging 3.5 seconds per incoming data stream, and with thousands of users (and growing) we were quickly approaching a limit of how many users we could handle with our current infrastructure.
One of the options was to bring on more hardware resources, but we are a bootstrapped startup and we really felt we could squeeze more performance out of our existing hardware.
I dug into the specifics of where that 3.5 seconds was coming from and it turned out that the insert and update statements were taking far longer than we had thought. The reason? Indexes. We had 8 indexes on our daily summary table and 7 on our hourly summaries. We had been so focused on pushing new features that I had really neglected basic DBA duties by being lazy and adding indexes in response to performance issues.
So, I set out to remove as many of the indexes as possible from our summary tables. As I researched how InnoDB stores data, I realized that what InnoDB does by default is to create an clustered index based on the primary key. As I read further, someone equated the InnoDB clustered index to Oracle’s Index Organized Tables (IOT’s). KAACHING! A light bulb went off in my head about what was happening.
Rails and ActiveRecord make it really easy for developers, but ActiveRecord does a poor job of structuring database tables for performance. By default, ActiveRecord will create a database table with a single column primary key of “id” which is an autonumber. The problem with this concept is that instead of taking advantage of InnoDB clustered indexes, it essentially dumbs the table back down to be heap-organized, where rows are stored in the order that they are inserted.
This isn’t an issue with smaller data sets, but RescueTime currently adds over 10 million records each week and that number is increasing 7% week over week.
I analyzed on of our primary summary tables (daily_summaries) to find a natural primary key and found that I was able to create a primary key that was 3 columns wide of User_id (int), summary_date(date), and durationable_id(int). After identifying the proposed new primary key, I started analyzing the queries that used that daily_summaries and found that all of the queries used at LEAST user_id and 90% of them used user_id, summary_date and durationable_id together.
I thought to myself, what would happen if I removed ALL of the indexes off of daily_summaries, rebuilt the table to drop the primary key of the autonumber’d “ID” column and add a primary key of (user_id, summary_date and duartionable_id). So I took one of our test instances and did just that. Right away the overall table size dropped to less than half it’s original size, since all of those indexes took up more space than the data did.
So, how about performance – Well, of course, the first thing that happened was the Rails application puked all over itself, but that was to be expected – I had removed the ID column Rails was expecting and replaced it with a multiple column primary key that it knew nothing about. I worked with our resident Rails expert, Brian Fioca, and he was able to locate a great Rails gem called composite_primary_keys that allowed him to modify the existing Rails code with a fairly minimal effort and still allow the use of ActiveRecord in most places.
Once the Rails code was working again we were astonished to see that RescueTime read performance was significantly better, even with NO indexes other than the primary key on the daily_summaries table. The reason was the InnoDB clustered index has reorganized the daily_summaries data on disk, in the order that we retrieved it. Meaning that a single read operation now returned back multiple records for a single user_id, thus there were less read operations to return our query.
So, how about insert and update speed? Again there were significant improvements in performance since each insert and update did not require overhead to maintain the additional indexes.
Getting the production database rebuilt with the new clustered indexes took a couple of hours due to over 300 million records worth of data that we had collected to date, but the effort was well more than worth it. The CPU of IO WAIT stats of our database server dropped to almost nothing compared to what it was previously. Our database was less than half the size, overall performance on the site improved significantly, and incoming data streams were being handled in less than a second.
Had we had the money, it would have been easy to throw hardware at this issue instead of spending a couple of days to iteratively tune our existing infrastructure. The benefit that we saw from the InnoDB clustered index was well worth the man-hours spent researching and implementing the solution.10 comments Digg this