What is this Y Combinator thing all about?
I’ve had a few of my friends ask me what Y Combinator was all about and what I’ve been up to.
Y Combinator (YC) provides mentorship and seed funding for early stage tech startups. Twice a year, they accept applications from thousands of prospective startups. Out of those thousands of applications, they interview a smaller group of potentials and out of the interviews they accept 20 to 27 companies to take part in the YC “experience”.
Tony Wright, Brian Fioca and I applied to YC in Oct. 2007 and received notice on October 19th, 2007 that we were being invited to introduce RescueTime to the YC partners. We were all really excited about the opportunity to even interview and we spent a large amount of time researching and preparing.
Long story short (I’ll do another post on the details of my YC experience) – we were accepted into YC and Tony, Brian and I moved down to Silicon Valley on January 4th to get RescueTime, Inc. kicked off.
We spent 3 months in Silicon Valley, going through the Y Combinator program. This essentially entailed spending an insane amount of time and being hugely productive building and refining RescueTime.com. YC holds weekly dinners for the group where they bring in guest speakers to present and do Q/A sessions. Most of the speakers where themselves founders of startup companies. We got the opportunity to sit down and talk with people like Marc Andreessen, Chris Sacca, Evan Williams, Paul Buchheit, Ron Conway and a ton of other really cool people.
On April 1 2008, the three of us moved back to Seattle to continue building RescueTime, Inc as a business. In September 2008, we closed a round of financing for $875,000, from True Ventures in Silicon Valley and a number of really cool Angel investors, including former Googlers and Microsofties and Tim Ferriss the author of “The 4-hour Workweek“.
Since we raised our funding, we’ve added two awesome guys to our team (Montana Low and Mark Wolgemuth), and we’ve been pretty much focused on trying to improve the RescueTime.com offering.
1 comment Digg thisExample MySQL configuration tuned for InnoDB engine
Someone asked for a sample InnoDB configuration and since nearly all of the MySQL tables in RescueTime are InnoDB, I thought I would post this.
################# /etc/my.cnf ################### [client] port = 3306 socket = /var/run/mysqld/mysqld.sock [mysqld_safe] socket = /var/run/mysqld/mysqld.sock nice = 0 [mysqld] user = mysql pid-file = /var/run/mysqld/mysqld.pid datadir = /db/data/mysql log-bin = /db/log/mysql/ mysqld-binlog-bin-index = /db/log/mysql/mysqld-bin.index expire-logs-days = 7 log-error = /db/log/mysql/mysqld.err log-slow-queries long-query-time = 1 relay-log = /db/log/mysql/mysql-relay relay-log-index = /db/log/mysql/mysql-relay.index #default-storage-engine = innobase innodb_data_home_dir = /db/data/mysql innodb_file_per_table innodb_autoextend_increment = 50 innodb_log_group_home_dir = /db/log/mysql innodb_log_files_in_group = 2 innodb_log_file_size = 100M # Start Replication Parameters #server-id = 1 innodb_flush_log_at_trx_commit = 1 #sync-binlog = 1 # End Replication Parameters # Start Performance Parameters #tmpdir = /tmp-ram max_connections = 50 max_heap_table_size = 512M tmp_table_size = 512M table_cache = 128 sort_buffer_size = 4M query_cache_min_res_unit = 1K query_cache_limit = 1M query_cache_size = 100M max_allowed_packet = 16M thread_stack = 128K thread_cache_size = 8 innodb_buffer_pool_size = 2000M innodb_additional_mem_pool_size = 4M innodb_lock_wait_timeout = 2 innodb_file_io_threads = 4 innodb_thread_concurrency = 8 innodb_flush_method = O_DIRECT transaction-isolation = READ-COMMITTED skip-external-locking # End Performance Parameters ################# /etc/my.cnf ###################3 comments Digg this
MySQL InnoDB Clustered Indexes and Rails
Recently we experienced an issue with a high level of IO waits on the primary RescueTime.com database server, which resulted in our MySQL DB being the major bottle neck to good performance.
RescueTime essentially serves as a large data warehouse for thousands of users who send us application and website usage information 24 hours a day. Most of our database activity is made up of insert and updates coming in from our API web service mongrels. At the moment, we are running 7 dedicated mongrels that do nothing but handle incoming data streams from our users.
One of the major functions of the API mongrel is to build real-time summaries of the incoming second by second attention data. We summarize this information by hour and by day, and this is currently happening in real-time as the API mongrels receive the data streams. This summary data is used by the www.RescueTime.com dashboard to display the usage information to our users, primarily through real-time graphs and analytics.
As we monitored the average round trip for a data stream update, we noticed a significant trend up as we added users. It became quickly obvious that we did not have a scalable solution as our user numbers grew 7% week over week. During our peak load times (6am – 3pm PST), we started seeing timeouts from the user client applications which resulted in a less than optimal user experience.
By using the excellent SeattleRB’s Production Log Analyzer we were able to see that we were averaging 3.5 seconds per incoming data stream, and with thousands of users (and growing) we were quickly approaching a limit of how many users we could handle with our current infrastructure.
One of the options was to bring on more hardware resources, but we are a bootstrapped startup and we really felt we could squeeze more performance out of our existing hardware.
I dug into the specifics of where that 3.5 seconds was coming from and it turned out that the insert and update statements were taking far longer than we had thought. The reason? Indexes. We had 8 indexes on our daily summary table and 7 on our hourly summaries. We had been so focused on pushing new features that I had really neglected basic DBA duties by being lazy and adding indexes in response to performance issues.
So, I set out to remove as many of the indexes as possible from our summary tables. As I researched how InnoDB stores data, I realized that what InnoDB does by default is to create an clustered index based on the primary key. As I read further, someone equated the InnoDB clustered index to Oracle’s Index Organized Tables (IOT’s). KAACHING! A light bulb went off in my head about what was happening.
Rails and ActiveRecord make it really easy for developers, but ActiveRecord does a poor job of structuring database tables for performance. By default, ActiveRecord will create a database table with a single column primary key of “id” which is an autonumber. The problem with this concept is that instead of taking advantage of InnoDB clustered indexes, it essentially dumbs the table back down to be heap-organized, where rows are stored in the order that they are inserted.
This isn’t an issue with smaller data sets, but RescueTime currently adds over 10 million records each week and that number is increasing 7% week over week.
I analyzed on of our primary summary tables (daily_summaries) to find a natural primary key and found that I was able to create a primary key that was 3 columns wide of User_id (int), summary_date(date), and durationable_id(int). After identifying the proposed new primary key, I started analyzing the queries that used that daily_summaries and found that all of the queries used at LEAST user_id and 90% of them used user_id, summary_date and durationable_id together.
I thought to myself, what would happen if I removed ALL of the indexes off of daily_summaries, rebuilt the table to drop the primary key of the autonumber’d “ID” column and add a primary key of (user_id, summary_date and duartionable_id). So I took one of our test instances and did just that. Right away the overall table size dropped to less than half it’s original size, since all of those indexes took up more space than the data did.
So, how about performance – Well, of course, the first thing that happened was the Rails application puked all over itself, but that was to be expected – I had removed the ID column Rails was expecting and replaced it with a multiple column primary key that it knew nothing about. I worked with our resident Rails expert, Brian Fioca, and he was able to locate a great Rails gem called composite_primary_keys that allowed him to modify the existing Rails code with a fairly minimal effort and still allow the use of ActiveRecord in most places.
Once the Rails code was working again we were astonished to see that RescueTime read performance was significantly better, even with NO indexes other than the primary key on the daily_summaries table. The reason was the InnoDB clustered index has reorganized the daily_summaries data on disk, in the order that we retrieved it. Meaning that a single read operation now returned back multiple records for a single user_id, thus there were less read operations to return our query.
So, how about insert and update speed? Again there were significant improvements in performance since each insert and update did not require overhead to maintain the additional indexes.
Getting the production database rebuilt with the new clustered indexes took a couple of hours due to over 300 million records worth of data that we had collected to date, but the effort was well more than worth it. The CPU of IO WAIT stats of our database server dropped to almost nothing compared to what it was previously. Our database was less than half the size, overall performance on the site improved significantly, and incoming data streams were being handled in less than a second.
Had we had the money, it would have been easy to throw hardware at this issue instead of spending a couple of days to iteratively tune our existing infrastructure. The benefit that we saw from the InnoDB clustered index was well worth the man-hours spent researching and implementing the solution.
10 comments Digg thisTipJoy Testing
Testing a fellow YCombinator.com startup – TipJoy.com
Trying to add Tip button…
Comments are off for this post Digg thisHello, world!
Not much to say for my first blog post - so I'll just say...
#include <iostream>
int main()
{
std::cout << "Hello, world!" << std::endl;
return 0;
}
------------------------------------------
SELECT 'Hello, world!' FROM dual;Comments are off for this post Digg this