Mixpanel is a company that prides itself on being able to provide extraordinarily advanced analytics in real time. Paramount to this goal is being able to accurately and reliably send data to our api.mixpanel.com servers, something that we failed at from 8:00 AM to 12:20 PM PDT this morning. For this we are extremely sorry, and you as our customers deserve better. We realize that you are trusting us by placing our JS on your websites and that your businesses are affected when we go down. Here is what happened, and what we are doing to make sure this sort of failure never happens again.
The problem was quickly identified, but solving it turned out to be more difficult. We could not update DNS to allow for api.mixpanel.com to point to a different set of servers, since we had previously set the TTL for the DNS requests to be an entire day. However, we could reassign the portable IPs the DNS was pointing at to a different machine. Unfortunately, the only machine we had available was behind an entirely different router compared to the router for the portable IP addresses assigned to the load balancer machines. We then determined that the next best thing to do would just be to order several new load balancer machines and temporarily lower the client header buffer size on nginx requests to reduce the amount of memory nginx was using.
What this means for you
The vast majority of track requests sent during this time were lost, and once again we sincerely apologize. In an hourly report on July 31st, you will see a drop-off in events for the hours spanning 8AM – 11AM PDT. Furthermore, 12PM PDT’s event counts should be about half of what they normally would be. Your daily data for July 31st will be slightly depressed as well.
How we are preventing this from happening in the future
Here at Mixpanel, we spend a tremendous amount of engineering effort making sure that our data collection infrastructure can withstand the thousands of events per second our customers send to us. While our custom data store handled the sudden event rate increase without a single hiccup, it is somewhat ironic that a simple load balancer failed simply because it ran out of memory. And frankly, also quite embarrassing. Here’s how we’re going to not get caught with our pants down again.
- Far more proactive monitoring
We use the Munin monitoring tool to keep tabs on server status. Munin provides an incredible variety of plugins to monitor everything from simple CPU usage to MongoDB write lock percentage. It also provides warning and critical thresholds for any numeric values that these plugins report. As we’ve grown our server count to over 200, properly setting these thresholds have fallen by the wayside. We’ve gone through each and every one to make sure these thresholds exist and make sense. Furthermore, we are adding an email notification every time a value crosses a warning or critical threshold.
- Long term capacity planning
It is not enough to simply add more memory to the machine and call it a day. While we have increased the memory on the load balancing machines by 8x, effectively removing it as a bottleneck in the future, we also want to resolve the underlying issue. We’ve gone through our infrastructure to identify what the next bottlenecks would be and made sure we have specific plans on how we are going to upgrade them in the future. In addition to monitoring, we will know exactly when and how we will mitigate bottlenecks.
- Using a CDN for delivery of mixpanel.js
If you have any other questions about the downtime, please feel free to email email@example.com.