Go back to mixpanel.com

Post-mortem: API Downtime on July 31st, 2012

Posted

Mixpanel is a company that prides itself on being able to provide extraordinarily advanced analytics in real time. Paramount to this goal is being able to accurately and reliably send data to our api.mixpanel.com servers, something that we failed at from 8:00 AM to 12:20 PM PDT this morning. For this we are extremely sorry, and you as our customers deserve better. We realize that you are trusting us by placing our JS on your websites and that your businesses are affected when we go down. Here is what happened, and what we are doing to make sure this sort of failure never happens again.

What happened

When you send a track request to api.mixpanel.com, DNS round robining routes the request to one of our two load balancer machines running nginx that act as reverse proxies to the actual API processing servers. A sudden, substantial increase in API requests resulted in both these load balancer machines simultaneously running out of memory, causing them to begin swapping to disk. This drastically increased the amount of time it took to service each request, and since we also use these machines to serve up the static mixpanel.js Javascript libraries, those requests began timing out as well.

The problem was quickly identified, but solving it turned out to be more difficult. We could not update DNS to allow for api.mixpanel.com to point to a different set of servers, since we had previously set the TTL for the DNS requests to be an entire day. However, we could reassign the portable IPs the DNS was pointing at to a different machine. Unfortunately, the only machine we had available was behind an entirely different router compared to the router for the portable IP addresses assigned to the load balancer machines. We then determined that the next best thing to do would just be to order several new load balancer machines and temporarily lower the client header buffer size on nginx requests to reduce the amount of memory nginx was using.

What this means for you

The vast majority of track requests sent during this time were lost, and once again we sincerely apologize. In an hourly report on July 31st, you will see a drop-off in events for the hours spanning 8AM – 11AM PDT. Furthermore, 12PM PDT’s event counts should be about half of what they normally would be. Your daily data for July 31st will be slightly depressed as well.

How we are preventing this from happening in the future

Here at Mixpanel, we spend a tremendous amount of engineering effort making sure that our data collection infrastructure can withstand the thousands of events per second our customers send to us. While our custom data store handled the sudden event rate increase without a single hiccup, it is somewhat ironic that a simple load balancer failed simply because it ran out of memory. And frankly, also quite embarrassing. Here’s how we’re going to not get caught with our pants down again.

  1. Far more proactive monitoring

    We use the Munin monitoring tool to keep tabs on server status. Munin provides an incredible variety of plugins to monitor everything from simple CPU usage to MongoDB write lock percentage. It also provides warning and critical thresholds for any numeric values that these plugins report. As we’ve grown our server count to over 200, properly setting these thresholds have fallen by the wayside. We’ve gone through each and every one to make sure these thresholds exist and make sense. Furthermore, we are adding an email notification every time a value crosses a warning or critical threshold.

  2. Long term capacity planning

    It is not enough to simply add more memory to the machine and call it a day. While we have increased the memory on the load balancing machines by 8x, effectively removing it as a bottleneck in the future, we also want to resolve the underlying issue. We’ve gone through our infrastructure to identify what the next bottlenecks would be and made sure we have specific plans on how we are going to upgrade them in the future. In addition to monitoring, we will know exactly when and how we will mitigate bottlenecks.

  3. Using a CDN for delivery of mixpanel.js

    While the Mixpanel Javascript snippet should be loaded asynchronously so as to not affect your page load times, there have still been several reports of a long Javascript timeout preventing customer sites from loading. We are still working to get to the bottom of this issue. However, there is still absolutely no reason why we cannot get virtually 100% uptime for a single snippet of JS. We will be moving the JS snippet off the API servers and onto a proper content delivery network.

If you have any other questions about the downtime, please feel free to email support@mixpanel.com.

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>