API Downtime
Incident Report for Elevio
Postmortem

In the past month, we've had a few unfortunate issues and we'd like to take a moment to document these periods to allay any fears.

In order of oldest to most recent:

Around a month ago we had some weird behavior on some of the servers. In the end, it was due to the servers running for eight months with no reboot. A quick reboot and all returned to normal.

To combat this, we are now doing periodic reboots to rule this out from occurring in future.

During the three weeks that followed, we prepped our servers for a new feature release (that wasn’t in public use) which inadvertently caused issues with some customers due to isolated backward compatibility, it only occurred sporadically and was limited to a small set of customers. This was entirely our fault, and with the help of some of those customers that were experiencing the issue, it was promptly resolved. At the same time, we added in additional monitoring to be made aware of similar issues moving foward.

To help streamline our reporting infrastructure while we were moving page view counts from one system to another, we modified how we were recording events but failed to load test it enough and it buckled under the load of the full production rollout. We reverted to a previous build until we were able to properly handle the load (and any future load) and this hasn’t been a problem since.

Finally, last night, AWS had an outage in one of their subsystems for around 2 hours. We're already running in two data centers, so there was little we could do short of moving our whole infrastructure to a new region. We were in constant contact with AWS during this period to resolve the issue ASAP, but were completely at the mercy of their system.

In summary

In response to the issues that have appeared recently, we’ve put a lot of effort into making our infrastructure more resilient to prevent these and similar issues moving forward.

We've also added more granular real-time monitoring to alert us of issues in a more timely fashion.

Thanks for your patience and trust during this period. Onward and upward.

Posted Nov 17, 2017 - 11:41 AEDT

Resolved
We had some downtime with the API, please see the postmortem for more detail.
Posted Nov 17, 2017 - 02:00 AEDT