API outage

Incident Report for Elevio

Postmortem

An update was made to the API early on the weekend (US time) to gather more usage statistics.

Things went well initially, but the infrastructure started to buckle under heavy load around 9 AM US EST.

As soon as we were notified of the increased error rates by our monitoring systems, we immediately rolled back the change. Unfortunately, during this rollback process, the system subsequently suffered from thundering herd due to a huge amount of failed requests being retried concurrently, even with autoscaling on.

We decided to double the allowable capacity temporarily to handle the increased load. The system subsequently stabilized a few minutes after that.

Posted Apr 10, 2018 - 00:22 AEST

Resolved

This incident has been resolved.

Posted Apr 10, 2018 - 00:09 AEST

Identified

We're currently experiencing a server outage with our main API service, we are aware of the issue and working to resolve it. We will have this restored ASAP.

Posted Apr 09, 2018 - 23:49 AEST