An update was made to the API early on the weekend (US time) to gather more usage statistics.
Things went well initially, but the infrastructure started to buckle under heavy load around 9 AM US EST.
As soon as we were notified of the increased error rates by our monitoring systems, we immediately rolled back the change. Unfortunately, during this rollback process, the system subsequently suffered from thundering herd due to a huge amount of failed requests being retried concurrently, even with autoscaling on.
We decided to double the allowable capacity temporarily to handle the increased load. The system subsequently stabilized a few minutes after that.