On Monday 12 July 2021 at 10AM AEST, we made a change to our main database to improve its stability and performance. We closely monitored the metrics to make sure all systems that depend on it were performing normally, and they were.
However, on Tuesday 13 July 2021 at around 12:05AM AEST, the database started degrading, leading to a partial outage. Our failover system kicked in within seconds, and it took 5-10 minutes for most of our services to fully recover and metrics to stabilise.
Unfortunately the database gradually degraded again, leading to another partial outage at 4:20AM AEST. It successfully failed over to a backup instance like in the first outage. However, this time around it took our systems 15-20 minutes to fully recover from it. At around 4:50AM AEST everything went back to normal.
After investigating the two incidents, we identified the root cause as the change we made on Monday morning. We've decided to roll the database back to its previous configuration and we're cautiously optimistic that the same issue will not happen again.
Posted Jul 13, 2021 - 13:06 AEST
This incident affected: Content Management and Infrastructure (Dashboard / API Servers).