Intermittent downtime for Hosted Knowledge Base
Incident Report for Elevio
Postmortem

Timeline (in UTC):19:10pm: The hosted KB becomes unavailable and service restores without intervention in less than 1 minute.

19:16pm: The hosted KB becomes unavailable and service restores without intervention within 2 minutes.

19:50pm - 20:30pm: Multiple events where the hosted KB becomes unavailable for 1-2 minutes and service restores without intervention. On-call engineer escalates the incident with the backend team.

20:30pm - 22:00pm: The service becomes unavailable and no longer restores. Restarting the servers restores service for short periods of time.

22:15pm: The issue is identified and a temporary fix is deployed. Service resumes as normal

23:30pm: A permanent fix is deployed. During this time, the temporary fix had to be momentarily reverted in order to deploy the new version which caused the service to become unavailable for < 5 mins.

Posted Nov 28, 2022 - 09:50 AEDT

Resolved
On Saturday 27th November there was an incident affecting the Elevio hosted KB which caused it to become intermittently unavailable between 19:10 and 23:30pm UTC. The issue was caused by a resource leak in the KB proxy server, where file descriptors (fd) failed to close after reading a certificate thus rejecting new connections once the fd limit was reached. The issue was caused by a bug in the underlying proxy server software, and was resolved by updating the proxy server software to the latest version.

We apologise for this disruption in our service. The issue was particularly difficult to track down since there were no recent updates to the proxy server, and the unfortunate timing of the event (Saturday night / Sunday early morning) meant extra delays in getting the daytime team up to speed
Posted Nov 27, 2022 - 06:00 AEDT