Cluster A - email service
Incident Report for OpenSRS
Postmortem

Incident Date: March 3, 2022
Incident Number: PR-2900

On March 3, 2022 at 6:12 PM ET, Tucows’ hosted email platform experienced service interruption impacting webmail for Prod A. Tucows’ Engineering team was engaged to investigate the issue.

The service interruption was due to execution of a planned maintenance that was recommended by the vendor which caused unexpected behaviour on the load balancer. 

At 6:45 PM ET, the Engineering team reverted the change and started working on stabilizing the high load in the email environment.

At 10:30 PM ET, The engineering team was able to reboot multiple systems to alleviate high load and stabilize the email environment. 

Tucows is to work with the vendor to further investigate the root cause of the unexpected behaviour on the load balancer during the maintenance. 

Thank you,

Tucows Engineering Team

Posted Mar 23, 2022 - 19:35 UTC

Resolved
All affected accounts are now accessible for sending and receiving messages, and can be accessed via webmail. This was achieved by restarting the services and controlling the deliveries until the queue was clear. Thank you for your patience as we worked to resolve this.

Incident Start Time: 03-03-2022 23:12:00 UTC
Incident End Time: 03-04-2022 03:30:00 UTC
Total Duration: 4 hours, 18 minutes
Posted Mar 04, 2022 - 04:20 UTC
Update
The resolution is trickier than expected but our team is still hard at work on it.
Next update: within 60 minutes.
Posted Mar 04, 2022 - 02:45 UTC
Identified
We've isolated the problem and are working on bringing the service back online.
Next update: within 60 minutes.
Posted Mar 04, 2022 - 01:28 UTC
Update
We really appreciate your patience. We know this is a big disruption to your business and we've got all hands on deck for this.
Next update: within 30 minutes.
Posted Mar 04, 2022 - 00:59 UTC
Update
Our team is working to address this issue as quickly as we can.
Next update: within 30 minutes.
Posted Mar 04, 2022 - 00:29 UTC
Investigating
We are investigating an issue preventing some users on Cluster A from logging into Webmail, sending and receiving messages. Our engineering team has been engaged.

We will provide an update once we have additional information.
Posted Mar 03, 2022 - 23:54 UTC
This incident affected: Hosted Email (Cluster A, Webmail).