Cluster A login issues
Incident Report for OpenSRS
Postmortem

Incident Date: September 17, 2021
Incident Number: PR-2349

On September 17, 2021, at 11:22 AM ET, Tucows’ hosted email platform experienced service interruption impacting IMAP, POP and Webmail in Prod A. 

The service interruption was due to a split-brain issue on one of the load balancers.

At 12:36 PM ET, The Engineering team recovered the services by manually failing over the traffic to the secondary load balancer to stabilize the email environment.  

Tucows is to further work with the vendor to investigate the root cause of the issue and to identify the failure of the automatic failover process.

Thank you,

Tucows Engineering Team

Posted Sep 21, 2021 - 15:56 UTC

Resolved
We can confirm services are fully operational and stabilized completely. This incident is now resolved.

Incident Start Time: 09-17-2021 15:22:00
Incident Start Time:09-17-2021 16:36:00
Total Duration: 1 hour and 14 minutes
Posted Sep 17, 2021 - 17:08 UTC
Monitoring
We have initiated measures to restore service for our cluster A users, we will be continuing to monitor to ensure stability. As of now users should be able to log back into their webmail for use.
Posted Sep 17, 2021 - 16:49 UTC
Update
The engineering team is currently investigating the root cause of this incident. We will update as soon as we have more information.
Posted Sep 17, 2021 - 16:19 UTC
Investigating
We are currently investigating a login issue for a small set of customers on Cluster A (Webmail, POP, IMAP). Users on email cluster A are facing issues with login. The engineering team has been engaged and is currently investigating the cause of this issue.
Posted Sep 17, 2021 - 15:56 UTC
This incident affected: Hosted Email (Cluster A).