Cluster A - Webmail
Incident Report for OpenSRS
Postmortem

Wed, Nov 13, 2024 8:00pm - Tue, Nov 19, 4:00pm EST

Description

On Nov 13, 8:00am EST , we had a hardware failure which caused IMAP and Webmail services to fail on one of our mailstores.

The failure and the normal inbound flow of requests resulted in an unexpected increase in the number of in-flight requests, which caused performance issues on the surviving members of the cluster. The net result was a degraded service level that did not self-remedy after the failover as expected.

This resulted in a subset of users unable to access their mailbox during the ongoing maintenance.

The increase in in-flight requests as the peak hours approached further compounded the issue, affecting the load on other platform components. This resulted in issues that affected normal access to IMAP and Webmail services.

Root Causes Found

The cause was due to hardware failure compounded with an increase of workload demands on the surviving network elements.

Solution

In order to recover from the peak backlog, we had to throttle connections and slowly enable service to stabilize the cluster.

Post Mortem

To further address recurrence of issues like this, we have modified the way in which users are assigned to different platform components. We have begun work to transfer user mailboxes to other elements to improve the resource requirements across the board.

Posted Nov 19, 2024 - 21:56 UTC

Resolved
We are very sorry for the disruption. Thank you again for your patience. Service is once again working as expected.

Start Time: 11/19/2024 17:10 UTC
End Time: 11/19/2024 20:42 UTC
Duration: 3 hours, 32 minutes
Posted Nov 19, 2024 - 21:44 UTC
Monitoring
A fix has been implemented for this issue. We're monitoring the results now.
Posted Nov 19, 2024 - 20:52 UTC
Update
We are getting reports that IMAP and POP services are also affected. Please standby while we continue to investigate.
Posted Nov 19, 2024 - 20:13 UTC
Investigating
We are investigating an issue preventing some users on Cluster A from logging into Webmail. HostedEmail is experiencing issues resulting in delayed access to webmail. IMAP and POP services are unaffected. Our engineering team has been engaged.

We will provide an update once we have additional information.
Posted Nov 19, 2024 - 20:09 UTC
This incident affected: Hosted Email (Cluster A, Webmail).