Incident Date: August 2, and August 5, 2019
Tucows' email platform, prod B, experienced a sequence of isolated incidents between August 2 and 8, 2019. The mentioned incidents caused service unavailability of the email platform in PROD B datacenter.
On August 2, 2019, at 08:51 a slow disk and raid controller card on one of the paired storage devices (d-nfs07b) failed; as per design the system failed-over to its secondary standby storage. Remote hands replaced the failed raid controller card on August 3, 2019, at 10:00.
Tucows Engineers continued migrating mail accounts over the weekend to prevent client impact.
On August 5, 2019. A storage device (d-nfs07b) experienced latency issues that affected around 2% of prod B customers. The degradation of d-nfs07b exhausted the resources on its paired replica d-nfs07a. Due to a software bug, it was challenging to identify and failover the bad disk following our standard operating procedures.
To stabilize the environment, Tucows' engineers started to migrate mailboxes onto other available clusters to reduce load and improve customers' experience. However, the degraded performance of the cluster (d-nfs A and B) hindered the migration efforts.
On August 7, 2019, 16:30 Tucows Engineers released select held-messages from the queue and they were delivered within 20 minutes. All messages were released at 20:00 after the Tucows abuse team completed the scan of the mail queues.
On August 8, 2019. The continued migration of mailboxes allowed us to stabilize the environment, reduce the load on the affected cluster and resolve customer experience.
The Tucows team continued the stabilization work over the weekend, which caused minor service interruptions while enabling services and further migrating mailboxes off the affected cluster.
Preventive measures:
Furthermore, Tucows has been planning the migration of the prod B email platform running in Ashburn data center to a new datacenter.