Postmortem for the Incident on the 14th of Dec

Postmortem for the Incident on the 14th of Dec

 

The following is the incident report for the push notification problem that occurred on December 14th, 2017. We understand this service issue has impacted our valued clients and readers, and we apologise to everyone who was affected.

Issue Summary

From 8:00AM to 4:00PM EST, all scheduled push notifications have been sent out multiple times. Reader who were subscribed to websites that have scheduled push notifications on Chrome, Firefox and Safari have received the notifications many times within that time interval. The root cause of this duplication was an update to our system of sending notification that conflicted with our scheduling system.

Root Cause

At around 8:00AM EST, a major upgrade of the way our back-end sends push notifications has been released on the live server. The update aimed to drastically increase the throughput of notifications and to processes tens of thousands in a matter of seconds. This change was covered by unit and integration tests, as well as tested on a staging environment. When the code was pushed live, it clashed with the scheduling service. While queue that sends notifications was processed, the scheduling service read these notifications as not sent and were queued for sending again.

On the staging environment, the scheduling service was not activated and that is how the use case was overlooked.

Resolution and Recovery

At around 9:00AM EST, some clients have reported to our customer support chat that duplicate notification were being received. At 9:30AM EST a bug fix attempt was released, but due to the complexity of the system, it turned out to be the wrong bug fixed. At 3:00PM EST, for safety, we have turned off all our services that ran on the server, to prevent any notification from being sent.  By 4:00PM EST, our team identified the true root cause, wrote unit tests to replicate the exact environment when the problem occurred, fix it and ultimately make the fix live. At this point, all our services were turned on again.

Corrective and Preventative Measures

In the following hours,  we’ve conducted an internal review and analysis of the incident. The following are actions we are taking to address the underlying cause of the issue and to help prevent recurrence and improve response times:

  • Ensure that very thorough tests are written for new features that impact the core of our system.
  • Ensure that the staging environment is a 1 to 1 replicate of our live servers.
  • Fix the root cause and improve monitoring of how notifications are sent.
  • Prevent the queue from accepting duplicate notifications.
  • Add a kill switch that can stop the sending of notifications once 1 report is received.
  • Push major updates live only when we are sure that developers will be available to fix them, ideally in their early morning, regardless of time-zone.

Push Monkey is committed to continually and quickly improving our technology and operational processes to prevent outages and incidents. We appreciate your patience and again apologise for the impact to you, your readers, and your organisation. We thank you for your business and continued support.

Leave a Reply

We’ve created a bunch of How-To videos to help you to get started with Push Monkey.

Check our YouTube channel