Bot platform outage affecting all bots
Incident Report for Meya CX Automation Platform
Postmortem

What happened?

Today at 1:08pm ET we experienced a platform outage lasting 20 minutes until 1:28pm ET. All bots were affected as a core piece of server infrastructure failed.

Root cause: Amazon Web Services experienced issues lasting from 8:15am ET - 1:05pm ET that prevented new nodes from scaling up. Meya’s system responded by queuing a large volume of “push” requests. There wasn’t any obvious issues until AWS resolved their scaling problem just prior to our downtime and our nodes began to scale up quickly.

Prior to the scale up, the deep queue included used up a lot of memory in our Redis instance. This was not a problem as there was still a lot of memory left. However, the sudden rapid scaling of our nodes overwhelmed several sub-systems including our logging which created even more pressure on our Redis memory which then ran out of memory causing all bots to fail as this is a critical piece of infrastructure.

How did we fix it?

1st phase 1:08pm - 1:18pm ET: Once aware of the event, our Engineers began the process of manually clearing large amounts of Redis memory that brought the platform back up.

2nd phase 1:18pm - 1:45pm ET: While the platform was now up for the most part, some bots were still experiencing high latency. We still had a deep queue of bot requests to process for some high volume bots. We began manually throttling and “overclocking” bots in priority order in order to get their queues down to zero.

3rd phase 1:45pm - 3:46pm ET: We began implementing longer term fixes to restore the platform to it’s “automatic” mode of scaling. We continued to monitor bot “liveliness” during this time.

What was affected?

All bots were down for 20 minutes 1:08pm - 1:28pm ET. Meya web console and command line interface were mostly unaffected.

What will change moving forward?

We’ve added a series of monitors and processes to look for the root cause. A large, long lasting delta between desired number of nodes vs. in service nodes. In this case, we will be notified and will be able to take preventative action.

It should be noted that this was an AWS issue, which is typically extremely rare, occurring less than once per year. We will be following up with our solutions contacts at AWS to further investigate the root cause from their side.

Posted Oct 14, 2020 - 18:14 EDT

Resolved
We've completed our monitoring of this event and all systems are operational. We will follow up later with a post-mortem explaining root cause.
Posted Oct 14, 2020 - 15:58 EDT
Update
We are continuing to monitor for any further issues.
Posted Oct 14, 2020 - 15:39 EDT
Update
We are continuing to monitor for any further issues.
Posted Oct 14, 2020 - 14:35 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 14, 2020 - 13:33 EDT
Investigating
We're currently working on getting the platform back online.
Posted Oct 14, 2020 - 13:32 EDT
This incident affected: Bot Engine.