Bot engine is down.
Incident Report for Meya Bot Platform
Postmortem

We experienced approximately 15 minutes of bot downtime yesterday, March 26 between 9:30 - 9:45am ET.

What happened?

We found that our event bus brokering system ran out of memory suddenly. The explanation for this incident is due to a multitude of factors working together:

  • quickly ramping up server load on Monday morning
  • slow scale-up of the logging DB in response to this load, which subsequently queued log writes in the event broker. (Call the # of queued writes Q)
  • excessively large log rows originating from integrations with large webhook payloads of greater than 3MB (Call the avg. log row size E)
  • our event broker is a shared system, therefore it's effects were widespread

Consider memory = Q x E. It's evident for a large value for E, it doesn't take a large value of Q to consume significant memory.

How did we fix it?

In order to resolve the issue, we 4x the memory capacity of the event broker cluster which took about 10 minutes to take effect before bringing bots back to life.

What was affected?

  • core bot operations: flow starts and transitions
  • message delivery
  • component / CMS publishing
  • API & webhook -test chat

What will we change moving forward?

Immediately:

  • we've increased the baseline memory of the event queuing system. This increases memory
  • we've posted a low-level patch that limits the size of the event meta-data to mitigate sudden memory spike. This lowers E
  • we've increased the baseline logging database throughput capacity to further mitigate this sudden spike. This lowers Q

Near term:

  • we're investigating a solution to auto-scale the event brokering system at low memory
  • we're evaluating further isolation of our logging sub-system

Longer term:

  • As part of our bot runtime improvements, we are already in the process of engineering a new platform architecture that is not susceptible to this sort of event bus memory issue.
  • This incident will be used as part of the tech spec to ensure that this does not happen again.

Please feel free to email me directly erik@meya.ai if you have any specific questions.

Posted 6 months ago. Mar 27, 2018 - 15:35 EDT

Resolved
Marking as resolved. Post mortem to follow once we've completed the investigation into root cause.
Posted 6 months ago. Mar 26, 2018 - 10:05 EDT
Monitoring
Tentatively back up. Verifying...
Posted 6 months ago. Mar 26, 2018 - 09:47 EDT
Identified
We've identified the problem and are working to resolve. We have increased platform capacity 4X. ETA is 5-10m until fully provisioned.
Posted 6 months ago. Mar 26, 2018 - 09:45 EDT
Investigating
We're investigating...
Posted 6 months ago. Mar 26, 2018 - 09:32 EDT