High bot latency
Incident Report for Meya CX Automation Platform
Postmortem

What happened?

Due to a combination of cumulative bot volume and high throughput, our core DB hit a read throughput threshold on a critical bot engine query at 12:15 pm ET. As a result, our core DB was being I/O throttled.

How did we fix it?

We placed a temporary quarantine throttle on the affected bots which returned latency to normal by 2:05 pm ET while we implemented a long-term fix.

We optimized the necessary bot engine queries and increased the provisioned throughput on the appropriate database and released the code fix at 7:30 pm ET.

We’re continuing to monitor the DB metrics, but it appears the fix has had the desired effect.

What was affected?

  • Bot runtime latency for most bots initially, then a subset for a longer duration
  • Web console speed to a lesser degree

What will change moving forward?

  • We've added additional monitoring and alarms for similar scenarios
  • We've made a contingency plan for both scale and optimization if necessary
  • We've updated our response process to more quickly resolve similar issues
Posted Feb 21, 2020 - 12:05 EST

Resolved
This incident has been resolved.
Posted Feb 20, 2020 - 17:34 EST
Identified
We've uncovered an issue that was affecting performance and have tentatively returned bot performance near normal. We're continuing to investigate for a longer term solution.
Posted Feb 20, 2020 - 14:10 EST
Investigating
We are currently investigating this issue.
Posted Feb 20, 2020 - 13:29 EST
This incident affected: Bot Engine.