v2 apps/console unstable w/ intermittent outages
Incident Report for Meya CX Automation Platform
Postmortem

What happened?

At approximately 1:00pm ET on Sept. 8th, 2021 our shared long term storage ledger (tldr; logs) entry database was experiencing greater than capacity load due to aggregate system load. We identified the bottleneck and then initiated a DB scale up operation that necessitated a short downtime at 1:24pm ET. The DB is required for bots to be online, and thus all bots when down until approximately 1:26pm ET.

How did we fix it?

  • Scaled up shared logs DB by doubling capacity

What was affected?

  • All v2 apps went offline for 2-4 minutes

  • Some apps took longer (up to 20 minutes) to stabilize

What will change moving forward?

  • We are investigating removing the dependency of live bots on the shared logs DB
  • We will add alerting for this specific load scenario on our logs DB
  • We will further increase our communication with larger customers to provide early insight into load planning
Posted Sep 08, 2021 - 18:22 EDT

Resolved
All apps are back online and the platform is fully operational. Post-mortem to follow.
Posted Sep 08, 2021 - 14:09 EDT
Update
Only a small number of apps are experiencing downtime while they catch up to processing deep ledger queues.

You can verify your app's status by following these instructions: https://docs.meya.ai/docs/how-to-monitor-a-production-apps-uptime-using-the-status-integration
Posted Sep 08, 2021 - 13:48 EDT
Update
Some bots remain unstable. We are still working on the issue.
Posted Sep 08, 2021 - 13:32 EDT
Monitoring
The DB is back online and bots are returning online. Continuing to monitor....
Posted Sep 08, 2021 - 13:26 EDT
Identified
We require an emergency scaling of one of our databases. We've initiated this process, but will incur some downtime as a restart is required.
Posted Sep 08, 2021 - 13:22 EDT
Investigating
We are currently investigating and will update as new information is available.
Posted Sep 08, 2021 - 13:20 EDT
This incident affected: Web IDE and Bot Engine.