v2 platform down
Incident Report for Meya CX Automation Platform
Postmortem

What happened?

At approximately 3:51pm ET on July 7th, 2021 all bots went down temporarily. This was a result of a failure on Kubernetes control plane failing for a brief period.

How did we fix it?

All systems self-healed within 3-5 minutes once the control plane was back up. There was a bit of initial latency while bots came back online.

What was affected?

  • All v2 bots went offline briefly

What will change moving forward?

  • We are investigating removing the dependency of live bots on the Kubernetes control plane
  • We are investigating as to why the control plane went down in the first place and to determine the probability of this happening again. The current thinking is that this was a very rare event.
Posted Jul 07, 2021 - 16:30 EDT

Resolved
All bots are back online. The root cause was that a Google Cloud k8s control plane went down, and we have a service that depends on the control plane being up.
Posted Jul 07, 2021 - 16:19 EDT
Monitoring
All bots are back online. We had about 3-5 minutes of downtime.
Posted Jul 07, 2021 - 16:12 EDT
Investigating
Ongoing outage related to Google Cloud Platform
Posted Jul 07, 2021 - 15:57 EDT