Intermittent dropped requests/responses
Incident Report for Meya CX Automation Platform
Postmortem

What happened?

On July 2, 2022 at 8:40pm ET for about 30 minutes apps started dropping requests/responses sporadically.

  • Meya’s nodes running on Google Cloud Platform all moved to a new underlying infrastructure for maintenance initiated by Google
  • This resulted in new nodes coming online, with one node in particular coming online with slow networking and connectivity issues
  • This prevented our compute infrastructure from connecting to a Redis database cluster which is critical for operation of all apps
  • Some of our app infrastructure was running on this slow node causing an undetected “brown out” of certain requests and responses
  • The result was that bots were behaving erratically and unpredictably

How did we fix it?

The problem resolved itself in less than 30 minutes once Google fixed the underlying networking issue

What was affected?

  • All production apps would randomly drop certain requests. The exact % is not known, but it commonly manifested in erratic behaviour for the end user experience
  • Dev apps were not affected
  • Console and CLI usage was not affected

What will change moving forward?

  • We are adjusting our Redis connectivity timeouts and retries to be less aggressive and to avoid similar slow networking scenarios
  • We are adding monitoring in a global way to ensure we are notified of a similar incident
  • We will add individual monitoring to critical production apps to alert our site reliability team of similar incidents
  • We are optimizing when and how our platform initiates Redis connectivity based on the learnings of this incident to mitigate future occurrences
  • We have adjusted our compute health checks to ensure “unhealthy” compute infrastructure doesn’t accept work if it’s struggling to achieve steady Redis connectivity
  • We are adjusting our autoscaling systems to ramp more slowly avoid exacerbating the latency during connectivity
  • These changes will be implemented over the coming days and culminating in the final changes pushed to production in 2.7.11 due out for release July 12/13
Posted Jul 05, 2022 - 15:33 EDT

Resolved
Apps are experiencing strange behavior due to dropped requests
Posted Jul 02, 2022 - 20:30 EDT