Meya v1 platform is down
Incident Report for Meya CX Automation Platform
Postmortem

What happened?

On May 3, 2021 at around 8am Python Packaging Authority (pypa) made the decision disable access to pypi.org and files.pythonhosted.org from non-SNI clients including Python 2.7.6 which Meya v1 uses as it’s core:

  1. Meya v1 platform runs on Python 2.7.6. Python 2 is EOL as of 2020, and the recommendation is to migrate to Meya v2 running on Python 3
  2. Meya v1 AWS autoscaler relies on pip to spin up new nodes daily with increasing demand.
  3. pypa disabled Python 2.7.6 access to pip on May 3, preventing the platform from scaling up
  4. New new nodes could come online as demand increased, and existing nodes would eventually come offline to overload
  5. This resulted in a complete outage of Meya v1 (not v2)

How did we fix it?

We immediately became aware of the outage and created a Slack and Zoom incident response room to get Meya v1 back online as fast as possible, but the resulting solution was highly complex due to a series of cascading requirements:

  1. We upgraded Python to 2.7.12, which created a series of cascading changes
  2. We were required to upgrade the operating system to Ubuntu 16.04 (Xenial) from Ubuntu 14.04
  3. This upgrade resulted in MySQL connectivity issues related to SSL certificate verification, which we resolved using AWS certificate management features
  4. Bots now came back online, but they were too slow to meet QoS threshold, so we optimized our database connection management

The result of these changes:

  1. 4:30pm ET: bots came back online, but slow
  2. 6:30pm ET: bots were now fast, but not yet fully stable
  3. 8:00pm ET: bots were now stable
  4. Ongoing bots are even faster due to the combination of optimizations undertook in the process

What was affected?

  • All of v1 platform was affected
  • v2 was not affected whatsoever

What will change moving forward?

  • We’ve made the call to freeze our v1 codebase into an image to remove the pip dependency for scale up. This will prevent future deprecations by pypa/pip/python however unlikely
  • We’ve adjusted our monitoring to account for similar situations if they are to come up again
  • We recommend customers migrate to Meya v2 within the next 6-12 months due to the EOL on Python 2 itself
Posted May 04, 2021 - 11:29 EDT

Resolved
This incident has been resolved. As of 4pm ET bots resumed function, but slow. As of 6:30pm ET bots were fast, but not yet stable and we continued to monitor. We're now confident that the system is stable.

We will follow up with a post-mortem and a minor bug fix release affecting some parts of the web console: logs, CMS, analytics, code snippets.
Posted May 03, 2021 - 20:07 EDT
Monitoring
Partially returned. However, very slow responses as systems come back online
Posted May 03, 2021 - 16:28 EDT
Update
Quick update. We've made quite a bit of progress and could be minutes away from a fix if all goes well. However, not certain at this point
Posted May 03, 2021 - 15:48 EDT
Identified
We are actively investigating and working on a fix
Posted May 03, 2021 - 11:19 EDT
This incident affected: Web IDE, Custom Components, Bot Engine, Datastore, NLP Services, Analytics, Messaging Integrations, and Bot code repo.