Platform is currently seeing significant instability
Incident Report for Meya Bot Platform

We experienced approximately 30 minutes of bot downtime earlier today between 4:30 - 5pm ET.

What happened?

We were able to quickly find the root cause of the problem: a) an undersized cache cluster which ran out of memory, b) the lack of an early detection alarm for this specific scenario.

How did we fix it?

In order to resolve the issue, we doubled the capacity of the cache cluster which took about 15 minutes to take effect before bringing bots back to life.

What was affected?

  • core bot operations: flow starts and transitions
  • message delivery
  • component / CMS publishing
  • API & webhook
  • test chat

What will we change moving forward?

  1. We are adding early detection alarms on key health metrics for the caching sub-system.
  2. Pro-actively ensure that the caching sub-system is adequately provisioned for current, forecasted and peak load.
  3. In the long run, work towards new auto-scaling architectures for our cache and database infrastructure to ensure all aspects are "auto-scaling".

Please feel free to email me directly if you have any specific questions.

Posted about 1 month ago. Feb 12, 2018 - 18:10 EST

All systems operational.
Posted about 1 month ago. Feb 12, 2018 - 17:54 EST
All systems back to normal w/ the exception of Chatbase analytics which was temporarily disabled. We are bringing this back online as well before calling the incident resolved.
Posted about 1 month ago. Feb 12, 2018 - 17:38 EST
The platform stability has improved. Actively monitoring....
Posted about 1 month ago. Feb 12, 2018 - 17:07 EST
We are currently investigating this issue.
Posted about 1 month ago. Feb 12, 2018 - 16:49 EST