Meya test chat and other resources not responding
Incident Report for Meya Bot Platform
Postmortem

Hi

I'd like to share Meya’s POV related to yesterday's AWS outage. Hopefully, this helps give insights into Meya’s direction.

In response to a community question:

"Obviously large parts of AWS-US East went in to meltdown yesterday which affected a lot of our SaaS integrations including Meya. Is there anyway you can mitigate the risk of BOT's being affected in future?"

preface: as a platform, it’s Meya’s responsibility to offer reliable service

short answer: redundancy

longer answer: strategically invest redundancy in balance with product development over time


Details

  • yesterday’s event was a big deal and affected a huge number of big Internet players (Github, Docker, Gitlab, Twilio, Zendesk, Quora, Slack, Giphy, Trello, Heroku, Apple iTunes, and ironically StatusPage, http://www.isitdownrightnow.com/ and AWS status page)
  • we learn from events such as this (in a detailed way), and take steps to prevent this and similar events
  • medium term: we will take steps to run sub-systems in multiple AWS availability zones
  • medium term: host static content spread over multiple different CDNs
  • longer term: redundancy on bot hosting contexts (AWS, GCE, self-hosted)
  • yesterday’s event demonstrated the vulnerability of the Internet as a whole when relying on a single point of failure (AWS)

Erik

Posted over 1 year ago. Mar 01, 2017 - 10:48 EST

Resolved
All systems are operational as AWS outage has been completely resolved.
Posted over 1 year ago. Feb 28, 2017 - 22:15 EST
Update
- Custom components have tentatively returned to operational status. Response times may still vary, and component deployment is slower that usual.
- Segment Analytics has returned to operational status
Posted over 1 year ago. Feb 28, 2017 - 20:20 EST
Update
Amazon S3 is returning to operational. Meya Web and Test chat are operational. Web IDE is operational. Custom components are still offline.
Posted over 1 year ago. Feb 28, 2017 - 16:12 EST
Update
AWS status page is now up, and the extent of the outage can be seen: https://status.aws.amazon.com/
Posted over 1 year ago. Feb 28, 2017 - 14:50 EST
Update
- Segment analytics tracking is down: https://status.segment.com/
Posted over 1 year ago. Feb 28, 2017 - 14:14 EST
Update
Other services that Meya depends on are also experiencing issues:
- Intercom live chat will not load, and message delivery is paused: https://status.intercom.com/
- Slack file uploads: https://status.slack.com/
Posted over 1 year ago. Feb 28, 2017 - 14:11 EST
Update
Meya custom components are also affected.
Posted over 1 year ago. Feb 28, 2017 - 13:53 EST
Monitoring
Due to an Amazon Web Services S3 outage, some Meya resources are being affected:
- Meya Web chat
- css
- images/icons
Posted over 1 year ago. Feb 28, 2017 - 13:46 EST