We experienced approximately 15 minutes of bot downtime yesterday, March 26 between 9:30 - 9:45am ET.
We found that our event bus brokering system ran out of memory suddenly. The explanation for this incident is due to a multitude of factors working together:
- quickly ramping up server load on Monday morning
- slow scale-up of the logging DB in response to this load, which subsequently queued log writes in the event broker. (Call the # of queued writes
- excessively large log rows originating from integrations with large webhook payloads of greater than 3MB (Call the avg. log row size
- our event broker is a shared system, therefore it's effects were widespread
memory = Q x E. It's evident for a large value for
E, it doesn't take a large value of
Q to consume significant memory.
How did we fix it?
In order to resolve the issue, we 4x the memory capacity of the event broker cluster which took about 10 minutes to take effect before bringing bots back to life.
What was affected?
- core bot operations: flow starts and transitions
- message delivery
- component / CMS publishing
- API & webhook
What will we change moving forward?
- we've increased the baseline memory of the event queuing system. This increases
- we've posted a low-level patch that limits the size of the event meta-data to mitigate sudden memory spike. This lowers
- we've increased the baseline logging database throughput capacity to further mitigate this sudden spike. This lowers
- we're investigating a solution to auto-scale the event brokering system at low memory
- we're evaluating further isolation of our logging sub-system
- As part of our bot runtime improvements, we are already in the process of engineering a new platform architecture that is not susceptible to this sort of event bus memory issue.
- This incident will be used as part of the tech spec to ensure that this does not happen again.
Please feel free to email me directly email@example.com if you have any specific questions.