We experienced approximately 30 minutes of bot downtime earlier today between 4:30 - 5pm ET.
We were able to quickly find the root cause of the problem: a) an undersized cache cluster which ran out of memory, b) the lack of an early detection alarm for this specific scenario.
How did we fix it?
In order to resolve the issue, we doubled the capacity of the cache cluster which took about 15 minutes to take effect before bringing bots back to life.
What was affected?
What will we change moving forward?
Please feel free to email me directly firstname.lastname@example.org if you have any specific questions.