At approximately 1:00pm ET on Sept. 8th, 2021 our shared long term storage ledger (tldr; logs) entry database was experiencing greater than capacity load due to aggregate system load. We identified the bottleneck and then initiated a DB scale up operation that necessitated a short downtime at 1:24pm ET. The DB is required for bots to be online, and thus all bots when down until approximately 1:26pm ET.
How did we fix it?
What was affected?
All v2 apps went offline for 2-4 minutes
Some apps took longer (up to 20 minutes) to stabilize
What will change moving forward?