Due to a combination of cumulative bot volume and high throughput, our core DB hit a read throughput threshold on a critical bot engine query at 12:15 pm ET. As a result, our core DB was being I/O throttled.
How did we fix it?
We placed a temporary quarantine throttle on the affected bots which returned latency to normal by 2:05 pm ET while we implemented a long-term fix.
We optimized the necessary bot engine queries and increased the provisioned throughput on the appropriate database and released the code fix at 7:30 pm ET.
We’re continuing to monitor the DB metrics, but it appears the fix has had the desired effect.
What was affected?
What will change moving forward?