Incident Details
Throughout the evening (UTC) of the 8th and the morning of the 9th of October, the platform started to see a high number of errors related to OAuth authentication refreshes for a specific 3rd party connector.
On the morning of the 9th of October, we started to receive reports from users of intermittent login problems, and after investigation, engineers identified a caching issue within our API service.
At 15:15 UTC on the 9th, a deployment was started with caching improvements that would improve the sporadic login issues that we were encountering.
At 15:25 UTC load balancers switched over to the newly created API instances as part of a blue/green deployment process, however, due to the existing high load and error rate we were seeing, combined with the new cache changes, the new instances immediately caused an increased load on the API service database, resulting in severely degraded availability of the Tray API.
At 15:35 UTC engineers started the rollback process.
At 15:42 UTC API service availability recovered, however, we were still seeing increased refresh error load.
During this 17 minute period, all client apps (dashboard, builder, embedded configuration wizard) were unavailable, and all workflow triggers were responding with 500 error codes.
At 17:10 UTC a new deployment was started with changes to solve a number of the issues that we had been seeing within the API service. During this blue/green deployment, because of the high API service load, the total number of instances and database connections exceeded the capacity of the database, and we started to see intermittent problems across the Tray API again.
Engineers started queueing incoming triggers to reduce some of the load on the API service while further investigation was done.
At 18:01 UTC a deployment was done to our authentication service, which greatly reduced the load that was being seen earlier on because of the refresh errors coming from the 3rd party service.
At 18:05 UTC engineers manually added some indexes to the API service database to help ease problems seen with the unusually high error rates.
At 18:15 UTC engineers noticed a high level of SWAP usage on the API service database and proceeded to upgrade the database cluster.
By 18:25 UTC trigger ingestion had resumed and all services were operating normally.
Remediation