On the afternoon of January 21st (UTC), we experienced a severe degradation of our services. Services were impacted from approximately 1:30 PM - 3:30 PM (UTC) and again from 5:00 PM - 6:40 PM (UTC). Although the two issues were different they were triggered by the same root cause.
During the affected periods, executions were delayed and some triggers were not processed. Of the triggers that were not processed, the vast majority were properly persisted to disk and have been replayed (or are able to replayed). A small fraction of triggers during this period could not be persisted to disk and cannot be replayed.
The embedded API was intermittently unavailable, and some parts of the Tray application (in particular the builder) were intermittently slow or unresponsive. Additionally, we temporarily suspended the lazy upgrade of embedded solution instances.
The builder and all APIs are currently online and available, and you can resume work. Lazy upgrades were also re-enabled. All delayed executions have caught up and no action is needed for the vast majority of triggers that were persisted. We will be replaying triggers automatically where possible, and reaching out to customers where we require further confirmation to replay. Triggers that were not processed returned a 500 error response.
If your trigger service has a retry mechanism in place (ideally with exponential backoff), no further action is needed. If the service received a 500 error response from the Tray.io API and your client could not gracefully handle it or if you are unsure, then you should speak with your Tray representative to identify if the trigger can be replayed.
Work is currently being carried out to improve our application caching and increase the performance of our database clusters. As part of this project, new code was added to one of our services to utilize the new caching infrastructure but this was initially disabled on production. A global configuration change along inadvertently resulted in the caching code being active before the infrastructure was fully in place. Engineers were alerted to the problem immediately, however due to the cache being unavailable, one of our databases was put under excessive load and started to experience timeouts. This in turn caused our API to start intermittently failing.
Our engineers quickly responded and although the root cause had not yet been identified, action was taken to minimize the impact across the platform by throttling workflow executions and disabling the lazy upgrade of embedded solution instances. As part of our normal incident response process, we also rolled back our most recent deployment which did not resolve the issue as the code path was introduced several versions ago, and only manifested itself because of a correlated configuration change.
Within an hour the issue had been identified and a fix was prepared, reviewed, and successfully deployed. The execution throttling was slowly raised and our system began processing the backlog of delayed executions.
Approximately 90 minutes later, engineers were alerted to new issues with the same database. During the period following the initial incident, the database in question was under heavier than normal load due to the backlog of executions that were being processed; this is expected and our monitoring indicated that all systems were operating within normal load ranges for this process.
What our metrics did not capture however, was that the database was having to use up a reserve burst capacity to handle the extra load. After approximately 90 minutes this reserve capacity was exhausted and the database became unavailable causing widespread problems with our API. Engineers again followed standard incident response procedures to minimise the impact of the database problems.
To resolve the problem, engineers forced a failover of the database with a increased capacity.
During both of these incidents trigger ingestion should have operated normally, however, while the database that stores the queued trigger requests was scaled up to handle the unusually large size of the queue, a small number of triggers were not processed or not fully persisted to disk.
We've run a retrospective and taken a number of actions: