Degraded Performance Across The App
Incident Report for Tray.io
Postmortem

Summary

On the afternoon of January 21st (UTC), we experienced a severe degradation of our services. Services were impacted from approximately 1:30 PM - 3:30 PM (UTC) and again from 5:00 PM - 6:40 PM (UTC). Although the two issues were different they were triggered by the same root cause.

What was affected?

During the affected periods, executions were delayed and some triggers were not processed. Of the triggers that were not processed, the vast majority were properly persisted to disk and have been replayed (or are able to replayed). A small fraction of triggers during this period could not be persisted to disk and cannot be replayed.

The embedded API was intermittently unavailable, and some parts of the Tray application (in particular the builder) were intermittently slow or unresponsive. Additionally, we temporarily suspended the lazy upgrade of embedded solution instances.

As a customer, what should I do?

The builder and all APIs are currently online and available, and you can resume work. Lazy upgrades were also re-enabled. All delayed executions have caught up and no action is needed for the vast majority of triggers that were persisted. We will be replaying triggers automatically where possible, and reaching out to customers where we require further confirmation to replay. Triggers that were not processed returned a 500 error response.

If your trigger service has a retry mechanism in place (ideally with exponential backoff), no further action is needed. If the service received a 500 error response from the Tray.io API and your client could not gracefully handle it or if you are unsure, then you should speak with your Tray representative to identify if the trigger can be replayed.

What happened and how was it fixed?

Work is currently being carried out to improve our application caching and increase the performance of our database clusters. As part of this project, new code was added to one of our services to utilize the new caching infrastructure but this was initially disabled on production. A global configuration change along inadvertently resulted in the caching code being active before the infrastructure was fully in place. Engineers were alerted to the problem immediately, however due to the cache being unavailable, one of our databases was put under excessive load and started to experience timeouts. This in turn caused our API to start intermittently failing.

Our engineers quickly responded and although the root cause had not yet been identified, action was taken to minimize the impact across the platform by throttling workflow executions and disabling the lazy upgrade of embedded solution instances. As part of our normal incident response process, we also rolled back our most recent deployment which did not resolve the issue as the code path was introduced several versions ago, and only manifested itself because of a correlated configuration change.

Within an hour the issue had been identified and a fix was prepared, reviewed, and successfully deployed. The execution throttling was slowly raised and our system began processing the backlog of delayed executions.

Approximately 90 minutes later, engineers were alerted to new issues with the same database. During the period following the initial incident, the database in question was under heavier than normal load due to the backlog of executions that were being processed; this is expected and our monitoring indicated that all systems were operating within normal load ranges for this process.

What our metrics did not capture however, was that the database was having to use up a reserve burst capacity to handle the extra load. After approximately 90 minutes this reserve capacity was exhausted and the database became unavailable causing widespread problems with our API. Engineers again followed standard incident response procedures to minimise the impact of the database problems.

To resolve the problem, engineers forced a failover of the database with a increased capacity.

During both of these incidents trigger ingestion should have operated normally, however, while the database that stores the queued trigger requests was scaled up to handle the unusually large size of the queue, a small number of triggers were not processed or not fully persisted to disk.

How are we preventing possible similar incidents in the future?

We've run a retrospective and taken a number of actions:

  • We have created new metrics and alarms to track the reserve database capacity
  • Updated our code review guidelines to make sure all code is properly guarded against configuration changes
  • We have an ongoing project to further decouple several services that rely on this database
  • We have permanently increased the database capacity and worked with our hosting provider to continue its optimisation with more long term improvements
  • We are looking to add a proxy service in front of the database in question, to better handle spikes in load
  • We working on a more robust system for handling trigger ingestion retry logic, so we do not lose any trigger data in these occasions
  • We are performing a system-wide audit to identify other occurrences of coupled services/databases that could cause platform issues

What are the risks of another outage?

  • We believe our database is currently capable of handling the increased load for extended periods of time.
  • We also believe we now have the correct metrics and alerting in place that will notify us of any database issue before it causes a production incident.

What did we do right?

  • Our on-call engineers were notified of the problem from our monitoring system and we were investigating the issue before customers raised an incident.
  • We had multiple eyes on the issue and assistance from all part of the business
  • Our system is designed so as not to lose any workflow execution once it enters the system; while workflows may have been delayed, they all eventually executed and caught up. Similarly, any trigger that was received was eventually processed even if it could not be processed right away.
  • We had redundancy and did not rely on a single database instance so we were able to failover to a live replica with no loss of data.
  • We took action to minimise the impact by disabling non-critical parts of the system

What could we have done better?

  • The code path that caused the initial incident should have been caught in code review
  • We should have had better monitoring that notified us our reserve capacity was being used up
  • We should have had better retry logic so that no triggers were lost during the autoscaling of our DB
Posted Jan 22, 2021 - 21:33 GMT

Resolved
This incident has been resolved. The app is operating normally again.
Posted Jan 22, 2021 - 16:20 GMT
Update
The functionality of the platform is back to normal, we are still continuously monitoring performance.
Embedded update- we have re-enabled lazy updates. Lazy updates will now propagate to existing solution instances.
Posted Jan 22, 2021 - 11:41 GMT
Update
We are continuing to monitor results to ensure no further performance issues are encountered, users might experience delays in new triggers. In addition, executions with a connector CSV V3.5 steps might take longer than expected.
Embedded update- we have disabled lazy updates to ensure platform stability. Any live solutions will continue to work for end users however lazy updates will not propagate to existing solution instances until we re-enable it later today.
Posted Jan 22, 2021 - 09:31 GMT
Update
The platform is in a stable condition, we are monitoring continuously.
Embedded update- due to today's degraded platform performance, we have disabled lazy updates to ensure platform stability. Any live solutions will continue to work for end users however lazy updates will not propagate to existing solution instances until we re-enable it tomorrow.
Posted Jan 21, 2021 - 21:53 GMT
Monitoring
A fix has been implemented and we are monitoring the results. Customers might experience delays in triggers and executions.
Posted Jan 21, 2021 - 19:12 GMT
Investigating
Users might experience unexpected errors and degraded performance across the application. We are continuously investigating this issue.
Posted Jan 21, 2021 - 17:11 GMT
Monitoring
A fix has been implemented and we are monitoring the results. Customers might still experience delays in workflow executions, but no data will be lost.
Posted Jan 21, 2021 - 15:14 GMT
Investigating
Customers will experience unexpected behavior while using the application. Our engineers are investigating the issue.
Posted Jan 21, 2021 - 14:23 GMT
This incident affected: Workflow Execution, Workflow Builder (Workflow Builder API, Workflow Builder App, Workflow Builder Logs), and Embedded (Embedded API, Embedded App).