Degraded performance of the app

Incident Report for Tray.io

Postmortem

Incident Details

Throughout the evening (UTC) of the 8th and the morning of the 9th of October, the platform started to see a high number of errors related to OAuth authentication refreshes for a specific 3rd party connector.

On the morning of the 9th of October, we started to receive reports from users of intermittent login problems, and after investigation, engineers identified a caching issue within our API service.

At 15:15 UTC on the 9th, a deployment was started with caching improvements that would improve the sporadic login issues that we were encountering.

At 15:25 UTC load balancers switched over to the newly created API instances as part of a blue/green deployment process, however, due to the existing high load and error rate we were seeing, combined with the new cache changes, the new instances immediately caused an increased load on the API service database, resulting in severely degraded availability of the Tray API.

At 15:35 UTC engineers started the rollback process.

At 15:42 UTC API service availability recovered, however, we were still seeing increased refresh error load.

During this 17 minute period, all client apps (dashboard, builder, embedded configuration wizard) were unavailable, and all workflow triggers were responding with 500 error codes.

At 17:10 UTC a new deployment was started with changes to solve a number of the issues that we had been seeing within the API service. During this blue/green deployment, because of the high API service load, the total number of instances and database connections exceeded the capacity of the database, and we started to see intermittent problems across the Tray API again.

Engineers started queueing incoming triggers to reduce some of the load on the API service while further investigation was done.

At 18:01 UTC a deployment was done to our authentication service, which greatly reduced the load that was being seen earlier on because of the refresh errors coming from the 3rd party service.

At 18:05 UTC engineers manually added some indexes to the API service database to help ease problems seen with the unusually high error rates.

At 18:15 UTC engineers noticed a high level of SWAP usage on the API service database and proceeded to upgrade the database cluster.

By 18:25 UTC trigger ingestion had resumed and all services were operating normally.

Remediation

Introduce a rolling deployment process to avoid exhausting database connections
Remove trigger dependency on API service, so all triggers can be queued immediately (where possible)
Remove the auth service dependency on the API and add better protection for high refresh error rates

Posted Oct 14, 2020 - 10:23 BST

Resolved

There have been no further issues overnight and the incident is now resolved.

All services are acting normally again, queued triggers have all been processed and workflow execution retries have been caught up.

Engineers will be doing a postmortem and will provide more details on the cause and impact of the incident in the following days.

Posted Oct 10, 2020 - 08:52 BST

Monitoring

After deploying a fix, we are showing that most issues are resolved. If you continue to experience any issues, please contact support. If you are experiencing trouble logging in to create a ticket, please email support@tray.io.

Posted Oct 09, 2020 - 19:45 BST

Identified

The issue has been identified, some customers are still unable to login and we're working hard to fix this issue.

Posted Oct 09, 2020 - 17:49 BST

Investigating

We are currently seeing issues with the Tray API which is causing problems with the dashboard and builder interfaces, the embedded config wizard, login issues and delayed triggers

Posted Oct 09, 2020 - 16:44 BST

This incident affected: Workflow Execution, Workflow Builder (Workflow Builder API, Workflow Builder App, Workflow Builder Logs), and Embedded (Embedded API, Embedded App).