I know that we have been in touch with many of you on chat already but I just wanted to give a more thorough explanation of the issues that we experienced earlier.
First off, I want to sincerely apologise to the teams that were impacted. It’s incredibly frustrating when something like this happens. All I can do is apologise and reassure you that we have identified the root cause of the issue. We patched for now and will implement a permanent fix in the coming days.
10.51– first report of an issue with the Dashboard re. loading. We were investigating this as an isolated incident. There was nothing on the stack that indicated any load.
11.45 – the performance of the stack all within standard performance.
12.00 – started noticing some lag in the system (but Webservers, Databases, Cache – all showing sub 10% load). We had no alarms. We use Elastic Search in some parts of the app and even that was showing 37% CPU load at this time (higher than normal but not critical).
12.45 – we manage to replicate the issue and start implementing a fix.
13.20 – fix is applied.
13.40 – system is back to normal.
- New users were added to the system that included an invisible character (space) in their emails.
- When those users loaded the dashboard, the space meant their websocket connection was refused. Spaces are not allowed.
- When there’s no websocket connection, the app automatically reloads to try to re-establish the connection.
- As more and more of those users were continually re-loading the dashboard, the issues started to compound.
- Which eventually impacted other users.
- Created a backlog that took time to work through.
If this happened in a standard team setup, there wouldn’t have been an issue.
As luck would have it, the users (with the space in their emails) were part of a large team with a lot of apps (some with ~40+ moderators). Every time they loaded the dashboard, there’s a lot of calls to return the correct data to them.
- Temporarily disabled the reload logic
- Implement validation on the new user emails
- Optimise the Dashboard API calls to prevent this from happening again.
Technically, we weren’t down today but for a period, the system was too slow to use for some users. This is not the service that we strive to deliver.
Again, I apologise for the inconvenience and disruption caused.
As a priority, we will implement the fix and improvements listed above.
Thanks for your understanding!