As discussed yesterday, I wanted to follow up with more detail on the outage we had this week. I’ll discuss how orders are handled, why there was a delay, and what we plan to do/have already done, to prevent these issues from happening again.
Summary
- Cumulative US order history (tens of thousands per holding) prevented our US partner from updating the status on orders.
- Our app failed to display information on any orders which were delayed in this fashion (pending).
- We have moved to a new integration model which shortcuts receipt of order status updates.
- We’ll be improving the app to show pending orders more clearly.
- We’ll improve our load testing to identify non-linear scaling issues sooner.
- We’ll improve how quickly we identify, and then communicate these issues to customers.
Orders
Placing orders either through the UK’s retail service providers or via our US clearing partner is handled asynchronously. This means that successfully sending the order to our trading partner and confirming its status are two separate processes. When you place a trade within the Freetrade app we attempt to complete both placing the order and updating its status.
Once an order has been successfully created it is put into a “Placed” status after which we keep checking with our partner for any updates. A successful placement and execution of an order will move its status to “Complete” and the app will tell you that your order has been completed. In the case of our US orders on Tuesday, we successfully placed the trades with our US partner but we were unable to receive any updates on their statuses.
If we’re unable to confirm an order’s status within 45 seconds we inform you at the end of the order flow and continue to try in the background. Right now this is a poor experience for two reasons: 1) The message shown is not clear enough and can easily be dismissed; 2) If you leave the order flow this status is not displayed anywhere else in the app - it appears as if the order never happened. We are going to improve how we handle these exceptions to ensure you always know the status of your order.
Partner integration
The Freetrade Invest platform is designed to support different protocols for different trading partners. Our US trading partner now supports both HTTP and Amazon SQS for order updates. We have been using their HTTP service to query updates on individual orders.
Historically our partner has worked in a fully disclosed manner, meaning end users are interacting directly with their platform. In our case, we’re working as an omnibus account, meaning Freetrade is seen as a single user. This means we can provide a better long term service across multiple markets as well as providing greater flexibility for our customers.
A side effect of working in this omnibus fashion, which was not properly accounted for, was how it would affect the processing of individual holdings. Rather than a number of users with hundreds or maybe thousands of orders, we’re seen as a single user with tens of thousands of orders for a single holding. Before the HTTP service is able to respond with an order status it has to process an increasingly large number of historic records for each holding. As these orders have accumulated and order volume has grown the processing of these holdings has increased leading to yesterday’s outage.
We became aware of the impact of these cumulative orders about two weeks ago. Since discovering the issue both Freetrade and our partner have been working to address it. We have been migrating to their new SQS service which doesn’t block on processing before sending out status updates. And our partner has been working to improve their HTTP service. Unfortunately, neither of these changes were ready in time for Tuesday, leading to the outage. However, I’m pleased to say that our migration to the SQS platform is complete and has now gone into production - mitigating this issue and improving US order response times.
Testing
Our Invest Platform undergoes a number of checks before launching any new features. We have a CI environment which runs a suite of unit, integration, driver and feature tests on a nightly basis. Once our automation test suites have passed we have a pre-production environment that goes through manual validation with our QA team. In addition to this our production changes are often gated by feature flags which allow us to deploy features in a controlled manner.
As well as this functional testing we also carry out load testing. And, although our trade performance indicated that we were able to handle the load seen this week, it did not account for the build-up in historic orders. We had failed to correctly project the load testing data that we had. Assuming the performance was simply a factor of the order flow.
This is a definite weakness in our testing armoury and more detailed analysis of the performance profile of our trade execution at scale would have captured this. To this end, our teams are working on a formal process to capture, analyse and project this usage more accurately in future.
Thank you again for your patience whilst we put this together and we appreciate the feedback!