On September 13th, at 14:25 UTC, we experienced elevated alerts in Anaplan Amazon Cloud Public — U.S. These alerts showed that customers were having trouble saving and opening models. Customers also experienced interruptions with the New User Experience, API, and CloudWorks connections on the platform.
Our initial investigation identified an unexpected spike in resource consumption within one of our storage services. In response, we quickly increased storage capacity to handle the surge. We identified that the spike was related to migration activity; however, the additional capacity was added before the migration was stopped which resulted in the additional capacity also being consumed. All additional capacity undergoes a storage optimization process, and it is not possible to add further capacity until this process is complete. Unfortunately, this process took significantly longer than expected and delayed restoring full service.
Due to the long running storage optimization process, we made several manual attempts during the incident to lessen resource consumption. This helped to reduce the issues on the platform; however, interim issues persisted while the optimization continued. After the optimization process was completed, full service was restored at 23:25 UTC. Following the restoration, we increased storage capacity further as an additional measure.
We have analyzed our system logs to quantify the impact of this incident. General model behavior (model open, model save, workspace open, creation of import/export/process tasks, writing changes to the changelog (e.g. a cell edit) and creation of revisions) was degraded throughout the duration of the incident. Furthermore, degradation was so significant during the following periods that the platform was effectively unusable, resulting in a total downtime of 6hrs 23mins:
In addition, API integration error rates increased leading to some customer integrations failing and/or responding more slowly. These failures were most significant in Anaplan Amazon Cloud Public — U.S but we have also identified degradation across all other regions.
A thorough review has been conducted on the migration activity, the storage service, and the contributing factors to the prolonged issue. Currently, we have paused further migrations until all mitigation activities are completed. To stop the issue from happening again, we are adding more monitoring to the rate of consumption. This will make sure that any unusual increase in consumption can be fixed before the platform experiences issues. We are also looking closely at and improving our runbooks for storage issues to make sure steps are taken in the correct order. This will ensure issues are not experienced because of optimization activities. We are also working to improve our internal processes for mitigations to reduce the load of the migration and ensure rate limits are applied to prevent a surge in consumption. We will also ensure these activities are completed during off-peak hours.
We deeply apologize for the impact this has had on your business operations. We know these issues can cause trouble for your business and users. We are always improving our systems and procedures to prevent similar issues from happening again in the future. If you have any further questions or concerns, please contact Anaplan Customer Care. Thank you for your patience during this situation and thank you for being an Anaplan customer.