Platform Alerts

Incident Report for Anaplan

Postmortem

On September 13th, at 14:25 UTC, we experienced elevated alerts in Anaplan Amazon Cloud Public — U.S. These alerts showed that customers were having trouble saving and opening models. Customers also experienced interruptions with the New User Experience, API, and CloudWorks connections on the platform.

Our initial investigation identified an unexpected spike in resource consumption within one of our storage services. In response, we quickly increased storage capacity to handle the surge. We identified that the spike was related to migration activity; however, the additional capacity was added before the migration was stopped which resulted in the additional capacity also being consumed. All additional capacity undergoes a storage optimization process, and it is not possible to add further capacity until this process is complete. Unfortunately, this process took significantly longer than expected and delayed restoring full service.

Due to the long running storage optimization process, we made several manual attempts during the incident to lessen resource consumption. This helped to reduce the issues on the platform; however, interim issues persisted while the optimization continued. After the optimization process was completed, full service was restored at 23:25 UTC. Following the restoration, we increased storage capacity further as an additional measure.

We have analyzed our system logs to quantify the impact of this incident. General model behavior (model open, model save, workspace open, creation of import/export/process tasks, writing changes to the changelog (e.g. a cell edit) and creation of revisions) was degraded throughout the duration of the incident. Furthermore, degradation was so significant during the following periods that the platform was effectively unusable, resulting in a total downtime of 6hrs 23mins:

14:45-19:00 UTC
19:34-19:47 UTC
20:25-22:20 UTC

In addition, API integration error rates increased leading to some customer integrations failing and/or responding more slowly. These failures were most significant in Anaplan Amazon Cloud Public — U.S but we have also identified degradation across all other regions.

A thorough review has been conducted on the migration activity, the storage service, and the contributing factors to the prolonged issue. Currently, we have paused further migrations until all mitigation activities are completed. To stop the issue from happening again, we are adding more monitoring to the rate of consumption. This will make sure that any unusual increase in consumption can be fixed before the platform experiences issues. We are also looking closely at and improving our runbooks for storage issues to make sure steps are taken in the correct order. This will ensure issues are not experienced because of optimization activities. We are also working to improve our internal processes for mitigations to reduce the load of the migration and ensure rate limits are applied to prevent a surge in consumption. We will also ensure these activities are completed during off-peak hours.

We deeply apologize for the impact this has had on your business operations. We know these issues can cause trouble for your business and users. We are always improving our systems and procedures to prevent similar issues from happening again in the future. If you have any further questions or concerns, please contact Anaplan Customer Care. Thank you for your patience during this situation and thank you for being an Anaplan customer.

Posted Sep 25, 2024 - 08:24 UTC

Resolved

We have confirmed that the issue is now resolved.

We deeply apologize for any impact this issue may have caused. We appreciate your patience and partnership as we worked through this issue.

We will follow up within 7 business days with a detailed root cause analysis (RCA) that will be shared on our Status Page. If you have any question or concerns, please do not hesitate to contact us at Anaplan Support.

Posted Sep 14, 2024 - 00:33 UTC

Monitoring

Service has now been restored; you should now be able to resume normal activities.

We will continue to monitor the platform to ensure no additional issues arise. If you have any questions, concerns, or continue to experience issues, please do not hesitate to contact Anaplan Support. We will provide a final update to you when we consider this situation fully resolved.

Posted Sep 13, 2024 - 23:25 UTC

Update

We are actively working to resolve the issue and restore service as quickly as possible.
Our teams are currently assessing solutions that are yielding positive results.
We will issue an update within 60 minutes or as soon as the fix is in place and the issue is resolved.

Posted Sep 13, 2024 - 21:10 UTC

Update

We are actively addressing the issue to restore service as promptly as possible. Our teams are evaluating solutions that are demonstrating positive results. We will provide an update within 60 minutes or once the fix has been implemented and the issue has been resolved.

Posted Sep 13, 2024 - 19:43 UTC

Update

We are actively working to restore the service as quickly as possible.

At this time, customers may experience degraded performance when opening models or saving data. Some users may also experience NUX errors.

Unfortunately, we do not have an estimated time for resolution yet.

We will provide further updates in 60 minutes or once the issue is resolved.

Posted Sep 13, 2024 - 17:30 UTC

Update

We have identified the likely cause of the issue, and we are focused right now on restoring service as quickly as possible. Currently, we do not yet have a time to resolution. We will provide further updates in 60 minutes or upon resolution.

Posted Sep 13, 2024 - 16:29 UTC

Update

We are actively working to restore the service as quickly as possible.

At this time, some customers may experience degraded performance when opening models or saving data. Unfortunately, we do not have an estimated time for resolution yet.

We will provide further updates in 30 minutes or once the issue is resolved.

Posted Sep 13, 2024 - 15:47 UTC

Identified

Posted Sep 13, 2024 - 14:56 UTC

Investigating

We are currently investigating an issue where some customers may experience degraded service, opening or saving their models.

We are working to resolve this issue as quickly as possible and will provide updates every 30 minutes or upon resolution.

Posted Sep 13, 2024 - 14:25 UTC

This incident affected: us7: Cloud - US.

Platform Alerts

Postmortem

Resolved

Monitoring

Update

Update

Update

Update

Update

Identified

Investigating

Need more help?

Visit Support portal

Register for Support portal

Call