On May 16, 2024, at 00:00 UTC, the Anaplan platform encountered service disruptions due to an unexpected increase in load on the API Services. This surge in load caused saturation, limiting available resources to a core back-end service, which resulted in intermittent outages across the platform.
This incident affected the following regions:
· Anaplan Data Center - US East
· Anaplan Data Center - US West
· Anaplan Data Center – Germany
· Anaplan Data Center – Netherlands
· Anaplan Google Cloud Public - US East
· Anaplan Google Cloud Public - Japan
· Anaplan Amazon Cloud Public – US
· Anaplan Amazon Cloud Public – Europe
The intermittent outages, preventing access to the Anaplan platform, occurred at the following times:
· 00:00-01:33 UTC
· 06:40-07:14 UTC
· 10:03-10:15 UTC
· 14:02-14:09 UTC
To stabilize the platform, we took immediate action to scale down the API and CloudWorks™ services. This measure paused all integration activities, allowing the core back-end service to recover and enabling customer access to the platform. While the API and CloudWorks™ services were scaled down, all integrations to the platform were unavailable. This affected Anaplan Connect, the Administration service and third party integrations. Integration functionality was restored once the API and CloudWorks™ services were scaled back up.
Integrations were unavailable during the following periods:
· 00:00-02:18 UTC
· 06:40-09:04 UTC
· 10:03-11:38 UTC
· 14:02-14:19 UTC
To identify the source of the increased load, we isolated a specific group of integrations. Although this initially seemed to alleviate the issue, analysis revealed that they weren’t the root cause. Further steps were taken to reduce the load on the API service by updating the configuration to the API retry mechanism at 14:05 UTC. Post this mitigation step, we monitored the platform for three hours until we resolved the issue at 17:09 UTC.
Additional analysis identified that an upstream service’s time-out configuration had been inadvertently lowered. The time-out change caused a more aggressive failure to be sent to the API service. This led to a significant surge in API retry requests and the increase in load on API services. The time-out change, in combination with the aggressive API retry mechanism and top-of-the-hour integrations traffic, resulted in resource saturation and subsequent outages of the platform.
We identified that the time-out configuration was unintentionally lowered as part of an upgrade completed to an upstream service on May 15, at 15:00 UTC. We have since updated the configuration to the former value. A thorough review is being conducted to understand how the upgrade led to this unintended change. In addition, we are reviewing the API retry mechanism. We have suspended the API retry functionality that's used in this type of failure scenario as this isn't used for daily processes. Furthermore, we have added increased observability for this scenario, and are increasing resources to the impacted core back-end service to add additional resiliency.
We apologize for any impact this issue may have had on your business operations. We are continuously strengthening our systems and procedures to ensure we avoid future disruptions to your business and users.
If you have further questions or concerns, please visit the Anaplan Support website. Thank you for your patience during this situation and thank you for being an Anaplan customer.