On December 27, 2024, at 13:13 UTC, our Anaplan internal monitoring alerted us of a hardware failure that impacted several servers. Customers may have experienced issues opening models. Models that were already open weren't impacted. Refreshing would enable models to open, but this was inconsistent. Additionally, scheduled CloudWorks™ integrations weren't processing in Anaplan Data Center — Netherlands.
The following regions were impacted by the model open issue:
Anaplan Data Center – U.S. East (intermittent)
Anaplan Data Center – U.S. West
Anaplan Data Center – Germany
Anaplan Data Center – Netherlands
Anaplan Google Cloud Public – U.S. East
Anaplan Google Cloud Public – Japan
Anaplan Amazon Cloud Public – U.S.
Anaplan Amazon Cloud Public – Europe
Upon initial investigation, our engineering team identified that the hardware failure triggered a failover event. This enabled the servers to automatically recover to new hosts. There were two servers that failed to recover as designed and required manual intervention. Post-intervention, those servers resumed operation by 13:51 UTC.
However, errors continued to occur with model opening. Further investigation identified a connectivity issue between our internal services. We undertook several lines of investigation. We reviewed network connectivity and infrastructure services to identify what was causing the connectivity issue.
At 15:30 UTC, we identified an unusual presentation within an internal load-balance instance. New connections that were routed to this load-balance instance weren't being processed as designed. We restarted this internal load-balance instance at 16:15 UTC. We saw an immediate reduction in connection errors, and models started to open more reliably.
We reviewed all load-balance instances, and at 16:30 UTC, we identified the same issue in an external load-balance instance. We restarted this instance at 17:06 UTC. Alerts and automated testing showed the issue was resolved. We completed additional testing to verify the problem was fixed, and at 17:20 UTC, we confirmed that the issue was fully resolved.
In parallel, we investigated the issue with CloudWorks™ integrations in Anaplan Data Center — Netherlands. An initial investigation determined this issue was related to the hardware failure. At 14:02 UTC, we performed a rolling restart on the impacted component, but it was unsuccessful. We investigated the failed rolling restart and found that the component was unable to successfully initialize. Further investigation showed that a call to an upstream message-broker resource was looping, which prevented the service from starting.
We reviewed the upstream resource and identified two corrupt queues. At 15:59 UTC, we removed the corrupt queues to enable new queues to be recreated. We performed a rolling restart of the impacted component and at 16:27 UTC, CloudWorks™ integrations resumed. The integration backlog had cleared by 17:10 UTC.
Working closely with our hardware vendor, we identified that the hardware failure had occurred due to a resource locking event. Hardware failures are a rare but expected event, and we have an automated failover process in place to enable graceful recovery. The hardware failure triggered the designed failover event, and the servers were auto recovered onto new hosts.
The failover is a non-disruptive event, but in this case, a missing host rule allowed for both DNS authoritative nameservers to be located on the same host, which was the host impacted by the hardware failure. This meant that both DNS authoritative nameservers were relocated at the same time during the failover.
Upon restart of the servers, the load-balance instances started up before the DNS authoritative nameservers instances. This meant the load-balance instances couldn't complete DNS resolution to process new connections. In normal operation, traffic is balanced between two load-balance instances. When one instance is unavailable, all traffic is automatically routed to the healthy instance. In this case, all the load-balance instances were up so were receiving traffic. However, traffic that came into the two load-balance instances couldn't finish the DNS resolution, which stopped those connections from working.
The restart of the servers also resulted in the CloudWorks issue. Two of the servers that were restarted were dependent on an upstream message-broker resource. The disruption resulted in the corruption of two queues. One of the queues was the default queue, which is used for initialization of the downstream CloudWorks component. As the default queue was corrupt, it was unable to complete the initialization request, resulting in the CloudWorks™ integration not processing.
To prevent recurrence, we have reviewed the host rules across the services and have updated these to ensure no two instances are located on the same host. We are also adding further assurances for load-balance instances on automatic restarts. This will ensure that restarts are completed in order of dependency. This will also ensure automatic retry requests are completed for dependent services. We are improving our alerting so that if a load-balance instance is unable to connect to a dependent service, an urgent alert is triggered. Additionally, we've updated our response procedures to resolve similar issues more quickly.
We deeply apologize for any impact this issue may have had on your business operations. We are continuously strengthening our systems and procedures to ensure we avoid future disruptions to your business and users. If you have any further questions or concerns, please contact Anaplan Customer Care. Thank you for your patience during this situation and thank you for being an Anaplan customer.