Platform Alerts

Postmortem

On December 27, 2024, at 13:13 UTC, our Anaplan internal monitoring alerted us of a hardware failure that impacted several servers. Customers may have experienced issues opening models. Models that were already open weren't impacted. Refreshing would enable models to open, but this was inconsistent. Additionally, scheduled CloudWorks™ integrations weren't processing in Anaplan Data Center — Netherlands. 

The following regions were impacted by the model open issue:   

Anaplan Data Center – U.S. East (intermittent)
Anaplan Data Center – U.S. West   
Anaplan Data Center – Germany   
Anaplan Data Center – Netherlands   
Anaplan Google Cloud Public – U.S. East   
Anaplan Google Cloud Public – Japan   
Anaplan Amazon Cloud Public – U.S.   
Anaplan Amazon Cloud Public – Europe

Upon initial investigation, our engineering team identified that the hardware failure triggered a failover event. This enabled the servers to automatically recover to new hosts. There were two servers that failed to recover as designed and required manual intervention. Post-intervention, those servers resumed operation by 13:51 UTC.

However, errors continued to occur with model opening. Further investigation identified a connectivity issue between our internal services. We undertook several lines of investigation. We reviewed network connectivity and infrastructure services to identify what was causing the connectivity issue.  

  At 15:30 UTC, we identified an unusual presentation within an internal load-balance instance. New connections that were routed to this load-balance instance weren't being processed as designed. We restarted this internal load-balance instance at 16:15 UTC. We saw an immediate reduction in connection errors, and models started to open more reliably.

   We reviewed all load-balance instances, and at 16:30 UTC, we identified the same issue in an external load-balance instance. We restarted this instance at 17:06 UTC. Alerts and automated testing showed the issue was resolved. We completed additional testing to verify the problem was fixed, and at 17:20 UTC, we confirmed that the issue was fully resolved.   

   In parallel, we investigated the issue with CloudWorks™ integrations in Anaplan Data Center — Netherlands. An initial investigation determined this issue was related to the hardware failure. At 14:02 UTC, we performed a rolling restart on the impacted component, but it was unsuccessful. We investigated the failed rolling restart and found that the component was unable to successfully initialize. Further investigation showed that a call to an upstream message-broker resource was looping, which prevented the service from starting.

We reviewed the upstream resource and identified two corrupt queues. At 15:59 UTC, we removed the corrupt queues to enable new queues to be recreated. We performed a rolling restart of the impacted component and at 16:27 UTC, CloudWorks™ integrations resumed. The integration backlog had cleared by 17:10 UTC.  

   Working closely with our hardware vendor, we identified that the hardware failure had occurred due to a resource locking event. Hardware failures are a rare but expected event, and we have an automated failover process in place to enable graceful recovery. The hardware failure triggered the designed failover event, and the servers were auto recovered onto new hosts.

The failover is a non-disruptive event, but in this case, a missing host rule allowed for both DNS authoritative nameservers to be located on the same host, which was the host impacted by the hardware failure. This meant that both DNS authoritative nameservers were relocated at the same time during the failover.

Upon restart of the servers, the load-balance instances started up before the DNS authoritative nameservers instances. This meant the load-balance instances couldn't complete DNS resolution to process new connections. In normal operation, traffic is balanced between two load-balance instances. When one instance is unavailable, all traffic is automatically routed to the healthy instance. In this case, all the load-balance instances were up so were receiving traffic. However, traffic that came into the two load-balance instances couldn't finish the DNS resolution, which stopped those connections from working.

   The restart of the servers also resulted in the CloudWorks issue. Two of the servers that were restarted were dependent on an upstream message-broker resource. The disruption resulted in the corruption of two queues. One of the queues was the default queue, which is used for initialization of the downstream CloudWorks component. As the default queue was corrupt, it was unable to complete the initialization request, resulting in the CloudWorks™ integration not processing.  

   To prevent recurrence, we have reviewed the host rules across the services and have updated these to ensure no two instances are located on the same host. We are also adding further assurances for load-balance instances on automatic restarts. This will ensure that restarts are completed in order of dependency. This will also ensure automatic retry requests are completed for dependent services. We are improving our alerting so that if a load-balance instance is unable to connect to a dependent service, an urgent alert is triggered. Additionally, we've updated our response procedures to resolve similar issues more quickly.  

   We deeply apologize for any impact this issue may have had on your business operations. We are continuously strengthening our systems and procedures to ensure we avoid future disruptions to your business and users. If you have any further questions or concerns, please contact Anaplan Customer Care. Thank you for your patience during this situation and thank you for being an Anaplan customer.

Posted Jan 06, 2025 - 18:46 UTC

Resolved

We have confirmed that the issue is now resolved.

We deeply apologize for any impact this issue may have caused. We appreciate your patience and partnership as we worked through this issue.

We will follow up within 7 business days with a detailed root cause analysis (RCA) that will be shared on our Status Page. If you have any question or concerns, please do not hesitate to contact us at Anaplan Support.

Posted Dec 27, 2024 - 17:20 UTC

Monitoring

Service has now been restored; you should now be able to resume normal activities.

We will continue to monitor the platform to ensure no additional issues arise. If you have any questions, concerns, or continue to experience issues, please do not hesitate to contact Anaplan Support. We will provide a final update to you when we consider this situation fully resolved.

Posted Dec 27, 2024 - 17:19 UTC

Update

Thank you for your patience as we continue to investigate this issue. Currently, we do not yet have a time to resolution. 

We're still working through different hypotheses, and our engineers are actively working to resolve the issue.
Some customers may be experiencing intermittent failures opening their models.

We will continue to provide updates every 30 minutes as we work to resolve this issue as quickly as possible.

Posted Dec 27, 2024 - 17:03 UTC

Update

Update - Thank you for your patience as we continue to investigate this issue. We are investigating various aspects of our network to understand the contributing factors. Initial investigations and steps to mitigate the problem haven't been successful.

Some customers may still be experiencing intermittent issues accessing their models or integration failures.

We do not yet have a time to resolution, but our engineers are actively working to resolve the issue. We will continue to provide updates every 30 minutes as we work to resolve this issue as quickly as possible.

Posted Dec 27, 2024 - 16:04 UTC

Update

Thank you for your patience as we continue to investigate this issue.
Some customers may still be experiencing intermittent issues accessing their models or integration failures.

We do not yet have a time to resolution, but our engineers are actively working to resolve the issue. We will continue to provide updates every 30 minutes as we work to resolve this issue as quickly as possible.

Posted Dec 27, 2024 - 15:19 UTC

Update

Thank you for your patience as we continue to investigate this issue. Currently, we do not yet have a time to resolution. We will continue to provide updates every 30 minutes as we work to resolve this issue as quickly as possible.

Posted Dec 27, 2024 - 14:42 UTC

Update

Posted Dec 27, 2024 - 14:10 UTC

Investigating

We are currently investigating an issue resulting in some customers not being able to load models.

We are working to resolve this issue as quickly as possible and will provide updates every 30 minutes or upon resolution.

Posted Dec 27, 2024 - 13:40 UTC

Platform Alerts

Postmortem

Resolved

Monitoring

Update

Update

Update

Update

Update

Investigating

Need more help?

Care portal

Email

Call