Platform Alerts

Postmortem

On July 20, 2025, at approximately 00:45 UTC, we received a notification that errors were being met when attempting to open large workspaces.

Our Support team investigated and identified an error with how the system was allocating resources from a specific category of nodepools used for large workspaces. Because of the error, very large workspaces (HyperModels) were unable to open. It was identified that this disruption impacted the region, us7: Anaplan Amazon Cloud Public — US.

A platform incident was declared at 01:58 UTC. Our technical teams joined the call and identified a configuration issue in our auto-scaling system. The configuration was manually updated, and full service was restored at 02:53 UTC.

We have completed a thorough investigation into the issue. A Cluster Autoscaler provisions on-demand capacity. The Autoscaler looks at how much capacity is needed and grows a nodepool that meets the needs of that amount of capacity. The Autoscaler needs to know the properties of the underlying nodepools to perform this matching process. This is typically done through static lookups and pre-defined information, but it can also absorb details about other properties based on existing nodes within a nodepool.

At 21:48 UTC, we completed a downtime maintenance window. During the scale-down process as part of the maintenance window, an ephemeral property was added to all nodes in a specific nodepool. As this property was present on all nodes within the nodepool, the Autoscaler absorbed it. This made the matching process fail because the Autoscaler falsely reasoned that the nodepool couldn’t handle the capacity demands.

Manually scaling up the nodepool did two things. It met the capacity needs, so HyperModel resources could run. It also fixed the node properties for this nodepool within the Aituoscaler.

To prevent this issue from happening again, we are updating our downtime maintenance scaling procedures. We are adding more post-downtime health checks across nodepools to make sure all pools are functioning correctly. We are also shortening the Autoscaler's learning and retention time. If the Autoscaler finds a wrong configuration again in the future, it'll check more often and fix it without it needing to be fixed manually.

We apologize for any impact this issue may have had on your business operations. We are continuously strengthening our systems and procedures to ensure we avoid future disruptions to your business.

If you have further questions or concerns, please visit our Support website. Thank you for your patience during this situation, and we appreciate your continued trust in Anaplan.

Posted Jul 25, 2025 - 08:48 UTC

Resolved

We have confirmed that the issue is now resolved.

We deeply apologize for any impact this issue may have caused. We appreciate your patience and partnership as we worked through this issue.

We will follow up within 7 business days with a detailed root cause analysis (RCA) that will be shared on our Status Page. If you have any question or concerns, please do not hesitate to contact us at Anaplan Customer Care.

Posted Jul 20, 2025 - 02:50 UTC

Identified

We have identified the likely cause of the issue, and we are focused right now on restoring service as quickly as possible.

ESTIMATED TIME TO RESOLUTION: We estimate service will be restored in approximately 20 minutes.

*Please note the estimated time to resolution is an approximation based on information we have at this stage and is subject to change.

Posted Jul 20, 2025 - 02:40 UTC

Investigating

We are currently investigating an issue resulting in some customers not being able to load models.

We are working to resolve this issue as quickly as possible and will provide updates every 30 minutes or upon resolution.

Posted Jul 20, 2025 - 02:26 UTC

Platform Alerts

Postmortem

Resolved

Identified

Investigating

Need more help?

Visit Support portal

Register for Support portal

Call