On July 20, 2025, at approximately 00:45 UTC, we received a notification that errors were being met when attempting to open large workspaces.
Our Support team investigated and identified an error with how the system was allocating resources from a specific category of nodepools used for large workspaces. Because of the error, very large workspaces (HyperModels) were unable to open. It was identified that this disruption impacted the region, us7: Anaplan Amazon Cloud Public — US.
A platform incident was declared at 01:58 UTC. Our technical teams joined the call and identified a configuration issue in our auto-scaling system. The configuration was manually updated, and full service was restored at 02:53 UTC.
We have completed a thorough investigation into the issue. A Cluster Autoscaler provisions on-demand capacity. The Autoscaler looks at how much capacity is needed and grows a nodepool that meets the needs of that amount of capacity. The Autoscaler needs to know the properties of the underlying nodepools to perform this matching process. This is typically done through static lookups and pre-defined information, but it can also absorb details about other properties based on existing nodes within a nodepool.
At 21:48 UTC, we completed a downtime maintenance window. During the scale-down process as part of the maintenance window, an ephemeral property was added to all nodes in a specific nodepool. As this property was present on all nodes within the nodepool, the Autoscaler absorbed it. This made the matching process fail because the Autoscaler falsely reasoned that the nodepool couldn’t handle the capacity demands.
Manually scaling up the nodepool did two things. It met the capacity needs, so HyperModel resources could run. It also fixed the node properties for this nodepool within the Aituoscaler.
To prevent this issue from happening again, we are updating our downtime maintenance scaling procedures. We are adding more post-downtime health checks across nodepools to make sure all pools are functioning correctly. We are also shortening the Autoscaler's learning and retention time. If the Autoscaler finds a wrong configuration again in the future, it'll check more often and fix it without it needing to be fixed manually.
We apologize for any impact this issue may have had on your business operations. We are continuously strengthening our systems and procedures to ensure we avoid future disruptions to your business.
If you have further questions or concerns, please visit our Support website. Thank you for your patience during this situation, and we appreciate your continued trust in Anaplan.