Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WagedInstanceCapacity Null Pointer Exception due to stale _instanceCapacityMap #2891

Open
GrantPSpencer opened this issue Aug 24, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@GrantPSpencer
Copy link
Contributor

Describe the bug

Waged pipeline will fail due to NPE during BestPossibleStateCalcStage as it will call checkAndReduceInstanceCapacity on an instance that is not in the WagedInstanceCapacity's _instanceCapacityMap. This will occur when the WagedInstanceCapacity is calculated at point A, a new instance is added at point B, and then at at point C the WagedInstanceCapacity is not refreshed to include this instance during the CurrentStateComputationStage The specific circumstances are detailed below

To Reproduce

  1. Add at least 1 waged enabled resource to a cluster and rebalance so assignments are made.
  2. Drop all resources from the cluster.
  3. Add a new instance ("new_instance")to the cluster.
  4. Add 1 waged enabled resource to the cluster
  5. NPE will occur

This occurs because "new_instance" is an assignable instance and is in the newly calculated preference list. So checkAndReduceInstanceCapacity is called on the instance. However, WagedInstanceCapacity's _instanceCapacityMap has not been updated and therefore has a stale view that does not include "new_instance"

This is because the skipCapacityCalculation method (a very effective optimization) causes the CurrentStateComputationStage to not refresh the cache if there are no resources in the resourceMap. However, the resourceMap is constructed based off the idealStates in the cluster which does not exist at this point. When a resource is added, a ResourceConfigChange event is first fired. Afterwards, an IdealStateChange will fire. In this case of a new resource being added, the CurrentStateComputationStage will not recalculate the WagedInstanceCapacity as the resourceMap is empty when we encounter a ResourceConfigChange and then we do not recalculate on subsequent IdealStateChange

Adding a WAGED resource to a new cluster does not trigger this NPE because there is no WagedInstanceCapacity so

    if (Objects.isNull(cache.getWagedInstanceCapacity())) {
      return false;
    }

will force it to be refreshed.

GrantPSpencer#32
The testcase in this draft PR will fail on master and follows the steps outlined above.

Expected behavior

WagedInstanceCapacity should be recalculated in the case of a new resource being added prior to the BestPossibleStateCalcStage.

Additional context

10539 [HelixController-pipeline-default-TestWagedNPE_cluster-(45df0f8d_DEFAULT)] ERROR org.apache.helix.controller.GenericHelixController [] - Exception while executing DEFAULT pipeline for cluster TestWagedNPE_cluster. Will not continue to next pipeline
java.lang.NullPointerException: null
	at org.apache.helix.controller.rebalancer.waged.WagedInstanceCapacity.checkAndReduceInstanceCapacity(WagedInstanceCapacity.java:206) ~[classes/:?]
	at org.apache.helix.controller.dataproviders.ResourceControllerDataProvider.checkAndReduceCapacity(ResourceControllerDataProvider.java:535) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossibleStateForPartition(DelayedAutoRebalancer.java:377) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossiblePartitionState(DelayedAutoRebalancer.java:271) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossiblePartitionState(DelayedAutoRebalancer.java:54) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.lambda$computeNewIdealStates$0(WagedRebalancer.java:281) ~[classes/:?]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
	at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1692) ~[?:?]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
	at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290) ~[?:?]
	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746) ~[?:?]
	at java.util.concurrent.ForkJoinTask.doExec$$$capture(ForkJoinTask.java:290) ~[?:?]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java) ~[?:?]
	at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:408) ~[?:?]
	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:736) ~[?:?]
	at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159) ~[?:?]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173) ~[?:?]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) ~[?:?]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) ~[?:?]
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:661) ~[?:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:277) ~[classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:445) ~[classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:289) ~[classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:94) ~[classes/:?]
	at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) ~[classes/:?]
	at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [classes/:?]
	at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [classes/:?]
@GrantPSpencer GrantPSpencer added the bug Something isn't working label Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant