Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[controller] Helix resources should be removed after a VPJ kill job #1069

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ishwarya-citro
Copy link
Collaborator

@ishwarya-citro ishwarya-citro commented Jul 17, 2024

Summary, imperative, start upper case, don't end with a period

When VPJ initiates the kill-job command, the kill-job commands didn't remove the Helix resource. Added logic to include the helix resource deletion whenever a VPJ fails.

Everytime a VPJ runs the killOfflinePush job, it does not remove the helix resources. As part of the job, it calls the killOfflinePush but does not call the method to remove the helix resource.

Changes follow an example of it doing the killOfflinePush & removing helix resource correctly like here and here.

How was this PR tested?

unit tests

Does this PR introduce any user-facing changes?

  • [X ] No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

@nisargthakkar nisargthakkar changed the title [controller] Helix resources should be removed after a VJP kill job [controller] Helix resources should be removed after a VPJ kill job Jul 17, 2024
@nisargthakkar
Copy link
Contributor

Hi @ishwarya-citro. We should not include non-public information in open-source PRs. Can you remove the ticket reference?

@@ -6526,6 +6524,11 @@ public void killOfflinePush(String clusterName, String kafkaTopic, boolean isFor
LOGGER.info("Resource: {} doesn't exist, kill job will be skipped", kafkaTopic);
return;
}

deleteHelixResource(clusterName, kafkaTopic);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a harmful change because it deletes helix resources without validating if it's not the current version, etc. Can you share some more details about the issue that was noticed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everytime a VPJ runs the killOfflinePush job, it does not remove the helix resources. As part of the job, it calls the killOfflinePush but does not call the method to remove the helix resource.

Example of it doing the killOfflinePush & removing helix resource correctly : https://github.com/linkedin/venice/blob/main/services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java#L3276

https://github.com/linkedin/venice/blob/main/services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java#L3274

I see that we do have version validates & i will add this to the deleteHelixResource()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to query the child controllers to validate none of them are using the killjob topic as the current version.
Also I am thinking what is the harm in keeping the helix resources intact till the time when the store versions are deleted (which happens later in some background thread).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majisourav99 made the changes requested. Pl lmk if this looks ok.

also, in another location, it doesn't keep helix resource intact until store versions are deleted. Is it ok to follow the same here?

@@ -3269,9 +3269,6 @@ private void deleteOneStoreVersion(String clusterName, String storeName, int ver
return;
}
String resourceName = Version.composeKafkaTopic(storeName, versionNumber);
LOGGER.info("Deleting helix resource: {} in cluster: {}", resourceName, clusterName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not remove this as kill pushjob could be called without calling delete store version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the killOfflinePush will delete the helix resourced. So, this is not removed but moved

Copy link
Contributor

@majisourav99 majisourav99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not add internal jira links in open source PR.


String storeName = Version.parseStoreFromKafkaTopicName(kafkaTopic);
int versionNumber = Version.parseVersionFromKafkaTopicName(kafkaTopic);
Map<String, ControllerClient> controllerClients = getControllerClientMap(clusterName);
Copy link
Contributor

@majisourav99 majisourav99 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to parent helix controller and skill checking the current version in parent as parent controller store current version is not properly maintained. Or check child controller local store current version and skip deleting those resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants