-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[controller] Helix resources should be removed after a VPJ kill job #1069
base: main
Are you sure you want to change the base?
[controller] Helix resources should be removed after a VPJ kill job #1069
Conversation
Hi @ishwarya-citro. We should not include non-public information in open-source PRs. Can you remove the ticket reference? |
@@ -6526,6 +6524,11 @@ public void killOfflinePush(String clusterName, String kafkaTopic, boolean isFor | |||
LOGGER.info("Resource: {} doesn't exist, kill job will be skipped", kafkaTopic); | |||
return; | |||
} | |||
|
|||
deleteHelixResource(clusterName, kafkaTopic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a harmful change because it deletes helix resources without validating if it's not the current version, etc. Can you share some more details about the issue that was noticed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everytime a VPJ runs the killOfflinePush
job, it does not remove the helix resources. As part of the job, it calls the killOfflinePush but does not call the method to remove the helix resource.
Example of it doing the killOfflinePush & removing helix resource correctly : https://github.com/linkedin/venice/blob/main/services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java#L3276
I see that we do have version validates & i will add this to the deleteHelixResource()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to query the child controllers to validate none of them are using the killjob topic as the current version.
Also I am thinking what is the harm in keeping the helix resources intact till the time when the store versions are deleted (which happens later in some background thread).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majisourav99 made the changes requested. Pl lmk if this looks ok.
also, in another location, it doesn't keep helix resource intact until store versions are deleted. Is it ok to follow the same here?
@@ -3269,9 +3269,6 @@ private void deleteOneStoreVersion(String clusterName, String storeName, int ver | |||
return; | |||
} | |||
String resourceName = Version.composeKafkaTopic(storeName, versionNumber); | |||
LOGGER.info("Deleting helix resource: {} in cluster: {}", resourceName, clusterName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not remove this as kill pushjob could be called without calling delete store version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the killOfflinePush
will delete the helix resourced. So, this is not removed but moved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not add internal jira links in open source PR.
|
||
String storeName = Version.parseStoreFromKafkaTopicName(kafkaTopic); | ||
int versionNumber = Version.parseVersionFromKafkaTopicName(kafkaTopic); | ||
Map<String, ControllerClient> controllerClients = getControllerClientMap(clusterName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to parent helix controller and skill checking the current version in parent as parent controller store current version is not properly maintained. Or check child controller local store current version and skip deleting those resources.
Summary, imperative, start upper case, don't end with a period
When VPJ initiates the kill-job command, the kill-job commands didn't remove the Helix resource. Added logic to include the helix resource deletion whenever a VPJ fails.
Everytime a VPJ runs the
killOfflinePush
job, it does not remove the helix resources. As part of the job, it calls the killOfflinePush but does not call the method to remove the helix resource.Changes follow an example of it doing the killOfflinePush & removing helix resource correctly like here and here.
How was this PR tested?
unit tests
Does this PR introduce any user-facing changes?