Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killing a Task in the Scheduler does not remove the container #19

Open
tobwiens opened this issue Jul 8, 2016 · 4 comments
Open

Killing a Task in the Scheduler does not remove the container #19

tobwiens opened this issue Jul 8, 2016 · 4 comments
Labels

Comments

@tobwiens
Copy link
Contributor

tobwiens commented Jul 8, 2016

Killing a docker-compose task from the scheduler interface leaves the container running.

@lpellegr
Copy link

This issue is still present in 7.19.1. It looks really important to me.

@tobwiens
Copy link
Contributor Author

Here is a PR which dealt with the issue: #21

The problem is that the task (java executable) is controlling and maintaining the container. ProActive can kill a task faster than there is time to tell docker to remove the container.

In the PR above, I remove and stop containers with the docker-compose down command.

So what happens is that the task is killed, the thread is interrupted and starts the docker-compose down command. In the first milliseconds of the docker-compose down command, the thread is interrupted again. At some point there is no way but the thread is being killed.
So the docker-compose down command keeps running, and stops and removes the containers.

So there are many cases were, e.g. through a slow node, the interrupts arrive faster than the java code reaching the execution of the docker-compose down command.

I submitted a PR into the Scheduler, which gives the thread (on the node) a cool-off period before it is interrupted aggressively: https://github.com/ow2-proactive/scheduling/pull/2611/files

Both PRs reduce the probability of a container being removed, but cannot ensure that under all circumstances.

@lpellegr Hope the above explains a little bit the issue. It was improved immensely. Maybe even enough.
If we make the timeouts, or one important timeout configurable (https://github.com/ow2-proactive/scheduling/pull/2611/files). Then we are able to configure the Scheduler in a very soft way for docker setups.

@lpellegr
Copy link

Thank you a lot Tobias for the explanations.

I made a really simple test with a container that runs a simple command: sleep 1000. Then, after a few seconds, I try to kill the Task from the Scheduler portal but the container keeps running. I made 5 attempts on 2 different environments and I always get the same behaviour.

What you describe looks like a synchronisation issue. Is it not possible to apply a specific logic for the kill, based on the script type?

Reducing the likelihood seems not enough to me. Random behaviour is a nightmare for users and developers.

@tobwiens
Copy link
Contributor Author

Yeah I agree, probabilities need to be dealt with.

Sure possibly we could write custom TaskKillers or similarly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants