Killing a Task in the Scheduler does not remove the container #19

tobwiens · 2016-07-08T07:19:16Z

Killing a docker-compose task from the scheduler interface leaves the container running.

lpellegr · 2016-12-12T09:51:14Z

This issue is still present in 7.19.1. It looks really important to me.

tobwiens · 2016-12-12T11:03:17Z

Here is a PR which dealt with the issue: #21

The problem is that the task (java executable) is controlling and maintaining the container. ProActive can kill a task faster than there is time to tell docker to remove the container.

In the PR above, I remove and stop containers with the docker-compose down command.

So what happens is that the task is killed, the thread is interrupted and starts the docker-compose down command. In the first milliseconds of the docker-compose down command, the thread is interrupted again. At some point there is no way but the thread is being killed.
So the docker-compose down command keeps running, and stops and removes the containers.

So there are many cases were, e.g. through a slow node, the interrupts arrive faster than the java code reaching the execution of the docker-compose down command.

I submitted a PR into the Scheduler, which gives the thread (on the node) a cool-off period before it is interrupted aggressively: https://github.com/ow2-proactive/scheduling/pull/2611/files

Both PRs reduce the probability of a container being removed, but cannot ensure that under all circumstances.

@lpellegr Hope the above explains a little bit the issue. It was improved immensely. Maybe even enough.
If we make the timeouts, or one important timeout configurable (https://github.com/ow2-proactive/scheduling/pull/2611/files). Then we are able to configure the Scheduler in a very soft way for docker setups.

lpellegr · 2016-12-12T11:51:39Z

Thank you a lot Tobias for the explanations.

I made a really simple test with a container that runs a simple command: sleep 1000. Then, after a few seconds, I try to kill the Task from the Scheduler portal but the container keeps running. I made 5 attempts on 2 different environments and I always get the same behaviour.

What you describe looks like a synchronisation issue. Is it not possible to apply a specific logic for the kill, based on the script type?

Reducing the likelihood seems not enough to me. Random behaviour is a nightmare for users and developers.

tobwiens · 2016-12-13T07:17:00Z

Yeah I agree, probabilities need to be dealt with.

Sure possibly we could write custom TaskKillers or similarly.

tobwiens added the type:bug label Jul 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Killing a Task in the Scheduler does not remove the container #19

Killing a Task in the Scheduler does not remove the container #19

tobwiens commented Jul 8, 2016

lpellegr commented Dec 12, 2016

tobwiens commented Dec 12, 2016

lpellegr commented Dec 12, 2016

tobwiens commented Dec 13, 2016

Killing a Task in the Scheduler does not remove the container #19

Killing a Task in the Scheduler does not remove the container #19

Comments

tobwiens commented Jul 8, 2016

lpellegr commented Dec 12, 2016

tobwiens commented Dec 12, 2016

lpellegr commented Dec 12, 2016

tobwiens commented Dec 13, 2016