Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memcpy-like duplication of a MAD-X instance #88

Open
chernals opened this issue Jun 6, 2021 · 7 comments
Open

Memcpy-like duplication of a MAD-X instance #88

chernals opened this issue Jun 6, 2021 · 7 comments

Comments

@chernals
Copy link
Contributor

chernals commented Jun 6, 2021

I would like to find a way to efficiently copy the whole MAD-X instance. So far what we've been doing is to "replay" the full command-log. This works decently when just reading sequences, however if there is a long matching this is painfully slow.

@coldfix Do you think that's feasible? Hard?

I'd be happy to contribute of course.

@rdemaria
Copy link
Collaborator

rdemaria commented Jun 6, 2021

It is an interesting feature, @coldfix is the expert here.

One approach could using fork in Linux, but I am not sure how easy is in practice. One would need to fork the python process running libmadx.

Another approach would be looping over globals, elements and sequences. Then one would miss expanded sequences ( tha you can regenerate ), read and load the error definitions, PTC universe ( that can be regenerated). I am not sure if this is enough for you, but for the application I am thinking it looks reasonable.
Performance will not be great if done using Madx class. Perhaps a set of functions implemented in libmadx could do the job more efficiently.

@coldfix
Copy link
Member

coldfix commented Jun 6, 2021

I think Riccardo is correct in that you get something like this using POSIX fork semantics (which makes a copy-on-write clone of the current process as far as I understand). Consider this:

import os
import cpymad.libmadx as l

l.start()
l.input("option, -info;")
l.input("x = 2;")

if os.fork():
    print("In parent: x =", l.eval("x"))
    l.input("x = 3;")
    print("In parent: x =", l.eval("x"))
else:
    print("In child: x =", l.eval("x"))
    l.input("x = 4;")
    print("In child: x =", l.eval("x"))

which outputs:

...
In parent: x = 2.0
In parent: x = 3.0
In child: x = 2.0
In child: x = 4.0

(the parent/child outputs could of course be interleaved differently, but should always show the same values)

On windows, we would need to copy stuff over manually - which will most likely go wrong. In that case it's probably better to just stick to current best practice which is to put the responsibility on the user - who knows best how execute sequence definition and other setup calls and then load knob values as e.g. globals. This should be sufficiently fast for medium sized sequences.

But we can definitely think about adding a Madx.fork() method for usage on linux. However this will still be tricky because it means we have to replace the IPC pipe with a different object and somehow relay the new channel back to the parent process. Not sure how to do that yet, but it might be possible with some tricks.

@coldfix
Copy link
Member

coldfix commented Jun 7, 2021

FYI, I looked around a bit, and found there is socket.send_fds with which we can exchange new pipe descriptors for communication with the forked process. This requires creating a socketpair() with AF_UNIX for each madx subprocess. Currently, the IPC in minrpc is based on os.pipe(), so we either have to replace the pipe channel by sockets, or just create a spare socketpair just for the sake of file descriptor exchange. Before finalizing on either option, we should do at least basic benchmarks (I marginally remember doing some benchmarks on sockets vs pipes a long time ago and finding sockets less reliable, but this may have changed).

Note that the stdout/stderr of forked MAD-X processes will be garbled up unless we also take care of setting up these streams in the forked process. But once we have setup routines for that, it might be easier to fully replace minrpc with the functionality from the multiprocessing package, which also has the capability of exchanging file descriptors. The main issue that is to be solved is redirecting or setting up stdout/stderr of new processes (which multiprocessing doesn't do by default).

@chernals
Copy link
Contributor Author

@coldfix @rdemaria Thanks for your quick reaction. It looks like we could find a way. I'll find a way to contribute time to this during the summer ( @Oskari-Tuormaa will also join us and should be willing to contribute ).

@Oskari-Tuormaa
Copy link
Contributor

Oskari-Tuormaa commented Jul 19, 2021

I've been trying my hand at implementing this for a bit, and have been initially trying to implement functionality purely in minrpc for forking Service/Client pairs, the progress for which you can see in my forked repo here. I've written a little test program to demonstrate that it works:

from minrpc.client import Client

svc1, proc = Client.spawn_subprocess()

mad1 = svc1.get_module("cpymad.libmadx")
mad1.start()
mad1.input("option, -info")

mad1.input("x = 2")

# Fork the first client
svc2 = Client.fork_client(svc1)
mad2 = svc2.get_module("cpymad.libmadx")

print("MAD1, x =", mad1.eval("x"))
mad1.input("x = 3")
print("MAD1, x =", mad1.eval("x"))

print("MAD2, x =", mad2.eval("x"))
mad2.input("x = 4")
print("MAD2, x =", mad2.eval("x"))

mad1.finish()
mad2.finish()

Which gives the output:

...
MAD1, x = 2.0
MAD1, x = 3.0
MAD2, x = 2.0
MAD2, x = 4.0
...

I however have a couple of problems/concerns that I don't quite know how to tackle:

  1. I'm not sure how to update the internal _proc variable of the forked Client object. Currently the _proc variable in the forked and the original Client objects are identical.
  2. Currently I've implemented the spare socketpair as a sort of "singleton", meaning that a single socketpair is created in the Client.spawn_subprocess method, and each Client/Service pair forked from this original Client instance will share the same single socketpair. I imagine that this can work as long as we assume that only a single Client/Service pair will be forking at any once instance.

@coldfix
Copy link
Member

coldfix commented Jul 22, 2021

looks very promising!

  1. About the _proc object: It's only ever used to call wait() on it and maybe to hold the PID in case someone is interested. It should be fine to get the childs PID, and implement some sort of lightweight Process class with a single wait() method that just calls os.waitid(os.P_PID, self.pid, os.WEXITED) or similar.
  2. that's fine for start. In the long term it would of course be nicer if a new socketpair is created as well, or if it even fully replaces the pipe for communication.

@Oskari-Tuormaa
Copy link
Contributor

Furthur development in the forking epic: I've tried fully replacing the pipes with sockets, by replacing the Connection class with a SerializedSocket class. This class has mostly the same methods, except it also has the methods send_fd and recv_fd for sending and receiving a file descriptor. Whenever a Service forks, it receives a new file descriptor before calling the os.fork method, after which it either ignores the new file descriptor or opens a new socket connection, depending on the return value of os.fork. This seems to be pretty stable throughout my tests, since the new file descriptor is sent and received before forking.

About the _proc object: It's only ever used to call wait() on it and maybe to hold the PID in case someone is interested. It should be fine to get the childs PID, and implement some sort of lightweight Process class with a single wait() method that just calls os.waitid(os.P_PID, self.pid, os.WEXITED) or similar.

I've also implemented a lightweight class DummyProcess, as you described.

As per usual however, I've run into a new problem:

I can create and destroy many forks of the Madx instance without problems, starting up batches of around 16 threads to do parallelized tracking, but always at some point, all the processes freeze. It's been hard to debug to figure out the exact reason or even code location in which the freeze happens, but the last command send to the remote service always seems to be Twiss;. When debugging the remote end, the command seems to be received, and the call to libmadx seems to be where it's hanging. I haven't been able to extract more information than this, unfortunately, and I haven't succeeded in creating a minimal example either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants