Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure to run ITs on remote nodes #7

Open
ieguinoa opened this issue Apr 29, 2020 · 6 comments
Open

Configure to run ITs on remote nodes #7

ieguinoa opened this issue Apr 29, 2020 · 6 comments

Comments

@ieguinoa
Copy link

ieguinoa commented Apr 29, 2020

Hi all,

I'm trying to configure the proxy (ie2 branch).
I've used the ansible role to deploy it in the Galaxy head only and the service seems to start fine and register the ITs running ok, or at least it's not showing any errors. But when I try to connect to a launched IT I get this error in the gie proxy:

Apr 29 11:43:46 vgcn-galaxy-head.usegalaxy.be.novalocal node[2421162]: Proxy error:  { Error: connect ECONNREFUSED 127.0.0.1:32770
Apr 29 11:43:46 vgcn-galaxy-head.usegalaxy.be.novalocal node[2421162]: at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1113:14)
Apr 29 11:43:46 vgcn-galaxy-head.usegalaxy.be.novalocal node[2421162]: errno: 'ECONNREFUSED',
Apr 29 11:43:46 vgcn-galaxy-head.usegalaxy.be.novalocal node[2421162]: code: 'ECONNREFUSED',
Apr 29 11:43:46 vgcn-galaxy-head.usegalaxy.be.novalocal node[2421162]: syscall: 'connect',
Apr 29 11:43:46 vgcn-galaxy-head.usegalaxy.be.novalocal node[2421162]: address: '127.0.0.1',
Apr 29 11:43:46 vgcn-galaxy-head.usegalaxy.be.novalocal node[2421162]: port: 32770 }

seems like it's trying to redirect the connection to localhost on that specific port but on the localhost, not on the compute node that is actually running the docker container (The ITs jobs are queued by condor and run on separate nodes. ).
Am I missing something in the proxy configuration? does it needs to be run on all the compute nodes too? or could this be just an error of connection with the host?

Something else also related with this....is there a way to restrict the range of the ports used by the proxy to forward connections to the containers?
I would need to configure ingress rules in the VMs acting as compute nodes and not sure which range of ports is being used.

Cheers,
Ignacio

@hexylena
Copy link
Member

It's not in the proxy, it's in galaxy. https://github.com/galaxyproject/galaxy/blob/release_20.01/lib/galaxy_ext/container_monitor/monitor.py is run on the node and figures out the IP of the host that the request should be routed to.

this patch might be needed for you: galaxyproject/galaxy#9353

is there a way to restrict the range of the ports used

the potential ports that containers attach to is defined by docker's configuration.

@ieguinoa
Copy link
Author

Thanks a lot or the quick reply @hexylena
Will take a look at those and see if I can get it working.

thanks again
Ignacio

@ieguinoa
Copy link
Author

Hi again @hexylena

I made some progress by adding that patch. I can launch and render the jupyter tool but the kernel never finish loading and when I try to run something it just says reconnecting kernel (top right corner of the notebook).
I got an error in nginx log:

2020/04/29 18:41:16 [error] 2378394#2378394: *5144 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 157.193.22.28, server: *.interactivetool.usegalaxy.be, request: "GET /ipython/api/kernels/16b931aa-b0ea-4f0a-839d-f6eea80ad332/channels?session_id=73eec890-0abd-419b-8d96-b9fe3c57f2a1 HTTP/1.1", upstream: "http://127.0.0.1:8000/ipython/api/kernels/16b931aa-b0ea-4f0a-839d-f6eea80ad332/channels?session_id=73eec890-0abd-419b-8d96-b9fe3c57f2a1", host: "11d2c29b3b1bbf2c-fa1b4fc887a3459196c86cbf6aa27571.interactivetoolentrypoint.interactivetool.usegalaxy.be"

and in the gie-proxy service I get:

Apr 29 17:58:04 vgcn-galaxy-head.usegalaxy.be.novalocal node[2554210]: Error in handler for 127.0.0.1:8000 GET /ipython/api/kernels/cb33c7b0-7b10-4f3a-a0c4-9fd5ea51ca3e/channels?session_id=ff53965a-c1df-4
Apr 29 17:58:04 vgcn-galaxy-head.usegalaxy.be.novalocal node[2554210]: at ProxyServer.<anonymous> (/srv/galaxy/gie-proxy/proxy/node_modules/http-proxy/lib/http-proxy/index.js:69:35)
Apr 29 17:58:04 vgcn-galaxy-head.usegalaxy.be.novalocal node[2554210]: at DynamicProxy.handleProxyRequest (/srv/galaxy/gie-proxy/proxy/lib/proxy.js:121:16)
Apr 29 17:58:04 vgcn-galaxy-head.usegalaxy.be.novalocal node[2554210]: at Server.<anonymous> (/srv/galaxy/gie-proxy/proxy/lib/proxy.js:23:32)
Apr 29 17:58:04 vgcn-galaxy-head.usegalaxy.be.novalocal node[2554210]: at Server.emit (events.js:182:13)
Apr 29 17:58:04 vgcn-galaxy-head.usegalaxy.be.novalocal node[2554210]: at parserOnIncoming (_http_server.js:652:12)
Apr 29 17:58:04 vgcn-galaxy-head.usegalaxy.be.novalocal node[2554210]: at HTTPParser.parserOnHeadersComplete (_http_common.js:109:17)

so it seems nginx is correctly forwarding the request to gie-proxy serving at 127.0.0.1:8000 but it times out.
Haven't looked too much into this but maybe you already have an idea what it could be or where should I start looking?

cheers,
Ignacio

@hexylena
Copy link
Member

Hiya,

Sounds like websockets. Do you have the upgrade header set appropriately?

dunno if galaxy has any examples but apollo's configuration here mixes http / websocket requests and uses this nice snippet.
https://genomearchitect.readthedocs.io/en/latest/Configure.html#nginx-proxy-from-version-1-4-on

you might add some console logging around that place in the JS, not sure what the "Error in handler" would be from.

@ieguinoa
Copy link
Author

Hey!
thanks for the tips, I've been trying to narrow it down a bit more.
The nginx config seems to be correct, in line with all config info I could find about nginx for websockets, even the Galaxy tutorial for interactive tools (https://training.galaxyproject.org/training-material/topics/admin/tutorials/interactive-tools/tutorial.html)

It's like:

  location / {
        proxy_pass http://localhost:8000;
        proxy_redirect off;
        proxy_http_version 1.1;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $host;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "Upgrade";
	proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
	proxy_set_header X-Forwarded-Proto $scheme;
    }

One relevant thing is that if I run an ipynb file and select to return after its run (no interactive browsing of jupyter) then it works fine, returning the correct outputs for the cells. Although not sure what this suggests.

Tried to debug by printing requests and headers from inside the proxy.js but there seems to be nothing strange. While running jupyter interactive tool I get mostly all:

May 11 20:03:29 vgcn-galaxy-head.usegalaxy.be.novalocal node[1591256]: PROXY fccad83dcc78127b-2233393b852040d6b0b20c7bf384e8b0.interactivetoolentrypoint.interactivetool.usegalaxy.be GET /ipython/api/kernels/39ad9dae-e27b-4d72-a6db-89fcccf75c0c/channels?session_id=15066c92-5ef1-4ee7-915a-e7a322d9f3cb to 10.10.6.224:32789

The error I mentioned earlier is hard to replicate, seems to only appear after quite some time so I guess it's probably not related with this issue. Any idea on how can I check if the problem is indeed with websockets or something else? or where else can I debug this?

from within the docker container logs I'm seeing some weird stuff, but I'm not sure it's a cause or a consequence of this problem:

[I 17:58:22.906 LabApp] Kernel started: 39ad9dae-e27b-4d72-a6db-89fcccf75c0c
[IPKernelApp] WARNING | Unknown error in handling startup files:
[I 17:58:23.860 LabApp] Adapting from protocol version 5.1 (kernel 39ad9dae-e27b-4d72-a6db-89fcccf75c0c) to 5.3 (client).
[W 17:58:23.861 LabApp] 400 GET /ipython/api/kernels/39ad9dae-e27b-4d72-a6db-89fcccf75c0c/channels?session_id=95412b74-a609-42c2-9c05-4fdd70cd8ead (10.10.6.5) 895.89ms referer=None

There is one thing I can think of...the SSL encrpytion is not done within this nginx proxy but its done in another server, centralized for several domains we manage. It's only purpouse is to do the encryption, then redirects to this nginx which does the rest. any chance it could be interfering with this?

So yes, I'm a bit lost with this.

Ignacio

@hexylena
Copy link
Member

One relevant thing is that if I run an ipynb file and select to return after its run (no interactive browsing of jupyter) then it works fine, returning the correct outputs for the cells. Although not sure what this suggests.

Then it's just run from startup, no websocket established, so, not much information to be gained from that unfortunately.

Any idea on how can I check if the problem is indeed with websockets or something else? or where else can I debug this?

If you're getting HTTP responses, but none of the cells are executable, then it's websockets. 100% for sure.

[IPKernelApp] WARNING | Unknown error in handling startup files:
[I 17:58:23.860 LabApp] Adapting from protocol version 5.1 (kernel 39ad9dae-e27b-4d72-a6db-89fcccf75c0c) to 5.3 (client).

that one looks harmless enough, but the following one is failure to establish websocket connection.

[W 17:58:23.861 LabApp] 400 GET /ipython/api/kernels/39ad9dae-e27b-4d72-a6db-89fcccf75c0c/channels?session_id=95412b74-a609-42c2-9c05-4fdd70cd8ead (10.10.6.5) 895.89ms referer=None

There is one thing I can think of...the SSL encrpytion is not done within this nginx proxy but its done in another server, centralized for several domains we manage.

Ah, if you've got two nginx proxies in the mix, then all sorts of fun things can happen.

Check that the host, upgrade, and connection headers are carried through properly. I had a problem with another websocket system + two proxies, that presented this way.

E.g. the following option in jupyter:

--LabApp.allow_origin=<Unicode>
    Default: ''
    Set the Access-Control-Allow-Origin header
    Use '*' to allow any origin to access your server.
    Takes precedence over allow_origin_pat.

is left alone to its default, which sounds to me like it might have expectations about what the http headers for Host and Origin should be, and it might be doing some internal security off of that? 400 isn't the usual code for that. You might try adding that option in your jupyter IE tool xml, to see if it helps. If yes, then it's an issue of headers probably, and the request is coming from an invalid origin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants