Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't successfully run kfp_v2 uats behind proxy #78

Closed
nishant-dash opened this issue Jul 3, 2024 · 7 comments
Closed

can't successfully run kfp_v2 uats behind proxy #78

nishant-dash opened this issue Jul 3, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@nishant-dash
Copy link

Bug Description

running kfp_v2 integration test from https://github.com/canonical/charmed-kubeflow-uats/tree/main/tests, commit [0] experiment fails on

$ kubectl logs -n dash pod/condition-v2-bcjmm-2542164255 
I0703 15:00:08.666715      61 launcher_v2.go:90] input ComponentSpec:{
  "inputDefinitions": {
    "parameters": {
      "force_flip_result": {
        "parameterType": "STRING",
        "defaultValue": "",
        "isOptional": true
      }
    }
  },
  "outputDefinitions": {
    "parameters": {
      "Output": {
        "parameterType": "STRING"
      }
    }
  },
  "executorLabel": "exec-flip-coin"
}
I0703 15:00:08.667506      61 cache.go:139] Cannot detect ml-pipeline in the same namespace, default to ml-pipeline.kubeflow:8887 as KFP endpoint.
I0703 15:00:08.667522      61 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
I0703 15:00:08.710062      61 object_store.go:306] Cannot detect minio-service in the same namespace, default to minio-service.kubeflow:9000 as MinIO endpoint.
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fe8fe8f9a50>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/kfp/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fe8fea17110>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/kfp/

To Reproduce

In a kf 1.8 env behind proxy, run the kfp_v2 uats

Environment

Model     Controller            Cloud/Region  Version  SLA          Timestamp
kubeflow  openstack-<REDACTED>  k8s/default   3.4.3    unsupported  16:12:37Z

SAAS                             Status   Store                 URL
alertmanager-karma-dashboard     active   openstack-<REDACTED>  admin/cos.alertmanager-karma-dashboard
grafana-dashboards               active   openstack-<REDACTED>  admin/cos.grafana-dashboards
loki-logging                     active   openstack-<REDACTED>  admin/cos.loki-logging
prometheus-receive-remote-write  active   openstack-<REDACTED>  admin/cos.prometheus-receive-remote-write
prometheus-scrape                active   openstack-<REDACTED>  admin/cos.prometheus-scrape
scrape-interval-config-metrics   blocked  openstack-<REDACTED>  admin/cos.scrape-interval-config-metrics
scrape-interval-config-monitors  blocked  openstack-<REDACTED>  admin/cos.scrape-interval-config-monitors

App                        Version                  Status  Scale  Charm                    Channel          Rev  Address        Exposed  Message
admission-webhook                                   active      1  admission-webhook        1.8/stable       301  a.b.c.d  no       
argo-controller                                     active      1  argo-controller          3.3.10/stable    424  a.b.c.d  no       
dex-auth                                            active      1  dex-auth                 2.36/stable      422  a.b.c.d   no       
envoy                      res:oci-image@cc06b3e    active      1  envoy                    2.0/stable       194  a.b.c.d   no       
grafana-agent-kubeflow     0.40.4                   active      1  grafana-agent-k8s        latest/edge       80  a.b.c.d   no       
istio-ingressgateway                                active      1  istio-gateway            1.17/stable     1000  a.b.c.d   no       
istio-pilot                                         active      1  istio-pilot              1.17/stable     1011  a.b.c.d  no       
jupyter-controller                                  active      1  jupyter-controller       1.8/stable       849  a.b.c.d   no       
jupyter-ui                                          active      1  jupyter-ui               1.8/stable       858  a.b.c.d     no       
katib-controller           res:oci-image@31ccd70    active      1  katib-controller         0.16/stable      576  a.b.c.d   no       
katib-db                   8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d  no       
katib-db-manager                                    active      1  katib-db-manager         0.16/stable      539  a.b.c.d   no       
katib-ui                                            active      1  katib-ui                 0.16/stable      422  a.b.c.d    no       
kfp-api                                             active      1  kfp-api                  2.0/stable      1283  a.b.c.d   no       
kfp-db                     8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d  no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      2.0/stable       334  a.b.c.d   no       
kfp-persistence                                     active      1  kfp-persistence          2.0/stable      1291  a.b.c.d    no       
kfp-profile-controller                              active      1  kfp-profile-controller   2.0/stable      1315  a.b.c.d  no       
kfp-schedwf                                         active      1  kfp-schedwf              2.0/stable      1302  a.b.c.d   no       
kfp-ui                                              active      1  kfp-ui                   2.0/stable      1285  a.b.c.d   no       
kfp-viewer                                          active      1  kfp-viewer               2.0/stable      1317  a.b.c.d  no       
kfp-viz                                             active      1  kfp-viz                  2.0/stable      1235  a.b.c.d   no       
knative-eventing                                    active      1  knative-eventing         1.10/stable      353  a.b.c.d   no       
knative-operator                                    active      1  knative-operator         1.10/stable      328  a.b.c.d    no       
knative-serving                                     active      1  knative-serving          1.10/stable      409  a.b.c.d   no       
kserve-controller                                   active      1  kserve-controller        0.11/stable      523  a.b.c.d  no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       1.8/stable       454  a.b.c.d    no       
kubeflow-profiles                                   active      1  kubeflow-profiles        1.8/stable       355  a.b.c.d  no       
kubeflow-roles                                      active      1  kubeflow-roles           1.8/stable       187  a.b.c.d    no       
kubeflow-volumes           res:oci-image@2261827    active      1  kubeflow-volumes         1.8/stable       260  a.b.c.d   no       
metacontroller-operator                             active      1  metacontroller-operator  3.0/stable       252  a.b.c.d    no       
minio                      res:oci-image@1755999    active      1  minio                    ckf-1.8/stable   278  a.b.c.d   no       
mlflow-minio               res:oci-image@1755999    active      1  minio                    ckf-1.7/stable   214  a.b.c.d  no       
mlflow-mysql               8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d  no       
mlflow-server                                       active      1  mlflow-server            2.1/stable       466  a.b.c.d   no       
mlmd                       res:oci-image@44abc5d    active      1  mlmd                     1.14/stable      127  a.b.c.d  no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          ckf-1.8/stable   350  a.b.c.d    no       
pvcviewer-operator                                  active      1  pvcviewer-operator       1.8/stable        30  a.b.c.d  no       
resource-dispatcher                                 active      1  resource-dispatcher      1.0/stable        93  a.b.c.d   no       
seldon-controller-manager                           active      1  seldon-core              1.17/stable      664  a.b.c.d    no       
tensorboard-controller                              active      1  tensorboard-controller   1.8/stable       257  a.b.c.d   no       
tensorboards-web-app                                active      1  tensorboards-web-app     1.8/stable       245  a.b.c.d    no       
training-operator                                   active      1  training-operator        1.7/stable       347  a.b.c.d   no       

Relevant Log Output

$ kubectl get all -n dash
NAME                                                  READY   STATUS      RESTARTS   AGE
pod/condition-v2-bcjmm-1791838033                     0/2     Completed   0          3m40s
pod/condition-v2-bcjmm-2085347550                     0/2     Completed   0          4m
pod/condition-v2-bcjmm-2542164255                     2/2     Running     0          3m30s
pod/ml-pipeline-ui-artifact-6b89ccc469-djz6v          2/2     Running     0          5d5h
pod/ml-pipeline-visualizationserver-955b54775-l9v7p   2/2     Running     0          4d20h
pod/test-dash-2-0                                     2/2     Running     0          9m54s

NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/ml-pipeline-ui-artifact           ClusterIP   10.87.227.78    <none>        80/TCP     5d5h
service/ml-pipeline-visualizationserver   ClusterIP   10.87.207.107   <none>        8888/TCP   5d5h
service/test-dash-2                       ClusterIP   10.87.85.27     <none>        80/TCP     9m54s

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ml-pipeline-ui-artifact           1/1     1            1           5d5h
deployment.apps/ml-pipeline-visualizationserver   1/1     1            1           5d5h

NAME                                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/ml-pipeline-ui-artifact-6b89ccc469          1         1         1       5d5h
replicaset.apps/ml-pipeline-visualizationserver-955b54775   1         1         1       5d5h

NAME                           READY   AGE
statefulset.apps/test-dash-2   1/1     9m54s

Additional Context

No response

@nishant-dash nishant-dash added the bug Something isn't working label Jul 3, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5957.

This message was autogenerated

@NohaIhab
Copy link
Contributor

NohaIhab commented Jul 8, 2024

Hi @nishant-dash ,

  1. Can you provide the logs you get from the notebook execution? It'd be useful to see in which step the notebook is failing to see what part of it is trying to reach the internet.
  2. Have you tried configuring the pipeline as linked in the example notebook in our how to guide? If you have a different configuration please provide us with that as well.
  3. Which Kubernetes and what version of it are you using?

In the meantime, we are prioritizing this issue and will try to reproduce it.

@nishant-dash
Copy link
Author

for

  1. it fails on the last cell of the kfp v2 integration test notebook
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[14], line 4
     1 # fetch KFP experiment to ensure it exists
     2 client.get_experiment(experiment_name=EXPERIMENT_NAME)
----> 4 assert_run_succeeded(client, run.run_id)

File /opt/conda/lib/python3.11/site-packages/tenacity/__init__.py:336, in BaseRetrying.wraps.<locals>.wrapped_f(*args, **kw)
   334 copy = self.copy()
   335 wrapped_f.statistics = copy.statistics  # type: ignore[attr-defined]
--> 336 return copy(f, *args, **kw)

File /opt/conda/lib/python3.11/site-packages/tenacity/__init__.py:475, in Retrying.__call__(self, fn, *args, **kwargs)
   473 retry_state = RetryCallState(retry_object=self, fn=fn, args=args, kwargs=kwargs)
   474 while True:
--> 475     do = self.iter(retry_state=retry_state)
   476     if isinstance(do, DoAttempt):
   477         try:

File /opt/conda/lib/python3.11/site-packages/tenacity/__init__.py:376, in BaseRetrying.iter(self, retry_state)
   374 result = None
   375 for action in self.iter_state.actions:
--> 376     result = action(retry_state)
   377 return result

File /opt/conda/lib/python3.11/site-packages/tenacity/__init__.py:418, in BaseRetrying._post_stop_check_actions.<locals>.exc_check(rs)
   416 retry_exc = self.retry_error_cls(fut)
   417 if self.reraise:
--> 418     raise retry_exc.reraise()
   419 raise retry_exc from fut.exception()

File /opt/conda/lib/python3.11/site-packages/tenacity/__init__.py:185, in RetryError.reraise(self)
   183 def reraise(self) -> t.NoReturn:
   184     if self.last_attempt.failed:
--> 185         raise self.last_attempt.result()
   186     raise self

File /opt/conda/lib/python3.11/concurrent/futures/_base.py:449, in Future.result(self, timeout)
   447     raise CancelledError()
   448 elif self._state == FINISHED:
--> 449     return self.__get_result()
   451 self._condition.wait(timeout)
   453 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File /opt/conda/lib/python3.11/concurrent/futures/_base.py:401, in Future.__get_result(self)
   399 if self._exception:
   400     try:
--> 401         raise self._exception
   402     finally:
   403         # Break a reference cycle with the exception in self._exception
   404         self = None

File /opt/conda/lib/python3.11/site-packages/tenacity/__init__.py:478, in Retrying.__call__(self, fn, *args, **kwargs)
   476 if isinstance(do, DoAttempt):
   477     try:
--> 478         result = fn(*args, **kwargs)
   479     except BaseException:  # noqa: B902
   480         retry_state.set_exception(sys.exc_info())  # type: ignore[arg-type]

Cell In[13], line 9, in assert_run_succeeded(client, run_id)
     7 """Wait for the run to complete successfully."""
     8 status = client.get_run(run_id).state
----> 9 assert status == "SUCCEEDED", f"KFP run in {status} state."

AssertionError: KFP run in RUNNING state.
  1. but isn't this for a cluster internal service ? maybe kfp? (perhaps the no proxy needs tweaking at the containerd level?)
Network is unreachable')': /simple/kfp/

@DnPlas
Copy link
Contributor

DnPlas commented Jul 8, 2024

@nishant-dash also for reproducing the issue, could you please tell us which method for running the UATs are you using? It was not clear from the issue description.

a) From inside a notebook
b) Using the driver

@NohaIhab
Copy link
Contributor

NohaIhab commented Jul 9, 2024

I got access from @nishant-dash to the env in question and was able to investigate
The error from the pipeline runner pod after it went to error state was as follows:

I0709 12:16:46.169293      69 cache.go:139] Cannot detect ml-pipeline in the same namespace, default to ml-pipeline.kubeflow:8887 as KFP endpoint.
I0709 12:16:46.169310      69 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
I0709 12:16:46.207929      69 object_store.go:306] Cannot detect minio-service in the same namespace, default to minio-service.kubeflow:9000 as MinIO endpoint.
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f5db14d3b50>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/kfp/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f5db14e2310>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/kfp/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f5db14e2810>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/kfp/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f5db14e2bd0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/kfp/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f5db14e2f10>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/kfp/
ERROR: Could not find a version that satisfies the requirement kfp==2.4.0 (from versions: none)
ERROR: No matching distribution found for kfp==2.4.0
I0709 12:22:55.223155      69 launcher_v2.go:151] publish success.
F0709 12:22:55.223210      69 main.go:49] failed to execute component: exit status 1
time="2024-07-09T12:22:55.240Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1
time="2024-07-09T12:22:56.202Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1

It is clear in the logs that the container was not able to install kfp package, specifically :

ERROR: Could not find a version that satisfies the requirement kfp==2.4.0 (from versions: none)
ERROR: No matching distribution found for kfp==2.4.0

The Retrying logs above are from pip trying to install the package and failing. This is because the container is not able to reach the internet.

The default base image for pipeline runners does not have kfp package installed, it will try to install it every time. We have previously seen this in airgapped testing (see this comment).

@NohaIhab
Copy link
Contributor

NohaIhab commented Jul 9, 2024

Based on my comment above, it is required in a proxy environment to configure the pipeline to have the proxy environment variables.

I was able to do this and run a pipeline successfully behind proxy by modifying the kfp v2 UATs notebook as follows:

  1. add a cell at the beginning to define the proxy and no proxy values
PROXY_URL=<insert your proxy values>
NO_PROXY_URLS=<insert your no proxy values>
  1. add a cell with a helper function that adds the proxy env variables to the PipelineTask. For more details see the kfp sdk docs. The helper function is as follows:
def add_proxy(obj, proxy=PROXY_URL, no_proxy=NO_PROXY_URLS):
    """Adds the proxy env vars to the PipelineTask object."""
    return (
        obj.set_env_variable(name='http_proxy', value=proxy)
        .set_env_variable(name='https_proxy', value=proxy)
        .set_env_variable(name='HTTP_PROXY', value=proxy)
        .set_env_variable(name='HTTPS_PROXY', value=proxy)
        .set_env_variable(name='no_proxy', value=no_proxy)
        .set_env_variable(name='NO_PROXY', value=no_proxy)
    )
  1. Modify the cell where the pipeline is defined to use the add_proxy helper so that it adds the proxy env vars to all components:
@dsl.pipeline(name='condition-v2')
def condition_pipeline(text: str = 'condition test', force_flip_result: str = ''):
    flip1 = add_proxy(flip_coin(force_flip_result=force_flip_result))
    add_proxy(print_msg(msg=flip1.output))

    with dsl.Condition(flip1.output == 'heads'):
        flip2 = add_proxy(flip_coin())
        add_proxy(print_msg(msg=flip2.output))
        add_proxy(print_msg(msg=text))
  1. Run the notebook after the 3 edits above, pipeline should now succeed

@nishant-dash can you try the above and confirm it fixes the issue for you?

@NohaIhab
Copy link
Contributor

Discussed with @nishant-dash
I have tested the fix in the environment in which it was failing for them, so we can close the issue.
If you hit this again, please report it and re-open the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants