Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dataflow): Add OAuth support for Kafka SASL to the dataflow engine #5111

Closed

Conversation

agrski
Copy link
Contributor

@agrski agrski commented Aug 29, 2023

Why

Motivation

This PR provides the dataflow engine/Kotlin equivalent of the Go changes provided by Rafal. This is about adding support for the OAUTHBEARER SASL mechanism when connecting to Kafka clusters.

References

What

Outstanding tasks

  • Fix SASL_PLAIN
  • Test with Strimzi setups & external provider again
    • PLAINTEXT
    • SASL, PLAIN
    • SASL, SCRAM
    • SASL, OAUTHBEARER
    • SASL, mTLS
  • Change Helm vars to reuse client.secret instead of client.oauthSecret
  • Change default Ansible value for security.kafka.sasl.client.passwordPath from /tmp/sasl/kafka/client/password to password as this is a k8s secret field, not a file path in this context

Changes

  • Add support for OAUTHBEARER SASL mechanism when connecting to Kafka.
  • Refactor internal secrets provider mechanisms.
  • Formatting & stylistic changes.

Testing

Create fresh kind cluster:

ansible-playbook playbooks/kind-cluster.yaml 
ansible-playbook playbooks/setup-ecosystem.yaml -e full_install=no -e install_kafka=yes
ansible-playbook playbooks/setup-seldon.yaml -e seldon_install_servers=no
  • Ensure default dataflow image is working:
k apply -f samples/pipelines/tfsimple.yaml

Protocol = PLAINTEXT

  • Use custom image for this PR and confirm the basic, in-cluster Strimzi setup works still:
$ k -n seldon-mesh get seldonconfig default -o json | jq '.spec.components[5].podSpec.containers[0].image'
"docker.io/seldonio/seldon-dataflow-engine:latest"

$ k -n seldon-mesh patch seldonconfigs default --type json -p '[{"op": "replace", "path": "/spec/components/5/podSpec/containers/0/image", 
"value": "agrski/seldon-dataflow-engine:oauth"}]'
seldonconfig.mlops.seldon.io/default patched

$ k -n seldon-mesh get seldonconfig default -o json | jq '.spec.components[5].podSpec.containers[0].image'
"agrski/seldon-dataflow-engine:oauth"

$ k -n seldon-mesh logs seldon-dataflow-engine-5f98fbff7c-s8987 | rg -i tfsimple | tail -2
14:30:50.771110980  INFO [                main] :     i.s.d.k.Pipeline : pipeline tfsimple (v1) changing to state REBALANCING
14:31:46.498112744  INFO [a8811-StreamThread-1] :     i.s.d.k.Pipeline : pipeline tfsimple (v1) changing to state RUNNING

Protocol = SASL_SSL, mechanism = OAUTHBEARER

  • Test an external Kafka provider with OAuth set up works:
helm -n seldon-mesh upgrade seldon-core-v2-components ./k8s/helm-charts/seldon-core-v2-setup/ \
  --reuse-values \
  --set kafka.bootstrap=<elided> \
  --set security.kafka.protocol=SASL_SSL \
  --set security.kafka.sasl.mechanism=OAUTHBEARER \
  --set security.kafka.sasl.client.oauthSecret=<elided> \
  --set dataflow.image.repository=agrski/seldon-dataflow-engine \
  --set dataflow.image.tag=oauth \
  # These are required as changes from the default Ansible setup for the external Kafka provider in use
  --set security.kafka.ssl.client.brokerValidationSecret= \
  --set security.kafka.ssl.client.secret= \
  --set kafka.topics.replicationFactor=3 \
  --set kafka.consumer.messageMaxBytes=8388608 \
  --set kafka.producer.messagesMaxBytes=8388608

This now works, with the dataflow engine getting itself to a running state without crashing.

Protocol = SASL_SSL, mechanism = SCRAM-SHA-512

  • Test with in-cluster Strimzi

I needed to delete the Strimzi setup and its PVCs as SCRAM is not supported in Strimzi with KRaft.

helm -n seldon-mesh uninstall seldon-core-v2-kafka
kubectl -n seldon-mesh delete pvc data-0-seldon-kafka-0 data-0-seldon-kafka-1 data-0-seldon-kafka-2

I then used the below values as Ansible overrides and rolled out Kafka and the Core v2 installation again:

ansible-playbook playbooks/setup-ecosystem.yaml -e full_install=no -e install_kafka=yes -e @.extra/values-scv2-sasl-scram.yaml
ansible-playbook playbooks/setup-seldon.yaml -e seldon_install_servers=no -e @.extra/values-scv2-sasl-scram.yaml
.extra/values-scv2-sasl-scram.yaml
seldon_core_v2_component_values:
  security:
    kafka:
      protocol: SASL_SSL
      sasl:
        client:
          secret: seldon
          passwordPath: password
      ssl:
        client:
          secret:
          brokerValidationSecret: seldon-cluster-ca-cert
          brokerCaPath: /tmp/certs/kafka/broker/ca.crt
  kafka:
    bootstrap: "seldon-kafka-bootstrap.{{ seldon_kafka_namespace }}.svc.cluster.local:9093"
    topics.numPartitions: 1

  dataflow: 
    image:
      repository: <custom image>
      tag: <custom tag>
    resources: 
      cpu: 500m
  envoy.service.type: "ClusterIP"
  scheduler.service.type: "ClusterIP"
  serviceGRPCPrefix: "http2-"

seldon_kafka_cluster_values:
  broker: 
    tls: 
      authentication: 
        type: scram-sha-512

strimzi_kafka_operator_feature_gates: "+UseStrimziPodSets"

This outputs logs like:

11:37:18.360151639  INFO [                main] : rnetesSecretProvider : reading secret seldon from seldon-mesh
11:37:18.359742003  INFO [                main] : SaslPasswordProvider : retrieving password for SASL user
11:37:18.416262190  INFO [                main] : SaslPasswordProvider : retrieved password for SASL user from secret seldon
...
11:37:19.887776545  INFO [                main] :     i.s.d.k.Pipeline : pipeline tfsimple (v1) changing to state REBALANCING
11:38:15.916283069  INFO [3524b-StreamThread-1] :     i.s.d.k.Pipeline : pipeline tfsimple (v1) changing to state RUNNING

This includes includes the values which need to be set, but without providing the client ID and secret, and so on.
This has yet to be done.
This will replace the existing one, splitting the responsibilities between retrieving secrets and interpreting their contents.
This will enable the upcoming SASL OAuth secret handler to reuse the underlying k8s secret provider.
This implementation is also simpler than before and provides logging.
@agrski agrski added the v2 label Aug 29, 2023
@agrski agrski self-assigned this Aug 29, 2023
@agrski
Copy link
Contributor Author

agrski commented Aug 31, 2023

I had some fun with this error and some other errors in the dataflow engine logs:

 Caused by: java.lang.IllegalArgumentException: Value not specified for key 'clientId' in JAAS config  

This one was caused by using .toString() instead of .toString(Charsets.UTF_8) - an unhelpful error, as a value was supplied, but it was not a valid one.

There were also issues about null keys being defined in the JAAS config, which turned out to be a formatting issue where the config string wasn't terminated with a semi-colon; I also added quotes around values to adherence to the specification. Presumably the parser expected more keys, then complained when no key was found, rather than complaining about a missing terminator, as in some issues one might find on the topic.

@agrski agrski force-pushed the add-kafka-oauth-support-to-dataflow-engine branch from 82c0fa8 to 8a0130c Compare September 4, 2023 15:52
@agrski
Copy link
Contributor Author

agrski commented Sep 4, 2023

For testing, I've created a suite of Ansible YAML overrides to simplify matters and for reproduciblity, rather than having to edit Helm releases and provide lots of command-line overrides.

These have been applied with the following Ansible commands. The ecosystem setup was not re-run when using the cluster-external Kafka provider (for SASL/PLAIN and SASL/OAUTHBEARER) but was for in-cluster Strimzi Kafka (for PLAINTEXT and mTLS).

ansible-playbook playbooks/setup-ecosystem.yaml -e full_install=no -e install_kafka=yes -e @.extra/values-scv2-plaintext.yaml
ansible-playbook playbooks/setup-seldon.yaml -e seldon_install_servers=no -e @.extra/values-scv2-plaintext.yaml

@RafalSkolasinski
Copy link
Contributor

Superseded by #5127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants