Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding HiveServer Credential Provider #210

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

georgepachitariu
Copy link

Hello Jim,
first I would like to thank you for implementing this library. We needed a way to launch Jupyter containers on our existing Hadoop and your libraries are fitting great.
In our Hadoop, the users get the data by connecting to our Hive database. Since we have a kerberised cluster, the way to connect from a yarn container is to use delegation tokens.

I implemented the part that connects to Hive and obtains a delegation token that is added to the rest of the tokens.
I used the Oozie implementation for inspiration (the interface CredentialProvider.java is also like there):
https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/Hive2Credentials.java

This is my first draft. Could you please have a look at it?

@georgepachitariu
Copy link
Author

The Travis CI checks failed with message:
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

I think that the stalling is caused by the build and not by the new code. I would appreciate if I receive any guidance here.

@georgepachitariu
Copy link
Author

georgepachitariu commented Mar 13, 2020

I created a pull-request in yarnspawner as well:
jupyterhub/yarnspawner#17

Copy link
Owner

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Apologies for the delayed review here. I've left a few comments on the implementation.

A few general questions:

  • Is credential_providers the best name for this field? What terminology do other systems use?
  • What other credential providers may a user want us to support?
  • Are uri and principal sufficient information for all other implementations?

@@ -390,6 +394,38 @@
<scope>provided</scope>
</dependency>

<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a system has hive available, is this jar likely to already be on the classpath (this is common for other hadoop libraries)? If so, we usually set <scope>provided</scope> for these types of dependencies to keep the size of our jar down and not worry about dependency mismatches. You should be able to drop all the exclusions as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true. Thanks for explaining why I should use provided

skein/model.py Outdated
@@ -1152,6 +1152,60 @@ def from_protobuf(cls, obj):
security=security)


class CredentialProviderSpec(Specification):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just call this CredentialProvider.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap

message CredentialProviderSpec {
string name = 1;
string uri = 2;
string principal = 3;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficient for any other credential provider we may want to support later?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as explained in the big comment, having these 2 are not enough for all providers, and I switched to a map of configurations

import org.apache.hadoop.yarn.exceptions.YarnException;
import java.io.IOException;

public interface CredentialProvider {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than abstracting this out into an interface with two implementations, I'd make a single class with a method for adding each one, and a larger method for iterating through the spec and adding credentials for each provider. In python pseudocode:

class CredentialManager:
    def add_credentials_common(self, credentials):
        pass

    def add_credentials_hive(self, credentials, uri, principal):
        pass

    def add_credentials(self, credentials, spec):
        self.add_credentials_common(credentials)
        for cred in spec.credential_providers:
            if cred.name == "hive":
                self.add_credentials_hive(credentials, cred.uri, cred.principal)
            elif cred.name == ...:
                pass  # other implementations added here

This may not be as java idiomatic, but better matches with the existing code style. I don't see a reason to abstract this out when we'll only ever support a set list of credential providers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'll refactor it this way.

for(int i = 0; i < spec.getCredentialProvidersCount(); i++) {
Msg.CredentialProviderSpec d = spec.getCredentialProviders(i);
if (d.getName().equals("hive")) {
credentialProviders.add(new HiveCredentials(d, spec.getUser()));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do if the user requests a provider we don't support? There's two cases here:

  • The name provided doesn't match any of our implementations.
  • The name does match one we implement, but adding a credential fails.

In both cases I think we should error. But I don't think that should be handled here, rather I think it should be handled in the routine that manages adding credentials (described below).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there will be exceptions propagated backwards in both cases in the next draft

@georgepachitariu
Copy link
Author

Thank you for reviewing my code :).
I understood your comments. I will come back with the answers and a new draft later today.

@georgepachitariu
Copy link
Author

Thanks for working on this! Apologies for the delayed review here. I've left a few comments on the implementation.

A few general questions:

  • Is credential_providers the best name for this field? What terminology do other systems use?
  • What other credential providers may a user want us to support?
  • Are uri and principal sufficient information for all other implementations?

Hi, I answered the questions:

  1. Is credential_providers the best name for this field? What terminology do other systems use?

Since we are only dealing with Delegation token maybe we can rename "credential_providers" to be more specific: hadoop_delegation_token_provider ?
This name will be in line with the Hadoop & Kerberos book. It is mentioned there "delegation token" and "Hadoop tokens".

  1. What other credential providers may a user want us to support?
    From reading Oozie and Spark code: HCAT (Hive Metastore), Hbase, JHS (Hadoop Job History Server), Kafka.

  2. Are uri and principal sufficient information for all other implementations?
    After some research, the answer is a sad no.

  • Hbase uses the Hbase client configuration and the input Hadoop job config. received as input.
  • I think that Hadoop Job History Server uses the Hadoop configuration.
  • Kafka similarly has it's own configuration: KafkaTokenClusterConf

I was thinking, can we have a dictionary<str, str> in protobuf that can be filled with whatever configuration (as keys with values) each provider needs, all bundled together?
Because It might not be very nice to change the protobuf everytime we add a provider.

@georgepachitariu
Copy link
Author

Hi Jim, my second draft is ready for review. I think I covered all the things you mentioned above.
If you could have a look when you have time, that would be great :).

@jcrist
Copy link
Owner

jcrist commented Mar 24, 2020

Thanks @georgepachitariu. I'm currently on break between jobs (and without a computer), but will try to look at the changes as soon as I can (max 2 weeks from now). Apologies, thanks for being patient.

@georgepachitariu
Copy link
Author

No worries @jcrist, take your time.

@santosh-d3vpl3x
Copy link

This is amazing effort at enabling good support for exhaustive services in hadoop, thanks! Any ETA on when can we expect this to be available for use?

@jcrist
Copy link
Owner

jcrist commented Apr 21, 2020

Hi all. I just started a new job, so a fair bit of my time is occupied ramping up on that. I expect to be able to give this a good review by end-of-week though. Thanks for your patience.

@wundervaflja
Copy link

This is very nice PR, exactly what we currently need for our platform. How I can help to have it merged ? Really looking forward to help and to have it in master.

@georgepachitariu
Copy link
Author

Hi @santosh-d3vpl3x @wundervaflja. It's very cool to see that other people have the same ideas as me. As Jim said, please be a little patient. We will work towards a solution we like.
If you like to live on the edge, you can install this branch in your deployment. (That's what I did :D )

@gboutry
Copy link

gboutry commented Nov 16, 2022

Hello, do you have an estimate on the completeness of this PR ? I'm really interested, and ready to help if needed.

@georgepachitariu
Copy link
Author

georgepachitariu commented Nov 16, 2022

Hello, do you have an estimate on the completeness of this PR ? I'm really interested, and ready to help if needed.

Hi @gboutry, nice to meet you!
Sadly this branch didn't get merged (and I don't work on data engineering systems anymore),
BUT the branch has the complete functionality. So I would advise you to build using this branch and try it out.

@gboutry
Copy link

gboutry commented Nov 25, 2022

Hi @georgepachitariu,

Sorry for the late reply, I was able to test your work, and indeed it works as expected, you did a really good job.
Many thanks

(I rebased your branch on skein master, and it worked well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants