Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Signed URL Support #200

Closed
shoeffner opened this issue Jul 18, 2024 · 10 comments
Closed

Add Signed URL Support #200

shoeffner opened this issue Jul 18, 2024 · 10 comments
Labels
status:deferred Will be looked at later. type:feature New feature

Comments

@shoeffner
Copy link
Collaborator

As discussed in #192, it could be a good idea to include Signed URL support as described in https://guides.dataverse.org/en/latest/api/external-tools.html#signed-urls.

This issue is there to discuss, track, etc. that proposal.

@pdurbin
Copy link
Member

pdurbin commented Jul 19, 2024

If we want to help out upstream, a report came in about the "edit" API endpoint not working with signed URLs. Maybe we try to reproduce the error:

@qqmyers
Copy link
Member

qqmyers commented Jul 19, 2024

@nana-boateng
Copy link

nana-boateng commented Jul 22, 2024

Yep, I tried a PUT request and it worked! Thanks @qqmyers!

shoeffner added a commit to shoeffner/pyDataverse that referenced this issue Jul 25, 2024
Tests to validate an implementation for gdcc#200
shoeffner added a commit to shoeffner/pyDataverse that referenced this issue Jul 25, 2024
Tests to validate an implementation for gdcc#200
@shoeffner
Copy link
Collaborator Author

shoeffner commented Jul 25, 2024

While trying this out, I found out a couple of things:

  • demo.dataverse.org returns a 503 with the message: {status:"error", message:"Endpoint available from localhost only. Please contact the dataverse administrator"}, which should definitely be handled gracefully
  • With the localhost docker server, when sending a malformed JSON as a signing request – I added an extra comma before the closing brace, i.e., {"a": "b",} – the server returns 500 with no message ({} is the actual response), but the server logs state the error:
    dataverse         | [#|2024-07-25T00:50:23.835+0000|SEVERE|Payara 6.2024.6|edu.harvard.iq.dataverse.api.errorhandlers.ThrowableHandler|_ThreadID=91;_ThreadName=http-thread-pool::http-listener-1(1);_TimeMillis=1721868623835;_LevelValue=1000;|
    dataverse         |   _status="ERROR";_code=500;_message="Internal server error. More details available at the server logs.";_incidentId="e748f4e3-3aaa-46ca-8fbe-6e1b68968cb3";_interalError="JsonParsingException";_requestUrl="http://localhost:8080/api/v1/admin/requestSignedUrl";_requestMethod="POST"||#]
    dataverse         | 
    dataverse         | [#|2024-07-25T00:50:23.837+0000|SEVERE|Payara 6.2024.6|edu.harvard.iq.dataverse.api.errorhandlers.ThrowableHandler|_ThreadID=91;_ThreadName=http-thread-pool::http-listener-1(1);_TimeMillis=1721868623837;_LevelValue=1000;|
    dataverse         |   jakarta.json.stream.JsonParsingException: Invalid token=CURLYCLOSE at (line no=6, column no=1, offset=183). Expected tokens are: [STRING]
    dataverse         |     at org.eclipse.parsson.JsonParserImpl.parsingException(JsonParserImpl.java:455)
    
    (removed the stacktrace for brevity). This should probably throw a 400 instead and maybe even return the JsonParsingException output, as that is really helpful to understand what went wrong with the request. In either case, it should be handled gracefully and maybe reported upstream, if it wasn't already done.
  • I didn't find a documentation for the correct/expected response to a signing request; in general the whole signed URL mechanism seems very sparsely documented. For references, this is what a correct response seems to look like:
    {
        "status": "OK",
        "data": {
            "signedUrl": "..."
        }
    }
    with the signedUrl being of the form: {originalUrl}?until={isoformatted-UTC}&user={username}&method={http-verb}&token={a long hexadecimal token}.
    Note that the isoformatted timestamp has a T separating date and time, no timezone information (but is UTC), and ends in 1/1000 s (0000-00-00T00:00.000).
  • I don't know from the docs if the signed URLs work for everything, or if, e.g., the SWORD API, which has a different authentication mechanism, works with it; nor if both, API and Bearer tokens work for signed URLs. However, I think the httpx.Auth auth_flow should support all of that nonetheless with, e.g., yield from signing_request_auth_flow(...) or something.

So far I only wrote unit test cases and haven't done any actual implementation yet (so the tests might still change a little bit here and there, especially the test_signed_url_flow_tracks_data_payload), I also want to add an integration test against the localhost server to see that the flow actually works. shoeffner@6386001 The branch relies on #201 , so it might become rebased or modified depending on that branch.

@shoeffner
Copy link
Collaborator Author

How is pyDataverse supposed to be using SignedURLs?

  1. To generate (and distribute) signed URLs for other users/applications to use?
  2. To request a signed URL per request (potentially caching them for repeated requests until they time out?)
  3. To use signed URLs supplied from another application?

Am I missing something?

So far, I focused on 2 (but this kind of entails 1), but I haven't made up any thoughts about 3, although that might also work, in particular if we implement "caching", then we could simply fill the cache with pre-signed URLs.

Another interesting thing I read is (emphasis mine):

url - the exact URL to sign, including api version number and all query parameters

Right now, pyDataverse mostly ignores the API version, so pyDataverse can either modify the URL before requesting a signature for it, or all routes should be changed to include the API Version (which relates to #197 and #189). Ironically, the canonical example uses /api/admin/requestSignedUrl without a version argument.

@shoeffner
Copy link
Collaborator Author

shoeffner commented Jul 28, 2024

To shed a little bit more light on my question on what should be supported, here's a little overview over the three variants.

Variant 1

This would be variant 1, and from what I understand, the "common" use case for signed URLs (at least from Dataverse's perspective):

sequenceDiagram
    participant User
    participant Admin/App
    participant pyDataverse
    participant Dataverse
    User->>Admin/App: Asks to perform action
    Admin/App->>pyDataverse: Requests signed URL
    pyDataverse->>Dataverse: Requests signed URL
    Dataverse->>pyDataverse: Issues signed URL
    pyDataverse->>Admin/App: Passes signed URL
    Admin/App->>User: Forwards signed URL
    User->>Dataverse: Uses signed URL to perform action
    Dataverse->>User: Responds with result
Loading

For this, pyDataverse would need to be able to returned signed URLs instead of actually performing the actions.

Variant 2

This is, as far as I understood, more or less the idea proposed with "we should support signed URLs" in #192.

sequenceDiagram
    participant User
    participant Admin/App
    participant pyDataverse
    participant Dataverse
    Admin/App->>pyDataverse: Asks to perform action
    pyDataverse->>Dataverse: Requests signed URL
    Dataverse->>pyDataverse: Issues signed URL
    pyDataverse->>Dataverse: Uses signed URL to perform action
    Dataverse->>pyDataverse: Responds with result
    pyDataverse->>Admin/App: Delivers result
Loading

Thinking more about signed URLs, I am not sure this is what is intended, as it eventually only increases the number of API calls.

Variant 3

In this scenario, the user wants to perform an action and some other App/Admin generates the signed URL which the user can then use with pyDataverse.
This is also very in line with the "common" use case for signed URLs, but this time the role of pyDataverse is flipped: Instead of providing the signed URLs for other services as in Variant 1, it becomes the consumer of signed URLs.

sequenceDiagram
    participant User
    participant Admin/App
    participant pyDataverse
    participant Dataverse
    User->>Admin/App: Asks to perform action
    Admin/App->>Dataverse: Requests signed URL 
    Dataverse->>Admin/App: Issues signed URL
    Admin/App->>User: Forwards signed URL
    User->>pyDataverse: Provides signed URL
    pyDataverse->>Dataverse: Uses signed URL to perform action
    Dataverse->>pyDataverse: Responds
    pyDataverse->>User: Delivers result
Loading

Summary

In general I think the idea of using signed URLs for auth does not make too much sense on a per-request basis, at least not "transparently," that is when the result is to perform the action.
It does make sense to support either or both of the other two use-cases though:

  • A user of pyDataverse might receive signed URLs via another app or administrator and wants to use those to perform actions with pyDataverse, e.g., scripting some data uploads.
  • A user of pyDataverse implements a service where pyDataverse is the backend and delivers URLs which can be used in a frontend without exposing the API key. In this case, pyDataverse would only generate the URLs and return those as a result, rather than performing the actual action.

The other variant, getting a signed URL and performing the action based on it, could then be naturally combined from Variants 1 and 3, but I do not really see special use cases for that.

What is your take on this?

@shoeffner
Copy link
Collaborator Author

I started implementing version 2 and also the tests on my branch kind of assume that flow, but I realized that this is not really the way it should work.

For the other two variants, pyDataverse as "requester" and "consumer", I am not sure how good implementations would look like and what @JR-1991 is planning in #189 , but I feel this should be conceptually integrated when restructuring the API.
Maybe some sort of middleware-y design might work to avoid having to implement everything twice, so that one can inject signed URL handling in variant 1, variant 3, and by injecting both variant 2 ;-)
But I have to think about that a little bit, not sure what the best design is. Do you know how other tools solve this?

@shoeffner
Copy link
Collaborator Author

shoeffner commented Jul 28, 2024

I checked the boto3 docs and they seem to handle this in a very direct fashion, essentially instead of performing an action they change the function call to generate_presigned_url: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html and just state

The user can download the S3 object by entering the presigned URL in a browser. A program or HTML page can download the S3 object by using the presigned URL as part of an HTTP GET request.

They also only seem to implement variant 1, i.e., generating presigned URLs ; for variant 3 they basically say: use it with curl.

@shoeffner
Copy link
Collaborator Author

I think event hooks might be a good idea: https://www.python-httpx.org/advanced/event-hooks/
For signing a URL one could rewrite the URL to the request a signed URL with a request hook, and extract the URL instead of the correct response with a response hook or something. I am not yet sure how that would look like on an API-level though, I can imagine any of the following options and there are probably way more:

# explicit args
api = DataAccessApi(URL, auth=..., middlewares=[SignedURLHook])

# wrapper
api = SignedURLs(DataAccessApi(URL, auth=...))
# alternative per method
api = DataAccessApi(URL, auth=...)
SignedURL(api.get_datafile(...))

# inheritance
class SignedURLDataAccessApi(DataAccessApi, SignedURLApi):
    pass
api = SignedURLDataAccessApi(URL, auth=...)

# extra method/property
api = DataAccessApi(URL, auth=...)
api.signed_url.get_datafile(...)

I am not sure how to do it the other way around, but similar ways might work, maybe with a signed_url_store which stores signed urls.
Or some API which allows api.get(signed_url).[as_dataset()|as_file()|...], Dataset(api.get(...)), or Dataset.from(api.get(...)).

One option which might entail quite some refactoring would be to have a template per URL and match the template for signed URLs or realize them for non-signed URLs, depending on the use case.

@shoeffner
Copy link
Collaborator Author

In the group meeting on August 21st, we decided to put this on hold until there is a need for it.

If you are in need of this feature, please re-open the issue.

@shoeffner shoeffner closed this as not planned Won't fix, can't repro, duplicate, stale Aug 24, 2024
@shoeffner shoeffner added status:deferred Will be looked at later. and removed status:incoming Newly created issue to be forwarded labels Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:deferred Will be looked at later. type:feature New feature
Projects
None yet
Development

No branches or pull requests

4 participants