Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

222 add deployment of example get started experiments model #233

Merged

Conversation

daavoo
Copy link
Contributor

@daavoo daavoo commented Aug 10, 2023

Add Sagemaker deployment.

https://github.com/iterative/example-get-started-experiments/actions/workflows/deploy-model.yml

https://us-east-2.console.aws.amazon.com/sagemaker/home?region=us-east-2#/endpoints/results-train-pool-segmentation-v0-1-0-dev

export AWS_DEFAULT_REGION=us-east-2
python src/endpoint_predict.py \
--img_path data/test_data/REGION_1-24_0_1024_0_1024.jpg \
--endpoint_name results-train-pool-segmentation-v0-1-0-dev

Copy link
Contributor

@tapadipti tapadipti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some change suggestions for the deploy GH action. Will add related changes to the deploy-model file.

Copy link
Contributor

@tapadipti tapadipti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some suggestions for creating different configs and endpoints per stage, and creating a serverless endpoint.

@daavoo daavoo marked this pull request as ready for review August 14, 2023 12:20
@daavoo daavoo self-assigned this Aug 14, 2023
@daavoo daavoo added the A: example-get-started-experiments DVC Experiment, DVCLive examples label Aug 14, 2023
Comment on lines +35 to +36
composed_name = re.sub(
r"[^a-zA-Z0-9\-]", "-", f"{name}-{version}-{stage}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what possible characters can appear in f"{name}-{version}-{stage}" that would need to be replaced here? I guess it's related to iterative/dvc#9821?

Copy link
Contributor Author

@daavoo daavoo Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's related to iterative/dvc#9821?

This regex is only relevant for Sagemaker and it is enforced by their API .
From top of my head, the composed name includes / , : from name and . from version

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rationale for the naming restrictions in gto were to make it more likely they would be compatible with systems like sagemaker. Not sure it really helps though. We still hit these restrictions and users don't know why they can't use these characters.

python src/train.py

dvc stage add -n evaluate \
-p base,evaluate \
-d src/evaluate.py -d models/model.pkl -d data/test_data \
python src/evaluate.py

dvc stage add -n sagemaker \
-d models/model.pth -o model.tar.gz \
'cp models/model.pth sagemaker/code/model.pth && cd sagemaker && tar -cpzf model.tar.gz code/ && cd .. && mv sagemaker/model.tar.gz . && rm sagemaker/code/model.pth'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor thoughts to try to clean this up a bit:

I haven't tested this, but wondering if we can simplify this at all with something like this?

Suggested change
'cp models/model.pth sagemaker/code/model.pth && cd sagemaker && tar -cpzf model.tar.gz code/ && cd .. && mv sagemaker/model.tar.gz . && rm sagemaker/code/model.pth'
'cp models/model.pth sagemaker/out && cp sagemaker/code sagemaker/out && tar -cpzf model.tar.gz sagemaker/out'

Also, I wonder if it would be better to append directly to dvc.yaml and use list syntax for cmd instead of &&?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I wonder if it would be better to append directly to dvc.yaml and use list syntax for cmd instead of &&?

I didn't manage to use list cmd with stage add (which we use in the generate.sh script , but might be missing something

@dberenbaum

This comment was marked as resolved.

Copy link

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to address my comments from this PR, but I don't see any actual blockers here. Nice work @daavoo!

@daavoo
Copy link
Contributor Author

daavoo commented Aug 14, 2023

I don't seem to have access:

Can you try with the Sandbox account?
Also, make sure you set us-east-2 as AWS region when querying

What is our plan here? Not a blocker, but do we want to work towards making it public?

I assume we don't want to make the actual endpoint public, but rather a simple UI that queries the endpoint.
I was assuming that, for now, we would be using it for live demos and using the sandbox account.

@dberenbaum

This comment was marked as resolved.

@daavoo

This comment was marked as resolved.

@@ -90,7 +90,7 @@ jobs:
- uses: aws-actions/configure-aws-credentials@v1
with:
aws-region: us-east-2
role-to-assume: arn:aws:iam::342840881361:role/SandboxUser
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to generalize it in this way? (an extra step for us to take care of)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was suggested by @jesper7 and @tapadipti to use the role as a secret. I assume it has some security implications although I am not sure what is the danger of leaking a role name

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo How did you set it up for https://github.com/iterative/example-get-started-experiments? Should we add it to the readme here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that roles can be public.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great! looks simple and clen

@tapadipti
Copy link
Contributor

tapadipti commented Aug 16, 2023

@daavoo Looks like this PR is close to getting merged. Since this uses one of our official demo repos, we could use this in the blog post instead of the demo-fashion-mnist that I have currently used. wdyt? I can try to replace the example snippets in the blog post to use your snippets. And you might wanna rewrite some of the text. We'll not have a web UI, but that should be ok.

@daavoo
Copy link
Contributor Author

daavoo commented Aug 16, 2023

Since this uses one of our official demo repos, we could use this in the blog post instead of the demo-fashion-mnist that I have currently used. wdyt?

Makes sense to me. I would perhaps also use the opportunity to cut the scope of the post a little by dropping DVC details in favor of pointers to the dvc get-started pages

@tapadipti
Copy link
Contributor

Since this uses one of our official demo repos, we could use this in the blog post instead of the demo-fashion-mnist that I have currently used. wdyt?

Makes sense to me. I would perhaps also use the opportunity to cut the scope of the post a little by dropping DVC details in favor of pointers to the dvc get-started pages

Ok. I'll share an updated version of the blog post tomorrow.
@shcheklein FYI since we were discussing this today morning.

@daavoo
Copy link
Contributor Author

daavoo commented Aug 16, 2023

Merging as the endpoint is now working. Don't hesitate to open followups

@daavoo daavoo merged commit 85cf558 into master Aug 16, 2023
@daavoo daavoo deleted the 222-add-deployment-of-example-get-started-experiments-model branch August 16, 2023 20:05
- run: dvc remote add -d --local storage s3://dvc-public/remote/get-started-pools

- run: |
MODEL_DATA=$(dvc get --show-url . model.tar.gz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo Doesn't this get the model data for the latest commit instead of the specific version we are trying to deploy? From the DVC docs, when --rev is not specified, The latest commit (in the default branch) is used by default when this option is not specified.

Copy link
Contributor Author

@daavoo daavoo Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the phrase in the docs is meant for the scenario where you pass a remote URL repo as the first argument.

I think the behavior here is correct, but I am going to double-check.

I think it works because we are using the local repo . and so the current status of the workspace will be used. The workflow uses actions/checkout and runs on the git tag creation so the workspace will be the revision where the tag was created

composed_name = re.sub(
r"[^a-zA-Z0-9\-]", "-", f"{name}-{version}-{stage}")

model = PyTorchModel(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo If I created a SageMaker model for a given model version before deploying it in one environment (eg, dev), does this code recreate the same model before deploying it in another environment (eg, prod). Or does it skip recreating it and just return a reference to the old model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It currently creates a new model. If we want to do what you said last, we should only add stage to the composed name after the model has been created.

Copy link
Contributor

@tapadipti tapadipti Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to this msg - you said that the existing model would be re-used, right? Would removing stage from the model name make this happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would removing stage from the model name make this happen?

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll create a PR to remove stage from the model name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR

)


return model.deploy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo If I am trying to deploy a new model version to an existing stage, does this code create a new endpoint or does it update the existing endpoint? (I need to confirm this for the blog post)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo If I am trying to deploy a new model version to an existing stage, does this code create a new endpoint or does it update the existing endpoint? (I need to confirm this for the blog post)

Not sure I understand the first part but a new endpoint is created from combining the name, version, and stage, so any change to those 3 things result in a new endpoint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I deploy v1 to prod today, and v2 to prod tomorrow, will I have 2 different endpoints with 2 different endpoint names? Shouldn't the prod endpoint always be the same? (just like studio prod version is always studio.iterative.ai) So that any clients running inference against the prod endpoint don't have to update their endpoint addresses each time there's a new deployment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version shouldn't be part of the endpoint name, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the prod endpoint always be the same?

I think it is a matter of opinion and depends on the pattern we expect/want to showcase:

A) The endpoint is directly queried by external consumers
B) There is some app&url that uses the right endpoint internally without exposing it to external consumers.

(just like studio prod version is always studio.iterative.ai) So that any clients running inference against the prod endpoint don't have to update their endpoint addresses each time there's a new deployment.

That is the user-facing app&url and it only "points" to the correct software because there are gitops bumping the internal studio version (i.e. https://github.com/iterative/itops/pull/2238)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then we should also create a service to delete stale endpoints, right? Else, we will end up with a bunch of endpoints over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then we should also create a service to delete stale endpoints, right? Else, we will end up with a bunch of endpoints over time.

I would expect that to be configured / handled on the AWS side.

Anyhow, as per your suggestions, I will send a new P.R. creating/updating a single endpoint per stage

@@ -68,6 +68,7 @@ def train():
models_dir = Path("models")
models_dir.mkdir(exist_ok=True)
learn.export(fname=(models_dir / "model.pkl").absolute())
torch.save(learn.model, (models_dir / "model.pth").absolute())
live.log_artifact(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo If we logged the tar file of the model, then we wouldn't need to specify the file name model.tar.gz in the deployment script, right? The deployment script would be able to get the file name from the Git tag itself, which means that a single deployment script could be used for deploying several models with different names. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, in the blog post, I show that there's an artifact entry for model.pkl (so the model registry shows model.pkl). But the file that is getting deployed is model.tar.gz. We know that they are the same model, but it still looks like a disconnect between the model registry and the deployment script.

tapadipti added a commit that referenced this pull request Aug 17, 2023
@dberenbaum
Copy link

Agree with @tapadipti that it makes sense to have one endpoint per stage or per version. Otherwise, I think we kind of miss the point of the registry (you can deploy every update to a new endpoint without it). IMO one endpoint per stage makes the most sense to drive home the value of that field, and I think we should focus on this being a self-contained deployment (you can do deployment without needing a separate engineering team to pick up the new model endpoint).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: example-get-started-experiments DVC Experiment, DVCLive examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants