Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvclive: update how to track results #4674

Merged
merged 10 commits into from
Aug 8, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 64 additions & 32 deletions content/docs/dvclive/how-it-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,43 +73,54 @@ model.pt.dvc

## Track the results

DVCLive expects each run to be tracked by Git, so it will save each run to the
same path and overwrite the results each time. Include
### Git integration

Unlike other experiment trackers, DVCLive relies on Git to track the [directory]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2cs: I think track results can start with a bit basic stuff and something that I think more people can relate to / understands faster.

1.that we can track them in VS Code and Studio
2.may be ways to compare experiments, or just experiments, or tracking experiments - that where we can go into Git concept to a certain degree and large files, etc (even though I still think we need

The biggest issues with explanation is that people don't expect it / can't most likely even understand why we put it here until they hit some issues.

May be another idea - "DVCLive vs other trackers: important workflow details".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed from "Track the results" to "Git and DVC integration" and introduced it by explaining that this differentiates it from other experiment trackers.

it generates, so it will save each run to the same path and overwrite the
results each time. DVCLive uses Git to manage results, code changes, and data
changes ([with DVC](#track-large-artifacts-with-dvc)). Include
[`save_dvc_exp=True`](/doc/dvclive/live#parameters) to auto-track as a <abbr>DVC
experiment</abbr>. DVC experiments are Git commits that DVC can find but that
don't clutter your Git history or create extra branches.
experiment</abbr> so you don't need to worry about manually making Git commits
or branches for each experiment. You can recover them using `dvc exp` commands
or using Git.

### Track large artifacts with DVC

Models and data are often large and aren't easily tracked in Git.
`Live.log_artifact("model.pt", type="model")` will
[cache](/doc/start/data-management/data-versioning) the `model.pt` file with DVC
and make Git ignore it. It will generate a `model.pt.dvc` metadata file, which
can be tracked in Git and becomes part of the experiment. With this metadata
file, you can [retrieve](/doc/start/data-management/data-versioning#retrieving)
the versioned artifact from the Git commit.

If `Live` was initialized with `dvcyaml=True` (which is the default) and you
include values for any of the optional metadata arguments, this will add an
[artifact](/doc/user-guide/project-structure/dvcyaml-files#artifacts) to the
corresponding `dvc.yaml`. Passing `type="model"` will mark it as a `model` for
DVC and will also show it in
[Studio Model Registry](/doc/studio/user-guide/model-registry/what-is-a-model-registry).
`Live.log_artifact("model.pt")` will [cache] the `model.pt` file with DVC and
make Git ignore it. It will generate a `model.pt.dvc` metadata file, which can
be tracked in Git and becomes part of the experiment. With this metadata file,
you can [retrieve](/doc/start/data-management/data-versioning#retrieving) the
versioned artifact from the Git commit. You can also use
`Live.log_artifact("model.pt", type="model")` to add it to the [Studio Model
Registry].

Using `Live.log_image()` to log multiple images may also grow too large to track
with Git, in which case you can use
[`Live(cache_images=True)`](/doc/dvclive/live#parameters) to cache them.

### Run with DVC
### Customize with DVC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that probably also a bit too much? even if we keep it - should it be part of the Run with DVC?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to part of Run with DVC and consolidated slightly.


DVCLive by default [generates] its own `dvc.yaml` file to configure the
experiment results, but you can create your own `dvc.yaml` file to customize
your project. For example, to define a [pipeline](#run-with-dvc) or
[customize plots](/doc/user-guide/experiment-management/visualizing-plots#defining-plots).
Do not reuse the DVCLive `dvc.yaml` file since it gets overwritten during each
experiment run. Instead, write customizations to a new `dvc.yaml` file at the
base of your repository or elsewhere outside the DVCLive directory.

## Run with DVC

Experimenting in Python interactively (like in notebooks) is great for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any other benefits?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more benefits listed later in the paragraph.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that's fine - it's just a bit abstract to me (as an end user). I mean the "more structured way to run
reproducible experiments" part and parallelized hyperparameter search jumps right into the advanced case. Again, I'm paying a lot of attention to this here since I expect the readers of this won't be DVC, and even not necessarily advanced Git users. There should be a story using their language / terminology as much as possible. Sorry, Dave for all this iterations. no intent to block it. I'm fine to merge it any time since it's an improvement already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the examples here from parallelized hyperparameter search to multi-step pipeline or queueing multiple experiments.

exploration, but eventually you may need a more structured way to run
reproducible experiments (for example, running a parallelized hyperparameter
search). By configuring DVC <abbr>pipelines</abbr>, you can
search). By configuring DVC [pipelines], you can
[run experiments](/doc/user-guide/experiment-management/running-experiments)
with `dvc exp run`.
with `dvc exp run`. This will track the inputs and outputs of your code, and
also enable features like queuing, parameter tuning, and grid searches.

You can configure a pipeline stage in `dvc.yaml` like:
You can configure a pipeline stage in your own `dvc.yaml` file at the base of
the repository (see [Customize with DVC](#customize-with-dvc)):

```yaml
stages:
Expand All @@ -121,21 +132,42 @@ stages:
- model.pt
```

Add this pipeline stage into `dvc.yaml`, modifying it to fit your project. Then,
run it with `dvc exp run`. This will track the inputs and outputs of your code,
and also enable features like queuing, parameter tuning, and grid searches.

<admon type="warn">
<admon type="tip">

Add to a `dvc.yaml` file at the base of your repository. Do not use
`dvclive/dvc.yaml` since DVCLive will overwrite it during each run.
You may have previously tracked [outputs] with `Live.log_artifact()` that
generated a `.dvc` file like `model.pt.dvc`. DVC will not allow you to also add
`model.pt` as a pipeline [output][outputs] since it is already tracked by
`model.pt.dvc`. You must `dvc remove model.pt.dvc` before you can add it to the
pipeline. You can optionally drop `Live.log_artifact()` from your code.

</admon>

<admon type="tip">
Optionally add any subpaths of the DVCLive [directory] to the [outputs]. DVC
will [cache] them by default, and you can use those paths as [dependencies]
downstream in your pipeline. For example, to cache all DVCLive plots:

```diff
stages:
dvclive:
cmd: <python my_code_file.py my_args>
deps:
- <my_code_file.py>
outs:
- model.pt
+ - dvclive/plots
```

If you already have a `.dvc` file like `model.pt.dvc`, DVC will not allow you to
also track `model.pt` in `dvc.yaml`. You must `dvc remove model.pt.dvc` before
you can add it to `dvc.yaml`.
<admon type="warn">

Do not add the entire DVCLive [directory] since DVC does not expect the DVCLive
`dvc.yaml` file to be inside the [outputs].

</admon>

[directory]: /doc/dvclive/how-it-works#directory-structure
[studio model registry]: /doc/studio/user-guide/model-registry
[cache]: /doc/start/data-management/data-versioning
[outputs]: /doc/user-guide/pipelines/defining-pipelines#outputs
[dependencies]: /doc/user-guide/pipelines/defining-pipelines#simple-dependencies
[pipelines]: /doc/start/experiments/experiment-pipelines
[generates]: /doc/dvclive/live/make_dvcyaml
2 changes: 1 addition & 1 deletion content/docs/dvclive/live/log_artifact.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ include any of the optional metadata fields (`type`, `name`, `desc`, `labels`,
[artifact](/doc/user-guide/project-structure/dvcyaml-files#artifacts) and all
the metadata passed as arguments to the corresponding `dvc.yaml`. Passing
`type="model"` will mark it as a `model` for DVC and will make it appear in
[Studio Model Registry](/doc/studio).
[Studio Model Registry](/doc/studio/user-guide/model-registry).

## Parameters

Expand Down