Evals and Testing, Benchmarking of Skyvern #1116

gwpl · 2024-11-03T14:35:18Z

gwpl
Nov 3, 2024

openai_evals , mistral_evals, etc, and datasets on hugging face, all seem to be focused on way simpler usecases then here.

Different application , just to illustrate other testing approach is : benchmarking , e.g. https://aider.chat/blog/ we see many analysis during different benchmarks.

I can imagine that with Skyvern it would be challanging, but maybe interesting to create template of repository with few example tests and benchmarks (and/or datasets on hugging face , as we see in mistral_evals their github repo seems fetching hugging face datasets).

Assuming that those maybe big, then may make sense if different Skyvern users would be forking and adding tests on their forks, and then Skyvern team can later run all of then on benchmark?

Just an idea, I know it's a lot of work , but collecting evals and datasets for skyvern sounds very interesting!

Maybe some UI changes, e.g. allowing users to share trace with Skyvern team , either from app.skyvern.com run or local run
as tar file?

Or maybe UI buttons allowing to label steps/actions in fast convenient way (maybe even navigate with keyboard shortcuts and annotate with keyboard shortcuts), so user could that way get copy of annotated data for themselves and share with the team if wanted of course?

suchintan · 2024-11-04T06:58:50Z

suchintan
Nov 4, 2024
Maintainer

GREAT IDEA!!! We have this under works right now. Keep your eyes peeled!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evals and Testing, Benchmarking of Skyvern #1116

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Evals and Testing, Benchmarking of Skyvern #1116

gwpl Nov 3, 2024

Replies: 1 comment

suchintan Nov 4, 2024 Maintainer

gwpl
Nov 3, 2024

suchintan
Nov 4, 2024
Maintainer