Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Adding Llama-3.1-Storm-8B model handler #598

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

akshita-sukhlecha
Copy link

Llama-3.1-Storm-8B model was recently released. This PR adds model handler for Llama-3.1-Storm-8B.

@akshita-sukhlecha akshita-sukhlecha changed the title Adding Llama-3.1-Storm-8B model handler [BFCL] Adding Llama-3.1-Storm-8B model handler Aug 23, 2024
@akshita-sukhlecha
Copy link
Author

@HuanzhiMao (Just a gentle reminder) Could you help reviewing this.

Here are the available functions:
<tools>{tools}</tools>

Follow the below guidelines:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These guidelines are mentioned in the 'Prompt Format for Function Calling' section in the model documentation. They need to be aligned.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Llama-3.1-Storm-8B is not trained on negative data. Since model has high instruction following (IFEval) capability, we added two instructions to compensate for the lack of negative data in training. It is also consistent with our evaluation prompt in the blog post: https://huggingface.co/blog/akjindal53244/llama31-storm8b#bfcl

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akshita-sukhlecha,

Thank you for the clarification. To ensure consistency and fairness across all models, it's important that the model handler uses the same prompt format as outlined in its official documentation.

In this case, while you have a specialized prompt optimized for BFCL categories, it would be more equitable to use the general-purpose prompt specified in the documentation for a uniform comparison with other models.

cc @ShishirPatil

@@ -865,7 +865,11 @@ def get_movie_rating(movie_name):
url = "http://www.omdbapi.com/"
params = {"t": movie_name, "apikey": api_key["OMDB-API-KEY"]}
response = requests.get(url, params=params)
return response.json()["Rated"]
Copy link
Collaborator

@CharlieJCJ CharlieJCJ Aug 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to add try except here since there's a try except outside in executable category here https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/eval_checker/checker.py#L702-L716 that handles when "Rated" is not a key.

And the API is expected to return a dict with Rated field like the following, if there isn't such field, it would be caught in API sanity check, indicating that there's an API response change from the provider.

>>> url = "http://www.omdbapi.com/"
>>> params = {"t": "Avatar", "apikey": api_key["OMDB-API-KEY"]}
>>> response = requests.get(url, params=params)
>>> response.json()
{'Title': 'Avatar', 'Year': '2009', 'Rated': 'PG-13', 'Released': '18 Dec 2009', 'Runtime': '162 min', 'Genre': 'Action, Adventure, Fantasy', 'Director': 'James Cameron', 'Writer': 'James Cameron', 'Actors': 'Sam Worthington, Zoe Saldana, Sigourney Weaver', 'Plot': 'A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his orders and protecting the world he feels is his home.', 'Language': 'English, Spanish', 'Country': 'United States, United Kingdom', 'Awards': 'Won 3 Oscars. 91 wins & 131 nominations total', 'Poster': 'https://m.media-amazon.com/images/M/MV5BZDA0OGQxNTItMDZkMC00N2UyLTg3MzMtYTJmNjg3Nzk5MzRiXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.9/10'}, {'Source': 'Rotten Tomatoes', 'Value': '81%'}, {'Source': 'Metacritic', 'Value': '83/100'}], 'Metascore': '83', 'imdbRating': '7.9', 'imdbVotes': '1,397,142', 'imdbID': 'tt0499549', 'Type': 'movie', 'DVD': 'N/A', 'BoxOffice': '$785,221,649', 'Production': 'N/A', 'Website': 'N/A', 'Response': 'True'}
>>> response.json()["Rated"]
'PG-13'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting KeyError when running the evaluation. Here are the logs:

  File "/gorilla/berkeley-function-call-leaderboard/eval_checker/eval_runner.py", line 495, in <module>
    runner(model_names, test_categories, api_sanity_check)
  File "/gorilla/berkeley-function-call-leaderboard/eval_checker/eval_runner.py", line 384, in runner
    get_executable_expected_output(prompt_file)
  File "/gorilla/berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py", line 908, in get_executable_expected_output
    exec(
  File "<string>", line 2, in <module>
  File "/gorilla/berkeley-function-call-leaderboard/eval_checker/executable_python_function.py", line 880, in get_movie_director
    return response.json()["Director"]
KeyError: 'Director'

Copy link
Collaborator

@CharlieJCJ CharlieJCJ Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akshita-sukhlecha, thanks for letting us know this error and providing the log. We want to raise a separate issue to do more investigation to this, and this should be caught in the get_executable_expected_output function, but not at the API's level. The API should freely raise exceptions if the request.get() json output does not have the expected keys.

Can you revert these changes to the API in this PR, and just provide the model handler? Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CharlieJCJ . I have reverted the changes of handling exception at API level.

The task of handling exceptions properly will be a blocker for evaluating this model. So, should merge these changes after that task is completed ?

One more thing to note is - There are already 10 instances in the executable_python_function.py file where exceptions are being handled at API level. Eg: https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/eval_checker/executable_python_function.py#L577

Copy link
Collaborator

@CharlieJCJ CharlieJCJ Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that correctly evaluating the model during your local evaluation have dependency on this new PR. This handling exception in the get_executable_expected_output logic will affect all models we want to review it thoroughly, related to #563. We can merge your model handler PR first, and then merge the PR that addresses handling error PR (with API definition fixes) since this will affects all models' score, they are fairly separate sources of concerns.

For context, this error didn't show up previously in our internal testing for a long time (over a couple months) and we have being actively evaluating recently as well using the omdb key, and didn't have this API exception.

There are already 10 instances in the executable_python_function.py file where exceptions are being handled at API level.

For this, thanks for raising, we also noticed this and yes we should be consistent with other functions, and we will review all executable python functions to make sure there are no false negatives during evaluation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting KeyError when running the evaluation. Here are the logs:

  File "/gorilla/berkeley-function-call-leaderboard/eval_checker/eval_runner.py", line 495, in <module>
    runner(model_names, test_categories, api_sanity_check)
  File "/gorilla/berkeley-function-call-leaderboard/eval_checker/eval_runner.py", line 384, in runner
    get_executable_expected_output(prompt_file)
  File "/gorilla/berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py", line 908, in get_executable_expected_output
    exec(
  File "<string>", line 2, in <module>
  File "/gorilla/berkeley-function-call-leaderboard/eval_checker/executable_python_function.py", line 880, in get_movie_director
    return response.json()["Director"]
KeyError: 'Director'

I faced same issue. resolved by updating the omdbapi key

@@ -877,7 +881,11 @@ def get_movie_director(movie_name):
url = "http://www.omdbapi.com/"
params = {"t": movie_name, "apikey": api_key["OMDB-API-KEY"]}
response = requests.get(url, params=params)
return response.json()["Director"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

@HuanzhiMao HuanzhiMao added the BFCL-New Model Add New Model to BFCL label Aug 27, 2024
@ShishirPatil
Copy link
Owner

Hey @akshita-sukhlecha and @Kishore-bluekyte is this still relevant? If so, once you resolve the comments, I'd be happy to go ahead and merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-New Model Add New Model to BFCL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants