diff --git a/audio/speech/getting-started/get_started_with_chirp_2_sdk.ipynb b/audio/speech/getting-started/get_started_with_chirp_2_sdk.ipynb index 4976608f8b..05019d21aa 100644 --- a/audio/speech/getting-started/get_started_with_chirp_2_sdk.ipynb +++ b/audio/speech/getting-started/get_started_with_chirp_2_sdk.ipynb @@ -1,1343 +1,1361 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ur8xi4C7S06n" - }, - "outputs": [], - "source": [ - "# Copyright 2024 Google LLC\n", - "#\n", - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JAPoU8Sm5E6e" - }, - "source": [ - "# Get started with Chirp 2 using Speech-to-Text V2 SDK\n", - "\n", - "
\n",
- " \n",
- " Open in Colab\n", - " \n", - " | \n",
- " \n",
- " \n",
- " Open in Colab Enterprise\n", - " \n", - " | \n",
- " \n",
- " \n",
- " Open in Vertex AI Workbench\n", - " \n", - " | \n",
- " \n",
- " \n",
- " View on GitHub\n", - " \n", - " | \n",
- "
{transcription}\"\n", - " display(ipd.HTML(formatted_text))\n", - "\n", - "\n", - "def evaluate_stt(\n", - " actual_transcriptions: list[str],\n", - " reference_transcriptions: list[str],\n", - " audio_sample_file_uri: str = INPUT_LONG_AUDIO_SAMPLE_FILE_URI,\n", - ") -> pd.DataFrame:\n", - " \"\"\"\n", - " Evaluate speech-to-text (STT) transcriptions against reference transcriptions.\n", - " \"\"\"\n", - " audio_uris = [audio_sample_file_uri] * len(actual_transcriptions)\n", - " evaluations = []\n", - " for audio_uri, actual_transcription, reference_transcription in zip(\n", - " audio_uris, actual_transcriptions, reference_transcriptions\n", - " ):\n", - " evaluation = {\n", - " \"audio_uri\": audio_uri,\n", - " \"actual_transcription\": actual_transcription,\n", - " \"reference_transcription\": reference_transcription,\n", - " \"wer\": jiwer.wer(reference_transcription, actual_transcription),\n", - " \"cer\": jiwer.cer(reference_transcription, actual_transcription),\n", - " }\n", - " evaluations.append(evaluation)\n", - "\n", - " evaluations_df = pd.DataFrame(evaluations)\n", - " evaluations_df.reset_index(inplace=True, drop=True)\n", - " return evaluations_df\n", - "\n", - "\n", - "def plot_evaluation_results(\n", - " evaluations_df: pd.DataFrame,\n", - ") -> go.Figure:\n", - " \"\"\"\n", - " Plot the mean Word Error Rate (WER) and Character Error Rate (CER) from the evaluation results.\n", - " \"\"\"\n", - " mean_wer = evaluations_df[\"wer\"].mean()\n", - " mean_cer = evaluations_df[\"cer\"].mean()\n", - "\n", - " trace_means = go.Bar(\n", - " x=[\"WER\", \"CER\"], y=[mean_wer, mean_cer], name=\"Mean Error Rate\"\n", - " )\n", - "\n", - " trace_baseline = go.Scatter(\n", - " x=[\"WER\", \"CER\"], y=[0.5, 0.5], mode=\"lines\", name=\"Baseline (0.5)\"\n", - " )\n", - "\n", - " layout = go.Layout(\n", - " title=\"Speech-to-Text Evaluation Results\",\n", - " xaxis=dict(title=\"Metric\"),\n", - " yaxis=dict(title=\"Error Rate\", range=[0, 1]),\n", - " barmode=\"group\",\n", - " )\n", - "\n", - " fig = go.Figure(data=[trace_means, trace_baseline], layout=layout)\n", - " return fig" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "VPVDNRyVxquo" - }, - "source": [ - "## Transcribe using Chirp 2\n", - "\n", - "You can use Chirp 2 to transcribe audio in Streaming, Online and Batch modes:\n", - "\n", - "* Streaming mode is good for streaming and real-time audio. \n", - "* Online mode is good for short audio < 1 min.\n", - "* Batch mode is good for long audio 1 min to 8 hrs. \n", - "\n", - "In the following sections, you explore how to use the API to transcribe audio in these three different scenarios." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4uTeBXo6dZlS" - }, - "source": [ - "### Read the audio file\n", - "\n", - "Let's start reading the input audio sample you want to transcribe.\n", - "\n", - "In this case, it is a podcast generated with NotebookLM about the \"Attention is all you need\" [paper](https://arxiv.org/abs/1706.03762)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pjzwMWqpdldM" - }, - "outputs": [], - "source": [ - "input_audio_bytes = read_audio_file(INPUT_AUDIO_SAMPLE_FILE_URI)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SyEUpcf12z73" - }, - "source": [ - "### Prepare audio samples\n", - "\n", - "The podcast audio is ~ 8 mins. Depending on the audio length, you can use different transcribe API methods. To learn more, check out the official documentation. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TRlgCdED793U" - }, - "source": [ - "#### Prepare a short audio sample (< 1 min)\n", - "\n", - "Extract a short audio sample from the original one for streaming and real-time audio processing." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "r-SYb9_b87BZ" - }, - "outputs": [], - "source": [ - "short_audio_sample_bytes = extract_audio_sample(input_audio_bytes, 30)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "0Hk2OSiSEFrf" - }, - "outputs": [], - "source": [ - "play_audio_sample(short_audio_sample_bytes)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2rPcMe0LvC3q" - }, - "source": [ - "#### Prepare a long audio sample (from 1 min up to 8 hrs)\n", - "\n", - "Extract a longer audio sample from the original one for batch audio processing." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "L44FoygqvHoP" - }, - "outputs": [], - "source": [ - "long_audio_sample_bytes = extract_audio_sample(input_audio_bytes, 120)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Ej2j0FBEvK6s" - }, - "outputs": [], - "source": [ - "play_audio_sample(long_audio_sample_bytes)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6tIbVVe76ML8" - }, - "outputs": [], - "source": [ - "save_audio_sample(long_audio_sample_bytes, INPUT_LONG_AUDIO_SAMPLE_FILE_URI)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "w5qPg2OfFAG9" - }, - "source": [ - "### Perform streaming speech recognition\n", - "\n", - "Let's start performing streaming speech recognition." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aAlIgQSoeDT5" - }, - "source": [ - "#### Prepare the audio stream\n", - "\n", - "To simulate an audio stream, you can create a generator yielding chunks of audio data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "j5SPyum6FMiC" - }, - "outputs": [], - "source": [ - "stream = [\n", - " compress_for_streaming(audio_chuck)\n", - " for audio_chuck in audio_sample_chunk_n(short_audio_sample_bytes, num_chunks=5)\n", - "]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "7dDap26FiKlL" - }, - "outputs": [], - "source": [ - "for s in stream:\n", - " play_audio_sample(s)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9z1XGzpxeAMP" - }, - "source": [ - "#### Prepare the stream request\n", - "\n", - "Once you have your audio stream, you can use the `StreamingRecognizeRequest`class to convert each stream component into a API message." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "IOZNYPrfeW49" - }, - "outputs": [], - "source": [ - "audio_requests = (cloud_speech.StreamingRecognizeRequest(audio=s) for s in stream)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oPbf5rNFecI_" - }, - "source": [ - "#### Define streaming recognition configuration\n", - "\n", - "Next, you define the streaming recognition configuration which allows you to set the model to use, language code of the audio and more." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "32Wz990perAo" - }, - "outputs": [], - "source": [ - "streaming_config = cloud_speech.StreamingRecognitionConfig(\n", - " config=cloud_speech.RecognitionConfig(\n", - " language_codes=[\"en-US\"],\n", - " model=\"chirp_2\",\n", - " features=cloud_speech.RecognitionFeatures(\n", - " enable_automatic_punctuation=True,\n", - " ),\n", - " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zVRyTqhWe2gf" - }, - "source": [ - "#### Define the streaming request configuration\n", - "\n", - "Then, you use the streaming configuration to define the streaming request. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "t5qiUJ48e9i5" - }, - "outputs": [], - "source": [ - "stream_request_config = cloud_speech.StreamingRecognizeRequest(\n", - " streaming_config=streaming_config, recognizer=RECOGNIZER\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "h1d508ScfD9I" - }, - "source": [ - "#### Run the streaming recognition request\n", - "\n", - "Finally, you are able to run the streaming recognition request." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QCq-iROpfl9t" - }, - "outputs": [], - "source": [ - "def requests(request_config: cloud_speech.RecognitionConfig, s: list) -> list:\n", - " yield request_config\n", - " yield from s\n", - "\n", - "\n", - "response = client.streaming_recognize(\n", - " requests=requests(stream_request_config, audio_requests)\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d__QUGkWCkGh" - }, - "source": [ - "Here you use a helper function to visualize transcriptions and the associated streams." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "_qWA8jXYuMH3" - }, - "outputs": [], - "source": [ - "streaming_recognize_results = parse_streaming_recognize_response(response)\n", - "streaming_recognize_output = get_recognize_output(\n", - " short_audio_sample_bytes, streaming_recognize_results\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "agk_M0xRwzv0" - }, - "outputs": [], - "source": [ - "for audio_sample_bytes, transcription in streaming_recognize_output:\n", - " print_transcription(audio_sample_bytes, transcription)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oYCgDay2hAgB" - }, - "source": [ - "### Perform real-time speech recognition" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "F83r9aiNhAgD" - }, - "source": [ - "#### Define real-time recognition configuration\n", - "\n", - "As for the streaming transcription, you define the real-time recognition configuration which allows you to set the model to use, language code of the audio and more." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "j0WprZ-phAgD" - }, - "outputs": [], - "source": [ - "real_time_config = cloud_speech.RecognitionConfig(\n", - " language_codes=[\"en-US\"],\n", - " model=\"chirp_2\",\n", - " features=cloud_speech.RecognitionFeatures(\n", - " enable_automatic_punctuation=True,\n", - " ),\n", - " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "r2TqksAqhAgD" - }, - "source": [ - "#### Define the real-time request configuration\n", - "\n", - "Next, you define the real-time request passing the configuration and the audio sample you want to transcribe. Again, you don't need to define a recognizer." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Nh55mSzXhAgD" - }, - "outputs": [], - "source": [ - "real_time_request = cloud_speech.RecognizeRequest(\n", - " recognizer=f\"projects/{PROJECT_ID}/locations/{LOCATION}/recognizers/_\",\n", - " config=real_time_config,\n", - " content=short_audio_sample_bytes,\n", - " recognizer=RECOGNIZER,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "817YXVBli0aY" - }, - "source": [ - "#### Run the real-time recognition request\n", - "\n", - "Finally you submit the real-time recognition request." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "rc0cBrVsi7UG" - }, - "outputs": [], - "source": [ - "response = client.recognize(request=real_time_request)\n", - "\n", - "real_time_recognize_results = parse_real_time_recognize_response(response)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "J2vpMSv7CZ_2" - }, - "source": [ - "And you use a helper function to visualize transcriptions and the associated streams." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ezH51rLH4CBR" - }, - "outputs": [], - "source": [ - "for transcription, _ in real_time_recognize_results:\n", - " print_transcription(short_audio_sample_bytes, transcription)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5M-lIwRJ43EC" - }, - "source": [ - "### Perform batch speech recognition" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LJxhFSg848MO" - }, - "source": [ - "#### Define batch recognition configuration\n", - "\n", - "You start defining the batch recognition configuration." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "0CEQUL5_5BT-" - }, - "outputs": [], - "source": [ - "batch_recognition_config = cloud_speech.RecognitionConfig(\n", - " language_codes=[\"en-US\"],\n", - " model=\"chirp_2\",\n", - " features=cloud_speech.RecognitionFeatures(\n", - " enable_automatic_punctuation=True,\n", - " ),\n", - " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SKf3pMBl5E4f" - }, - "source": [ - "#### Set the audio file you want to transcribe\n", - "\n", - "For the batch transcription, you need the audio be staged in a Cloud Storage bucket. Then you set the associated metadata to pass in the batch recognition request." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "o1VCvEEI5MkG" - }, - "outputs": [], - "source": [ - "audio_metadata = cloud_speech.BatchRecognizeFileMetadata(\n", - " uri=INPUT_LONG_AUDIO_SAMPLE_FILE_URI\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5HOKZLp25yFB" - }, - "source": [ - "#### Define batch recognition request\n", - "\n", - "Next, you define the batch recognition request. Notice how you define a recognition output configuration which allows you to determine how would you retrieve the resulting transcription outcome." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "SItkaX7tyZ14" - }, - "outputs": [], - "source": [ - "batch_recognition_request = cloud_speech.BatchRecognizeRequest(\n", - " config=batch_recognition_config,\n", - " files=[audio_metadata],\n", - " recognition_output_config=cloud_speech.RecognitionOutputConfig(\n", - " inline_response_config=cloud_speech.InlineOutputConfig(),\n", - " ),\n", - " recognizer=RECOGNIZER,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "YQY1eqaY7H0n" - }, - "source": [ - "#### Run the batch recognition request\n", - "\n", - "Finally you submit the batch recognition request which is a [long-running operation](https://google.aip.dev/151) as you see below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "AlZwRlLo6F1p" - }, - "outputs": [], - "source": [ - "operation = client.batch_recognize(request=batch_recognition_request)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "DrqsNzVmeWu0" - }, - "outputs": [], - "source": [ - "while True:\n", - " if not operation.done():\n", - " print(\"Waiting for operation to complete...\")\n", - " time.sleep(5)\n", - " else:\n", - " print(\"Operation completed.\")\n", - " break" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "B9MEScw7FYAf" - }, - "source": [ - "After the operation finishes, you can retrieve the result as shown below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pjObiPUweZYA" - }, - "outputs": [], - "source": [ - "response = operation.result()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "31cMuwXZFdgI" - }, - "source": [ - "And visualize transcriptions using a helper function." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "d0eMjC3Kmo-5" - }, - "outputs": [], - "source": [ - "batch_recognize_results = parse_batch_recognize_response(\n", - " response, audio_sample_file_uri=INPUT_LONG_AUDIO_SAMPLE_FILE_URI\n", - ")\n", - "batch_recognize_output = get_recognize_output(\n", - " long_audio_sample_bytes, batch_recognize_results\n", - ")\n", - "for audio_sample_bytes, transcription in batch_recognize_output:\n", - " print_transcription(audio_sample_bytes, transcription)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "teU52ISxqUQd" - }, - "source": [ - "### Evaluate transcriptions\n", - "\n", - "Finally, you may want to evaluate Chirp transcriptions. To do so, you can use [JiWER](https://github.com/jitsi/jiwer), a simple and fast Python package which supports several metrics. In this tutorial, you use:\n", - "\n", - "- **WER (Word Error Rate)** which is the most common metric. WER is the number of word edits (insertions, deletions, substitutions) needed to change the recognized text to match the reference text, divided by the total number of words in the reference text.\n", - "- **CER (Character Error Rate)** which is the number of character edits (insertions, deletions, substitutions) needed to change the recognized text to match the reference text, divided by the total number of characters in the reference text." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "q1u3g4LnqX6z" - }, - "outputs": [], - "source": [ - "actual_transcriptions = [t for _, t in batch_recognize_output]\n", - "reference_transcriptions = [\n", - " \"\"\"Okay, so, you know, everyone's been talking about AI lately, right? Writing poems, like nailing those tricky emails, even building websites and all you need is a few, what do they call it again? Prompts? Yeah, it's wild. These AI tools are suddenly everywhere. It's hard to keep up. Seriously. But here's the thing, a lot of this AI stuff we're seeing, it all goes back to this one research paper from way back in 2017. Attention is all you need. So, today we're doing a deep dive into the core of it. The engine that's kind of driving\"\"\",\n", - " \"\"\"all this change. The Transformer. It's funny, right? This super technical paper, I mean, it really did change how we think about AI and how it uses language. Totally. It's like it, I don't know, cracked a code or something. So, before we get into the transformer, we need to like paint that before picture. Can you take us back to how AI used to deal with language before this whole transformer thing came along? Okay. So, imagine this. You're trying to understand a story, but you can only read like one word at a time. Ouch. Right. And not only that, but you also\"\"\",\n", - " \"\"\"have to like remember every single word you read before just to understand the word you're on right now. That sounds so frustrating, like trying to get a movie by looking at one pixel at a time. Exactly. And that's basically how old AI models used to work. RNNs, recurrent neural networks, they processed language one word after the other, which, you can imagine, was super slow and not that great at handling how, you know, language actually works. So, like remembering how the start of a sentence connects\"\"\",\n", - " \"\"\"to the end or how something that happens at the beginning of a book affects what happens later on. That was really tough for older AI. Totally. It's like trying to get a joke by only remembering the punch line. You miss all the important stuff, all that context. Okay, yeah. I'm starting to see why this paper was such a big deal. So how did \"Attention Is All You Need\" change everything? What's so special about this Transformer thing? Well, I mean, even the title is a good hint, right? It's all about attention. This paper introduced self-attention. Basically, it's how the\"\"\",\n", - "]\n", - "\n", - "evaluation_df = evaluate_stt(actual_transcriptions, reference_transcriptions)\n", - "plot_evaluation_results(evaluation_df)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2a4e033321ad" - }, - "source": [ - "## Cleaning up" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "5bsE-XtXzmpR" - }, - "outputs": [], - "source": [ - "delete_bucket = False\n", - "\n", - "if delete_bucket:\n", - " ! gsutil rm -r $BUCKET_URI" - ] - } - ], - "metadata": { - "colab": { - "name": "get_started_with_chirp_2_sdk.ipynb", - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ur8xi4C7S06n" + }, + "outputs": [], + "source": [ + "# Copyright 2024 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JAPoU8Sm5E6e" + }, + "source": [ + "# Get started with Chirp 2 using Speech-to-Text V2 SDK\n", + "\n", + "
\n",
+ " \n",
+ " Open in Colab\n", + " \n", + " | \n",
+ " \n",
+ " \n",
+ " Open in Colab Enterprise\n", + " \n", + " | \n",
+ " \n",
+ " \n",
+ " Open in Vertex AI Workbench\n", + " \n", + " | \n",
+ " \n",
+ " \n",
+ " View on GitHub\n", + " \n", + " | \n",
+ "
{transcription}\"\n", + " display(ipd.HTML(formatted_text))\n", + "\n", + "\n", + "def evaluate_stt(\n", + " actual_transcriptions: list[str],\n", + " reference_transcriptions: list[str],\n", + " audio_sample_file_uri: str = INPUT_LONG_AUDIO_SAMPLE_FILE_URI,\n", + ") -> pd.DataFrame:\n", + " \"\"\"\n", + " Evaluate speech-to-text (STT) transcriptions against reference transcriptions.\n", + " \"\"\"\n", + " audio_uris = [audio_sample_file_uri] * len(actual_transcriptions)\n", + " evaluations = []\n", + " for audio_uri, actual_transcription, reference_transcription in zip(\n", + " audio_uris, actual_transcriptions, reference_transcriptions\n", + " ):\n", + " evaluation = {\n", + " \"audio_uri\": audio_uri,\n", + " \"actual_transcription\": actual_transcription,\n", + " \"reference_transcription\": reference_transcription,\n", + " \"wer\": jiwer.wer(reference_transcription, actual_transcription),\n", + " \"cer\": jiwer.cer(reference_transcription, actual_transcription),\n", + " }\n", + " evaluations.append(evaluation)\n", + "\n", + " evaluations_df = pd.DataFrame(evaluations)\n", + " evaluations_df.reset_index(inplace=True, drop=True)\n", + " return evaluations_df\n", + "\n", + "\n", + "def plot_evaluation_results(\n", + " evaluations_df: pd.DataFrame,\n", + ") -> go.Figure:\n", + " \"\"\"\n", + " Plot the mean Word Error Rate (WER) and Character Error Rate (CER) from the evaluation results.\n", + " \"\"\"\n", + " mean_wer = evaluations_df[\"wer\"].mean()\n", + " mean_cer = evaluations_df[\"cer\"].mean()\n", + "\n", + " trace_means = go.Bar(\n", + " x=[\"WER\", \"CER\"], y=[mean_wer, mean_cer], name=\"Mean Error Rate\"\n", + " )\n", + "\n", + " trace_baseline = go.Scatter(\n", + " x=[\"WER\", \"CER\"], y=[0.5, 0.5], mode=\"lines\", name=\"Baseline (0.5)\"\n", + " )\n", + "\n", + " layout = go.Layout(\n", + " title=\"Speech-to-Text Evaluation Results\",\n", + " xaxis=dict(title=\"Metric\"),\n", + " yaxis=dict(title=\"Error Rate\", range=[0, 1]),\n", + " barmode=\"group\",\n", + " )\n", + "\n", + " fig = go.Figure(data=[trace_means, trace_baseline], layout=layout)\n", + " return fig" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VPVDNRyVxquo" + }, + "source": [ + "## Transcribe using Chirp 2\n", + "\n", + "You can use Chirp 2 to transcribe audio in Streaming, Online and Batch modes:\n", + "\n", + "* Streaming mode is good for streaming and real-time audio. \n", + "* Online mode is good for short audio < 1 min.\n", + "* Batch mode is good for long audio 1 min to 8 hrs. \n", + "\n", + "In the following sections, you explore how to use the API to transcribe audio in these three different scenarios." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4uTeBXo6dZlS" + }, + "source": [ + "### Read the audio file\n", + "\n", + "Let's start reading the input audio sample you want to transcribe.\n", + "\n", + "In this case, it is a podcast generated with NotebookLM about the \"Attention is all you need\" [paper](https://arxiv.org/abs/1706.03762)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pjzwMWqpdldM" + }, + "outputs": [], + "source": [ + "input_audio_bytes = read_audio_file(INPUT_AUDIO_SAMPLE_FILE_URI)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SyEUpcf12z73" + }, + "source": [ + "### Prepare audio samples\n", + "\n", + "The podcast audio is ~ 8 mins. Depending on the audio length, you can use different transcribe API methods. To learn more, check out the official documentation. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TRlgCdED793U" + }, + "source": [ + "#### Prepare a short audio sample (< 1 min)\n", + "\n", + "Extract a short audio sample from the original one for streaming and real-time audio processing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "r-SYb9_b87BZ" + }, + "outputs": [], + "source": [ + "short_audio_sample_bytes = extract_audio_sample(input_audio_bytes, 30)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0Hk2OSiSEFrf" + }, + "outputs": [], + "source": [ + "play_audio_sample(short_audio_sample_bytes)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2rPcMe0LvC3q" + }, + "source": [ + "#### Prepare a long audio sample (from 1 min up to 8 hrs)\n", + "\n", + "Extract a longer audio sample from the original one for batch audio processing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L44FoygqvHoP" + }, + "outputs": [], + "source": [ + "long_audio_sample_bytes = extract_audio_sample(input_audio_bytes, 120)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ej2j0FBEvK6s" + }, + "outputs": [], + "source": [ + "play_audio_sample(long_audio_sample_bytes)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6tIbVVe76ML8" + }, + "outputs": [], + "source": [ + "save_audio_sample(long_audio_sample_bytes, INPUT_LONG_AUDIO_SAMPLE_FILE_URI)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w5qPg2OfFAG9" + }, + "source": [ + "### Perform streaming speech recognition\n", + "\n", + "Let's start performing streaming speech recognition." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aAlIgQSoeDT5" + }, + "source": [ + "#### Prepare the audio stream\n", + "\n", + "To simulate an audio stream, you can create a generator yielding chunks of audio data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j5SPyum6FMiC" + }, + "outputs": [], + "source": [ + "stream = [\n", + " compress_for_streaming(audio_chuck)\n", + " for audio_chuck in audio_sample_chunk_n(short_audio_sample_bytes, num_chunks=5)\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7dDap26FiKlL" + }, + "outputs": [], + "source": [ + "for s in stream:\n", + " play_audio_sample(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9z1XGzpxeAMP" + }, + "source": [ + "#### Prepare the stream request\n", + "\n", + "Once you have your audio stream, you can use the `StreamingRecognizeRequest`class to convert each stream component into a API message." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IOZNYPrfeW49" + }, + "outputs": [], + "source": [ + "audio_requests = (cloud_speech.StreamingRecognizeRequest(audio=s) for s in stream)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oPbf5rNFecI_" + }, + "source": [ + "#### Define streaming recognition configuration\n", + "\n", + "Next, you define the streaming recognition configuration which allows you to set the model to use, language code of the audio and more." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "32Wz990perAo" + }, + "outputs": [], + "source": [ + "streaming_config = cloud_speech.StreamingRecognitionConfig(\n", + " config=cloud_speech.RecognitionConfig(\n", + " language_codes=[\"en-US\"],\n", + " model=\"chirp_2\",\n", + " features=cloud_speech.RecognitionFeatures(\n", + " enable_automatic_punctuation=True,\n", + " ),\n", + " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zVRyTqhWe2gf" + }, + "source": [ + "#### Define the streaming request configuration\n", + "\n", + "Then, you use the streaming configuration to define the streaming request. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "t5qiUJ48e9i5" + }, + "outputs": [], + "source": [ + "stream_request_config = cloud_speech.StreamingRecognizeRequest(\n", + " streaming_config=streaming_config, recognizer=RECOGNIZER\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h1d508ScfD9I" + }, + "source": [ + "#### Run the streaming recognition request\n", + "\n", + "Finally, you are able to run the streaming recognition request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QCq-iROpfl9t" + }, + "outputs": [], + "source": [ + "def requests(request_config: cloud_speech.RecognitionConfig, s: list) -> list:\n", + " yield request_config\n", + " yield from s\n", + "\n", + "\n", + "response = client.streaming_recognize(\n", + " requests=requests(stream_request_config, audio_requests)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d__QUGkWCkGh" + }, + "source": [ + "Here you use a helper function to visualize transcriptions and the associated streams." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_qWA8jXYuMH3" + }, + "outputs": [], + "source": [ + "streaming_recognize_results = parse_streaming_recognize_response(response)\n", + "streaming_recognize_output = get_recognize_output(\n", + " short_audio_sample_bytes, streaming_recognize_results\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "agk_M0xRwzv0" + }, + "outputs": [], + "source": [ + "for audio_sample_bytes, transcription in streaming_recognize_output:\n", + " print_transcription(audio_sample_bytes, transcription)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oYCgDay2hAgB" + }, + "source": [ + "### Perform real-time speech recognition" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F83r9aiNhAgD" + }, + "source": [ + "#### Define real-time recognition configuration\n", + "\n", + "As for the streaming transcription, you define the real-time recognition configuration which allows you to set the model to use, language code of the audio and more." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j0WprZ-phAgD" + }, + "outputs": [], + "source": [ + "real_time_config = cloud_speech.RecognitionConfig(\n", + " language_codes=[\"en-US\"],\n", + " model=\"chirp_2\",\n", + " features=cloud_speech.RecognitionFeatures(\n", + " enable_automatic_punctuation=True,\n", + " ),\n", + " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r2TqksAqhAgD" + }, + "source": [ + "#### Define the real-time request configuration\n", + "\n", + "Next, you define the real-time request passing the configuration and the audio sample you want to transcribe. Again, you don't need to define a recognizer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Nh55mSzXhAgD" + }, + "outputs": [], + "source": [ + "real_time_request = cloud_speech.RecognizeRequest(\n", + " config=real_time_config,\n", + " content=short_audio_sample_bytes,\n", + " recognizer=RECOGNIZER,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "817YXVBli0aY" + }, + "source": [ + "#### Run the real-time recognition request\n", + "\n", + "Finally you submit the real-time recognition request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rc0cBrVsi7UG" + }, + "outputs": [], + "source": [ + "response = client.recognize(request=real_time_request)\n", + "\n", + "real_time_recognize_results = parse_real_time_recognize_response(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J2vpMSv7CZ_2" + }, + "source": [ + "And you use a helper function to visualize transcriptions and the associated streams." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ezH51rLH4CBR" + }, + "outputs": [], + "source": [ + "for transcription, _ in real_time_recognize_results:\n", + " print_transcription(short_audio_sample_bytes, transcription)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5M-lIwRJ43EC" + }, + "source": [ + "### Perform batch speech recognition" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LJxhFSg848MO" + }, + "source": [ + "#### Define batch recognition configuration\n", + "\n", + "You start defining the batch recognition configuration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0CEQUL5_5BT-" + }, + "outputs": [], + "source": [ + "batch_recognition_config = cloud_speech.RecognitionConfig(\n", + " language_codes=[\"en-US\"],\n", + " model=\"chirp_2\",\n", + " features=cloud_speech.RecognitionFeatures(\n", + " enable_automatic_punctuation=True,\n", + " ),\n", + " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SKf3pMBl5E4f" + }, + "source": [ + "#### Set the audio file you want to transcribe\n", + "\n", + "For the batch transcription, you need the audio be staged in a Cloud Storage bucket. Then you set the associated metadata to pass in the batch recognition request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "o1VCvEEI5MkG" + }, + "outputs": [], + "source": [ + "audio_metadata = cloud_speech.BatchRecognizeFileMetadata(\n", + " uri=INPUT_LONG_AUDIO_SAMPLE_FILE_URI\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5HOKZLp25yFB" + }, + "source": [ + "#### Define batch recognition request\n", + "\n", + "Next, you define the batch recognition request. Notice how you define a recognition output configuration which allows you to determine how would you retrieve the resulting transcription outcome." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SItkaX7tyZ14" + }, + "outputs": [], + "source": [ + "batch_recognition_request = cloud_speech.BatchRecognizeRequest(\n", + " config=batch_recognition_config,\n", + " files=[audio_metadata],\n", + " recognition_output_config=cloud_speech.RecognitionOutputConfig(\n", + " inline_response_config=cloud_speech.InlineOutputConfig(),\n", + " ),\n", + " recognizer=RECOGNIZER,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQY1eqaY7H0n" + }, + "source": [ + "#### Run the batch recognition request\n", + "\n", + "Finally you submit the batch recognition request which is a [long-running operation](https://google.aip.dev/151) as you see below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AlZwRlLo6F1p" + }, + "outputs": [], + "source": [ + "operation = client.batch_recognize(request=batch_recognition_request)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DrqsNzVmeWu0" + }, + "outputs": [], + "source": [ + "while True:\n", + " if not operation.done():\n", + " print(\"Waiting for operation to complete...\")\n", + " time.sleep(5)\n", + " else:\n", + " print(\"Operation completed.\")\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B9MEScw7FYAf" + }, + "source": [ + "After the operation finishes, you can retrieve the result as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pjObiPUweZYA" + }, + "outputs": [], + "source": [ + "response = operation.result()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "31cMuwXZFdgI" + }, + "source": [ + "And visualize transcriptions using a helper function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d0eMjC3Kmo-5" + }, + "outputs": [], + "source": [ + "batch_recognize_results = parse_batch_recognize_response(\n", + " response, audio_sample_file_uri=INPUT_LONG_AUDIO_SAMPLE_FILE_URI\n", + ")\n", + "batch_recognize_output = get_recognize_output(\n", + " long_audio_sample_bytes, batch_recognize_results\n", + ")\n", + "for audio_sample_bytes, transcription in batch_recognize_output:\n", + " print_transcription(audio_sample_bytes, transcription)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "teU52ISxqUQd" + }, + "source": [ + "### Evaluate transcriptions\n", + "\n", + "Finally, you may want to evaluate Chirp transcriptions. To do so, you can use [JiWER](https://github.com/jitsi/jiwer), a simple and fast Python package which supports several metrics. In this tutorial, you use:\n", + "\n", + "- **WER (Word Error Rate)** which is the most common metric. WER is the number of word edits (insertions, deletions, substitutions) needed to change the recognized text to match the reference text, divided by the total number of words in the reference text.\n", + "- **CER (Character Error Rate)** which is the number of character edits (insertions, deletions, substitutions) needed to change the recognized text to match the reference text, divided by the total number of characters in the reference text." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "q1u3g4LnqX6z" + }, + "outputs": [], + "source": [ + "actual_transcriptions = [t for _, t in batch_recognize_output]\n", + "reference_transcriptions = [\n", + " \"\"\"Okay, so, you know, everyone's been talking about AI lately, right? Writing poems, like nailing those tricky emails, even building websites and all you need is a few, what do they call it again? Prompts? Yeah, it's wild. These AI tools are suddenly everywhere. It's hard to keep up. Seriously. But here's the thing, a lot of this AI stuff we're seeing, it all goes back to this one research paper from way back in 2017. Attention is all you need. So, today we're doing a deep dive into the core of it. The engine that's kind of driving\"\"\",\n", + " \"\"\"all this change. The Transformer. It's funny, right? This super technical paper, I mean, it really did change how we think about AI and how it uses language. Totally. It's like it, I don't know, cracked a code or something. So, before we get into the transformer, we need to like paint that before picture. Can you take us back to how AI used to deal with language before this whole transformer thing came along? Okay. So, imagine this. You're trying to understand a story, but you can only read like one word at a time. Ouch. Right. And not only that, but you also\"\"\",\n", + " \"\"\"have to like remember every single word you read before just to understand the word you're on right now. That sounds so frustrating, like trying to get a movie by looking at one pixel at a time. Exactly. And that's basically how old AI models used to work. RNNs, recurrent neural networks, they processed language one word after the other, which, you can imagine, was super slow and not that great at handling how, you know, language actually works. So, like remembering how the start of a sentence connects\"\"\",\n", + " \"\"\"to the end or how something that happens at the beginning of a book affects what happens later on. That was really tough for older AI. Totally. It's like trying to get a joke by only remembering the punch line. You miss all the important stuff, all that context. Okay, yeah. I'm starting to see why this paper was such a big deal. So how did \"Attention Is All You Need\" change everything? What's so special about this Transformer thing? Well, I mean, even the title is a good hint, right? It's all about attention. This paper introduced self-attention. Basically, it's how the\"\"\",\n", + "]\n", + "\n", + "evaluation_df = evaluate_stt(actual_transcriptions, reference_transcriptions)\n", + "plot_evaluation_results(evaluation_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2a4e033321ad" + }, + "source": [ + "## Cleaning up" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5bsE-XtXzmpR" + }, + "outputs": [], + "source": [ + "delete_bucket = False\n", + "\n", + "if delete_bucket:\n", + " ! gsutil rm -r $BUCKET_URI" + ] + } + ], + "metadata": { + "colab": { + "name": "get_started_with_chirp_2_sdk.ipynb", + "toc_visible": true + }, + "environment": { + "kernel": "python3", + "name": "tf2-cpu.2-11.m125", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/tf2-cpu.2-11:m125" + }, + "kernelspec": { + "display_name": "Python 3 (Local)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 4 }