diff --git a/audio/speech/getting-started/get_started_with_chirp_2_sdk.ipynb b/audio/speech/getting-started/get_started_with_chirp_2_sdk.ipynb index 4976608f8b..05019d21aa 100644 --- a/audio/speech/getting-started/get_started_with_chirp_2_sdk.ipynb +++ b/audio/speech/getting-started/get_started_with_chirp_2_sdk.ipynb @@ -1,1343 +1,1361 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ur8xi4C7S06n" - }, - "outputs": [], - "source": [ - "# Copyright 2024 Google LLC\n", - "#\n", - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JAPoU8Sm5E6e" - }, - "source": [ - "# Get started with Chirp 2 using Speech-to-Text V2 SDK\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \"Google
Open in Colab\n", - "
\n", - "
\n", - " \n", - " \"Google
Open in Colab Enterprise\n", - "
\n", - "
\n", - " \n", - " \"Vertex
Open in Vertex AI Workbench\n", - "
\n", - "
\n", - " \n", - " \"GitHub
View on GitHub\n", - "
\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "84f0f73a0f76" - }, - "source": [ - "| | |\n", - "|-|-|\n", - "| Author(s) | [Ivan Nardini](https://github.com/inardini) |" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tvgnzT1CKxrO" - }, - "source": [ - "## Overview\n", - "\n", - "In this tutorial, you learn about how to use Chirp 2, the latest generation of Google's multilingual ASR-specific models.\n", - "\n", - "Chirp 2 improves upon the original Chirp model in accuracy and speed, as well as expanding into key new features like word-level timestamps, model adaptation, and speech translation." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "61RBz8LLbxCR" - }, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "No17Cw5hgx12" - }, - "source": [ - "### Install Speech-to-Text SDK and other required packages\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "e73_ZgKWYedz" - }, - "outputs": [], - "source": [ - "! apt update -y -qq\n", - "! apt install ffmpeg -y -qq" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "tFy3H3aPgx12" - }, - "outputs": [], - "source": [ - "%pip install --quiet 'google-cloud-speech' 'protobuf<4.21' 'google-auth==2.27.0' 'pydub' 'etils' 'jiwer' 'ffmpeg-python' 'plotly'" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "R5Xep4W9lq-Z" - }, - "source": [ - "### Restart runtime\n", - "\n", - "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.\n", - "\n", - "The restart might take a minute or longer. After it's restarted, continue to the next step." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "XRvKdaPDTznN" - }, - "outputs": [], - "source": [ - "import IPython\n", - "\n", - "app = IPython.Application.instance()\n", - "app.kernel.do_shutdown(True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SbmM4z7FOBpM" - }, - "source": [ - "
\n", - "⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dmWOrTJ3gx13" - }, - "source": [ - "### Authenticate your notebook environment (Colab only)\n", - "\n", - "If you're running this notebook on Google Colab, run the cell below to authenticate your environment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "NyKGtVQjgx13" - }, - "outputs": [], - "source": [ - "import sys\n", - "\n", - "if \"google.colab\" in sys.modules:\n", - " from google.colab import auth\n", - "\n", - " auth.authenticate_user()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DF4l8DTdWgPY" - }, - "source": [ - "### Set Google Cloud project information and initialize Speech-to-Text V2 SDK\n", - "\n", - "To get started using the Speech-to-Text API, you must have an existing Google Cloud project and [enable the Speech-to-Text API](https://console.cloud.google.com/flows/enableapi?apiid=speech.googleapis.com).\n", - "\n", - "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "WIQyBhAn_9tK" - }, - "outputs": [], - "source": [ - "import os\n", - "\n", - "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\", isTemplate: true}\n", - "\n", - "if PROJECT_ID == \"[your-project-id]\":\n", - " PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n", - "\n", - "LOCATION = os.environ.get(\"GOOGLE_CLOUD_REGION\", \"us-central1\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Nqwi-5ufWp_B" - }, - "outputs": [], - "source": [ - "from google.api_core.client_options import ClientOptions\n", - "from google.cloud.speech_v2 import SpeechClient\n", - "\n", - "API_ENDPOINT = f\"{LOCATION}-speech.googleapis.com\"\n", - "\n", - "client = SpeechClient(\n", - " client_options=ClientOptions(\n", - " api_endpoint=API_ENDPOINT,\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zgPO1eR3CYjk" - }, - "source": [ - "### Create a Cloud Storage bucket\n", - "\n", - "Create a storage bucket to store intermediate artifacts such as datasets." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "MzGDU7TWdts_" - }, - "outputs": [], - "source": [ - "BUCKET_NAME = \"your-bucket-name-unique\" # @param {type:\"string\", isTemplate: true}\n", - "\n", - "BUCKET_URI = f\"gs://{BUCKET_NAME}\" # @param {type:\"string\"}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-EcIXiGsCePi" - }, - "source": [ - "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "NIq7R4HZCfIc" - }, - "outputs": [], - "source": [ - "! gsutil mb -l $LOCATION -p $PROJECT_ID $BUCKET_URI" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5303c05f7aa6" - }, - "source": [ - "### Import libraries" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6fc324893334" - }, - "outputs": [], - "source": [ - "from google.cloud.speech_v2.types import cloud_speech" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qqm0OQpAYCph" - }, - "outputs": [], - "source": [ - "import io\n", - "import os\n", - "import subprocess\n", - "import time\n", - "\n", - "import IPython.display as ipd\n", - "from etils import epath as ep\n", - "import jiwer\n", - "import pandas as pd\n", - "import plotly.graph_objs as go\n", - "from pydub import AudioSegment" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sP8GBj3tBAC1" - }, - "source": [ - "### Set constants" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "rXTVeU1uBBqY" - }, - "outputs": [], - "source": [ - "INPUT_AUDIO_SAMPLE_FILE_URI = (\n", - " \"gs://github-repo/audio_ai/speech_recognition/attention_is_all_you_need_podcast.wav\"\n", - ")\n", - "INPUT_LONG_AUDIO_SAMPLE_FILE_URI = (\n", - " f\"{BUCKET_URI}/speech_recognition/data/long_audio_sample.wav\"\n", - ")\n", - "\n", - "RECOGNIZER = client.recognizer_path(PROJECT_ID, LOCATION, \"_\")\n", - "\n", - "MAX_CHUNK_SIZE = 25600" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "djgFxrGC_Ykd" - }, - "source": [ - "### Helpers" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Zih8W_wC_caW" - }, - "outputs": [], - "source": [ - "def read_audio_file(audio_file_path: str) -> bytes:\n", - " \"\"\"\n", - " Read audio file as bytes.\n", - " \"\"\"\n", - " if audio_file_path.startswith(\"gs://\"):\n", - " with ep.Path(audio_file_path).open(\"rb\") as f:\n", - " audio_bytes = f.read()\n", - " else:\n", - " with open(audio_file_path, \"rb\") as f:\n", - " audio_bytes = f.read()\n", - " return audio_bytes\n", - "\n", - "\n", - "def save_audio_sample(audio_bytes: bytes, output_file_uri: str) -> None:\n", - " \"\"\"\n", - " Save audio sample as a file in Google Cloud Storage.\n", - " \"\"\"\n", - "\n", - " output_file_path = ep.Path(output_file_uri)\n", - " if not output_file_path.parent.exists():\n", - " output_file_path.parent.mkdir(parents=True, exist_ok=True)\n", - "\n", - " with output_file_path.open(\"wb\") as f:\n", - " f.write(audio_bytes)\n", - "\n", - "\n", - "def extract_audio_sample(audio_bytes: bytes, duration: int) -> bytes:\n", - " \"\"\"\n", - " Extracts a random audio sample of a given duration from an audio file.\n", - " \"\"\"\n", - " audio = AudioSegment.from_file(io.BytesIO(audio_bytes))\n", - " start_time = 0\n", - " audio_sample = audio[start_time : start_time + duration * 1000]\n", - "\n", - " audio_bytes = io.BytesIO()\n", - " audio_sample.export(audio_bytes, format=\"wav\")\n", - " audio_bytes.seek(0)\n", - "\n", - " return audio_bytes.read()\n", - "\n", - "\n", - "def play_audio_sample(audio_bytes: bytes) -> None:\n", - " \"\"\"\n", - " Plays the audio sample in a notebook.\n", - " \"\"\"\n", - " audio_file = io.BytesIO(audio_bytes)\n", - " ipd.display(ipd.Audio(audio_file.read(), rate=44100))\n", - "\n", - "\n", - "def audio_sample_chunk_n(audio_bytes: bytes, num_chunks: int) -> list[bytes]:\n", - " \"\"\"\n", - " Chunks an audio sample into a specified number of chunks and returns a list of bytes for each chunk.\n", - " \"\"\"\n", - " audio = AudioSegment.from_file(io.BytesIO(audio_bytes))\n", - " total_duration = len(audio)\n", - " chunk_duration = total_duration // num_chunks\n", - "\n", - " chunks = []\n", - " start_time = 0\n", - "\n", - " for _ in range(num_chunks):\n", - " end_time = min(start_time + chunk_duration, total_duration)\n", - " chunk = audio[start_time:end_time]\n", - "\n", - " audio_bytes_chunk = io.BytesIO()\n", - " chunk.export(audio_bytes_chunk, format=\"wav\")\n", - " audio_bytes_chunk.seek(0)\n", - " chunks.append(audio_bytes_chunk.read())\n", - "\n", - " start_time = end_time\n", - "\n", - " return chunks\n", - "\n", - "\n", - "def audio_sample_merge(audio_chunks: list[bytes]) -> bytes:\n", - " \"\"\"\n", - " Merges a list of audio chunks into a single audio sample.\n", - " \"\"\"\n", - " audio = AudioSegment.empty()\n", - " for chunk in audio_chunks:\n", - " audio += AudioSegment.from_file(io.BytesIO(chunk))\n", - "\n", - " audio_bytes = io.BytesIO()\n", - " audio.export(audio_bytes, format=\"wav\")\n", - " audio_bytes.seek(0)\n", - "\n", - " return audio_bytes.read()\n", - "\n", - "\n", - "def compress_for_streaming(audio_bytes: bytes) -> bytes:\n", - " \"\"\"\n", - " Compresses audio bytes for streaming using ffmpeg, ensuring the output size is under 25600 bytes.\n", - " \"\"\"\n", - "\n", - " # Temporary file to store original audio\n", - " with open(\"temp_original.wav\", \"wb\") as f:\n", - " f.write(audio_bytes)\n", - "\n", - " # Initial compression attempt with moderate bitrate\n", - " bitrate = \"32k\"\n", - " subprocess.run(\n", - " [\n", - " \"ffmpeg\",\n", - " \"-i\",\n", - " \"temp_original.wav\",\n", - " \"-b:a\",\n", - " bitrate,\n", - " \"-y\",\n", - " \"temp_compressed.mp3\",\n", - " ]\n", - " )\n", - "\n", - " # Check if compressed size is within limit\n", - " compressed_size = os.path.getsize(\"temp_compressed.mp3\")\n", - " if compressed_size <= 25600:\n", - " with open(\"temp_compressed.mp3\", \"rb\") as f:\n", - " compressed_audio_bytes = f.read()\n", - " else:\n", - " # If too large, reduce bitrate and retry\n", - " while compressed_size > 25600:\n", - " bitrate = str(int(bitrate[:-1]) - 8) + \"k\" # Reduce bitrate by 8kbps\n", - " subprocess.run(\n", - " [\n", - " \"ffmpeg\",\n", - " \"-i\",\n", - " \"temp_original.wav\",\n", - " \"-b:a\",\n", - " bitrate,\n", - " \"-y\",\n", - " \"temp_compressed.mp3\",\n", - " ]\n", - " )\n", - " compressed_size = os.path.getsize(\"temp_compressed.mp3\")\n", - "\n", - " with open(\"temp_compressed.mp3\", \"rb\") as f:\n", - " compressed_audio_bytes = f.read()\n", - "\n", - " # Clean up temporary files\n", - " os.remove(\"temp_original.wav\")\n", - " os.remove(\"temp_compressed.mp3\")\n", - "\n", - " return compressed_audio_bytes\n", - "\n", - "\n", - "def parse_streaming_recognize_response(response) -> list[tuple[str, int]]:\n", - " \"\"\"Parse streaming responses from the Speech-to-Text API\"\"\"\n", - " streaming_recognize_results = []\n", - " for r in response:\n", - " for result in r.results:\n", - " streaming_recognize_results.append(\n", - " (result.alternatives[0].transcript, result.result_end_offset)\n", - " )\n", - " return streaming_recognize_results\n", - "\n", - "\n", - "def parse_real_time_recognize_response(response) -> list[tuple[str, int]]:\n", - " \"\"\"Parse real-time responses from the Speech-to-Text API\"\"\"\n", - " real_time_recognize_results = []\n", - " for result in response.results:\n", - " real_time_recognize_results.append(\n", - " (result.alternatives[0].transcript, result.result_end_offset)\n", - " )\n", - " return real_time_recognize_results\n", - "\n", - "\n", - "def parse_batch_recognize_response(\n", - " response, audio_sample_file_uri: str = INPUT_LONG_AUDIO_SAMPLE_FILE_URI\n", - ") -> list[tuple[str, int]]:\n", - " \"\"\"Parse batch responses from the Speech-to-Text API\"\"\"\n", - " batch_recognize_results = []\n", - " for result in response.results[\n", - " audio_sample_file_uri\n", - " ].inline_result.transcript.results:\n", - " batch_recognize_results.append(\n", - " (result.alternatives[0].transcript, result.result_end_offset)\n", - " )\n", - " return batch_recognize_results\n", - "\n", - "\n", - "def get_recognize_output(\n", - " audio_bytes: bytes, recognize_results: list[tuple[str, int]]\n", - ") -> list[tuple[bytes, str]]:\n", - " \"\"\"\n", - " Get the output of recognize results, handling 0 timedelta and ensuring no overlaps or gaps.\n", - " \"\"\"\n", - " audio = AudioSegment.from_file(io.BytesIO(audio_bytes))\n", - " recognize_output = []\n", - " start_time = 0\n", - "\n", - " initial_end_time = recognize_results[0][1].total_seconds() * 1000\n", - "\n", - " # This loop handles the streaming case where result timestamps might be zero.\n", - " if initial_end_time == 0:\n", - " for i, (transcript, timedelta) in enumerate(recognize_results):\n", - " if i < len(recognize_results) - 1:\n", - " # Use the next timedelta if available\n", - " next_end_time = recognize_results[i + 1][1].total_seconds() * 1000\n", - " end_time = next_end_time\n", - " else:\n", - " next_end_time = len(audio)\n", - " end_time = next_end_time\n", - "\n", - " # Ensure no gaps between chunks\n", - " chunk = audio[start_time:end_time]\n", - " chunk_bytes = io.BytesIO()\n", - " chunk.export(chunk_bytes, format=\"wav\")\n", - " chunk_bytes.seek(0)\n", - " recognize_output.append((chunk_bytes.read(), transcript))\n", - "\n", - " # Set start_time for the next iteration\n", - " start_time = end_time\n", - " else:\n", - " for i, (transcript, timedelta) in enumerate(recognize_results):\n", - " # Calculate end_time in milliseconds\n", - " end_time = timedelta.total_seconds() * 1000\n", - "\n", - " # Ensure no gaps between chunks\n", - " chunk = audio[start_time:end_time]\n", - " chunk_bytes = io.BytesIO()\n", - " chunk.export(chunk_bytes, format=\"wav\")\n", - " chunk_bytes.seek(0)\n", - " recognize_output.append((chunk_bytes.read(), transcript))\n", - "\n", - " # Set start_time for the next iteration\n", - " start_time = end_time\n", - "\n", - " return recognize_output\n", - "\n", - "\n", - "def print_transcription(audio_sample_bytes: bytes, transcription: str) -> None:\n", - " \"\"\"Prettify the play of the audio and the associated print of the transcription text in a notebook\"\"\"\n", - "\n", - " # Play the audio sample\n", - " display(ipd.HTML(\"Audio:\"))\n", - " play_audio_sample(audio_sample_bytes)\n", - " display(ipd.HTML(\"
\"))\n", - "\n", - " # Display the transcription text\n", - " display(ipd.HTML(\"Transcription:\"))\n", - " formatted_text = f\"
{transcription}
\"\n", - " display(ipd.HTML(formatted_text))\n", - "\n", - "\n", - "def evaluate_stt(\n", - " actual_transcriptions: list[str],\n", - " reference_transcriptions: list[str],\n", - " audio_sample_file_uri: str = INPUT_LONG_AUDIO_SAMPLE_FILE_URI,\n", - ") -> pd.DataFrame:\n", - " \"\"\"\n", - " Evaluate speech-to-text (STT) transcriptions against reference transcriptions.\n", - " \"\"\"\n", - " audio_uris = [audio_sample_file_uri] * len(actual_transcriptions)\n", - " evaluations = []\n", - " for audio_uri, actual_transcription, reference_transcription in zip(\n", - " audio_uris, actual_transcriptions, reference_transcriptions\n", - " ):\n", - " evaluation = {\n", - " \"audio_uri\": audio_uri,\n", - " \"actual_transcription\": actual_transcription,\n", - " \"reference_transcription\": reference_transcription,\n", - " \"wer\": jiwer.wer(reference_transcription, actual_transcription),\n", - " \"cer\": jiwer.cer(reference_transcription, actual_transcription),\n", - " }\n", - " evaluations.append(evaluation)\n", - "\n", - " evaluations_df = pd.DataFrame(evaluations)\n", - " evaluations_df.reset_index(inplace=True, drop=True)\n", - " return evaluations_df\n", - "\n", - "\n", - "def plot_evaluation_results(\n", - " evaluations_df: pd.DataFrame,\n", - ") -> go.Figure:\n", - " \"\"\"\n", - " Plot the mean Word Error Rate (WER) and Character Error Rate (CER) from the evaluation results.\n", - " \"\"\"\n", - " mean_wer = evaluations_df[\"wer\"].mean()\n", - " mean_cer = evaluations_df[\"cer\"].mean()\n", - "\n", - " trace_means = go.Bar(\n", - " x=[\"WER\", \"CER\"], y=[mean_wer, mean_cer], name=\"Mean Error Rate\"\n", - " )\n", - "\n", - " trace_baseline = go.Scatter(\n", - " x=[\"WER\", \"CER\"], y=[0.5, 0.5], mode=\"lines\", name=\"Baseline (0.5)\"\n", - " )\n", - "\n", - " layout = go.Layout(\n", - " title=\"Speech-to-Text Evaluation Results\",\n", - " xaxis=dict(title=\"Metric\"),\n", - " yaxis=dict(title=\"Error Rate\", range=[0, 1]),\n", - " barmode=\"group\",\n", - " )\n", - "\n", - " fig = go.Figure(data=[trace_means, trace_baseline], layout=layout)\n", - " return fig" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "VPVDNRyVxquo" - }, - "source": [ - "## Transcribe using Chirp 2\n", - "\n", - "You can use Chirp 2 to transcribe audio in Streaming, Online and Batch modes:\n", - "\n", - "* Streaming mode is good for streaming and real-time audio. \n", - "* Online mode is good for short audio < 1 min.\n", - "* Batch mode is good for long audio 1 min to 8 hrs. \n", - "\n", - "In the following sections, you explore how to use the API to transcribe audio in these three different scenarios." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4uTeBXo6dZlS" - }, - "source": [ - "### Read the audio file\n", - "\n", - "Let's start reading the input audio sample you want to transcribe.\n", - "\n", - "In this case, it is a podcast generated with NotebookLM about the \"Attention is all you need\" [paper](https://arxiv.org/abs/1706.03762)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pjzwMWqpdldM" - }, - "outputs": [], - "source": [ - "input_audio_bytes = read_audio_file(INPUT_AUDIO_SAMPLE_FILE_URI)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SyEUpcf12z73" - }, - "source": [ - "### Prepare audio samples\n", - "\n", - "The podcast audio is ~ 8 mins. Depending on the audio length, you can use different transcribe API methods. To learn more, check out the official documentation. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TRlgCdED793U" - }, - "source": [ - "#### Prepare a short audio sample (< 1 min)\n", - "\n", - "Extract a short audio sample from the original one for streaming and real-time audio processing." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "r-SYb9_b87BZ" - }, - "outputs": [], - "source": [ - "short_audio_sample_bytes = extract_audio_sample(input_audio_bytes, 30)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "0Hk2OSiSEFrf" - }, - "outputs": [], - "source": [ - "play_audio_sample(short_audio_sample_bytes)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2rPcMe0LvC3q" - }, - "source": [ - "#### Prepare a long audio sample (from 1 min up to 8 hrs)\n", - "\n", - "Extract a longer audio sample from the original one for batch audio processing." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "L44FoygqvHoP" - }, - "outputs": [], - "source": [ - "long_audio_sample_bytes = extract_audio_sample(input_audio_bytes, 120)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Ej2j0FBEvK6s" - }, - "outputs": [], - "source": [ - "play_audio_sample(long_audio_sample_bytes)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6tIbVVe76ML8" - }, - "outputs": [], - "source": [ - "save_audio_sample(long_audio_sample_bytes, INPUT_LONG_AUDIO_SAMPLE_FILE_URI)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "w5qPg2OfFAG9" - }, - "source": [ - "### Perform streaming speech recognition\n", - "\n", - "Let's start performing streaming speech recognition." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aAlIgQSoeDT5" - }, - "source": [ - "#### Prepare the audio stream\n", - "\n", - "To simulate an audio stream, you can create a generator yielding chunks of audio data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "j5SPyum6FMiC" - }, - "outputs": [], - "source": [ - "stream = [\n", - " compress_for_streaming(audio_chuck)\n", - " for audio_chuck in audio_sample_chunk_n(short_audio_sample_bytes, num_chunks=5)\n", - "]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "7dDap26FiKlL" - }, - "outputs": [], - "source": [ - "for s in stream:\n", - " play_audio_sample(s)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9z1XGzpxeAMP" - }, - "source": [ - "#### Prepare the stream request\n", - "\n", - "Once you have your audio stream, you can use the `StreamingRecognizeRequest`class to convert each stream component into a API message." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "IOZNYPrfeW49" - }, - "outputs": [], - "source": [ - "audio_requests = (cloud_speech.StreamingRecognizeRequest(audio=s) for s in stream)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oPbf5rNFecI_" - }, - "source": [ - "#### Define streaming recognition configuration\n", - "\n", - "Next, you define the streaming recognition configuration which allows you to set the model to use, language code of the audio and more." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "32Wz990perAo" - }, - "outputs": [], - "source": [ - "streaming_config = cloud_speech.StreamingRecognitionConfig(\n", - " config=cloud_speech.RecognitionConfig(\n", - " language_codes=[\"en-US\"],\n", - " model=\"chirp_2\",\n", - " features=cloud_speech.RecognitionFeatures(\n", - " enable_automatic_punctuation=True,\n", - " ),\n", - " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zVRyTqhWe2gf" - }, - "source": [ - "#### Define the streaming request configuration\n", - "\n", - "Then, you use the streaming configuration to define the streaming request. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "t5qiUJ48e9i5" - }, - "outputs": [], - "source": [ - "stream_request_config = cloud_speech.StreamingRecognizeRequest(\n", - " streaming_config=streaming_config, recognizer=RECOGNIZER\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "h1d508ScfD9I" - }, - "source": [ - "#### Run the streaming recognition request\n", - "\n", - "Finally, you are able to run the streaming recognition request." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QCq-iROpfl9t" - }, - "outputs": [], - "source": [ - "def requests(request_config: cloud_speech.RecognitionConfig, s: list) -> list:\n", - " yield request_config\n", - " yield from s\n", - "\n", - "\n", - "response = client.streaming_recognize(\n", - " requests=requests(stream_request_config, audio_requests)\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d__QUGkWCkGh" - }, - "source": [ - "Here you use a helper function to visualize transcriptions and the associated streams." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "_qWA8jXYuMH3" - }, - "outputs": [], - "source": [ - "streaming_recognize_results = parse_streaming_recognize_response(response)\n", - "streaming_recognize_output = get_recognize_output(\n", - " short_audio_sample_bytes, streaming_recognize_results\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "agk_M0xRwzv0" - }, - "outputs": [], - "source": [ - "for audio_sample_bytes, transcription in streaming_recognize_output:\n", - " print_transcription(audio_sample_bytes, transcription)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oYCgDay2hAgB" - }, - "source": [ - "### Perform real-time speech recognition" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "F83r9aiNhAgD" - }, - "source": [ - "#### Define real-time recognition configuration\n", - "\n", - "As for the streaming transcription, you define the real-time recognition configuration which allows you to set the model to use, language code of the audio and more." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "j0WprZ-phAgD" - }, - "outputs": [], - "source": [ - "real_time_config = cloud_speech.RecognitionConfig(\n", - " language_codes=[\"en-US\"],\n", - " model=\"chirp_2\",\n", - " features=cloud_speech.RecognitionFeatures(\n", - " enable_automatic_punctuation=True,\n", - " ),\n", - " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "r2TqksAqhAgD" - }, - "source": [ - "#### Define the real-time request configuration\n", - "\n", - "Next, you define the real-time request passing the configuration and the audio sample you want to transcribe. Again, you don't need to define a recognizer." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Nh55mSzXhAgD" - }, - "outputs": [], - "source": [ - "real_time_request = cloud_speech.RecognizeRequest(\n", - " recognizer=f\"projects/{PROJECT_ID}/locations/{LOCATION}/recognizers/_\",\n", - " config=real_time_config,\n", - " content=short_audio_sample_bytes,\n", - " recognizer=RECOGNIZER,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "817YXVBli0aY" - }, - "source": [ - "#### Run the real-time recognition request\n", - "\n", - "Finally you submit the real-time recognition request." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "rc0cBrVsi7UG" - }, - "outputs": [], - "source": [ - "response = client.recognize(request=real_time_request)\n", - "\n", - "real_time_recognize_results = parse_real_time_recognize_response(response)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "J2vpMSv7CZ_2" - }, - "source": [ - "And you use a helper function to visualize transcriptions and the associated streams." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ezH51rLH4CBR" - }, - "outputs": [], - "source": [ - "for transcription, _ in real_time_recognize_results:\n", - " print_transcription(short_audio_sample_bytes, transcription)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5M-lIwRJ43EC" - }, - "source": [ - "### Perform batch speech recognition" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LJxhFSg848MO" - }, - "source": [ - "#### Define batch recognition configuration\n", - "\n", - "You start defining the batch recognition configuration." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "0CEQUL5_5BT-" - }, - "outputs": [], - "source": [ - "batch_recognition_config = cloud_speech.RecognitionConfig(\n", - " language_codes=[\"en-US\"],\n", - " model=\"chirp_2\",\n", - " features=cloud_speech.RecognitionFeatures(\n", - " enable_automatic_punctuation=True,\n", - " ),\n", - " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SKf3pMBl5E4f" - }, - "source": [ - "#### Set the audio file you want to transcribe\n", - "\n", - "For the batch transcription, you need the audio be staged in a Cloud Storage bucket. Then you set the associated metadata to pass in the batch recognition request." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "o1VCvEEI5MkG" - }, - "outputs": [], - "source": [ - "audio_metadata = cloud_speech.BatchRecognizeFileMetadata(\n", - " uri=INPUT_LONG_AUDIO_SAMPLE_FILE_URI\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5HOKZLp25yFB" - }, - "source": [ - "#### Define batch recognition request\n", - "\n", - "Next, you define the batch recognition request. Notice how you define a recognition output configuration which allows you to determine how would you retrieve the resulting transcription outcome." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "SItkaX7tyZ14" - }, - "outputs": [], - "source": [ - "batch_recognition_request = cloud_speech.BatchRecognizeRequest(\n", - " config=batch_recognition_config,\n", - " files=[audio_metadata],\n", - " recognition_output_config=cloud_speech.RecognitionOutputConfig(\n", - " inline_response_config=cloud_speech.InlineOutputConfig(),\n", - " ),\n", - " recognizer=RECOGNIZER,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "YQY1eqaY7H0n" - }, - "source": [ - "#### Run the batch recognition request\n", - "\n", - "Finally you submit the batch recognition request which is a [long-running operation](https://google.aip.dev/151) as you see below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "AlZwRlLo6F1p" - }, - "outputs": [], - "source": [ - "operation = client.batch_recognize(request=batch_recognition_request)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "DrqsNzVmeWu0" - }, - "outputs": [], - "source": [ - "while True:\n", - " if not operation.done():\n", - " print(\"Waiting for operation to complete...\")\n", - " time.sleep(5)\n", - " else:\n", - " print(\"Operation completed.\")\n", - " break" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "B9MEScw7FYAf" - }, - "source": [ - "After the operation finishes, you can retrieve the result as shown below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pjObiPUweZYA" - }, - "outputs": [], - "source": [ - "response = operation.result()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "31cMuwXZFdgI" - }, - "source": [ - "And visualize transcriptions using a helper function." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "d0eMjC3Kmo-5" - }, - "outputs": [], - "source": [ - "batch_recognize_results = parse_batch_recognize_response(\n", - " response, audio_sample_file_uri=INPUT_LONG_AUDIO_SAMPLE_FILE_URI\n", - ")\n", - "batch_recognize_output = get_recognize_output(\n", - " long_audio_sample_bytes, batch_recognize_results\n", - ")\n", - "for audio_sample_bytes, transcription in batch_recognize_output:\n", - " print_transcription(audio_sample_bytes, transcription)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "teU52ISxqUQd" - }, - "source": [ - "### Evaluate transcriptions\n", - "\n", - "Finally, you may want to evaluate Chirp transcriptions. To do so, you can use [JiWER](https://github.com/jitsi/jiwer), a simple and fast Python package which supports several metrics. In this tutorial, you use:\n", - "\n", - "- **WER (Word Error Rate)** which is the most common metric. WER is the number of word edits (insertions, deletions, substitutions) needed to change the recognized text to match the reference text, divided by the total number of words in the reference text.\n", - "- **CER (Character Error Rate)** which is the number of character edits (insertions, deletions, substitutions) needed to change the recognized text to match the reference text, divided by the total number of characters in the reference text." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "q1u3g4LnqX6z" - }, - "outputs": [], - "source": [ - "actual_transcriptions = [t for _, t in batch_recognize_output]\n", - "reference_transcriptions = [\n", - " \"\"\"Okay, so, you know, everyone's been talking about AI lately, right? Writing poems, like nailing those tricky emails, even building websites and all you need is a few, what do they call it again? Prompts? Yeah, it's wild. These AI tools are suddenly everywhere. It's hard to keep up. Seriously. But here's the thing, a lot of this AI stuff we're seeing, it all goes back to this one research paper from way back in 2017. Attention is all you need. So, today we're doing a deep dive into the core of it. The engine that's kind of driving\"\"\",\n", - " \"\"\"all this change. The Transformer. It's funny, right? This super technical paper, I mean, it really did change how we think about AI and how it uses language. Totally. It's like it, I don't know, cracked a code or something. So, before we get into the transformer, we need to like paint that before picture. Can you take us back to how AI used to deal with language before this whole transformer thing came along? Okay. So, imagine this. You're trying to understand a story, but you can only read like one word at a time. Ouch. Right. And not only that, but you also\"\"\",\n", - " \"\"\"have to like remember every single word you read before just to understand the word you're on right now. That sounds so frustrating, like trying to get a movie by looking at one pixel at a time. Exactly. And that's basically how old AI models used to work. RNNs, recurrent neural networks, they processed language one word after the other, which, you can imagine, was super slow and not that great at handling how, you know, language actually works. So, like remembering how the start of a sentence connects\"\"\",\n", - " \"\"\"to the end or how something that happens at the beginning of a book affects what happens later on. That was really tough for older AI. Totally. It's like trying to get a joke by only remembering the punch line. You miss all the important stuff, all that context. Okay, yeah. I'm starting to see why this paper was such a big deal. So how did \"Attention Is All You Need\" change everything? What's so special about this Transformer thing? Well, I mean, even the title is a good hint, right? It's all about attention. This paper introduced self-attention. Basically, it's how the\"\"\",\n", - "]\n", - "\n", - "evaluation_df = evaluate_stt(actual_transcriptions, reference_transcriptions)\n", - "plot_evaluation_results(evaluation_df)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2a4e033321ad" - }, - "source": [ - "## Cleaning up" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "5bsE-XtXzmpR" - }, - "outputs": [], - "source": [ - "delete_bucket = False\n", - "\n", - "if delete_bucket:\n", - " ! gsutil rm -r $BUCKET_URI" - ] - } - ], - "metadata": { - "colab": { - "name": "get_started_with_chirp_2_sdk.ipynb", - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ur8xi4C7S06n" + }, + "outputs": [], + "source": [ + "# Copyright 2024 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JAPoU8Sm5E6e" + }, + "source": [ + "# Get started with Chirp 2 using Speech-to-Text V2 SDK\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Google
Open in Colab\n", + "
\n", + "
\n", + " \n", + " \"Google
Open in Colab Enterprise\n", + "
\n", + "
\n", + " \n", + " \"Vertex
Open in Vertex AI Workbench\n", + "
\n", + "
\n", + " \n", + " \"GitHub
View on GitHub\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "84f0f73a0f76" + }, + "source": [ + "| | |\n", + "|-|-|\n", + "| Author(s) | [Ivan Nardini](https://github.com/inardini) |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tvgnzT1CKxrO" + }, + "source": [ + "## Overview\n", + "\n", + "In this tutorial, you learn about how to use Chirp 2, the latest generation of Google's multilingual ASR-specific models.\n", + "\n", + "Chirp 2 improves upon the original Chirp model in accuracy and speed, as well as expanding into key new features like word-level timestamps, model adaptation, and speech translation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "61RBz8LLbxCR" + }, + "source": [ + "## Get started" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "No17Cw5hgx12" + }, + "source": [ + "### Install Speech-to-Text SDK and other required packages\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "e73_ZgKWYedz" + }, + "outputs": [], + "source": [ + "! apt update -y -qq\n", + "! apt install ffmpeg -y -qq" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tFy3H3aPgx12" + }, + "outputs": [], + "source": [ + "%pip install --quiet 'google-cloud-speech' 'protobuf<4.21' 'google-auth==2.27.0' 'pydub' 'etils' 'jiwer' 'ffmpeg-python' 'plotly'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R5Xep4W9lq-Z" + }, + "source": [ + "### Restart runtime\n", + "\n", + "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.\n", + "\n", + "The restart might take a minute or longer. After it's restarted, continue to the next step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XRvKdaPDTznN" + }, + "outputs": [], + "source": [ + "import IPython\n", + "\n", + "app = IPython.Application.instance()\n", + "app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SbmM4z7FOBpM" + }, + "source": [ + "
\n", + "⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dmWOrTJ3gx13" + }, + "source": [ + "### Authenticate your notebook environment (Colab only)\n", + "\n", + "If you're running this notebook on Google Colab, run the cell below to authenticate your environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NyKGtVQjgx13" + }, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "if \"google.colab\" in sys.modules:\n", + " from google.colab import auth\n", + "\n", + " auth.authenticate_user()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DF4l8DTdWgPY" + }, + "source": [ + "### Set Google Cloud project information and initialize Speech-to-Text V2 SDK\n", + "\n", + "To get started using the Speech-to-Text API, you must have an existing Google Cloud project and [enable the Speech-to-Text API](https://console.cloud.google.com/flows/enableapi?apiid=speech.googleapis.com).\n", + "\n", + "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WIQyBhAn_9tK" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\", isTemplate: true}\n", + "\n", + "if PROJECT_ID == \"[your-project-id]\":\n", + " PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n", + "\n", + "LOCATION = os.environ.get(\"GOOGLE_CLOUD_REGION\", \"us-central1\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Nqwi-5ufWp_B" + }, + "outputs": [], + "source": [ + "from google.api_core.client_options import ClientOptions\n", + "from google.cloud.speech_v2 import SpeechClient\n", + "\n", + "API_ENDPOINT = f\"{LOCATION}-speech.googleapis.com\"\n", + "\n", + "client = SpeechClient(\n", + " client_options=ClientOptions(\n", + " api_endpoint=API_ENDPOINT,\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zgPO1eR3CYjk" + }, + "source": [ + "### Create a Cloud Storage bucket\n", + "\n", + "Create a storage bucket to store intermediate artifacts such as datasets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MzGDU7TWdts_" + }, + "outputs": [], + "source": [ + "BUCKET_NAME = \"your-bucket-name-unique\" # @param {type:\"string\", isTemplate: true}\n", + "\n", + "BUCKET_URI = f\"gs://{BUCKET_NAME}\" # @param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-EcIXiGsCePi" + }, + "source": [ + "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NIq7R4HZCfIc" + }, + "outputs": [], + "source": [ + "! gsutil mb -l $LOCATION -p $PROJECT_ID $BUCKET_URI" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5303c05f7aa6" + }, + "source": [ + "### Import libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6fc324893334" + }, + "outputs": [], + "source": [ + "from google.cloud.speech_v2.types import cloud_speech" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qqm0OQpAYCph" + }, + "outputs": [], + "source": [ + "import io\n", + "import os\n", + "import subprocess\n", + "import time\n", + "\n", + "import IPython.display as ipd\n", + "from etils import epath as ep\n", + "import jiwer\n", + "import pandas as pd\n", + "import plotly.graph_objs as go\n", + "from pydub import AudioSegment" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sP8GBj3tBAC1" + }, + "source": [ + "### Set constants" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rXTVeU1uBBqY" + }, + "outputs": [], + "source": [ + "INPUT_AUDIO_SAMPLE_FILE_URI = (\n", + " \"gs://github-repo/audio_ai/speech_recognition/attention_is_all_you_need_podcast.wav\"\n", + ")\n", + "INPUT_LONG_AUDIO_SAMPLE_FILE_URI = (\n", + " f\"{BUCKET_URI}/speech_recognition/data/long_audio_sample.wav\"\n", + ")\n", + "\n", + "RECOGNIZER = client.recognizer_path(PROJECT_ID, LOCATION, \"_\")\n", + "\n", + "MAX_CHUNK_SIZE = 25600" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "djgFxrGC_Ykd" + }, + "source": [ + "### Helpers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Zih8W_wC_caW" + }, + "outputs": [], + "source": [ + "def read_audio_file(audio_file_path: str) -> bytes:\n", + " \"\"\"\n", + " Read audio file as bytes.\n", + " \"\"\"\n", + " if audio_file_path.startswith(\"gs://\"):\n", + " with ep.Path(audio_file_path).open(\"rb\") as f:\n", + " audio_bytes = f.read()\n", + " else:\n", + " with open(audio_file_path, \"rb\") as f:\n", + " audio_bytes = f.read()\n", + " return audio_bytes\n", + "\n", + "\n", + "def save_audio_sample(audio_bytes: bytes, output_file_uri: str) -> None:\n", + " \"\"\"\n", + " Save audio sample as a file in Google Cloud Storage.\n", + " \"\"\"\n", + "\n", + " output_file_path = ep.Path(output_file_uri)\n", + " if not output_file_path.parent.exists():\n", + " output_file_path.parent.mkdir(parents=True, exist_ok=True)\n", + "\n", + " with output_file_path.open(\"wb\") as f:\n", + " f.write(audio_bytes)\n", + "\n", + "\n", + "def extract_audio_sample(audio_bytes: bytes, duration: int) -> bytes:\n", + " \"\"\"\n", + " Extracts a random audio sample of a given duration from an audio file.\n", + " \"\"\"\n", + " audio = AudioSegment.from_file(io.BytesIO(audio_bytes))\n", + " start_time = 0\n", + " audio_sample = audio[start_time : start_time + duration * 1000]\n", + "\n", + " audio_bytes = io.BytesIO()\n", + " audio_sample.export(audio_bytes, format=\"wav\")\n", + " audio_bytes.seek(0)\n", + "\n", + " return audio_bytes.read()\n", + "\n", + "\n", + "def play_audio_sample(audio_bytes: bytes) -> None:\n", + " \"\"\"\n", + " Plays the audio sample in a notebook.\n", + " \"\"\"\n", + " audio_file = io.BytesIO(audio_bytes)\n", + " ipd.display(ipd.Audio(audio_file.read(), rate=44100))\n", + "\n", + "\n", + "def audio_sample_chunk_n(audio_bytes: bytes, num_chunks: int) -> list[bytes]:\n", + " \"\"\"\n", + " Chunks an audio sample into a specified number of chunks and returns a list of bytes for each chunk.\n", + " \"\"\"\n", + " audio = AudioSegment.from_file(io.BytesIO(audio_bytes))\n", + " total_duration = len(audio)\n", + " chunk_duration = total_duration // num_chunks\n", + "\n", + " chunks = []\n", + " start_time = 0\n", + "\n", + " for _ in range(num_chunks):\n", + " end_time = min(start_time + chunk_duration, total_duration)\n", + " chunk = audio[start_time:end_time]\n", + "\n", + " audio_bytes_chunk = io.BytesIO()\n", + " chunk.export(audio_bytes_chunk, format=\"wav\")\n", + " audio_bytes_chunk.seek(0)\n", + " chunks.append(audio_bytes_chunk.read())\n", + "\n", + " start_time = end_time\n", + "\n", + " return chunks\n", + "\n", + "\n", + "def audio_sample_merge(audio_chunks: list[bytes]) -> bytes:\n", + " \"\"\"\n", + " Merges a list of audio chunks into a single audio sample.\n", + " \"\"\"\n", + " audio = AudioSegment.empty()\n", + " for chunk in audio_chunks:\n", + " audio += AudioSegment.from_file(io.BytesIO(chunk))\n", + "\n", + " audio_bytes = io.BytesIO()\n", + " audio.export(audio_bytes, format=\"wav\")\n", + " audio_bytes.seek(0)\n", + "\n", + " return audio_bytes.read()\n", + "\n", + "\n", + "def compress_for_streaming(audio_bytes: bytes) -> bytes:\n", + " \"\"\"\n", + " Compresses audio bytes for streaming using ffmpeg, ensuring the output size is under 25600 bytes.\n", + " \"\"\"\n", + "\n", + " # Temporary file to store original audio\n", + " with open(\"temp_original.wav\", \"wb\") as f:\n", + " f.write(audio_bytes)\n", + "\n", + " # Initial compression attempt with moderate bitrate\n", + " bitrate = \"32k\"\n", + " subprocess.run(\n", + " [\n", + " \"ffmpeg\",\n", + " \"-i\",\n", + " \"temp_original.wav\",\n", + " \"-b:a\",\n", + " bitrate,\n", + " \"-y\",\n", + " \"temp_compressed.mp3\",\n", + " ]\n", + " )\n", + "\n", + " # Check if compressed size is within limit\n", + " compressed_size = os.path.getsize(\"temp_compressed.mp3\")\n", + " if compressed_size <= 25600:\n", + " with open(\"temp_compressed.mp3\", \"rb\") as f:\n", + " compressed_audio_bytes = f.read()\n", + " else:\n", + " # If too large, reduce bitrate and retry\n", + " while compressed_size > 25600:\n", + " bitrate = str(int(bitrate[:-1]) - 8) + \"k\" # Reduce bitrate by 8kbps\n", + " subprocess.run(\n", + " [\n", + " \"ffmpeg\",\n", + " \"-i\",\n", + " \"temp_original.wav\",\n", + " \"-b:a\",\n", + " bitrate,\n", + " \"-y\",\n", + " \"temp_compressed.mp3\",\n", + " ]\n", + " )\n", + " compressed_size = os.path.getsize(\"temp_compressed.mp3\")\n", + "\n", + " with open(\"temp_compressed.mp3\", \"rb\") as f:\n", + " compressed_audio_bytes = f.read()\n", + "\n", + " # Clean up temporary files\n", + " os.remove(\"temp_original.wav\")\n", + " os.remove(\"temp_compressed.mp3\")\n", + "\n", + " return compressed_audio_bytes\n", + "\n", + "\n", + "def parse_streaming_recognize_response(response) -> list[tuple[str, int]]:\n", + " \"\"\"Parse streaming responses from the Speech-to-Text API\"\"\"\n", + " streaming_recognize_results = []\n", + " for r in response:\n", + " for result in r.results:\n", + " streaming_recognize_results.append(\n", + " (result.alternatives[0].transcript, result.result_end_offset)\n", + " )\n", + " return streaming_recognize_results\n", + "\n", + "\n", + "def parse_real_time_recognize_response(response) -> list[tuple[str, int]]:\n", + " \"\"\"Parse real-time responses from the Speech-to-Text API\"\"\"\n", + " real_time_recognize_results = []\n", + " for result in response.results:\n", + " real_time_recognize_results.append(\n", + " (result.alternatives[0].transcript, result.result_end_offset)\n", + " )\n", + " return real_time_recognize_results\n", + "\n", + "\n", + "def parse_batch_recognize_response(\n", + " response, audio_sample_file_uri: str = INPUT_LONG_AUDIO_SAMPLE_FILE_URI\n", + ") -> list[tuple[str, int]]:\n", + " \"\"\"Parse batch responses from the Speech-to-Text API\"\"\"\n", + " batch_recognize_results = []\n", + " for result in response.results[\n", + " audio_sample_file_uri\n", + " ].inline_result.transcript.results:\n", + " batch_recognize_results.append(\n", + " (result.alternatives[0].transcript, result.result_end_offset)\n", + " )\n", + " return batch_recognize_results\n", + "\n", + "\n", + "def get_recognize_output(\n", + " audio_bytes: bytes, recognize_results: list[tuple[str, int]]\n", + ") -> list[tuple[bytes, str]]:\n", + " \"\"\"\n", + " Get the output of recognize results, handling 0 timedelta and ensuring no overlaps or gaps.\n", + " \"\"\"\n", + " audio = AudioSegment.from_file(io.BytesIO(audio_bytes))\n", + " recognize_output = []\n", + " start_time = 0\n", + "\n", + " initial_end_time = recognize_results[0][1].total_seconds() * 1000\n", + "\n", + " # This loop handles the streaming case where result timestamps might be zero.\n", + " if initial_end_time == 0:\n", + " for i, (transcript, timedelta) in enumerate(recognize_results):\n", + " if i < len(recognize_results) - 1:\n", + " # Use the next timedelta if available\n", + " next_end_time = recognize_results[i + 1][1].total_seconds() * 1000\n", + " end_time = next_end_time\n", + " else:\n", + " next_end_time = len(audio)\n", + " end_time = next_end_time\n", + "\n", + " # Ensure no gaps between chunks\n", + " chunk = audio[start_time:end_time]\n", + " chunk_bytes = io.BytesIO()\n", + " chunk.export(chunk_bytes, format=\"wav\")\n", + " chunk_bytes.seek(0)\n", + " recognize_output.append((chunk_bytes.read(), transcript))\n", + "\n", + " # Set start_time for the next iteration\n", + " start_time = end_time\n", + " else:\n", + " for i, (transcript, timedelta) in enumerate(recognize_results):\n", + " # Calculate end_time in milliseconds\n", + " end_time = timedelta.total_seconds() * 1000\n", + "\n", + " # Ensure no gaps between chunks\n", + " chunk = audio[start_time:end_time]\n", + " chunk_bytes = io.BytesIO()\n", + " chunk.export(chunk_bytes, format=\"wav\")\n", + " chunk_bytes.seek(0)\n", + " recognize_output.append((chunk_bytes.read(), transcript))\n", + "\n", + " # Set start_time for the next iteration\n", + " start_time = end_time\n", + "\n", + " return recognize_output\n", + "\n", + "\n", + "def print_transcription(audio_sample_bytes: bytes, transcription: str) -> None:\n", + " \"\"\"Prettify the play of the audio and the associated print of the transcription text in a notebook\"\"\"\n", + "\n", + " # Play the audio sample\n", + " display(ipd.HTML(\"Audio:\"))\n", + " play_audio_sample(audio_sample_bytes)\n", + " display(ipd.HTML(\"
\"))\n", + "\n", + " # Display the transcription text\n", + " display(ipd.HTML(\"Transcription:\"))\n", + " formatted_text = f\"
{transcription}
\"\n", + " display(ipd.HTML(formatted_text))\n", + "\n", + "\n", + "def evaluate_stt(\n", + " actual_transcriptions: list[str],\n", + " reference_transcriptions: list[str],\n", + " audio_sample_file_uri: str = INPUT_LONG_AUDIO_SAMPLE_FILE_URI,\n", + ") -> pd.DataFrame:\n", + " \"\"\"\n", + " Evaluate speech-to-text (STT) transcriptions against reference transcriptions.\n", + " \"\"\"\n", + " audio_uris = [audio_sample_file_uri] * len(actual_transcriptions)\n", + " evaluations = []\n", + " for audio_uri, actual_transcription, reference_transcription in zip(\n", + " audio_uris, actual_transcriptions, reference_transcriptions\n", + " ):\n", + " evaluation = {\n", + " \"audio_uri\": audio_uri,\n", + " \"actual_transcription\": actual_transcription,\n", + " \"reference_transcription\": reference_transcription,\n", + " \"wer\": jiwer.wer(reference_transcription, actual_transcription),\n", + " \"cer\": jiwer.cer(reference_transcription, actual_transcription),\n", + " }\n", + " evaluations.append(evaluation)\n", + "\n", + " evaluations_df = pd.DataFrame(evaluations)\n", + " evaluations_df.reset_index(inplace=True, drop=True)\n", + " return evaluations_df\n", + "\n", + "\n", + "def plot_evaluation_results(\n", + " evaluations_df: pd.DataFrame,\n", + ") -> go.Figure:\n", + " \"\"\"\n", + " Plot the mean Word Error Rate (WER) and Character Error Rate (CER) from the evaluation results.\n", + " \"\"\"\n", + " mean_wer = evaluations_df[\"wer\"].mean()\n", + " mean_cer = evaluations_df[\"cer\"].mean()\n", + "\n", + " trace_means = go.Bar(\n", + " x=[\"WER\", \"CER\"], y=[mean_wer, mean_cer], name=\"Mean Error Rate\"\n", + " )\n", + "\n", + " trace_baseline = go.Scatter(\n", + " x=[\"WER\", \"CER\"], y=[0.5, 0.5], mode=\"lines\", name=\"Baseline (0.5)\"\n", + " )\n", + "\n", + " layout = go.Layout(\n", + " title=\"Speech-to-Text Evaluation Results\",\n", + " xaxis=dict(title=\"Metric\"),\n", + " yaxis=dict(title=\"Error Rate\", range=[0, 1]),\n", + " barmode=\"group\",\n", + " )\n", + "\n", + " fig = go.Figure(data=[trace_means, trace_baseline], layout=layout)\n", + " return fig" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VPVDNRyVxquo" + }, + "source": [ + "## Transcribe using Chirp 2\n", + "\n", + "You can use Chirp 2 to transcribe audio in Streaming, Online and Batch modes:\n", + "\n", + "* Streaming mode is good for streaming and real-time audio. \n", + "* Online mode is good for short audio < 1 min.\n", + "* Batch mode is good for long audio 1 min to 8 hrs. \n", + "\n", + "In the following sections, you explore how to use the API to transcribe audio in these three different scenarios." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4uTeBXo6dZlS" + }, + "source": [ + "### Read the audio file\n", + "\n", + "Let's start reading the input audio sample you want to transcribe.\n", + "\n", + "In this case, it is a podcast generated with NotebookLM about the \"Attention is all you need\" [paper](https://arxiv.org/abs/1706.03762)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pjzwMWqpdldM" + }, + "outputs": [], + "source": [ + "input_audio_bytes = read_audio_file(INPUT_AUDIO_SAMPLE_FILE_URI)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SyEUpcf12z73" + }, + "source": [ + "### Prepare audio samples\n", + "\n", + "The podcast audio is ~ 8 mins. Depending on the audio length, you can use different transcribe API methods. To learn more, check out the official documentation. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TRlgCdED793U" + }, + "source": [ + "#### Prepare a short audio sample (< 1 min)\n", + "\n", + "Extract a short audio sample from the original one for streaming and real-time audio processing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "r-SYb9_b87BZ" + }, + "outputs": [], + "source": [ + "short_audio_sample_bytes = extract_audio_sample(input_audio_bytes, 30)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0Hk2OSiSEFrf" + }, + "outputs": [], + "source": [ + "play_audio_sample(short_audio_sample_bytes)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2rPcMe0LvC3q" + }, + "source": [ + "#### Prepare a long audio sample (from 1 min up to 8 hrs)\n", + "\n", + "Extract a longer audio sample from the original one for batch audio processing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L44FoygqvHoP" + }, + "outputs": [], + "source": [ + "long_audio_sample_bytes = extract_audio_sample(input_audio_bytes, 120)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ej2j0FBEvK6s" + }, + "outputs": [], + "source": [ + "play_audio_sample(long_audio_sample_bytes)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6tIbVVe76ML8" + }, + "outputs": [], + "source": [ + "save_audio_sample(long_audio_sample_bytes, INPUT_LONG_AUDIO_SAMPLE_FILE_URI)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w5qPg2OfFAG9" + }, + "source": [ + "### Perform streaming speech recognition\n", + "\n", + "Let's start performing streaming speech recognition." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aAlIgQSoeDT5" + }, + "source": [ + "#### Prepare the audio stream\n", + "\n", + "To simulate an audio stream, you can create a generator yielding chunks of audio data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j5SPyum6FMiC" + }, + "outputs": [], + "source": [ + "stream = [\n", + " compress_for_streaming(audio_chuck)\n", + " for audio_chuck in audio_sample_chunk_n(short_audio_sample_bytes, num_chunks=5)\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7dDap26FiKlL" + }, + "outputs": [], + "source": [ + "for s in stream:\n", + " play_audio_sample(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9z1XGzpxeAMP" + }, + "source": [ + "#### Prepare the stream request\n", + "\n", + "Once you have your audio stream, you can use the `StreamingRecognizeRequest`class to convert each stream component into a API message." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IOZNYPrfeW49" + }, + "outputs": [], + "source": [ + "audio_requests = (cloud_speech.StreamingRecognizeRequest(audio=s) for s in stream)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oPbf5rNFecI_" + }, + "source": [ + "#### Define streaming recognition configuration\n", + "\n", + "Next, you define the streaming recognition configuration which allows you to set the model to use, language code of the audio and more." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "32Wz990perAo" + }, + "outputs": [], + "source": [ + "streaming_config = cloud_speech.StreamingRecognitionConfig(\n", + " config=cloud_speech.RecognitionConfig(\n", + " language_codes=[\"en-US\"],\n", + " model=\"chirp_2\",\n", + " features=cloud_speech.RecognitionFeatures(\n", + " enable_automatic_punctuation=True,\n", + " ),\n", + " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zVRyTqhWe2gf" + }, + "source": [ + "#### Define the streaming request configuration\n", + "\n", + "Then, you use the streaming configuration to define the streaming request. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "t5qiUJ48e9i5" + }, + "outputs": [], + "source": [ + "stream_request_config = cloud_speech.StreamingRecognizeRequest(\n", + " streaming_config=streaming_config, recognizer=RECOGNIZER\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h1d508ScfD9I" + }, + "source": [ + "#### Run the streaming recognition request\n", + "\n", + "Finally, you are able to run the streaming recognition request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QCq-iROpfl9t" + }, + "outputs": [], + "source": [ + "def requests(request_config: cloud_speech.RecognitionConfig, s: list) -> list:\n", + " yield request_config\n", + " yield from s\n", + "\n", + "\n", + "response = client.streaming_recognize(\n", + " requests=requests(stream_request_config, audio_requests)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d__QUGkWCkGh" + }, + "source": [ + "Here you use a helper function to visualize transcriptions and the associated streams." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_qWA8jXYuMH3" + }, + "outputs": [], + "source": [ + "streaming_recognize_results = parse_streaming_recognize_response(response)\n", + "streaming_recognize_output = get_recognize_output(\n", + " short_audio_sample_bytes, streaming_recognize_results\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "agk_M0xRwzv0" + }, + "outputs": [], + "source": [ + "for audio_sample_bytes, transcription in streaming_recognize_output:\n", + " print_transcription(audio_sample_bytes, transcription)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oYCgDay2hAgB" + }, + "source": [ + "### Perform real-time speech recognition" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F83r9aiNhAgD" + }, + "source": [ + "#### Define real-time recognition configuration\n", + "\n", + "As for the streaming transcription, you define the real-time recognition configuration which allows you to set the model to use, language code of the audio and more." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j0WprZ-phAgD" + }, + "outputs": [], + "source": [ + "real_time_config = cloud_speech.RecognitionConfig(\n", + " language_codes=[\"en-US\"],\n", + " model=\"chirp_2\",\n", + " features=cloud_speech.RecognitionFeatures(\n", + " enable_automatic_punctuation=True,\n", + " ),\n", + " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r2TqksAqhAgD" + }, + "source": [ + "#### Define the real-time request configuration\n", + "\n", + "Next, you define the real-time request passing the configuration and the audio sample you want to transcribe. Again, you don't need to define a recognizer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Nh55mSzXhAgD" + }, + "outputs": [], + "source": [ + "real_time_request = cloud_speech.RecognizeRequest(\n", + " config=real_time_config,\n", + " content=short_audio_sample_bytes,\n", + " recognizer=RECOGNIZER,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "817YXVBli0aY" + }, + "source": [ + "#### Run the real-time recognition request\n", + "\n", + "Finally you submit the real-time recognition request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rc0cBrVsi7UG" + }, + "outputs": [], + "source": [ + "response = client.recognize(request=real_time_request)\n", + "\n", + "real_time_recognize_results = parse_real_time_recognize_response(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J2vpMSv7CZ_2" + }, + "source": [ + "And you use a helper function to visualize transcriptions and the associated streams." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ezH51rLH4CBR" + }, + "outputs": [], + "source": [ + "for transcription, _ in real_time_recognize_results:\n", + " print_transcription(short_audio_sample_bytes, transcription)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5M-lIwRJ43EC" + }, + "source": [ + "### Perform batch speech recognition" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LJxhFSg848MO" + }, + "source": [ + "#### Define batch recognition configuration\n", + "\n", + "You start defining the batch recognition configuration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0CEQUL5_5BT-" + }, + "outputs": [], + "source": [ + "batch_recognition_config = cloud_speech.RecognitionConfig(\n", + " language_codes=[\"en-US\"],\n", + " model=\"chirp_2\",\n", + " features=cloud_speech.RecognitionFeatures(\n", + " enable_automatic_punctuation=True,\n", + " ),\n", + " auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SKf3pMBl5E4f" + }, + "source": [ + "#### Set the audio file you want to transcribe\n", + "\n", + "For the batch transcription, you need the audio be staged in a Cloud Storage bucket. Then you set the associated metadata to pass in the batch recognition request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "o1VCvEEI5MkG" + }, + "outputs": [], + "source": [ + "audio_metadata = cloud_speech.BatchRecognizeFileMetadata(\n", + " uri=INPUT_LONG_AUDIO_SAMPLE_FILE_URI\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5HOKZLp25yFB" + }, + "source": [ + "#### Define batch recognition request\n", + "\n", + "Next, you define the batch recognition request. Notice how you define a recognition output configuration which allows you to determine how would you retrieve the resulting transcription outcome." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SItkaX7tyZ14" + }, + "outputs": [], + "source": [ + "batch_recognition_request = cloud_speech.BatchRecognizeRequest(\n", + " config=batch_recognition_config,\n", + " files=[audio_metadata],\n", + " recognition_output_config=cloud_speech.RecognitionOutputConfig(\n", + " inline_response_config=cloud_speech.InlineOutputConfig(),\n", + " ),\n", + " recognizer=RECOGNIZER,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQY1eqaY7H0n" + }, + "source": [ + "#### Run the batch recognition request\n", + "\n", + "Finally you submit the batch recognition request which is a [long-running operation](https://google.aip.dev/151) as you see below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AlZwRlLo6F1p" + }, + "outputs": [], + "source": [ + "operation = client.batch_recognize(request=batch_recognition_request)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DrqsNzVmeWu0" + }, + "outputs": [], + "source": [ + "while True:\n", + " if not operation.done():\n", + " print(\"Waiting for operation to complete...\")\n", + " time.sleep(5)\n", + " else:\n", + " print(\"Operation completed.\")\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B9MEScw7FYAf" + }, + "source": [ + "After the operation finishes, you can retrieve the result as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pjObiPUweZYA" + }, + "outputs": [], + "source": [ + "response = operation.result()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "31cMuwXZFdgI" + }, + "source": [ + "And visualize transcriptions using a helper function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d0eMjC3Kmo-5" + }, + "outputs": [], + "source": [ + "batch_recognize_results = parse_batch_recognize_response(\n", + " response, audio_sample_file_uri=INPUT_LONG_AUDIO_SAMPLE_FILE_URI\n", + ")\n", + "batch_recognize_output = get_recognize_output(\n", + " long_audio_sample_bytes, batch_recognize_results\n", + ")\n", + "for audio_sample_bytes, transcription in batch_recognize_output:\n", + " print_transcription(audio_sample_bytes, transcription)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "teU52ISxqUQd" + }, + "source": [ + "### Evaluate transcriptions\n", + "\n", + "Finally, you may want to evaluate Chirp transcriptions. To do so, you can use [JiWER](https://github.com/jitsi/jiwer), a simple and fast Python package which supports several metrics. In this tutorial, you use:\n", + "\n", + "- **WER (Word Error Rate)** which is the most common metric. WER is the number of word edits (insertions, deletions, substitutions) needed to change the recognized text to match the reference text, divided by the total number of words in the reference text.\n", + "- **CER (Character Error Rate)** which is the number of character edits (insertions, deletions, substitutions) needed to change the recognized text to match the reference text, divided by the total number of characters in the reference text." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "q1u3g4LnqX6z" + }, + "outputs": [], + "source": [ + "actual_transcriptions = [t for _, t in batch_recognize_output]\n", + "reference_transcriptions = [\n", + " \"\"\"Okay, so, you know, everyone's been talking about AI lately, right? Writing poems, like nailing those tricky emails, even building websites and all you need is a few, what do they call it again? Prompts? Yeah, it's wild. These AI tools are suddenly everywhere. It's hard to keep up. Seriously. But here's the thing, a lot of this AI stuff we're seeing, it all goes back to this one research paper from way back in 2017. Attention is all you need. So, today we're doing a deep dive into the core of it. The engine that's kind of driving\"\"\",\n", + " \"\"\"all this change. The Transformer. It's funny, right? This super technical paper, I mean, it really did change how we think about AI and how it uses language. Totally. It's like it, I don't know, cracked a code or something. So, before we get into the transformer, we need to like paint that before picture. Can you take us back to how AI used to deal with language before this whole transformer thing came along? Okay. So, imagine this. You're trying to understand a story, but you can only read like one word at a time. Ouch. Right. And not only that, but you also\"\"\",\n", + " \"\"\"have to like remember every single word you read before just to understand the word you're on right now. That sounds so frustrating, like trying to get a movie by looking at one pixel at a time. Exactly. And that's basically how old AI models used to work. RNNs, recurrent neural networks, they processed language one word after the other, which, you can imagine, was super slow and not that great at handling how, you know, language actually works. So, like remembering how the start of a sentence connects\"\"\",\n", + " \"\"\"to the end or how something that happens at the beginning of a book affects what happens later on. That was really tough for older AI. Totally. It's like trying to get a joke by only remembering the punch line. You miss all the important stuff, all that context. Okay, yeah. I'm starting to see why this paper was such a big deal. So how did \"Attention Is All You Need\" change everything? What's so special about this Transformer thing? Well, I mean, even the title is a good hint, right? It's all about attention. This paper introduced self-attention. Basically, it's how the\"\"\",\n", + "]\n", + "\n", + "evaluation_df = evaluate_stt(actual_transcriptions, reference_transcriptions)\n", + "plot_evaluation_results(evaluation_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2a4e033321ad" + }, + "source": [ + "## Cleaning up" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5bsE-XtXzmpR" + }, + "outputs": [], + "source": [ + "delete_bucket = False\n", + "\n", + "if delete_bucket:\n", + " ! gsutil rm -r $BUCKET_URI" + ] + } + ], + "metadata": { + "colab": { + "name": "get_started_with_chirp_2_sdk.ipynb", + "toc_visible": true + }, + "environment": { + "kernel": "python3", + "name": "tf2-cpu.2-11.m125", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/tf2-cpu.2-11:m125" + }, + "kernelspec": { + "display_name": "Python 3 (Local)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 4 }