In this cookbook, we will start with a simple audio summarizer using a combination of web technologies and AI models. This application is made possible by the following technologies :
- Next.js provides us a simple to use framework for building web applications based on React
- OpenAI's Whisper is a state of the art Speech-to-Text model created by OpenAI. Although it is fully open-source and can be self-hosted, we opted to use the OpenAI API for this tutorial.
- OpenAI's GPT-4o is a powerful language model that can be used for a variety of tasks. We will use it to summarize the text generated by Whisper.
- Literal AI is an end-to-end observability, evaluation and monitoring platform for building & improving production-grade LLM applications.
Lastly, the frontend of the application relies on the AudioRecorder React component, which is itself based on the MediaRecorder API.
As you can see, this repository includes two versions of the application : the base application, and the application with Literal AI integrated.
If you want to code along with this tutorial, you can start by cloning the repository and navigating to the without-literal
folder. You can then install the dependencies and start the development server to try out the Speech-to-emoji summarizer :
cd without-literal
npm install
npm run dev
If you want to skip ahead to the final version with Literal AI, you can find it in the with-literal
folder.
Before we get started, you will need to create a free account on Literal AI. You will also need to create a new project and generate an API key from your project's "Settings > General" section.
For the OpenAI calls to work, you will also need to generate an OpenAI API key from the OpenAI API platform.
You can now copy the provided .env.example
file to a new .env
file and fill in the required environment variables.
We will start by adding the Literal AI SDK to the application. You can install the SDK using npm:
npm install @literalai/client
Having prompts in the code can get unwieldy with time, those big templated strings get hard to read and maintain (although YMMV on this). For this project I have opted to manage them entirely on Literal AI, which allows me to iterate on the prompt's messages and settings without having to redeploy my application.
My initial prompt looked like this :
You are a hyeroglyphic assistant. Your job is to summarize text into emojis.
However after testing it out, I realized that the prompt was not clear enough and the model was not generating the expected results. I then iterated on the prompt and came up with the following :
You are a hyeroglyphic assistant. Your job is to summarize text into emojis. You respect the following rules :
* keep the rhythm and punctuation of the original text
* ONLY output emojis
* add a line break between sentences
You can create a new prompt on Literal AI with the following steps :
- Create a project on Literal AI
- Navigate to your project
- Click on the "Prompts" tab
- Click on the "New Prompt" button
- Click on the "+ Message" button in the "Template" section
- Copy my new prompt to the editor
- Adjust models, temperatures and other settings as needed
- Save the prompt with the name "Emojifier Prompt" (be sure to copy this exact name as it will be used to retrieve the prompt through the API)
Here is what it looks like in Literal AI :
Now you can edit the src/app/api/emojify/route.ts
file and add the following :
import { LiteralClient } from "@literalai/client";
const openai = new OpenAI();
// Init the Literal client
const literalClient = new LiteralClient();
export async function POST(req: NextRequest) {
// ...
// Get the prompt from the Literal API
const promptName = "Emojifier Prompt";
const prompt = await literalClient.api.getPrompt(promptName);
if (!prompt) throw new Error("Prompt not found");
const promptMessages = prompt.formatMessages();
// Call the LLM API
const completion = await openai.chat.completions.create({
...prompt.settings,
messages: [...promptMessages, { role: "user", content: text }],
});
// ...
}
We want to log each request as a run which will contain two steps :
- One step for the audio transcription
- One step for the summarization
Here is what it will look like on Literal AI :
To facilitate its use, we will generate the Run ID from the frontend using crypto.randomUUID()
and pass it to the backend. This ensures that my run IDs are unique and fully compatible with Literal AI. I then simply add runId
to the payload of the API requests.
In the src/app/api/transcribe/route.ts
let's then create a thread for each run. This is a bit of a hack as the interaction is not really a threaded conversation, however it is necessary so that we can upload Audio files.
const transcribedText = await literalClient
.thread({ name: "Speech to Emoji Thread" })
.wrap(async () => {
// Upload the file to Literal and add it as an attachment
const attachment = await literalClient.api.createAttachment({
content: formAudio,
threadId: literalClient.getCurrentThread().id,
mime: "audio/webm",
name: "Audio file",
});
// Create the run with the attached audio file
const run = await thread
.step({
id: runId,
type: "run",
name: "Speech to Emoji",
input: {
input: { content: "Audio file" },
attachments: [attachment],
},
})
.wrap(/* ... */);
});
Still in src/app/api/transcribe/route.ts
, we now add the first step for the audio transcription. Please note that we are measuring start and end time, which will allow me to monitor latency from Literal AI. On classic chat-based LLM calls (openai.chat.completions.create
), this is handled automatically by the Literal SDK instrumentation however this does not apply to other OpenAI API calls.
await run
.step({
type: "llm",
name: "Speech to Text",
input: { content: "Audio file" },
output: { content: transcribedText },
attachments: [attachment],
startTime: start.toISOString(),
endTime: end.toISOString(),
generation: {
provider: "openai",
model: "whisper-1",
prompt: "See attached audio file",
completion: transcribedText,
},
})
.send();
Next in src/app/api/emojify/route.ts
, we need to fetch the run and add the second step for the summarization. This time, we will make use of the built-in OpenAI instrumentation provided by the Literal AI SDK. This not only logs the latency, but also token counts and model parameters.
// Instrument the call to OpenAI
literalClient.instrumentation.openai();
// ...
// Fetch the run
const runData = await literalClient.api.getStep(runId);
if (!runData) {
return new NextResponse("Run not found", { status: 404 });
}
// This step will just instantiate the run data as a new Step instance so it can be used later
const run = literalClient.step(runData);
// Call the LLM API
const completion = await openai.chat.completions.create({
...prompt.settings,
messages: [...promptMessages, { role: "user", content: text }],
});
Lastly, we will patch the Run by providing its end time and the completion data. This allows us to monitor the perceived overall latency of each run, including network latency from one call to the other.
run.endTime = new Date().toISOString();
run.output = {
role: "assistant",
content: completion.choices[0].message.content,
};
await run.send();
With this setup, I can now monitor the performance of my application and the quality of the responses from OpenAI. This is just a starting point, once my application hits production and has a few runs logged, I can start to analyze the data and optimize the performance of my application :
- by improving the prompt and settings. This will then allow me to compare performance using different system prompts, models, temperatures etc... by re-running actual runs.
- because all the audio is logged, I can also experiment on other STT models and compare their performance.
I hope this cookbook was helpful to you ! I've included both the base version of the application and the version with Literal AI in the without-literal
and with-literal
folders. You can simply use diff
to compare the two versions and see the changes I made, like so :
diff with-literal/src/app/page.tsx without-literal/src/app/page.tsx
If you are having issue integrating Literal AI into your own application, I would love to help ! Feel free to reach out to me on damien@chainlit.io if you have any questions or feedback. Happy coding! 🖖