Important
Recently, I started working on the fourth generation of Jenny, a Telegram bot with a quirky personality. There were some scenarios where it needed Speech Synthesis to narrate summaries of the user command, Speech Recognition, and Translation. So, I opted to use Microsoft Azure Speech and Language services to simplify the process. This repository demonstrates how I did these using their SDK.
This repository primarily focused on demonstrating the use of Speech Synthesis and Speech Recognition. Demonstration of Translation, Text Summarization, Sentiment Detection, and Keyphrase Extraction was added later.
Useful Links:
- Quickstart: Convert text to speech
- Quickstart: Recognize and convert speech to text
- Quickstart: Recognize and translate speech to text
- Quickstart: using text, document and conversation summarization
Useful Nodejs Examples:
- Text to Speech Quickstart Code
- Speech to Text Quickstart Code
- Speech Translation Quickstart Code
- Text Summarization Quickstart Code
- Extract Key Phrases from Alternative Document Input
- Analyze Text Sentiment
Samples from Official Repositories:
- Microsoft Cognitive Services Speech SDK (TTS, STT, Speech Translation)
- Azure SDK for JavaScript (Text Summarization)
Playground:
Caution
The official documentation for using these services was hard to follow. Some instructions seem confusing due to their use of AI Language and Language Service terms interchangeably. Also, there is so much information, yet everything is scattered and disorganized.
Note
Speech Synthesis, Speech Recognition, and Translation are Speech Services that requires the cognitive-services-speech-sdk
NPM package. However, Text Summarization is a Language Service that requires a separate NPM package (@azure/ai-language-text
), which, in turn, has a completely different type of API. It seems that Speech Services SDK and Language Services SDK were developed by separate developer teams with no correlation with each other's API structure.
-
Sign in with your Microsoft or GitHub account and go to the Microsoft Azure Portal.
-
Create a Speech Resource and a Language Resource in the Azure Portal.
-
Take notes for the first key (Key 1), region (e.g. eastus, westus) and endpoint for both resources. Create a
.env
file in the current directory and add the environmental variables here:AZURE_SPEECH_SUBSCRIPTION_KEY=<SPEECH FIRST KEY> AZURE_SPEECH_REGION=<SPEECH REGION> AZURE_LANUAGE_ENDPOINT=<LANGUAGE ENDPOINT URL> AZURE_LANUAGE_SUBSCRIPTION_KEY=<LANGUAGE FIRST KEY>
Run the following command in the terminal:
npm install
This will install the following NPM dependencies:
- dotenv: Required for defining environmental variables.
- @azure/ai-language-text: Required for Text Summarization.
- @azure/ai-text-analytics: Required for Sentiment Detection and Keyphrase Extraction.
- microsoft-cognitiveservices-speech-sdk: Required for Text to Speech, Speech to Text, and Speech Translation.
Run the following command in the terminal (You can also use npm test
if you want):
node .
If everything is working as expected, it should produce the following output in the terminal:
Wrote speech to file: test.wav
Spoken text: Hi and thank you so much for paying a visit to this repository. It's been a pleasure to meet you.
Translated text: হাই এবং এই সংগ্রহস্থলটি দেখার জন্য আপনাকে অনেক ধন্যবাদ। আপনার সাথে দেখা করে খুব ভাল লাগল।
Summary:
[
'As you probably guessed, this was heavily inspired by the story and plot of "The Greatest Showman."'
]
Sentiment:
{
sentiment: 'negative',
scores: { positive: 0, neutral: 0.03, negative: 0.97 }
}
Keyphrases:
[
'favorite song cover', 'The Greatest Showman',
'best violinist', 'crazy dream',
'young boy', 'story',
'childhood', 'classmates',
'teachers', 'fun',
'Life', 'turn',
'stage', 'front',
'millions', 'people',
'doubters', 'wrong',
'passion', 'plot'
]
const speak(text: string, talentID?: number, style?: string, wav?: boolean): Promise<Buffer|false>
Parameter | Description |
---|---|
text |
The given text that would be converted to speech. |
talentID |
Voice ID of the speaker. It can be any integer from 0 to 6. |
style |
Narrative style of the given style. Accepted values are: default , cheerful , newscast , empathetic , excited , unfriendly , friendly , shy , embarassed , serious , sad , relieved , angry , terrified , shouting , whispering Please note that not all voice models will support all the styles mentioned above. |
wav |
When true the output format would be wav (PCM 48Khz 16Bit Mono). When false the output format would be opus (OGG 48Khz 16Bit Mono) |
const recognize(buffer: Buffer, bengali?: boolean): Promise<string|false>
Parameter | Description |
---|---|
buffer |
Audio buffer (e.g. fs.readFileSync("test.wav") ) |
bengali |
true if the speech recognition language is Bengali. |
const translate(buffer: Buffer, from?: string, to?: string): Promise<string|false>
Parameter | Description |
---|---|
buffer |
Audio buffer (e.g. fs.readFileSync("test.wav") ) |
from |
Speech recognition language (Default: en-US ) |
to |
Translation language (Default: bn-IN ) |
const summarize(text: string, length?: boolean|string|null, language?: string): Promise<string|false>
Parameter | Description |
---|---|
text |
Given long form text. |
length |
By default the summarize function would extract text from the given text, but if length is true , it will produce an abstracted summary of the given text. length can be a string (oneSentence , short , medium ) that denotes the size of the extractive summary. length can also be any integer from 1 to 20 specifying the maximum number of sentences to be extracted in the extractive summary. Defaults to 3 when omitted. |
language |
Language of the given text (Default: en ) |
const sentiment: (text: string, language?: string) => Promise<false | {
sentiment: "positive" | "neutral" | "negative";
scores: { positive: number, neutral: number, negative: number };
}>
Parameter | Description |
---|---|
text |
Given text |
language |
Language of the given text (Default: en) |
const keyphrases: (text: string, language?: string) => Promise<false | string[]>
Parameter | Description |
---|---|
text |
Given text |
language |
Language of the given text (Default: en) |
For my use case, I needed support for English and Bengali languages. Here are the language codes I used for each service:
Service | Language Codes |
---|---|
Text to Speech | en-AU , bn-BD |
Speech to Text | en-US , bn-IN |
Speech Translation | en-US , bn-IN |
Text Summarization | en |
Sentiment Detection | en , bn |
Keyphrase Extraction | en , bn |
Support for other languages:
- Language Support (Text to Speech)
- Language Support (Speech to Text)
- Language Support (Speech Translation)
- Language Support (Text Summarization/Sentiment Detection/Keyphrase Extraction)
The source code is licensed under the MIT License.