Skip to content

This repository demonstrates the usage of Microsoft Azure Text-to-Speech (TTS) and Speech-to-Text (STT) services for Bengali and English languages.

License

Notifications You must be signed in to change notification settings

ShadowShahriar/speech-services

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

speech-services

Important

Recently, I started working on the fourth generation of Jenny, a Telegram bot with a quirky personality. There were some scenarios where it needed Speech Synthesis to narrate summaries of the user command, Speech Recognition, and Translation. So, I opted to use Microsoft Azure Speech and Language services to simplify the process. This repository demonstrates how I did these using their SDK.

This repository primarily focused on demonstrating the use of Speech Synthesis and Speech Recognition. Demonstration of Translation, Text Summarization, Sentiment Detection, and Keyphrase Extraction was added later.

Useful Links:

Useful Nodejs Examples:

Samples from Official Repositories:

Playground:

Walk Through

Caution

The official documentation for using these services was hard to follow. Some instructions seem confusing due to their use of AI Language and Language Service terms interchangeably. Also, there is so much information, yet everything is scattered and disorganized.

Note

Speech Synthesis, Speech Recognition, and Translation are Speech Services that requires the cognitive-services-speech-sdk NPM package. However, Text Summarization is a Language Service that requires a separate NPM package (@azure/ai-language-text), which, in turn, has a completely different type of API. It seems that Speech Services SDK and Language Services SDK were developed by separate developer teams with no correlation with each other's API structure.

Creating the Resources

  1. Sign in with your Microsoft or GitHub account and go to the Microsoft Azure Portal.

  2. Create a Speech Resource and a Language Resource in the Azure Portal.

  3. Take notes for the first key (Key 1), region (e.g. eastus, westus) and endpoint for both resources. Create a .env file in the current directory and add the environmental variables here:

    AZURE_SPEECH_SUBSCRIPTION_KEY=<SPEECH FIRST KEY>
    AZURE_SPEECH_REGION=<SPEECH REGION>
    AZURE_LANUAGE_ENDPOINT=<LANGUAGE ENDPOINT URL>
    AZURE_LANUAGE_SUBSCRIPTION_KEY=<LANGUAGE FIRST KEY>

Installing the dependencies

Run the following command in the terminal:

npm install

This will install the following NPM dependencies:

Testing

Run the following command in the terminal (You can also use npm test if you want):

node .

If everything is working as expected, it should produce the following output in the terminal:

Wrote speech to file: test.wav
Spoken text: Hi and thank you so much for paying a visit to this repository. It's been a pleasure to meet you.
Translated text: হাই এবং এই সংগ্রহস্থলটি দেখার জন্য আপনাকে অনেক ধন্যবাদ। আপনার সাথে দেখা করে খুব ভাল লাগল।
Summary:
[
  'As you probably guessed, this was heavily inspired by the story and plot of "The Greatest Showman."'
]
Sentiment:
{
  sentiment: 'negative',
  scores: { positive: 0, neutral: 0.03, negative: 0.97 }
}
Keyphrases:
[
  'favorite song cover', 'The Greatest Showman',
  'best violinist',      'crazy dream',
  'young boy',           'story',
  'childhood',           'classmates',
  'teachers',            'fun',
  'Life',                'turn',
  'stage',               'front',
  'millions',            'people',
  'doubters',            'wrong',
  'passion',             'plot'
]

Methods

speak

const speak(text: string, talentID?: number, style?: string, wav?: boolean): Promise<Buffer|false>
Parameter Description
text The given text that would be converted to speech.
talentID Voice ID of the speaker. It can be any integer from 0 to 6.
style Narrative style of the given style. Accepted values are: default, cheerful, newscast, empathetic, excited, unfriendly, friendly, shy, embarassed, serious, sad, relieved, angry, terrified, shouting, whispering
Please note that not all voice models will support all the styles mentioned above.
wav When true the output format would be wav (PCM 48Khz 16Bit Mono). When false the output format would be opus (OGG 48Khz 16Bit Mono)

recognize

const recognize(buffer: Buffer, bengali?: boolean): Promise<string|false>
Parameter Description
buffer Audio buffer (e.g. fs.readFileSync("test.wav"))
bengali true if the speech recognition language is Bengali.

translate

const translate(buffer: Buffer, from?: string, to?: string): Promise<string|false>
Parameter Description
buffer Audio buffer (e.g. fs.readFileSync("test.wav"))
from Speech recognition language (Default: en-US)
to Translation language (Default: bn-IN)

summarize

const summarize(text: string, length?: boolean|string|null, language?: string): Promise<string|false>
Parameter Description
text Given long form text.
length By default the summarize function would extract text from the given text, but if length is true, it will produce an abstracted summary of the given text.

length can be a string (oneSentence, short, medium) that denotes the size of the extractive summary.

length can also be any integer from 1 to 20 specifying the maximum number of sentences to be extracted in the extractive summary. Defaults to 3 when omitted.
language Language of the given text (Default: en)

sentiment

const sentiment: (text: string, language?: string) => Promise<false | {
    sentiment: "positive" | "neutral" | "negative";
    scores: { positive: number, neutral: number, negative: number };
}>
Parameter Description
text Given text
language Language of the given text (Default: en)

keyphrases

const keyphrases: (text: string, language?: string) => Promise<false | string[]>
Parameter Description
text Given text
language Language of the given text (Default: en)

Language Support

For my use case, I needed support for English and Bengali languages. Here are the language codes I used for each service:

Service Language Codes
Text to Speech en-AUbn-BD
Speech to Text en-USbn-IN
Speech Translation en-USbn-IN
Text Summarization en
Sentiment Detection enbn
Keyphrase Extraction enbn

Support for other languages:

License

The source code is licensed under the MIT License.

About

This repository demonstrates the usage of Microsoft Azure Text-to-Speech (TTS) and Speech-to-Text (STT) services for Bengali and English languages.

Topics

Resources

License

Stars

Watchers

Forks