speech-services

Important

Recently, I started working on the fourth generation of Jenny, a Telegram bot with a quirky personality. There were some scenarios where it needed Speech Synthesis to narrate summaries of the user command, Speech Recognition, and Translation. So, I opted to use Microsoft Azure Speech and Language services to simplify the process. This repository demonstrates how I did these using their SDK.

This repository primarily focused on demonstrating the use of Speech Synthesis and Speech Recognition. Demonstration of Translation, Text Summarization, Sentiment Detection, and Keyphrase Extraction was added later.

Useful Links:

Useful Nodejs Examples:

Samples from Official Repositories:

Microsoft Cognitive Services Speech SDK (TTS, STT, Speech Translation)
Azure SDK for JavaScript (Text Summarization)

Playground:

Walk Through

Caution

The official documentation for using these services was hard to follow. Some instructions seem confusing due to their use of AI Language and Language Service terms interchangeably. Also, there is so much information, yet everything is scattered and disorganized.

Note

Speech Synthesis, Speech Recognition, and Translation are Speech Services that requires the cognitive-services-speech-sdk NPM package. However, Text Summarization is a Language Service that requires a separate NPM package (@azure/ai-language-text), which, in turn, has a completely different type of API. It seems that Speech Services SDK and Language Services SDK were developed by separate developer teams with no correlation with each other's API structure.

Creating the Resources

Sign in with your Microsoft or GitHub account and go to the Microsoft Azure Portal.
Create a Speech Resource and a Language Resource in the Azure Portal.

Take notes for the first key (Key 1), region (e.g. eastus, westus) and endpoint for both resources. Create a .env file in the current directory and add the environmental variables here:

AZURE_SPEECH_SUBSCRIPTION_KEY=<SPEECH FIRST KEY>
AZURE_SPEECH_REGION=<SPEECH REGION>
AZURE_LANUAGE_ENDPOINT=<LANGUAGE ENDPOINT URL>
AZURE_LANUAGE_SUBSCRIPTION_KEY=<LANGUAGE FIRST KEY>

Installing the dependencies

Run the following command in the terminal:

npm install

This will install the following NPM dependencies:

dotenv: Required for defining environmental variables.
@azure/ai-language-text: Required for Text Summarization.
@azure/ai-text-analytics: Required for Sentiment Detection and Keyphrase Extraction.
microsoft-cognitiveservices-speech-sdk: Required for Text to Speech, Speech to Text, and Speech Translation.

Testing

Run the following command in the terminal (You can also use npm test if you want):

node .

If everything is working as expected, it should produce the following output in the terminal:

Wrote speech to file: test.wav
Spoken text: Hi and thank you so much for paying a visit to this repository. It's been a pleasure to meet you.
Translated text: হাই এবং এই সংগ্রহস্থলটি দেখার জন্য আপনাকে অনেক ধন্যবাদ। আপনার সাথে দেখা করে খুব ভাল লাগল।
Summary:
[
  'As you probably guessed, this was heavily inspired by the story and plot of "The Greatest Showman."'
]
Sentiment:
{
  sentiment: 'negative',
  scores: { positive: 0, neutral: 0.03, negative: 0.97 }
}
Keyphrases:
[
  'favorite song cover', 'The Greatest Showman',
  'best violinist',      'crazy dream',
  'young boy',           'story',
  'childhood',           'classmates',
  'teachers',            'fun',
  'Life',                'turn',
  'stage',               'front',
  'millions',            'people',
  'doubters',            'wrong',
  'passion',             'plot'
]

Methods

speak

const speak(text: string, talentID?: number, style?: string, wav?: boolean): Promise<Buffer|false>

Parameter	Description
`text`	The given text that would be converted to speech.
`talentID`	Voice ID of the speaker. It can be any integer from 0 to 6.
`style`	Narrative style of the given style. Accepted values are: `default`, `cheerful`, `newscast`, `empathetic`, `excited`, `unfriendly`, `friendly`, `shy`, `embarassed`, `serious`, `sad`, `relieved`, `angry`, `terrified`, `shouting`, `whispering` Please note that not all voice models will support all the styles mentioned above.
`wav`	When `true` the output format would be `wav` (PCM 48Khz 16Bit Mono). When `false` the output format would be `opus` (OGG 48Khz 16Bit Mono)

recognize

const recognize(buffer: Buffer, bengali?: boolean): Promise<string|false>

Parameter	Description
`buffer`	Audio buffer (e.g. `fs.readFileSync("test.wav")`)
`bengali`	`true` if the speech recognition language is Bengali.

translate

const translate(buffer: Buffer, from?: string, to?: string): Promise<string|false>

Parameter	Description
`buffer`	Audio buffer (e.g. `fs.readFileSync("test.wav")`)
`from`	Speech recognition language (Default: `en-US`)
`to`	Translation language (Default: `bn-IN`)

summarize

const summarize(text: string, length?: boolean|string|null, language?: string): Promise<string|false>

Parameter	Description
`text`	Given long form text.
`length`	By default the `summarize` function would extract text from the given text, but if `length` is `true`, it will produce an abstracted summary of the given text. `length` can be a string (`oneSentence`, `short`, `medium`) that denotes the size of the extractive summary. `length` can also be any integer from 1 to 20 specifying the maximum number of sentences to be extracted in the extractive summary. Defaults to 3 when omitted.
`language`	Language of the given text (Default: `en`)

sentiment

const sentiment: (text: string, language?: string) => Promise<false | {
    sentiment: "positive" | "neutral" | "negative";
    scores: { positive: number, neutral: number, negative: number };
}>

Parameter	Description
`text`	Given text
`language`	Language of the given text (Default: en)

keyphrases

const keyphrases: (text: string, language?: string) => Promise<false | string[]>

Parameter	Description
`text`	Given text
`language`	Language of the given text (Default: en)

Language Support

For my use case, I needed support for English and Bengali languages. Here are the language codes I used for each service:

Service	Language Codes
Text to Speech	`en-AU`, `bn-BD`
Speech to Text	`en-US`, `bn-IN`
Speech Translation	`en-US`, `bn-IN`
Text Summarization	`en`
Sentiment Detection	`en`, `bn`
Keyphrase Extraction	`en`, `bn`

Support for other languages:

License

The source code is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
.env-sample		.env-sample
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
api.js		api.js
index.js		index.js
package.json		package.json
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

speech-services

Walk Through

Creating the Resources

Installing the dependencies

Testing

Methods

speak

recognize

translate

summarize

sentiment

keyphrases

Language Support

License

About

Languages

License

ShadowShahriar/speech-services

Folders and files

Latest commit

History

Repository files navigation

speech-services

Walk Through

Creating the Resources

Installing the dependencies

Testing

Methods

speak

recognize

translate

summarize

sentiment

keyphrases

Language Support

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages