Create your own audiobook with open source tools

Introduction

This is a proof of concept for creating an audio book using free TTS tools. Starting with a long text file (e.g. extracted from a PDF, copied from the Internet), we can generate an audio file (e.g. MP3, ACC, ...).

Technologies

I have found that the text-to-speech (TTS) programs on Linux, such as accessibility tools, produce a quality of speech synthesis that, let's say, isn't acceptable for me to listen to for hours on end.

I came across https://github.com/coqui-ai/TTS which produces much better sounding audio.

There is a docker image to get you started: https://github.com/coqui-ai/TTS#docker-image

Initial experimentation:

# start docker container
sudo docker run -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu

# (inside the container) start the server with a voice model:
python3 TTS/server/server.py --model_name tts_models/en/ljspeech/tacotron2-DDC

This launches a web server at http://localhost/5002 where you can enter text and listen to the generated voice.

I quickly found that there are several limitations to the length of text that can be processed at one time. One is the fact that it uses a URL query parameter to post the text, the other limitation is that the speech model can become unstable with longer inputs.

To get around this, I tried to break up my text into smaller parts. The natural "parts" of a text would be the sentences.

Python script to process larger text

python/main.py

Here is a summary of what this Python script does:

read a text file
remove line breaks (to improve sentence tokenisation)
split the text into sentences using https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize
make an HTTP request to the running coqui-ai docker container
store the resulting WAV file in the response in the file system

To listen to the audio, you can enqueue some WAV files into a playlist in a player such as VLC.

Performance: With my old 4-core CPU, it was able to generate about 25 seconds of speech output within 10 seconds.

Combining the WAV files

I tried assembling the separate WAVs in the Python script, using https://github.com/MaxStrange/AudioSegment, but this failed with some weird array size errors. I did not investigate further and decided to use the command line tool sox (Sound eXchange) instead:

sox segment_00001.wav segment_00002.wav final.wav

Then use ffmpeg to compress it into an ACC file, for example:

ffmpeg -i final.wav -c:a aac -b:a 64k -threads 4 final.aac

Findings

The most important factor in quality is the input text. For example, if it is a scanned PDF that has been OCRed, structures in the text such as page numbers, images, footnotes, etc. will disrupt the reading flow.
Sometimes the TTS model gets "stuck in a loop" where instead of a normal voice you hear a robotic sound that lasts for a few seconds. For me, this happened randomly every 5-10 minutes. It is a bit annoying.
Rarely, the HTTP request got stuck completely. It seems to me that certain senctences reliably trigger the problem. I have not been able to find out what exactly is causing this. I suspect it might be something with different quoting styles used in a longer sentence. This happened 3 times in a text input with 10,000 sentences. As the Python script has no good timeout or error handling in general, I killed the script and removed the offending sentence manually.

Summary

If you have some longer texts/books that you've always wanted to read, but haven't had the time; If you are not afraid of Linux command tools like docker and like to play a bit with Python; you can relatively easily produce an audio file with an artificial voice that actually sounds good.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
python		python
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Create your own audiobook with open source tools

Introduction

Technologies

Initial experimentation:

Python script to process larger text

Combining the WAV files

Findings

Summary

About

Languages

License

thomassteinreiter/audiobook

Folders and files

Latest commit

History

Repository files navigation

Create your own audiobook with open source tools

Introduction

Technologies

Initial experimentation:

Python script to process larger text

Combining the WAV files

Findings

Summary

About

Topics

Resources

License

Stars

Watchers

Forks

Languages