Tools for selecting a smaller set of sentences from a larger text corpus, for creating manuscripts for TTS and ASR.
I. Requirements
II. Set up DB
III. Scripttool
IV. Config files
V. Sample scripts
VI. Documentation
- python3
- sqlite3
- go 1.16 (or higher)
For Swedish: https://dumps.wikimedia.org/svwiki/latest/svwiki-latest-pages-articles-multistream.xml.bz2
python wp_dump_extract/WikiExtractor.py --no_templates -o <out dir> <dump file>
The above step takes a lot of time.
cat dbapi/schema_sqlite.sql | sqlite3 <new db file>
go run cmd/load_db/main.go <options> <db file> <featcatdir> <WikiExtractor.py output files>
where featcatdir
is the directory in which feature category/domain files reside. This repository contains a set of domain files, located in the feat_data
folder: Swedish words for sports, weather, common names, etc. More information can be found in the documentation manuscript_tool.pdf (Swedish only).
The above steps takes a lot of time and will eventually create a huge database file. The database becomes very large, since for every sentence in the corpus, a large amount of features and relations are added to the database.
scripttool
is a CLI for manipulating a script database created according to instructions above. You create batches by filtering sentences in the database, and from these batches, you can create output manuscripts. You can also retreive information about the database, such as list existing batches/scripts, print db statistics, etc.
Usage:
go run cmd/scripttool/*.go <db file> <command> <args>
For full usage and documentation, please invoke
go run cmd/scripttool/*.go
go run cmd/scripttool/*.go <db file> scriptgen <config file>
More information about config files can be found in sections Config files and Sample scripts below.
go run cmd/scripttool/*.go <db file> export_script <script name(s)>
go run cmd/scripttool/*.go <db file> list_filter_feats
go run cmd/scripttool/*.go <db file> list_selector_feats
go run cmd/scripttool/*.go <db file> help
Some config examples can be found in folder config_examples
.
Config files used to generate sample scripts can be found in the folder sample_scripts
.
Sample scripts can be found in the latest release.
This work was supported by the Swedish Post and Telecom Authority (PTS) through the grant "Talresursinsamlaren – För ett tillgängligare Wikipedia genom Wikispeech" (2019–2021).