At WebSummit 2019, 46 talks (mainly from the Central Stage) were automatically transcripted in real-time using the Otter.ai platform. This repo, provides the transcripted text, as well as the code to re-download it and preprocess it.
With this dataset, you can do statistic analysis on the text of the transcripts, or even train a neural network model to produce your very own WebSummit speech.
Each speech, is an individual .txt
file inside the plain-texts
folder eg: (224STLFR2BIGPLOD.txt
). All the speeches are titled using their id from the otter.ai platform. If you have a different naming scheme to propose, I'm all ears! :D
Apart from that, inside the plain-texts
folder, there is a data.json
file, that includes all the raw data from otter.ai.
npm install
npm run compile
npm run datagen
(This downloads the transcripts from otter.ai and produces thedata.json
file)npm run preprocess
(This reads thedata.json
file and creates the various.txt
files.)
Please fork, copy, share and contribute!