-
Notifications
You must be signed in to change notification settings - Fork 0
Finka
Fetch the corpus from IDA or some other place. For each session, LAT requires an m4a and wav version of the audio data as well as a TextGrid file for the audio data. We presume that we have the wav and TextGrid files available.
If needed, lower sample frequency of wav files for LAT:
for f in WAVFILES; do ffmpeg -i $f -ar 22050 output.wav && mv --force output.wav $f; done
You possibly get a warning which can be ignored(?):
"Using AVStream.codec.time_base as a timebase hint to the muxer is deprecated. Set AVStream.time_base instead."
For each .wav file, also offer a .m4a version:
for f in WAVFILES; do ffmpeg -i $f -c:a libfdk_aac -vbr 1 `echo $f | perl -pe 's/\.wav/\.m4a/;'`; done
Currently at taito-shell, you get the following warnings which can also be ignored(?):
[libfdk_aac @ 0x1f97380] Note, the VBR setting is unsupported and only works with some parameter combinations
[ipod @ 0x1f96420] Using AVStream.codec.time_base as a timebase hint to the muxer is deprecated. Set
AVStream.time_base instead.
Get rid of lowest tier named "original" from TextGrid files, if needed. TODO: a script for this?
Convert TextGrid files from UTF-16 to UTF-8:
for f in TEXTGRIDFILES; do iconv -f UTF-16 -t UTF-8 $f -o tmp && mv --force tmp $f; done
Change non-ascii characters in file names/paths to ascii, i.e. ÄÖäö -> AOao. TODO: which is the easiest way to rename files that contain characters outside ascii range?
From The Language Archive web page:
"The ISLE Meta Data Initiative (IMDI) is a metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions. All TLA archiving tools (Arbil, LAMUS, etc.) are compatible with IMDI."
For a LAT package foo, the outline for an imdi directory structure would be something like:
readme.txt
foo.imdi
foo/
subdir1.imdi
subdir1/
subsubdir1.imdi
subsubdir1/
subsubdir1_fullname.m4a
subsubdir1_fullname.TextGrid
subsubdir1_fullname.wav
...
subsubdirN.imdi
subsubdirN
...
subdirN.imdi
subdirN
For a given directory dir in an IMDI tree, there exists an imdi file named dir.imdi at the same level as the directory itself is located. The imdi file essentially documents what is found under directory dir and links to those resources.
For structures of IMDI files, see for example finka package:
- root level imdi file named finka.imdi that lists six subdirectories organized according to places of origin of speakers (Ilomantsi, Impilahti, Korpiselkä, Salmi, Suistamo, Suojärvi)
- imdi file under directory finka named Ilomantsi.imdi that lists 55 subdirectories organized according to original audio tapes from Ilomantsi (each tape corresponds to a separate LAT session) * imdi file under directory finka/Ilomantsi named Ilomantsi_01nA.imdi that contains links to m4a, TextGrid and wav files of a given session located under directory finka/Ilomantsi/Ilomantsi_01nA