Skip to content
eaxelson edited this page Dec 19, 2018 · 1 revision

Preprocess the files

Fetch the corpus from IDA or some other place. For each session, LAT requires an m4a and wav version of the audio data as well as a TextGrid file for the audio data. We presume that we have the wav and TextGrid files available.

If needed, lower sample frequency of wav files for LAT:

for f in WAVFILES; do ffmpeg -i $f -ar 22050 output.wav && mv --force output.wav $f; done

You possibly get a warning which can be ignored(?):

"Using AVStream.codec.time_base as a timebase hint to the muxer is deprecated. Set AVStream.time_base instead."

For each .wav file, also offer a .m4a version:

for f in WAVFILES; do ffmpeg -i $f -c:a libfdk_aac -vbr 1 `echo $f | perl -pe 's/\.wav/\.m4a/;'`; done

Currently at taito-shell, you get the following warnings which can also be ignored(?):

[libfdk_aac @ 0x1f97380] Note, the VBR setting is unsupported and only works with some parameter combinations
[ipod @ 0x1f96420] Using AVStream.codec.time_base as a timebase hint to the muxer is deprecated. Set 
  AVStream.time_base instead.

Get rid of lowest tier named "original" from TextGrid files, if needed. TODO: a script for this?

Convert TextGrid files from UTF-16 to UTF-8:

for f in TEXTGRIDFILES; do iconv -f UTF-16 -t UTF-8 $f -o tmp && mv --force tmp $f; done

Change non-ascii characters in file names/paths to ascii, i.e. ÄÖäö -> AOao. TODO: which is the easiest way to rename files that contain characters outside ascii range?

Create an IMDI structure

From The Language Archive web page:

"The ISLE Meta Data Initiative (IMDI) is a metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions. All TLA archiving tools (Arbil, LAMUS, etc.) are compatible with IMDI."

For a LAT package foo, the outline for an imdi directory structure would be something like:

    readme.txt
    foo.imdi
    foo/
        subdir1.imdi
        subdir1/
            subsubdir1.imdi
            subsubdir1/
                subsubdir1_fullname.m4a
                subsubdir1_fullname.TextGrid
                subsubdir1_fullname.wav
            ...
            subsubdirN.imdi
            subsubdirN
        ...
        subdirN.imdi
        subdirN

For a given directory dir in an IMDI tree, there exists an imdi file named dir.imdi at the same level as the directory itself is located. The imdi file essentially documents what is found under directory dir and links to those resources.

For structures of IMDI files, see for example finka package:

  • root level imdi file named finka.imdi that lists six subdirectories organized according to places of origin of speakers (Ilomantsi, Impilahti, Korpiselkä, Salmi, Suistamo, Suojärvi)
  • imdi file under directory finka named Ilomantsi.imdi that lists 55 subdirectories organized according to original audio tapes from Ilomantsi (each tape corresponds to a separate LAT session) * imdi file under directory finka/Ilomantsi named Ilomantsi_01nA.imdi that contains links to m4a, TextGrid and wav files of a given session located under directory finka/Ilomantsi/Ilomantsi_01nA
Clone this wiki locally