Skip to content

LAT Pipeline

eaxelson edited this page Dec 19, 2018 · 1 revision

LAT Pipeline

Limitations and workarounds

Limitations encountered with LAT/Lamus when processing the eduskunta and finka corpora. There are also some workarounds. Add these to LAT documentation pages at Kielipankki.

  • Size of a file cannot exceed ~110 MB. add here

  • Converting mp4 files into wav seems only to make files bigger. Converting stereo to mono doesn't make a noticeable difference in size. add here

  • ffmpeg version 2.2. works, newer versions not necessarily. add here

  • Extracting just the audio data makes files a little smaller but linking to them from eaf files doesn't work for some reason... add here

  • URN from nodeid: urn.fi/urn:nbn:fi:lb-10011001{NODEID} (or "date" 1.10.1001 + NodeId) add here

  • If you want to link to LAT from Korp, you need the nodeid for a given session. If there are many sessions, doing this manually is slow and error-prone. Solution: Collect the nodeids by running psql -U postgres corpusstructure -c "select nodeid,url,pid from archiveobjects where url like '%eduskunta%';" on lat server (ask Martin to do this). Link is something like: https://lat.csc.fi/ds/annex/runLoader?viewType=timeline&nodeid=MPI17182&time=10000&duration=5000 add here

  • RELATIVE_MEDIA_URL in eaf files doesn't necessarily work. Solution: use full path given with MEDIA_URL instead. add here

  • Do not use other than ascii characters in file and directory names add maybe here?

  • No BOM is allowed in eaf files. Remove it with sed '1s/^\xEF\xBB\xBF//' < orig.txt > new.txt, if needed. add here

  • When lowering sample frequency of wav files with ffmpeg -i INPUT -ar 22050 OUTPUT, you get a warning that can be ignored: Using AVStream.codec.time_base as a timebase hint to the muxer is deprecated. Set AVStream.time_base instead. add here

  • When converting from wav to m4a with ffmpeg -i INPUT -c:a libfdk_aac -vbr 1 OUTPUT, you get warnings that can be ignored: [libfdk_aac @ 0x1f97380] Note, the VBR setting is unsupported and only works with some parameter combinations and [ipod @ 0x1f96420] Using AVStream.codec.time_base as a timebase hint to the muxer is deprecated. Set AVStream.time_base instead. add here

  • External media files in lat urls https://lat.csc.fi/ds/annex/runLoader?extAnno=foo.eaf&extMedia=foo.mp4. Does this even work because they are not streamable?

Clone this wiki locally