-
Notifications
You must be signed in to change notification settings - Fork 0
LAT Pipeline
Limitations encountered with LAT/Lamus when processing the eduskunta and finka corpora. There are also some workarounds. Add these to LAT documentation pages at Kielipankki.
-
Size of a file cannot exceed ~110 MB. add here
-
Converting mp4 files into wav seems only to make files bigger. Converting stereo to mono doesn't make a noticeable difference in size. add here
-
ffmpeg version 2.2. works, newer versions not necessarily. add here
-
Extracting just the audio data makes files a little smaller but linking to them from eaf files doesn't work for some reason... add here
-
URN from nodeid: urn.fi/urn:nbn:fi:lb-10011001{NODEID} (or "date" 1.10.1001 + NodeId) add here
-
If you want to link to LAT from Korp, you need the nodeid for a given session. If there are many sessions, doing this manually is slow and error-prone. Solution: Collect the nodeids by running
psql -U postgres corpusstructure -c "select nodeid,url,pid from archiveobjects where url like '%eduskunta%';"
on lat server (ask Martin to do this). Link is something like:https://lat.csc.fi/ds/annex/runLoader?viewType=timeline&nodeid=MPI17182&time=10000&duration=5000
add here -
RELATIVE_MEDIA_URL in eaf files doesn't necessarily work. Solution: use full path given with MEDIA_URL instead. add here
-
Do not use other than ascii characters in file and directory names add maybe here?
-
No BOM is allowed in eaf files. Remove it with
sed '1s/^\xEF\xBB\xBF//' < orig.txt > new.txt
, if needed. add here -
When lowering sample frequency of wav files with
ffmpeg -i INPUT -ar 22050 OUTPUT
, you get a warning that can be ignored:Using AVStream.codec.time_base as a timebase hint to the muxer is deprecated. Set AVStream.time_base instead.
add here -
When converting from wav to m4a with
ffmpeg -i INPUT -c:a libfdk_aac -vbr 1 OUTPUT
, you get warnings that can be ignored:[libfdk_aac @ 0x1f97380] Note, the VBR setting is unsupported and only works with some parameter combinations
and[ipod @ 0x1f96420] Using AVStream.codec.time_base as a timebase hint to the muxer is deprecated. Set AVStream.time_base instead.
add here -
External media files in lat urls
https://lat.csc.fi/ds/annex/runLoader?extAnno=foo.eaf&extMedia=foo.mp4
. Does this even work because they are not streamable?