Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bashlib: implement new CLI options #929

Merged
merged 2 commits into from
Oct 25, 2022
Merged

bashlib: implement new CLI options #929

merged 2 commits into from
Oct 25, 2022

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Oct 24, 2022

Avoids becoming inconsistent between our help text (which now contains --dump-module-dir and --profile) and the actual implementation. For the moduledir I chose the directory which contains the tool json (because that's how we already handle resources in ocrd-page-transform, the only existing bashlib processor with resources yet).

@bertsky bertsky requested a review from kba October 24, 2022 10:36
@@ -127,6 +129,7 @@ ocrd__parse_argv () {
-l|--log-level) ocrd__argv[log_level]=$2 ; shift ;;
-h|--help|--usage) ocrd__usage; exit ;;
-J|--dump-json) ocrd__dumpjson; exit ;;
-D|--dump-module-dir) echo $(dirname $OCRD_TOOL_JSON); exit ;;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that's a sensible location because it will normally be the $SHAREDIR for installation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and there's nothing we can rely on from bashlib except for the installation path of the tool json (by whatever means the tool got there)

@@ -137,6 +140,8 @@ ocrd__parse_argv () {
-w|--working-dir) ocrd__argv[working_dir]=$(realpath "$2") ; shift ;;
-m|--mets) ocrd__argv[mets_file]=$(realpath "$2") ; shift ;;
--overwrite) ocrd__argv[overwrite]=true ;;
--profile) ocrd__argv[profile]=true ;;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since --profile is python-specific, I am unsure whether we want to mirror it in bashlib unless we have a good idea on how to actually implement it. If we decide that this should be consistent across all processors - so a clearly defined feature, not just a debugging aid - then we should probably amend the CLI spec for it. If we decide against that, it might be better to use a non-flag option like an environment variable and document it outside of the --help text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should elevate this to spec. But having it in core anyway, it's only natural to at least pass the information on in the arg parser for bashlib. Of course there is not so much more we can do from bashlib (except perhaps set -x, or even exec 3>${ocrd__argv[profile_file]}; export BASH_XTRACEFD=3; or maybe some timestamping in the DEBUG trap), because the implementation may not use much of bash and only delegate to other programs. The flag would still be useful then, for example by activating some native profiling mechanism in that program.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exec 3>${ocrd__argv[profile_file]}; export BASH_XTRACEFD=3

That's an amazing snippet I will save for later :)

I don't think we should elevate this to spec.

I agree, so for consistency's sake, shall we change the mechanism in the python implementation to use environment variables (OCRD_PROFILE, OCRD_PROFILE_FILE) and drop it from the --help output?

Copy link
Collaborator Author

@bertsky bertsky Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have agreed upon a general mechanism (also for error handling, caching, timeouts and for "universal parameters" like DPI or runtime input/output validation) – yes, then that should be used here, too.

At the moment this just restores consistency for the current state of affairs (i.e. the existing --profile option).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that, but consistency with the current state of affairs is also the target of this PR, so I thought we could handle --profile vs. OCRD_PROFILE here before we start to adding new flags for caching, processing server etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we agree that --profile* should not be part of the CLI spec, we could get rid of those flags now, rather than introduce them in bashlib.

Oh, now I got it. But that's a breaking change! And I doubt we will arrive at a definitive conclusion for the general configuration/customization issue quickly. So why not just get this done first?

If we don't want to decide now and revisit later, I would not add the flags to the bashlib so that users are aware that --profile and --profile-file are not actually implemented by bashlib processors raising an error.

Ah, I see! Then let's just implement them quick and dirty (see above) as a first step.

Also, with a thing like bashlib, you cannot say that they are not implemented: bashlib processors can now start implementing them. (But not without the parser!)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In c7aa0a8 I implemented a variant that's superior to set -x: it also includes the timing of all commands.

Example with ocrd-olena-binarize…
+ 18:36:32 local _valopts=(--workspace "${ocrd__argv[working_dir]}" --mets-basename "$(basename ${ocrd__argv[mets_file]})")
++ 18:36:32 basename ${ocrd__argv[mets_file]}
+ 18:36:32 [[ ${ocrd__argv[overwrite]} = true ]]
+ 18:36:32 _valopts+=(--overwrite)
+ 18:36:32 [[ -n "${ocrd__argv[page_id]:-}" ]]
+ 18:36:32 _valopts+=(--page-id "${ocrd__argv[page_id]}")
+ 18:36:32 _valopts+=("${OCRD_TOOL_NAME#ocrd-} -I ${ocrd__argv[input_file_grp]} -O ${ocrd__argv[output_file_grp]} ${__parameters[*]@Q} ${__parameter_overrides[*]@Q}")
+ 18:36:32 ocrd validate tasks "${_valopts[@]}"
+ 18:36:36 local params_parsed retval
+ 18:36:36 params_parsed="$(ocrd ocrd-tool "$OCRD_TOOL_JSON" tool $OCRD_TOOL_NAME parse-params "${__parameters[@]}" "${__parameter_overrides[@]}")"
++ 18:36:36 ocrd ocrd-tool "$OCRD_TOOL_JSON" tool $OCRD_TOOL_NAME parse-params "${__parameters[@]}" "${__parameter_overrides[@]}"
+ 18:36:36 eval "$params_parsed"
+ 18:36:36 params["impl"]="singh"
+ 18:36:36 params["k"]="0.34"
+ 18:36:36 params["win-size"]="0"
+ 18:36:36 params["dpi"]="0"
+ 18:36:36 i=0
+ 18:36:36 declare -ag ocrd__files
+ 18:36:36 read line
++ 18:36:36 ocrd bashlib input-files -m "${ocrd__argv[mets_file]}" -I "${ocrd__argv[input_file_grp]}" -O "${ocrd__argv[output_file_grp]}" ${ocrd__argv[page_id]:+-g} ${ocrd__argv[page_id]:-}
+ 18:36:37 eval declare -Ag "ocrd__file$i=( $line )"
+ 18:36:37 declare -Ag ocrd__file0=([url]='MAX/filemax00005.jpg' [ID]='MAX_00005' [mimetype]='image/jpeg' [pageId]='phys00005' [outputFileId]='BINSINGH_00005')
+ 18:36:37 eval "ocrd__files[$i]=ocrd__file$i"
+ 18:36:37 ocrd__files[0]=ocrd__file0
+ 18:36:37 let ++i
+ 18:36:37 read line
+ 18:36:37 ocrd__minversion 2.29.0
+ 18:36:37 ocrd__minversion 2.29.0
+ 18:36:37 local minversion="$1"
+ 18:36:37 local version=$(ocrd --version|sed 's/ocrd, version //')
++ 18:36:37 ocrd --version
++ 18:36:37 sed 's/ocrd, version //'
+ 18:36:38 local IFS=.
+ 18:36:38 version=($version)
+ 18:36:38 minversion=($minversion)
+ 18:36:38 (( ${version[0]} > ${minversion[0]} ))
+ 18:36:38 (( ${version[0]} == ${minversion[0]} ))
+ 18:36:38 (( ${version[1]} > ${minversion[1]} ))
+ 18:36:38 return
+ 18:36:38 scribo_options=(--enable-negate-output)
+ 18:36:38 case ${params[impl]} in 
+ 18:36:38 scribo_options+=(--k $($PYTHON -c "print(${params[k]}*0.1765)"))
++ 18:36:38 $PYTHON -c "print(${params[k]}*0.1765)"
+ 18:36:38 scribo_options+=(--win-size ${params[win-size]})
+ 18:36:38 cd "${ocrd__argv[working_dir]}"
+ 18:36:38 out_file_grp=${ocrd__argv[output_file_grp]}
+ 18:36:38 ((n=0))
+ 18:36:38 ((n<1))
+ 18:36:38 local in_fpath="$(ocrd__input_file $n url)"
++ 18:36:38 ocrd__input_file $n url
++ 18:36:38 ocrd__input_file $n url
++ 18:36:38 eval echo "\${${ocrd__files[$1]}[$2]}"
++ 18:36:38 echo ${ocrd__file0[url]}
+ 18:36:38 local in_id="$(ocrd__input_file $n ID)"
++ 18:36:38 ocrd__input_file $n ID
++ 18:36:38 ocrd__input_file $n ID
++ 18:36:38 eval echo "\${${ocrd__files[$1]}[$2]}"
++ 18:36:38 echo ${ocrd__file0[ID]}
+ 18:36:38 local in_mimetype="$(ocrd__input_file $n mimetype)"
++ 18:36:38 ocrd__input_file $n mimetype
++ 18:36:38 ocrd__input_file $n mimetype
++ 18:36:38 eval echo "\${${ocrd__files[$1]}[$2]}"
++ 18:36:38 echo ${ocrd__file0[mimetype]}
+ 18:36:38 local in_pageId="$(ocrd__input_file $n pageId)"
++ 18:36:38 ocrd__input_file $n pageId
++ 18:36:38 ocrd__input_file $n pageId
++ 18:36:38 eval echo "\${${ocrd__files[$1]}[$2]}"
++ 18:36:38 echo ${ocrd__file0[pageId]}
+ 18:36:38 local out_id="$(ocrd__input_file $n outputFileId)"
++ 18:36:38 ocrd__input_file $n outputFileId
++ 18:36:38 ocrd__input_file $n outputFileId
++ 18:36:38 eval echo "\${${ocrd__files[$1]}[$2]}"
++ 18:36:38 echo ${ocrd__file0[outputFileId]}
+ 18:36:38 local out_fpath="$out_file_grp/${out_id}.xml"
+ 18:36:38 test -f "${in_fpath#file://}"
+ 18:36:38 mkdir -p $out_file_grp
+ 18:36:38 [ "x${in_mimetype}" = x${MIMETYPE_PAGE} ]
+ 18:36:38 [ "${in_mimetype}" != "${in_mimetype#image/}" ]
+ 18:36:38 ocrd log info "processing $in_mimetype input file $in_id ($in_pageId)"
+ 18:36:39 process_imagefile "$in_fpath" "$in_id" "$in_pageId" "$out_fpath" "$out_id" "$out_file_grp"
+ 18:36:39 process_imagefile "$in_fpath" "$in_id" "$in_pageId" "$out_fpath" "$out_id" "$out_file_grp"
+ 18:36:39 local in_fpath="$1" in_id="$2" in_pageId="$3" out_fpath="$4" out_id="$5" out_file_grp="$6"
+ 18:36:39 local image_out_fpath image_out_id
+ 18:36:39 image_in_fpath="${in_fpath#file://}"
+ 18:36:39 image_out_id="${out_id}-BIN_${params[impl]}"
+ 18:36:39 image_out_fpath="${out_file_grp}/${image_out_id}.png"
+ 18:36:39 local scribo_extra=$(auto_winsize "${image_in_fpath}" "${in_pageId}")++ 18:36:39 auto_winsize "${image_in_fpath}" "${in_pageId}"
++ 18:36:39 auto_winsize "${image_in_fpath}" "${in_pageId}"
++ 18:36:39 local image_in_fpath="$1" in_pageId="$2"
++ 18:36:39 [ ${params[impl]} = 'otsu' ]
++ 18:36:39 ((${params[win-size]}))
++ 18:36:39 ((${params[dpi]}))
++ 18:36:39 read dpi units < <(identify -format "24.10.2022 43\n" "${image_in_fpath}[0]")
+++ 18:36:39 identify -format "24.10.2022 43\n" "${image_in_fpath}[0]"
++ 18:36:39 case "$units" in 
++ 18:36:39 dpi=$($PYTHON -c "print(int($dpi) + int(($dpi+1)%2))")
+++ 18:36:39 $PYTHON -c "print(int($dpi) + int(($dpi+1)%2))"
++ 18:36:39 ocrd log debug "Using DPI-derived window size $dpi for page ${in_pageId}" 1>&2
++ 18:36:39 echo --win-size $dpi
++ 18:36:39 return
+ 18:36:39 scribo-cli "${params[impl]}" "${image_in_fpath}" "${image_out_fpath}" "${scribo_options[@]}" ${scribo_extra}
+ 18:36:41 ocrd workspace add --force -G ${out_file_grp} -g "$in_pageId" -m image/png -i "$image_out_id" "$image_out_fpath"
+ 18:36:41 imageGeometry=($(identify -format "%[fx:w] %[fx:h]" "${in_fpath}[0]"))
++ 18:36:41 identify -format "%[fx:w] %[fx:h]" "${in_fpath}[0]"
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is still just tracing (but again, the processor code can act on ocrd__argv[profiling] and do something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought I could slip a backwards-incompatible change into your PR, that's not the right approach. Let's align bashlib and core first and postpone the discussion about configuring these "extra" behaviors.

OK, I agree that c7aa0a8 is a sensible approximation of the functionality. Always impressed with your knowledge of bash internals 👍

Funnily enough, @MehmedGIT and I debugged a problem with his ocrd_olena installation with bash -x $(which ocrd-olena-binarize) earlier today, so there's at least two potential users of the functionality :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will also give us hints which parts of the bashlib processors need to be streamlined or parallelised. And we can just ask the user to provide these logs from their environment now.

@kba kba merged commit 9431c76 into master Oct 25, 2022
@kba kba deleted the bashlib-add-moduledir branch October 25, 2022 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants