From a85b63d810268cdde57cbe9c3eac0c9a01ad523c Mon Sep 17 00:00:00 2001 From: Michael Spalti Date: Mon, 11 Apr 2022 16:03:22 -0700 Subject: [PATCH 1/6] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 3da9cde..858d07a 100644 --- a/README.md +++ b/README.md @@ -9,10 +9,10 @@ retrieved from DSpace using the DSpace IIIF integration. #### Supports * GET, POST, and DELETE methods -* Addition of `MiniOcr`, `hOCR` or `ALTO` to the index with "full" or "lazy" indexing (with optional XML-encoding of Unicode characters), via POST. +* Adding `MiniOcr`, `hOCR` or `ALTO` files to the Solr index with "full" or "lazy" indexing (and optional XML-encoding of Unicode characters). * Conversion of `hOCR` and `ALTO` files to `MiniOcr`. -* Checks for whether OCR files have been indexed, via GET. -* Removal of OCR files from the index and the file system if "lazy" indexing was used, via DELETE. +* Checks for whether OCR files for a DSpace Item have already been indexed. +* Removal of OCR files from the index, and from the file system if "lazy" indexing was used. #### Configuration Options * **http_port**: listen port of service From bb858e810a559a83f8943f63c74229de6a2271b7 Mon Sep 17 00:00:00 2001 From: Michael Spalti Date: Mon, 11 Apr 2022 16:11:10 -0700 Subject: [PATCH 2/6] Update README.md --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index 858d07a..ad9431e 100644 --- a/README.md +++ b/README.md @@ -26,6 +26,11 @@ retrieved from DSpace using the DSpace IIIF integration. * **xml_file_location**: Path to OCR files (when "lazy" indexing used) * **log_dir**: Path to the log directory +#### Requirements +* Go 1.16.15+ +* DSpace 7+ +* Solr OCR Highlighting Plugin v0.7.2+ + #### Overview The service works in conjunction with DSpace 7.x IIIF support. From db9ebf558495bcc76175b75d5f44e4f90c70de3f Mon Sep 17 00:00:00 2001 From: Michael Spalti Date: Mon, 11 Apr 2022 16:14:32 -0700 Subject: [PATCH 3/6] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ad9431e..4629d79 100644 --- a/README.md +++ b/README.md @@ -27,7 +27,7 @@ retrieved from DSpace using the DSpace IIIF integration. * **log_dir**: Path to the log directory #### Requirements -* Go 1.16.15+ +* Go 1.16.15+ (if you are building your own binary and not using a distributed version) * DSpace 7+ * Solr OCR Highlighting Plugin v0.7.2+ From d5399912282fc066cf13fddd3940faecd1a4b136 Mon Sep 17 00:00:00 2001 From: Michael Spalti Date: Mon, 11 Apr 2022 17:11:32 -0700 Subject: [PATCH 4/6] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4629d79..5bb9dcf 100644 --- a/README.md +++ b/README.md @@ -77,7 +77,7 @@ Pull from Docker Hub: Example of running the container with volumes (Linux). -`docker run -d -user --network host -v /host/path/to/configs:/processor/configs -v /host/path/to/logs:/var/log/ocr_processor -v /path/escaped/alto/files:/var/ocr_files mspalti/ocrprocessor` +` docker run -d -u --network host --name ocr_processor -v /host/path/to/config:/processor -v /host/path/to/log:/var/log/ocr_processor -v /path/to/ocr_files:/var/ocr_files mspalti/ocr_processor` Note that you don't need to create a volume for the `/var/ocr_files` mount point if you aren't using "lazy" indexing. From f0c9cf83051fc74d1f2b9014a51cdce41d983cea Mon Sep 17 00:00:00 2001 From: Michael Spalti Date: Tue, 12 Apr 2022 09:56:27 -0700 Subject: [PATCH 5/6] Update README.md --- README.md | 29 +++++++++++++++++++---------- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 5bb9dcf..3f87414 100644 --- a/README.md +++ b/README.md @@ -31,7 +31,7 @@ retrieved from DSpace using the DSpace IIIF integration. * DSpace 7+ * Solr OCR Highlighting Plugin v0.7.2+ -#### Overview +## Overview The service works in conjunction with DSpace 7.x IIIF support. When indexing a new item, the service retrieves an IIIF `AnnotationList` of OCR files from the @@ -57,19 +57,24 @@ See DSpace IIIF documentation: https://wiki.lyrasis.org/display/DSDOC7x/IIIF+Con ## Installation -#### Binary Files: +#### Solr Core -DSpace 7.x should eventually include OS-specific directories with starter configuration files and a Solr core that's pre-configured for the `solr-ocrhighlighting` plugin. +Add the word_highlighting plugin to your Solr cores. DSpace 7.x may eventually include a starter core for you to use. In the +meantime, see the `solr-ocrhighlighting` documentation for more details. -In the meantime, you can build from source. +#### Binary Executables files and Sample Configuration: -`go build -o /output/directory main.go ` +Archive files for various platforms are provided in the [Release List](https://github.com/mspalti/solr_ocr_processor/releases). + +You can also build from source. + +`go build -o /output/directory/ main.go` For a specific platform: -`env GOOS= GOARCH= go build -o /output/directory main.go ` +`env GOOS= GOARCH= go build -o /output/directory/ main.go` -#### Docker +#### Using Docker Pull from Docker Hub: @@ -90,14 +95,18 @@ indexing. ## Usage -POST, DELETE, or GET requests use the identifier of a DSpace Item as follows: +POST, DELETE, or GET requests use the identifier of a DSpace `Item` as follows: `http://:3000/item/413065ef-e242-4d0e-867d-8e2f6486be56` +* GET returns 200 if the DSpace `Item` is in the Solr index and 404 if it has not yet been added. +* DELETE removes all Solr index entries for the DSpace `Item` and OCR files from disk for "lazy" indexing. +* POST adds all OCR files for the DSpace `Item` to the index. + ### DSpace command line tool (under development) -A DSpace CLI tool is currently being considered. That tool uses this service to add or delete OCR from the -Solr index. The tool allows batch updates at the Community or Collection levels, as well as individual Item +A DSpace CLI tool is being considered. That tool uses this service to add or delete OCR from the +Solr index. The CLI tool allows batch updates at the Community or Collection levels, as well as individual Item updates. Usage: From 809abccf98e3fe78838b6dd1f644a47a1b39e326 Mon Sep 17 00:00:00 2001 From: Michael Spalti Date: Tue, 12 Apr 2022 10:09:21 -0700 Subject: [PATCH 6/6] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3f87414..c5c969e 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,7 @@ You must add the solr-ocrhighlighting plugin to Solr. See the instructions: http You need an IIIF-enabled DSpace instance. Your DSpace `Items` must be individually enabled for IIIF and search via the metadata fields `dspace-iiif-enabled` and `iiif-search-enabled`. The Item's OCR files must be -in the DSpace Item's `OtherContent` Bundle. If your processing order is determined by structural metadata, be sure +in the DSpace Item's `OtherContent` Bundle. If your processing order is determined by METS metadata, be sure to name your structural metadata file `mets.xml`. If this file does not exist or has not been correctly named, processing order is determined by the order of OCR files in the `OtherContent` Bundle.