Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR Caching Files & Command Batch Generations #12

Open
wants to merge 12 commits into
base: 1.0.x
Choose a base branch
from

Conversation

WengerK
Copy link
Member

@WengerK WengerK commented Mar 27, 2024

💬 Description

add a new layer of performance by allowing developers to cache OCR'ed files

🔢 To Review Caching

  1. Use the new way to cache OCR files
// Anywhere at least once in the code (Eg. module.install) in order to prepare the storage.
\Drupal::service('entity_to_text_tika.storage.local_file')->prepareStorage();

// Load the already OCR'ed file if possible to avoid unecessary calls to Tika.
$body = \Drupal::service('entity_to_text_tika.storage.local_file')->load($file, 'eng+fra');

if (!$body) {
  // When the OCR'ed file is not available, then run Tika over it and store it for the next run.
  $body = \Drupal::service('entity_to_text_tika.extractor.file_to_text')->fromFileToText($file, 'eng+fra');
  // Save the OCR'ed file for the next run.
  \Drupal::service('entity_to_text_tika.storage.local_file')->save($file, $body, 'eng+fra');
}

🔢 To Review Command

The module expose a Drush command to generate OCR for all Drupal files.

This command is intended to be used sporadically, as it can be resource intensive.
The purpose is to generate OCR for all files that have not been OCR'ed yet.
This may be usefully after an initial install, a new OCR language has been added or right after files migration.

# Warmup all files that does not already have an associated .ocr file.
drush e2t:t:w
# Warmup all files even if the files has already been processed before.
drush e2t:t:w --force
# Warmup the file with FID 2.
drush e2t:t:w --fid=2
  • Update the "Unreleased" section of the CHANGELOG.md with chan

@WengerK WengerK force-pushed the 10x/ocr-caching-files branch 9 times, most recently from b6e41c2 to 0a7f239 Compare April 2, 2024 12:17
@WengerK WengerK marked this pull request as ready for review April 2, 2024 12:21
@WengerK WengerK requested a review from gido April 8, 2024 06:44
@WengerK WengerK force-pushed the 10x/ocr-caching-files branch 3 times, most recently from 640f179 to 2185512 Compare April 13, 2024 09:37
@WengerK WengerK force-pushed the 10x/ocr-caching-files branch 5 times, most recently from 42ff0fe to a6a0410 Compare April 25, 2024 14:37
@WengerK WengerK force-pushed the 10x/ocr-caching-files branch 3 times, most recently from bbec8df to cf10716 Compare April 25, 2024 15:27
@WengerK WengerK force-pushed the 10x/ocr-caching-files branch 6 times, most recently from 3405588 to 1b0d4cd Compare April 26, 2024 07:40
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants