Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Convert the BMW encoding to JSON #16

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

cindyli
Copy link
Contributor

@cindyli cindyli commented Aug 3, 2023

Description

This pull request converts the BMW encoding to a JSON file to be used for the future development.

Steps to test

Refer to the document Convert BMW encoding to JSON about steps to convert.

Additional information

Due to the copyright concern, the original BMW encoding files are not included in this pull request.

that will serve as the foundation for implementing the BMW input method.

BMW encoding documents are in PDF format. These PDFs are composed by digitalized images of orginal
books. The coversio method is:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo coversio -> conversion -
Also "digitized" is better rendering of "digitalized"

books. The coversio method is:

1. Split every single page in a PDF into .jpg files
2. Use OCR library to extract texts from .jpg files
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a working OCR library for this? Could we provide more detailed instructions?

BMW encoding documents are in PDF format. These PDFs are composed by digitalized images of orginal
books. The coversio method is:

1. Split every single page in a PDF into .jpg files
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split each page in the PDF into its own .jpg file

utils/README.md Outdated

**File formats**

1. The content of any .txt file in the `source_txt_path` directory
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample content of a .txt file in the ...

@cindyli
Copy link
Contributor Author

cindyli commented Aug 3, 2023

Thanks for the review, @amb26. All addressed and ready for another round.

@CLAassistant
Copy link

CLAassistant commented Sep 19, 2023

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants