Skip to content

Extract glyphs from PDF to get correct Unicode using a VLM

Notifications You must be signed in to change notification settings

mriya98/glyph-to-character

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

glyph-to-character

The aim is to build parser that can help parse PDF's that have non-standard unicode mapping - a common problem in PDFs containing texts in Indian languages. If glyphs of unique chracters can be extracted, then they can be used to get the correct unicode by querying a VLM with an appropriate prompt. Gemini performs well on this.

Requirements

The notebook is ready to run. Personal Gemini API Key is required which needs to be updated in config.py.

About

Extract glyphs from PDF to get correct Unicode using a VLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published