uFOIL: An Unsupervised Fusion of Image and Language Understanding

This repository contains the implementation code and pipeline for uFOIL, a novel ensemble-based unsupervised learning framework for automating information extraction from exam scripts.

Methodology

Our methodology combines several advanced techniques, including dynamic contrast adjustment, rotation correction, BM3D denoising, and GAN-based augmentation. We also implement a multi-step segmentation process to isolate different sections of exam scripts and use an ensemble of OCR models to extract text data.

Dataset

The dataset used in this project was sourced from 412 exam script images collected from a United International University, Bangladesh. These scripts include both handwritten and printed student details, as well as question-wise marks presented in a tabular format.

Preprocessing

The detailed implementation of our preprocessing pipeline includes the following steps:

BM3D Denoising: Reduces noise from exam scripts while preserving important features 💻.
Dynamic CLAHE: Improves contrast dynamically based on local regions of the image 💻.
GAN-Based Augmentation: Generates synthetic data to expand the dataset for more robust model training 💻.
Rotation Correction: Corrects for any skew in the scanned exam scripts based on detected text regions 💻.

Segmentation

Our segmentation pipeline consists of several steps:

Label Detection: Detects labels like "Name", "ID", and "Course Code" using OCR 💻.
Section Separation: Divides the exam script into upper and lower sections 💻.
Table Segmentation: Segments the table containing question-wise marks from the lower section of the script 💻.

Text Processing and OCR

For text extraction and validation, we use an ensemble of OCR models, including Tesseract, EasyOCR, CRAFT, and TrOCR.

OCR Models: Code to integrate various OCR models can be found 💻.
Majority Voting: After OCR, majority voting is applied to select the most accurate text output 💻.
Post-Processing: Cleans and formats the extracted text for easier validation 💻.
Validation: The extracted data is validated against expected formats, such as student IDs and marks 💻.

Requirements

To run the project, the following key Python packages are required:

OpenCV
PyTorch
scikit-image
Tesseract-OCR
EasyOCR
NumPy
Matplotlib
Pandas

Citation

Will be added soon

Acknowledgments

We would like to thank the Director of the Masters in Computer Science and Engineering (MSCSE) program for providing access to the exam scripts dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
fig		fig
preproc		preproc
segmentation		segmentation
text_proc		text_proc
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

uFOIL: An Unsupervised Fusion of Image and Language Understanding

Methodology

Dataset

Preprocessing

Segmentation

Text Processing and OCR

Requirements

Citation

Acknowledgments

About

Releases

Packages

Languages

abdurrahman4127/uFOIL

Folders and files

Latest commit

History

Repository files navigation

uFOIL: An Unsupervised Fusion of Image and Language Understanding

Methodology

Dataset

Preprocessing

Segmentation

Text Processing and OCR

Requirements

Citation

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages