This repository contains the implementation code and pipeline for uFOIL, a novel ensemble-based unsupervised learning framework for automating information extraction from exam scripts.
Our methodology combines several advanced techniques, including dynamic contrast adjustment, rotation correction, BM3D denoising, and GAN-based augmentation. We also implement a multi-step segmentation process to isolate different sections of exam scripts and use an ensemble of OCR models to extract text data.
The dataset used in this project was sourced from 412 exam script images collected from a United International University, Bangladesh. These scripts include both handwritten and printed student details, as well as question-wise marks presented in a tabular format.
The detailed implementation of our preprocessing pipeline includes the following steps:
- BM3D Denoising: Reduces noise from exam scripts while preserving important features 💻.
- Dynamic CLAHE: Improves contrast dynamically based on local regions of the image 💻.
- GAN-Based Augmentation: Generates synthetic data to expand the dataset for more robust model training 💻.
- Rotation Correction: Corrects for any skew in the scanned exam scripts based on detected text regions 💻.
Our segmentation pipeline consists of several steps:
- Label Detection: Detects labels like "Name", "ID", and "Course Code" using OCR 💻.
- Section Separation: Divides the exam script into upper and lower sections 💻.
- Table Segmentation: Segments the table containing question-wise marks from the lower section of the script 💻.
For text extraction and validation, we use an ensemble of OCR models, including Tesseract, EasyOCR, CRAFT, and TrOCR.
- OCR Models: Code to integrate various OCR models can be found 💻.
- Majority Voting: After OCR, majority voting is applied to select the most accurate text output 💻.
- Post-Processing: Cleans and formats the extracted text for easier validation 💻.
- Validation: The extracted data is validated against expected formats, such as student IDs and marks 💻.
To run the project, the following key Python packages are required:
- OpenCV
- PyTorch
- scikit-image
- Tesseract-OCR
- EasyOCR
- NumPy
- Matplotlib
- Pandas
Will be added soon
We would like to thank the Director of the Masters in Computer Science and Engineering (MSCSE) program for providing access to the exam scripts dataset.