An AI-powered system that detects and extracts metadata from book spines in images. The system uses computer vision and machine learning to identify books, read their spines, and extract title and author information.
- Book spine detection using YOLO object detection
- Image enhancement using RealESRGAN
- Text extraction using Google Cloud Vision API
- Metadata refinement using Google Gemini AI
- Intelligent caching system to reduce API costs
- Web interface for uploading and viewing results
The system detects individual book spines and extracts metadata including title and author information
- Python 3.8+
- CUDA-capable GPU (for YOLO detection)
- Node.js and npm
- torch
- torchvision
- opencv-python
- numpy
- pillow
- requests
- google-cloud-vision
- google-cloud-gemini
- YOLO weights file (
models/yolo_weights/best.pt
)- Download the weights file from Google Drive
- Place the downloaded
best.pt
file inmodels/yolo_weights/
directory
- RealESRGAN executable (
models/realesrgan_portable/realesrgan-ncnn-vulkan.exe
)- Download the portable executable from Real-ESRGAN releases
- For Windows: Use realesrgan-ncnn-vulkan.exe
- For Mac/Linux: Download appropriate version and adjust path accordingly
- Google Cloud Vision API credentials
- Google Gemini API key
- Download the portable RealESRGAN executable for your platform
- Place the executable in
models/realesrgan_portable/
- The system uses RealESRGAN with these default settings:
Available models:
# Windows example realesrgan-ncnn-vulkan.exe -i input.jpg -o output.png -n realesrgan-x4plus
- realesrgan-x4plus (default)
- realesrnet-x4plus
- realesrgan-x4plus-anime (optimized for anime images)
- realesr-animevideov3 (animation video)
Note: For Mac/Linux users, adjust the executable path and filename according to your platform.
- Create a Google Cloud Project
- Enable the Cloud Vision API
- Create service account credentials
- Download the JSON key file
- Set up authentication by either:
- Setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your key file:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/credentials.json"
- Or placing the JSON key file in a known location and updating the code to reference it
- Setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your key file:
- Get a Gemini API key from Google AI Studio
- Create a
.env
file in the backend directory - Add your Gemini API key:
GEMINI_API=your_api_key_here
The system uses these APIs for:
- Google Cloud Vision: Text extraction from book spines
- Google Gemini: Intelligent refinement of extracted text and metadata parsing
For detailed Google Cloud Vision setup instructions, visit the official documentation.
├── backend/
│ ├── python-scripts/
│ │ ├── detect.py # Main detection script
│ │ ├── fetch_book_info.py # Book metadata fetching
│ │ └── fetch_database.py # Database operations
│ └── src/
│ └── server.js # Backend server
├── frontend/
│ └── public/
│ ├── index.html # Web interface
│ └── index.js # Frontend logic
└── models/ # AI model files
- Clone the repository
- Install Python dependencies:
pip install -r backend/requirements.txt
- Install Node.js dependencies:
cd backend npm install
- Set up required API keys and credentials
- Place model files in the appropriate directories
- Configure CORS settings:
- The backend server runs on
http://localhost:3000
- Frontend should be served from a live server (e.g., VS Code Live Server) at
http://127.0.0.1:5500
- If using different ports, update the CORS configuration in
backend/src/server.js
- The backend server runs on
python backend/python-scripts/detect.py <path_to_image>
- Start the backend server:
cd backend npm start
- Open
frontend/public/index.html
in a web browser - Upload an image containing book spines
- View the detected books and extracted metadata
The system generates:
- Detected book metadata (title, author)
- Enhanced images
- Cropped individual book spine images
- Annotated original image showing detections
- Cached results for faster subsequent processing
The system implements multi-level caching to improve performance and reduce API costs:
- OCR results cache
- Gemini API response cache
- Full process results cache
Cache files are stored in the output directory structure:
output/
└── image_name/
├── crops/ # Cropped book spine images
├── ocr_cache/ # OCR results
├── gemini_cache/ # AI refinement results
└── process_cache/ # Full process results
Apache License 2.0
Copyright 2024 Book Spine Detection System Contributors
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
- Min-Han Li (@MinHanLiWesley)
- Yuan Kuang (@greendress2022)
- Lulu Jiao (@luljia0)
- Yue Zhang (@WillzDevs)
Thank you to all contributors who have helped make this project possible!