This is a Streamlit application that demonstrates the use of Llama Vision model for structured image analysis. It's designed to test the new JSON output capabilities of the Llama Vision model by extracting structured information from images.
url https://github.com/pleabargain/ollama_llamavision_OCR_JSON_output
run ollama in github codespaces
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2-vision:11b
- This application must be run locally due to its dependencies on the Ollama server
- Requires the Llama Vision model to be installed locally
- Very CPU intensive - expect significant processing time for each image
- This is a demonstration of the new JSON output tool functionality
- Python 3.7+
- Ollama server running locally with the llama3.2-vision model installed
- Significant CPU resources for image processing
- Dependencies listed in
requirements.txt
- Clone this repository
- Install dependencies:
pip install -r requirements.txt
- Ensure Ollama is running with the llama3.2-vision model installed:
ollama pull llama3.2-vision
3.1 for github codespaces you get 6GB so running llama3.2-vision 11B version is NOT going work.
- Start the Streamlit application:
streamlit run main.py
- Open your web browser to the displayed local URL (typically http://localhost:8501)
- Upload an image using the file uploader
- Wait for the analysis to complete (this may take several minutes due to CPU processing)
The application extracts structured information using the following model:
- name (str): Name of the detected object
- confidence (float): Confidence score of the detection
- attributes (str): Additional attributes of the object
- summary (str): Overall description of the image
- objects (list[Object]): List of detected objects
- scene (str): Description of the scene
- colors (list[str]): Dominant colors in the image
- time_of_day: One of ['Morning', 'Afternoon', 'Evening', 'Night']
- setting: One of ['Indoor', 'Outdoor', 'Unknown']
- text_content (optional): Any text detected in the image
- The analysis process is computationally intensive and may take several minutes per image
- Performance depends heavily on your CPU capabilities
- Consider closing other CPU-intensive applications while using this tool
- The first analysis may take longer as the model loads into memory
This application uses:
- Streamlit for the web interface
- Ollama's Llama Vision model for image analysis
- Pydantic for data validation and JSON schema generation
- PIL (Python Imaging Library) for image processing
The application features several tabs for easy navigation and transparency:
The primary interface where you can upload and analyze images.
Contains this documentation for quick reference while using the application.
Displays the application's source code (main.py) for transparency and educational purposes.
Shows the complete list of Python package dependencies required to run this application.
Provides real-time visibility into:
- Function calls and their execution
- Processing status and outcomes
- System performance metrics
- Error tracking and debugging information
Displays the results of unit tests, ensuring code reliability and proper functionality.