Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision.
This model enables multi-frame image understanding, image comparison, multi-image summarization/storytelling, and video summarization, which have broad applications in office scenarios.
Follow these steps to set up and run the project:
Ensure all necessary packages are installed by running:
pip install -r requirements.txt
Launch the API server powered by LitServe:
python server.py
Start the Streamlit application with the following command:
streamlit run app.py
This project is developed and maintained with ❤️ by Bhimraj Yadav.