Skip to content

Latest commit

 

History

History

Preprocessing

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

🚀 Preprocessing Pipeline for Meeting Transcriptions and Documents

A system diagram covering the preprocessing pipeline

🌟 Overview

The Preprocessing Pipeline is a staff-facing application designed to streamline the transcription, cleanup, and vectorization of meeting-related documents such as agendas, minutes, and audio files. This tool enables municipal staff to:

  • Upload 🖼️, process ⚙️, and view 👀 transformed documents at each stage.
  • Ensure accurate and efficient management of meeting data 🏛️.
  • Integrate metadata 📋 for organizational clarity.
  • Use cutting-edge AI tools 🤖 to enhance transcription quality and create vector embeddings for further analysis.

✨ Features

🏠 Home Page

  • Upload Documents: A page to upload files and provide metadata.
  • View Documents: A page to view and download files at different processing stages.

📤 Upload Documents

  • Upload agendas, minutes, or audio files.
  • The system processes the files through transcription, cleaning, and vectorization stages.

For Audio Files 🎙️:

  • Choose between AssemblyAI models for transcription:
    • Nano: Faster, cheaper, lower quality.
    • Best: Higher quality, slower, more expensive.
  • Files up to 1GB can be uploaded.

For Agendas and Minutes 📄:

  • Upload PDFs for processing.
  • Converts text or uses OCR for scanned documents.

📂 View Documents

  • Access and download files at these stages:
    • Raw Audio 🎵
    • Dirty Transcriptions 📝
    • Clean Text

⚙️ How to Start

  1. Go to the Streamlit Preprocessing App.

  2. Configure your API keys in the sidebar:

    • OpenAI
      • Get your OpenAI API Key
      • Set OPENAI_API_KEY in the sidebar.
      • For OPENAI_BASE_URL, use https://api.openai.com/v1 or leave it blank.
    • Weaviate
    • AssemblyAI
    • Azure
      • Create a storage account
      • Go to the Access Keys section in Azure and copy the connection string into AZURE_STORAGE_CONNECTION_STRING.
      • Specify the container name in AZURE_STORAGE_CONTAINER_NAME.
  3. Use the Upload Documents page to:

    • Add metadata:
      • 📅 Meeting Date: Any date.
      • 🏛️ Meeting Type: Board of Commissioners or Planning Board.
      • 📄 File Type: Agenda, Minutes, or Audio.
    • Upload files for processing.
  4. Monitor the preprocessing stages in the View Documents page, where you can access Raw, Dirty, and Clean documents.