Preprocess updates #46

RileyLePrell · 2024-11-21T23:49:20Z

Staff Facing App
The app is built in Streamlit and designed with a user-friendly interface comprising three main pages:

Home Page
- Serves as an introduction to the website.
- Provides a high-level overview of the app's purpose and functionality.
Upload Files
- Allows staff to upload various types of files, including:
  - Audio recordings (e.g., meeting recordings).
  - Minutes (PDF format).
  - Agendas (PDF format).
- Metadata is required before uploading files for better organization.
View Documents
- Displays all files currently stored in Azure Blob Storage.
- Allows staff to view files at various stages of the preprocessing pipeline.

Preprocessing Pipeline
The pipeline processes files uploaded through the app, enabling seamless integration with Weaviate Cloud for advanced querying. Below is a breakdown of the pipeline steps:

Metadata Input
- Users input essential metadata (e.g., meeting date, type, and file type) before uploading files.
File Conversion
- Audio files: Transcribed using AssemblyAI and stored as a "dirty" text file.
- PDF files: (Minutes/Agendas) Converted directly into "dirty" text files.
Text Cleaning
- Employs OpenAI's GPT with tailored prompt engineering to clean and standardize the "dirty" text.
Tokenization and Vector Embedding
- Tokenizer: Integrated Tiktoken, configured with chunk sizes of 250 tokens for efficient processing.
- Embedder: Utilizes OpenAI’s "text-embedding-ada-002" model to embed text.
Storage
- All processed data is stored in Weaviate Cloud, enabling efficient storage and search functionalities.

Cloud Functionality

Azure Blob Storage:
- Serves as the file repository.
- Streamlit app interacts with the Azure Blob for storing files at various stages of the pipeline (e.g., raw, dirty, clean).
Weaviate Cloud:
- Hosts the tokenized and embedded vectors.
- Facilitates future querying and advanced analytics.

Current Status

Pipeline: Fully operational.
Staff Facing App:
- Pages implemented and functional.
- Metadata handling and file upload integrated with preprocessing pipeline.
View Documents Page: Displays uploaded files from Azure Blob Storage.
Tokenizer and Embedder:
- Tokenization via Tiktoken.
- Embedding with text-embedding-ada-002.

- Preliminary outline of the Staff Facing App; it includes three pages: 1. Home Page - Intro to website. 2. Upload Files - Where staff can upload audio recording, minutes, or agendas. 3. View Documents - Allows to view whats currently in the cloud. I've created a preprocessing pipeline as well. 1. Users input metadata before uploading file 2. If audio file is transcribed by Assembly-AI then converted into a dirty text file. If minute/agenda converts pdf to dirty text file. 3. Uses OpenAI + Prompt Engineering to clean the text files 4. Chunks Tokens Vectors. 5. Sits in Weviate Cloud

- Create Azure Blob; Streamlit interacts and stores files at different stages of preprocessing pipeline. - Pipeline is up and running - View Document Page to see files uploaded

- Added Tokenizer: Tiktoken - Chunks set to 250 - Embedder: "text-embedding-ada-002"

- Files were capped at 200mb; Increased this to 1 gig. - Formatting Issues + Fixing some problems w/ streamlit not running

Preprocessing/preprocessing_pipeline/audio_transcription.py

+
+        # Step 2: Upload the audio file to AssemblyAI
+        print("Uploading audio file to AssemblyAI...")
+        upload_response = requests.post(


Preprocessing/preprocessing_pipeline/audio_transcription.py

+        if speaker_labels:
+            transcription_payload["speaker_labels"] = True
+
+        transcription_response = requests.post(


Preprocessing/preprocessing_pipeline/audio_transcription.py

+
+        # Step 4: Poll for transcription result
+        while True:
+            status_response = requests.get(


RileyLePrell and others added 5 commits November 13, 2024 14:40

Update main.py

c194a6f

Cloud Functionality

305cbb2

- Create Azure Blob; Streamlit interacts and stores files at different stages of preprocessing pipeline. - Pipeline is up and running - View Document Page to see files uploaded

Tokenization + Embedder

e6bbcfd

- Added Tokenizer: Tiktoken - Chunks set to 250 - Embedder: "text-embedding-ada-002"

Formatting + Upload Size

94bfff6

- Files were capped at 200mb; Increased this to 1 gig. - Formatting Issues + Fixing some problems w/ streamlit not running

github-advanced-security bot found potential problems Nov 21, 2024

View reviewed changes

neal-logan merged commit 8920d30 into main Nov 21, 2024
5 of 7 checks passed

neal-logan deleted the Preprocess-Updates branch November 22, 2024 00:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess updates #46

Preprocess updates #46

RileyLePrell commented Nov 21, 2024

Preprocess updates #46

Preprocess updates #46

Conversation

RileyLePrell commented Nov 21, 2024