This repo generates documentation and insights from GitHub repositories. It uses a Streamlit interface to interact with the user, allowing them to input a GitHub repository and process specific file types.
The application extracts data from the repository, vectorizes it using AstraDB, and generates relevant documentation with the help of an OpenAI language model.
Key functionalities include displaying repository data, generating an overview, architectural summary, domain model, and enabling a chat interface for code-related questions. The system uses asynchronous programming to manage tasks efficiently.
To proof the point, the following documentation has been auto-generated by the app!
The application, termed Code Whisperer, is a comprehensive tool designed to automate the creation of documentation and provide insights into code stored within a GitHub repository.
The architectural overview can be delineated as follows:
Streamlit: The application utilizes Streamlit as its primary interface. Streamlit serves as the web application framework that orchestrates user interactions, organizes the interface into various tabs, and provides immediate feedback to users. The main tabs include "Repository data," "Overview," "Architectural summary," "Domain model," and "Chat with your code."
RepoReader Module: This encapsulates the logic required to fetch data from GitHub repositories. Upon initializing with a GitHub token, it connects to the GitHub API, retrieves repository contents, and filters files based on specified extensions. AstraDB: Leveraging AstraDB's scalable database capabilities, the application stores vector embeddings of repository data. This is achieved using the DataAPIClient to create and manage collections that store vectorized representations of file contents. Vectorization and Storage:
Vectorization: The application relies on text-embedding-ada-002 from OpenAI to transform repository contents into vector embeddings, utilizing cosine similarity metrics for efficient storage and retrieval.
AstraDB Integration: Data from the repository is vectorized and inserted into AstraDB collections for persistent storage. This includes metadata about the repository and the contents of its files.
OpenAI Integration: OpenAI’s Large Language Models (LLMs), specifically GPT-4, are employed to generate code documentation. The LLMs interact with vectorized content stored within AstraDB to extract and summarize relevant information about the repository and its codebase.
Asyncio and AsyncOpenAI: Asynchronous capabilities are provided by Python’s asyncio, ensuring non-blocking operations for tasks such as fetching repository contents, generating documentation, and querying LLMs. Session Management:
Streamlit Session State: Streamlit's session state mechanism is utilized to maintain the application's state across various user interactions, such as loading repository data and storing generated documentation.
- The user inputs GitHub details via the Streamlit interface.
- The application reads and vectorizes repository data using RepoReader.
- Vectorized data is stored in AstraDB.
- User interacts with provided tabs to view repository details, architectural summaries, domain models, and to chat with the code.
- OpenAI LLMs are queried to generate documentation based on context retrieved from AstraDB.
- The results are presented to the user through the Streamlit interface.
- This architecture ensures a seamless, interactive experience by integrating various technologies for data retrieval, storage, processing, and presentation
- Code Whisperer Application
- Attributes:
- None (Primary entity)
- Methods:
- generateDocumentation(): Asynchronously generate documentation based on the repository.
- load_sidebar(): Load the repository data and settings from the sidebar.
- show_repository_data(): Display the repository data on the first tab.
- show_overview(): Display the overview on the second tab.
- show_architectural_summary(): Display the architectural summary on the third tab.
- show_domain_model(): Display the domain model on the fourth tab.
- show_chat(): Display the chat interface on the fifth tab.
- RepoReader
- Attributes:
- github_token: string
- github_repo_name: string
- github_handle: object
- github_repo: object
- extensions: tuple (file extensions to process)
- Methods:
- connect(token: str): Connect to the GitHub API using a token.
- setRepository(repo: str): Set the current repository to work with.
- getRepositoryContents(): Retrieve the contents of the repository.
- getRepositoryContent(file_path: str): Retrieve the contents of a specific file.
- getName(): Get the repository name.
- getTopics(): Get the topics of the repository.
- getStars(): Get the star count of the repository.
- setExtensions(extensions: str): Set file extensions to process.
+----------------------------+
| Code Whisperer Application |
+----------------------------+
| - |
+----------------------------+
| + generateDocumentation() |
| + load_sidebar() |
| + show_repository_data() |
| + show_overview() |
| + show_architectural_summary() |
| + show_domain_model() |
| + show_chat() |
+----------------------------+
| uses
v
+----------------------------+
| RepoReader |
+----------------------------+
| - github_token: string |
| - github_repo_name: string |
| - github_handle: object |
| - github_repo: object |
| - extensions: tuple |
+----------------------------+
| + connect(token: str) |
| + setRepository(repo: str) |
| + getRepositoryContents() |
| + getRepositoryContent |
| + getName() |
| + getTopics() |
| + getStars() |
| + setExtensions() |
+----------------------------+
| uses
v
+----------------------------+
| DataAPIClient |
+----------------------------+
| - |
+----------------------------+
| + get_database_by_api_endpoint(endpoint: str) |
+----------------------------+
| uses
v
+------------+---------+
| OpenAI |
+-----------------------+
| - api_key: string |
+-----------------------+
| + chat.completions.create() |
+----------------------------+
| interacts
v
+-------------------------+
| Collection |
+-------------------------+
| - |
+-------------------------+
| + insert_one(context: dict) |
| + delete_many(criteria: dict) |
+-------------------------+
In case you want to run all of the above locally, it's useful to create a Virtual Environment. Use the below to set it up:
python3.10 -m venv myenv
Then activate it as follows:
source myenv/bin/activate # on Linux/Mac
myenv\Scripts\activate.bat # on Windows
Now you can start installing packages:
pip3 install -r requirements.txt
pip3 install -r requirements.txt
streamlit run app.py