Skip to content

Virtual architect that understands code, generates documentation and allows for chatting with the code

Notifications You must be signed in to change notification settings

michelderu/code-whisperer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Whisperer

This repo generates documentation and insights from GitHub repositories. It uses a Streamlit interface to interact with the user, allowing them to input a GitHub repository and process specific file types.

The application extracts data from the repository, vectorizes it using AstraDB, and generates relevant documentation with the help of an OpenAI language model.

Key functionalities include displaying repository data, generating an overview, architectural summary, domain model, and enabling a chat interface for code-related questions. The system uses asynchronous programming to manage tasks efficiently.

To proof the point, the following documentation has been auto-generated by the app!

Architectural summary

The application, termed Code Whisperer, is a comprehensive tool designed to automate the creation of documentation and provide insights into code stored within a GitHub repository.

Architecture Diagram

The architectural overview can be delineated as follows:

Frontend Layer

Streamlit: The application utilizes Streamlit as its primary interface. Streamlit serves as the web application framework that orchestrates user interactions, organizes the interface into various tabs, and provides immediate feedback to users. The main tabs include "Repository data," "Overview," "Architectural summary," "Domain model," and "Chat with your code."

Backend Layer

RepoReader Module: This encapsulates the logic required to fetch data from GitHub repositories. Upon initializing with a GitHub token, it connects to the GitHub API, retrieves repository contents, and filters files based on specified extensions. AstraDB: Leveraging AstraDB's scalable database capabilities, the application stores vector embeddings of repository data. This is achieved using the DataAPIClient to create and manage collections that store vectorized representations of file contents. Vectorization and Storage:

Vectorization: The application relies on text-embedding-ada-002 from OpenAI to transform repository contents into vector embeddings, utilizing cosine similarity metrics for efficient storage and retrieval.

AstraDB Integration: Data from the repository is vectorized and inserted into AstraDB collections for persistent storage. This includes metadata about the repository and the contents of its files.

LLM Integration

OpenAI Integration: OpenAI’s Large Language Models (LLMs), specifically GPT-4, are employed to generate code documentation. The LLMs interact with vectorized content stored within AstraDB to extract and summarize relevant information about the repository and its codebase.

Asynchronous Operations

Asyncio and AsyncOpenAI: Asynchronous capabilities are provided by Python’s asyncio, ensuring non-blocking operations for tasks such as fetching repository contents, generating documentation, and querying LLMs. Session Management:

Streamlit Session State: Streamlit's session state mechanism is utilized to maintain the application's state across various user interactions, such as loading repository data and storing generated documentation.

Workflow Summary

  1. The user inputs GitHub details via the Streamlit interface.
  2. The application reads and vectorizes repository data using RepoReader.
  3. Vectorized data is stored in AstraDB.
  4. User interacts with provided tabs to view repository details, architectural summaries, domain models, and to chat with the code.
  5. OpenAI LLMs are queried to generate documentation based on context retrieved from AstraDB.
  6. The results are presented to the user through the Streamlit interface.
  7. This architecture ensures a seamless, interactive experience by integrating various technologies for data retrieval, storage, processing, and presentation

Conceptual Domain Model

  1. Code Whisperer Application
  • Attributes:
    • None (Primary entity)
  • Methods:
    • generateDocumentation(): Asynchronously generate documentation based on the repository.
    • load_sidebar(): Load the repository data and settings from the sidebar.
    • show_repository_data(): Display the repository data on the first tab.
    • show_overview(): Display the overview on the second tab.
    • show_architectural_summary(): Display the architectural summary on the third tab.
    • show_domain_model(): Display the domain model on the fourth tab.
    • show_chat(): Display the chat interface on the fifth tab.
  1. RepoReader
  • Attributes:
    • github_token: string
    • github_repo_name: string
    • github_handle: object
    • github_repo: object
    • extensions: tuple (file extensions to process)
  • Methods:
    • connect(token: str): Connect to the GitHub API using a token.
    • setRepository(repo: str): Set the current repository to work with.
    • getRepositoryContents(): Retrieve the contents of the repository.
    • getRepositoryContent(file_path: str): Retrieve the contents of a specific file.
    • getName(): Get the repository name.
    • getTopics(): Get the topics of the repository.
    • getStars(): Get the star count of the repository.
    • setExtensions(extensions: str): Set file extensions to process.

UML Class Diagram

+----------------------------+
| Code Whisperer Application |
+----------------------------+
| -                          |
+----------------------------+
| + generateDocumentation()  |
| + load_sidebar()           |
| + show_repository_data()   |
| + show_overview()          |
| + show_architectural_summary() |
| + show_domain_model()      |
| + show_chat()              |
+----------------------------+

              | uses
              v

+----------------------------+
|       RepoReader           |
+----------------------------+
| - github_token: string     |
| - github_repo_name: string |
| - github_handle: object    |
| - github_repo: object      |
| - extensions: tuple        |
+----------------------------+
| + connect(token: str)      |
| + setRepository(repo: str) |
| + getRepositoryContents()  |
| + getRepositoryContent     |
| + getName()                |
| + getTopics()              |
| + getStars()               |
| + setExtensions()          |
+----------------------------+

              | uses
              v

+----------------------------+
|        DataAPIClient       |
+----------------------------+
| -                          |
+----------------------------+
| + get_database_by_api_endpoint(endpoint: str) |
+----------------------------+

              | uses
              v

+------------+---------+
|   OpenAI              |
+-----------------------+
| - api_key: string     |
+-----------------------+
| + chat.completions.create() |
+----------------------------+

              | interacts
              v

+-------------------------+
|       Collection        |
+-------------------------+
| -                       |
+-------------------------+
| + insert_one(context: dict)      |
| + delete_many(criteria: dict)    |
+-------------------------+

Start the app

Create a Python environment

In case you want to run all of the above locally, it's useful to create a Virtual Environment. Use the below to set it up:

python3.10 -m venv myenv

Then activate it as follows:

source myenv/bin/activate   # on Linux/Mac
myenv\Scripts\activate.bat  # on Windows

Now you can start installing packages:

pip3 install -r requirements.txt

Install dependencies

pip3 install -r requirements.txt

Run it

streamlit run app.py

About

Virtual architect that understands code, generates documentation and allows for chatting with the code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages