Wouldn't it be cool to use vector DBs to search for semantically (not synactically) similar code in the public domain?
For real?
It's not a game changer or anything. This is just a fun experiment that I built to get familiarized with Tree-sitter and pgvector :) There's still a lot of room for improvement to achieve high accuracy. This is not productized, this was just an intellectual escapade.
- Use node 18 and run npm install
- Go to your supabase dashboard and get the 2 environment variables that you need on your .env file. Thos are:
NEXT_PUBLIC_SUPABASE_URL=YOUR_URL
NEXT_PUBLIC_SUPABASE_ANON_KEY=YOUR_API_KEY
Semantica works with 2 very basic functionalities:
- You Save a code snippet to the DB. This codebase is now searchable by other users.
- You retrieve the most semantically similar code snippets from the DB, given a snippet of your own.
Behinds the scenes, Semantica:
- Converts the code snippet to a vectorized AST using Tree-sitter. Right now JS is the only language for which Semantica has a grammar.
- Normalizes and stores the vectorized AST.
- Uses dot product to search for the code snippets with the most similar embeddings. The match threshold is 0.9.
For example, you can add two numbers with the addition operator or by using an array and reducing it. They are syntactically different but semantically similar snippets that Semantica matches.