Language: Mojo 🔥
API: MAX Graph
This pipeline demonstrates code completion from an initial prompt using Replit's Code V1.5 3B large language model. The model itself has been constructed from end to end in the Mojo language using the MAX Graph API.
The MAX Graph API provides an accessible Mojo interface to the contruction of flexible accelerated compute graphs, which are then optimized by the MAX Engine's advanced graph compiler. This pipeline showcases how a large language model can be fully defined using Mojo and MAX Graphs and then compiled for optimal inference performance via the MAX Engine.
Replit Code is an open source code generation model trained on permissively licensed code and released by Replit. The V1.5, 3B variant is the basis for this implementation, and weights are obtained via Hugging Face.
-
Install MAX:
If MAX is not already installed, follow the installation instructions to set it up on your system.
-
Clone the MAX examples repository:
If you don't already have a local clone of this repository, create one via:
git clone https://github.com/modularml/max.git
The following instructions assume that you're present within this pipeline's directory, and you can change to it after cloning:
cd max/examples/graph-api/pipelines/replit/
-
Download and convert the model weights:
Before the first execution of the pipeline, weights need to be downloaded and converted into the correct format for use by this model. This weight conversion process requires the use of PyTorch, which currently is only compatible with Python 3.11 or older on macOS. PyTorch and all dependencies will be automatically installed, and weights will be downloaded and converted by running the following script:
source setup.sh
-
Run the code completion demo:
Invoking the pipeline will cause the model graph to be compiled and code generation will begin from the specified prompt.
All of the pipelines have been configured to use a common driver, located in the directory hosting all MAX Graph examples. Assuming you're starting at the path of this README, the command invocation will look like:
mojo ../../run_pipeline.🔥 replit --prompt 'def hello():\n print("hello world")'
The following command-line options are available to customize operation of the pipeline:
--converted-weights-path
: Specifies the path to the converted model weights. (Default value:.cache/replit/converted
)--prompt
: The text prompt to use for further code generation.
This isn't an exhaustive list, but here are some ideas for ways in which this pipeline may be extended or improved:
- Replace the SentencePiece tokenizer with one written in Mojo. Currently,
the tokenizer is loaded from the
transformers
library via Python interoperability and it might be useful to have this all in Mojo. - Incorporate 4-bit quantization.
- Improve the quality of the code generation.
- Identify performance bottlenecks and further tune time-to-first-token and throughput.