Akash AI - Foundation Model (re)Training #300
Replies: 4 comments 9 replies
-
I think the proposal needs some further clarification to make it consistent. In the FAQ we discuss LoRA training, but the main proposal suggests a base 1.5 model using 0 prior training data. While LoRA training goals are admirable, I don't think they offer a compelling narrative. Google offers no cost Lora training on T4s, one can achieve 2000 steps in 15 minutes with no upfront cost, this includes using google drive to store and retrieve data sets. Kohya_SS can be run on consumer hardware, and LoRA train 2000 steps in 2 hours. I think the proposal should focus on base SD 1.5 model
Just some things I thought of when reading, excited to see where this goes! |
Beta Was this translation helpful? Give feedback.
-
Perhaps a shot in a dark but is the state of ML software stack or ML/AI Ops advanced enough to allow efficiently splitting a dataset into multiple chunks which would get processed by completely different GPU clusters from completely different providers (and then combined together)? Demonstrating this could be possibly awesome for Akash network as Akash resources will probably be much more fragmented across different provides unlike huge dedicated AWS DCs. |
Beta Was this translation helpful? Give feedback.
-
Hey all - thanks to everyone that reviewed and participated in the discussion of this. Sorry it's taken a little longer than planned to get back to this - this is mainly because myself, @rakataprime and @shimpa have been working with a provider to test out the viability of this on a smaller scale, before we commit to doing it. I've updated the proposal to reflect the following changes:
Unless there are any immediate objections, this proposal will be going on chain in the next day or two. |
Beta Was this translation helpful? Give feedback.
-
The proposal is up for voting https://www.mintscan.io/akash/proposals/234 |
Beta Was this translation helpful? Give feedback.
-
Summary
Background & Context
AI Developer Use Cases
From a compute infrastructure perspective, AI workloads can broadly be broken down into 3 categories:
In order for Akash to be attractive to AI developers, we should strive to demonstrate that our platform is capable of handling each of those.
During the recent GPU testnet, the Akash Network community demonstrated that Akash Network can be used to deploy and run Inference on many popular AI models. There are also efforts underway (by Overclock Labs, as well as others in the community) to create embeddings from Akash and Cosmos documentation, passing those to an LLM and demonstrating that the “fine-tuned” LLM is able to respond to questions about the specific data set (like “what is an IP lease?”) with high accuracy (see demos here and here). This demonstrates a very common use case of Fine-Tuning with custom Embeddings.
That leaves us with Training.
AI model training represents a use case that fits perfectly with the current state of Akash Network, because training workloads are inherently batch-y in nature, do not require five-9s uptime and are able to tolerate pauses (through the use of checkpoints). These characteristics make them a much better workload to run on Akash Network in its current state, compared to complex, large scale, production, web services.
State of Current Generative AI Models
Stable diffusion is arguably the most popular text-to-image model out there today. Stable diffusion was trained using the LAION-5B dataset, consisting of 5.85 billion publicly available images, scrapped off the internet. This is the case for other foundation models as well. The obvious challenge with that is that creators who want to use these models (and the applications that use the models), run the risk of violating copyright rules. Here are some examples of where that has already happened:
https://www.theverge.com/2023/1/17/23558516/ai-art-copyright-stable-diffusion-getty-images-lawsuit
https://www.artnews.com/art-in-america/features/midjourney-ai-art-image-generators-lawsuit-1234665579/
https://www.theverge.com/2023/7/9/23788741/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai
Proposal
We propose re-training stable diffusion (likely SD version 1.5) using a creative commons data set. Specifically an image dataset consisting of 100M Creative Commons 0 (CC0) images. This will be accomplished by the Akash community (and Overclock team) through a partnership with Thumper AI (see "About Thumper AI" section of this proposal). Akash Network will provide the compute resources necessary for retraining this model and Thumper AI will invest development resources to code, train and open source the model and the code, at no cost to Akash Network. The result will be a retrained stable diffusion model that can be used by people without risk of running into copyright issues. The model will be hosted on Huggingface - the de facto repository for AI models (aka “github for AI”) and the code will be open sourced on github under /akash-network.
We think that the resource requirements for this will be about 24,000 Nvidia A100 (80GB) hours. This will be achieved by setting up a single Akash provider, with a cluster of A100s. Depending on the size of the cluster, our hope is that we will have a trained model built anywhere from <1 week (128 GPU cluster) to 3 months (8 GPU cluster).
The cost for this is expected to be about $48,000. This includes the cost of leasing the right sized cluster of Nvidia A100 GPUs as well as incentivizing an Akash community provider who will manage the cluster through this exercise.
Benefits to Akash Network and its community
Tentative Timeline
Note: This is subject to change based on discussions as well as chosen cluster size etc
About Thumper AI
Thumper AI is a startup building an fairly-sourced generative AI image content platform and community model marketplace using LoRAs and other parameter-efficient fine-tuned models. The Thumper AI family of image tools include an AI QR code generator tool (cuterqr.com) that launched on 6/28/23 , a lora training tool (loratrainer.com) that is launching 9/1/23, and a community model marketplace called lorastation (lorastation.com) that is launching 10/1/23. Co-founder and CEO Logan Cerkovnik leads a team of 7 people at Thumper including 4 developers with over 60 years of combined machine learning or software development experience across the team.
Logan has also been an Akash community member for a while and worked closely with the Overclock team in building the integrated Torchbench and Jupyter Notebook solution (https://github.com/akash-network/awesome-akash/tree/master/torchbench) that was used to benchmark GPU performance during the testnet.
FAQs
Why SD and not SDXL?
While SDXL will result in much better images (Midjourney quality), LoRA training with SDXL would be a lot of computationally intensive. We estimate that the compute resource requirements would be 5x that of SD.
If this first effort is successful, SDXL LoRa training could be a subsequent follow-on effort.
Why SD1.5 (and not the latest, SD2.1)?
SD1.5 has more LoRAs today compared to SD2.1 which wasn't as popular due to increased VRAM requirements and SD2.1 would also be more expensive to train, from a resource perspective.
If this first effort is successful, SD2.1 could be a subsequent follow-on effort. Note that using SD1.5 as a base does mean that there is a risk that the quality of the images may not be the best but the goal for this exercise is more to demonstrate that LoRa training can be conducted.
What is the data set specifically?
The dataset is comprised of images from the wikimedia and stock images websites that license images under CC0
How confident are we that we won’t exceed the resource requirements above?
Very high level of confidence. This comes from referencing similar exercise conducted by https://www.mosaicml.com/blog/stable-diffusion-2 where they trained a composer model on a subset of the LAION-5B data (with 800M images)
What Akash provider(s) will be used for this?
Since this is funded by community grant, we recommend that the provider that runs this to be someone from the community (and not Overclock Labs). Someone who has had a track record of maintaining an Akash provider for a long time with consistent uptime as well as being a good technical resource to helping others become providers on Akash. As such we propose that the Akash provider be managed by Shimpa who has been a long time Akash community member, very active Akash Insider and Vanguard, produces training content regularly and most importantly, has operated a reliable provider on Akash Network (Europlots) for a long time.
What will the cluster configuration be like?
Something comparable to an AWS P4d instance type (ml.p4d.24xlarge with 8x A100s, 96 CPUs, 1152 GB RAM and 400 Gbps bandwidth) is likely what we will go with.
Where will the actual hardware come from?
The hardware will be leased from a datacenter operator. We are currently conducting a proof-of-concept and performance test with a specific datacenter operator to determine if it will meet our needs.
Beta Was this translation helpful? Give feedback.
All reactions