Akash AI - Foundation Model (re)Training #300

anilmurty · 2023-08-30T00:59:02Z

anilmurty
Aug 30, 2023
Maintainer

Summary

We are proposing (and seeking community funds for) re-training a foundation AI model on Akash Network, resulting in an “Akash'' named open source AI model, archived/ shared on Huggingface.
Benefits to the Akash community include:
- Demonstrating that AI model training can be conducted on Akash Network's decentralized cloud platform.
- Attracting AI and ML developers (web3 and web2) to Akash Network.
- Positioning GPU providers for success, following the mainnet launch (by generating demand).
- Rounding out AI use cases, by adding model training (Inference and fine tuning have already been demonstrated in the testnet).
- Strengthening Akash Networks’ open source commitment by building and contributing an open source AI model to the broader OSS community.
- Contribute an Image generation AI Model to the broader open source community that can be used with less risk of copyright infringement.
This is a funding proposal asking for $48,000, which is 53,631 AKT (at the 30 day moving average price of 0.895). Since the duration of the incentives in this initial experiment is long (3-6 months), using an average price mitigates some of the volatility. Any unspent funds at the end of the experiment will be returned to the community pool OR used towards future tenant/provider incentives ONLY. In the event of a shortfall, Overclock Labs will cover the difference to avoid tapping into the community pool again." Wallet Address that holds the funds, will be included in the governance proposal.

Background & Context

AI Developer Use Cases

From a compute infrastructure perspective, AI workloads can broadly be broken down into 3 categories:

Training: Building a general purpose AI model from a scientific research paper. GPT3, GPT4, LLaMA and others are examples of this.
Fine-Tuning: Taking a foundation model and fine-tuning it to work really well for a small data set (for example, an enterprise chatbot that trains a foundation model on the company's docs and knowledge base content).
Inference: Deploying AI models in production and running predictions based on user requests, on a continuous basis.

In order for Akash to be attractive to AI developers, we should strive to demonstrate that our platform is capable of handling each of those.

During the recent GPU testnet, the Akash Network community demonstrated that Akash Network can be used to deploy and run Inference on many popular AI models. There are also efforts underway (by Overclock Labs, as well as others in the community) to create embeddings from Akash and Cosmos documentation, passing those to an LLM and demonstrating that the “fine-tuned” LLM is able to respond to questions about the specific data set (like “what is an IP lease?”) with high accuracy (see demos here and here). This demonstrates a very common use case of Fine-Tuning with custom Embeddings.

That leaves us with Training.

AI model training represents a use case that fits perfectly with the current state of Akash Network, because training workloads are inherently batch-y in nature, do not require five-9s uptime and are able to tolerate pauses (through the use of checkpoints). These characteristics make them a much better workload to run on Akash Network in its current state, compared to complex, large scale, production, web services.

State of Current Generative AI Models

Stable diffusion is arguably the most popular text-to-image model out there today. Stable diffusion was trained using the LAION-5B dataset, consisting of 5.85 billion publicly available images, scrapped off the internet. This is the case for other foundation models as well. The obvious challenge with that is that creators who want to use these models (and the applications that use the models), run the risk of violating copyright rules. Here are some examples of where that has already happened:

https://www.theverge.com/2023/1/17/23558516/ai-art-copyright-stable-diffusion-getty-images-lawsuit

https://www.artnews.com/art-in-america/features/midjourney-ai-art-image-generators-lawsuit-1234665579/

https://www.theverge.com/2023/7/9/23788741/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai

Proposal

We propose re-training stable diffusion (likely SD version 1.5) using a creative commons data set. Specifically an image dataset consisting of 100M Creative Commons 0 (CC0) images. This will be accomplished by the Akash community (and Overclock team) through a partnership with Thumper AI (see "About Thumper AI" section of this proposal). Akash Network will provide the compute resources necessary for retraining this model and Thumper AI will invest development resources to code, train and open source the model and the code, at no cost to Akash Network. The result will be a retrained stable diffusion model that can be used by people without risk of running into copyright issues. The model will be hosted on Huggingface - the de facto repository for AI models (aka “github for AI”) and the code will be open sourced on github under /akash-network.

We think that the resource requirements for this will be about 24,000 Nvidia A100 (80GB) hours. This will be achieved by setting up a single Akash provider, with a cluster of A100s. Depending on the size of the cluster, our hope is that we will have a trained model built anywhere from <1 week (128 GPU cluster) to 3 months (8 GPU cluster).

The cost for this is expected to be about $48,000. This includes the cost of leasing the right sized cluster of Nvidia A100 GPUs as well as incentivizing an Akash community provider who will manage the cluster through this exercise.

Benefits to Akash Network and its community

Named AI Model: The AI Model will be called “Akash-Thumper”
Demonstrate ability to train a foundation model: Can be used as a reference in customer conversations as well as in blog posts and other content.
Visibility to all AI devs (Web2 and Web3): through Huggingface, Github, search indexing and other places.
Advancing Open Source in AI (aligns with Akash Network’s mission of being open): The code for the model training will be placed in a repo called “akash-thumper” under https://github.com/akash-network
Useful product: An image generation model that can be used without fear of risking copyright violation.
Repeatable pattern: If successful, we could perform other variants of this as follow ups to keep up momentum and generate more interest for Akash (this is a common pattern with other cloud and big tech infra companies these days). This could include:
- Retrain SDXL with a creative commons data set (should be better quality that SD1.5 or SD2.1)
- Retrain SD1.5 with lower cost GPUs and mining rigs to demonstrate that it is possible to do so.
- Attract other AI startups to do their own training (with their own funds)

Tentative Timeline

Open Discussions: Through End-September, 2023
Governance Proposal: Through Mid-October, 2023
Provider build up (if prop passes): Mid-late October, 2023
Training: Mid October, 2023 through early January, 2024
Model Launch: January 2024.

Note: This is subject to change based on discussions as well as chosen cluster size etc

About Thumper AI

Thumper AI is a startup building an fairly-sourced generative AI image content platform and community model marketplace using LoRAs and other parameter-efficient fine-tuned models. The Thumper AI family of image tools include an AI QR code generator tool (cuterqr.com) that launched on 6/28/23 , a lora training tool (loratrainer.com) that is launching 9/1/23, and a community model marketplace called lorastation (lorastation.com) that is launching 10/1/23. Co-founder and CEO Logan Cerkovnik leads a team of 7 people at Thumper including 4 developers with over 60 years of combined machine learning or software development experience across the team.

Logan has also been an Akash community member for a while and worked closely with the Overclock team in building the integrated Torchbench and Jupyter Notebook solution (https://github.com/akash-network/awesome-akash/tree/master/torchbench) that was used to benchmark GPU performance during the testnet.

FAQs

Why SD and not SDXL?

While SDXL will result in much better images (Midjourney quality), LoRA training with SDXL would be a lot of computationally intensive. We estimate that the compute resource requirements would be 5x that of SD.
If this first effort is successful, SDXL LoRa training could be a subsequent follow-on effort.

Why SD1.5 (and not the latest, SD2.1)?

SD1.5 has more LoRAs today compared to SD2.1 which wasn't as popular due to increased VRAM requirements and SD2.1 would also be more expensive to train, from a resource perspective.
If this first effort is successful, SD2.1 could be a subsequent follow-on effort. Note that using SD1.5 as a base does mean that there is a risk that the quality of the images may not be the best but the goal for this exercise is more to demonstrate that LoRa training can be conducted.

What is the data set specifically?

The dataset is comprised of images from the wikimedia and stock images websites that license images under CC0

How confident are we that we won’t exceed the resource requirements above?

Very high level of confidence. This comes from referencing similar exercise conducted by https://www.mosaicml.com/blog/stable-diffusion-2 where they trained a composer model on a subset of the LAION-5B data (with 800M images)

What Akash provider(s) will be used for this?

Since this is funded by community grant, we recommend that the provider that runs this to be someone from the community (and not Overclock Labs). Someone who has had a track record of maintaining an Akash provider for a long time with consistent uptime as well as being a good technical resource to helping others become providers on Akash. As such we propose that the Akash provider be managed by Shimpa who has been a long time Akash community member, very active Akash Insider and Vanguard, produces training content regularly and most importantly, has operated a reliable provider on Akash Network (Europlots) for a long time.

What will the cluster configuration be like?

Something comparable to an AWS P4d instance type (ml.p4d.24xlarge with 8x A100s, 96 CPUs, 1152 GB RAM and 400 Gbps bandwidth) is likely what we will go with.

Where will the actual hardware come from?

The hardware will be leased from a datacenter operator. We are currently conducting a proof-of-concept and performance test with a specific datacenter operator to determine if it will meet our needs.

dmikey · 2023-08-30T15:47:08Z

dmikey
Aug 30, 2023

I think the proposal needs some further clarification to make it consistent. In the FAQ we discuss LoRA training, but the main proposal suggests a base 1.5 model using 0 prior training data.

While LoRA training goals are admirable, I don't think they offer a compelling narrative. Google offers no cost Lora training on T4s, one can achieve 2000 steps in 15 minutes with no upfront cost, this includes using google drive to store and retrieve data sets. Kohya_SS can be run on consumer hardware, and LoRA train 2000 steps in 2 hours.

I think the proposal should focus on

base SD 1.5 model

discuss biased training to make this model open and inclusive
discuss NSFW generation in the base model - including what base subjects should be included from a community standpoint
conversion pipelines for coreml and ggml along with pytorch
if sourcing LoRA training after initial model base to lower base model composition, and then a 20% continuous merge of LoRA results.

Just some things I thought of when reading, excited to see where this goes!

3 replies

anilmurty Aug 30, 2023
Maintainer Author

Hey Derek - glad to see your comment and glad you're excited about this. Those are some really great questions. I'll post a quick comment for now (since we're going to discuss this live in about 20mins on the SC meeting) and come back later and add another with more details, with inputs from @rakataprime as well

Re. the seeming inconsistency: I think the Lora mention threw you off. To be clear, we're talking about taking a base SD 1.5 model and training it from scratch with a CC0 data set.
Biases: These are by no means trivial to address (even the most expensive models that took multiple millions of dollars to train have them). The best way to address this may be through supplemental data and with later stage bias evaluations. We can certainly look into adding info on this.
NSFW generation: One way to deal with this by selecting choosing weights from a data source that has a license that prohibits NSFW or other harmful content. The question though is, whether we would tradeoff model performance (vs another set of weights that include NSFW content). Definitely something to think about
Conversion pipelines: Will have to discuss with the ML devs ( @rakataprime and team) and come back on whether we can do this in the scope or not.

rakataprime Aug 30, 2023

Hi Derek,

Thanks for the questions.

The focus of this specific proposal is the base foundation model. The choice of foundation model was influenced by several factors including the training cost, the large LoRA ecosystem around SD1.5, and the ability to fit onto the consumer graphics cards for inference. Thumper is launching a unified LoRA marketplace and LoRA training tool that will utilize Akash gpu's for fine tuning, but that is outside the scope of this proposal.

If you have any ideas for supplemental datasets or pretrained reward/preference models that might be of interest for this please feel free to share. I think we would release a paper or report on the model performance and due a bias evaluation as a part of that.
I think there is a question of what use restrictions for the model versus what data goes into the model. For license, I think that OPEN RAIL is a nice license that has been widely accepted by many people. The CC0 dataset is more curated than LAION-5B so there is greater comfort in including NSFW content or celebrity photos than with LAION-5B. Efforts to remove NSFW content from the image training datasets often result in unintended worse performance in the overall model and even models intentionally trained to avoid NSFW content can be fine-tuned to produce that content relatively easily. Its definitely something to continue to think about.
I think we should be able to do some conversion pipeline services or other model file formats. Distilling the model into lower precision versions would probably be outside the scope of this proposal, but if there is a strong demand for that we could consider adding that.
I think is out of scope for this particular proposal, but would be interested in pursuing this perhaps separately at least from the Thumper side.

dmikey Aug 30, 2023

Thank I think thats pretty clear and straight forward!

remeq · 2023-09-03T09:56:31Z

remeq
Sep 3, 2023

Perhaps a shot in a dark but is the state of ML software stack or ML/AI Ops advanced enough to allow efficiently splitting a dataset into multiple chunks which would get processed by completely different GPU clusters from completely different providers (and then combined together)? Demonstrating this could be possibly awesome for Akash network as Akash resources will probably be much more fragmented across different provides unlike huge dedicated AWS DCs.

6 replies

dmikey Sep 5, 2023

afaik even in local environments, batching causes learning decay, I think that's still trying to be tackled.

rakataprime Sep 7, 2023

It depends on how you define efficiently. It is slower to do training over multiple providers than doing training on a single provider without federated learning methods. For this particular work we will likely do the entire training job on a single large gpu provider. However, we believe that we should be able to build a cross provider gpu cluster deployment approach using ray clusters in the near future.

remeq Sep 9, 2023

I think two aspects:

Computing over many remote clusters will always be slower compared to the same amount of horse power located within a single datacenter. The question here is how big is the difference. There will certainly be a breaking point above which distribution stops making economical sense because of the implied cost of other overhead related to the distribution itself.
How easy is it to set up such deployment - AKA - are we ready for that from the DevOps/Middleware point of view.

iFimbrethil Sep 13, 2023

what overhead does running on akash add vs a centralized datacenter ass far as training speeds? do metrics for that exist yet

anilmurty Sep 13, 2023
Maintainer Author

hey @iFimbrethil - we have some limited benchmarking data from our testnet that you can view here (https://akash.network/blog/testing-the-first-ai-supercloud/) - that seems comparable with centralized providers. Part of the above exercise would be to provide a reference for others looking to train models on Akash. Were you asking for your own needs our just out of curiosity?

anilmurty · 2023-10-05T00:52:40Z

anilmurty
Oct 5, 2023
Maintainer Author

Hey all - thanks to everyone that reviewed and participated in the discussion of this. Sorry it's taken a little longer than planned to get back to this - this is mainly because myself, @rakataprime and @shimpa have been working with a provider to test out the viability of this on a smaller scale, before we commit to doing it.

I've updated the proposal to reflect the following changes:

Removed to LoRA mention to avoid any confusion that we are doing LoRa training (which we are not - we're proposing base foundation model training here)
Updated 30 Day MA for $AKT to reflect the last 30 days as of today
Updated timelines based on where we are today.

Unless there are any immediate objections, this proposal will be going on chain in the next day or two.

0 replies

gosuri · 2023-10-06T17:13:44Z

gosuri
Oct 6, 2023
Maintainer

The proposal is up for voting https://www.mintscan.io/akash/proposals/234

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akash Network

Akash AI - Foundation Model (re)Training #300

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Akash AI - Foundation Model (re)Training #300

anilmurty Aug 30, 2023 Maintainer

Summary

Background & Context

AI Developer Use Cases

State of Current Generative AI Models

Proposal

Benefits to Akash Network and its community

Tentative Timeline

About Thumper AI

FAQs

Why SD and not SDXL?

Why SD1.5 (and not the latest, SD2.1)?

What is the data set specifically?

How confident are we that we won’t exceed the resource requirements above?

What Akash provider(s) will be used for this?

What will the cluster configuration be like?

Where will the actual hardware come from?

Replies: 4 comments · 9 replies

anilmurty Aug 30, 2023 Maintainer Author

anilmurty Sep 13, 2023 Maintainer Author

anilmurty Oct 5, 2023 Maintainer Author

gosuri Oct 6, 2023 Maintainer

anilmurty
Aug 30, 2023
Maintainer

Replies: 4 comments 9 replies

anilmurty Aug 30, 2023
Maintainer Author

anilmurty Sep 13, 2023
Maintainer Author

anilmurty
Oct 5, 2023
Maintainer Author

gosuri
Oct 6, 2023
Maintainer