The NeMo Multimodal Collection supports a diverse range of multimodal models tailored for various tasks, including text-2-image generation, text-2-NeRF synthesis, multimodal language models (LLM), and foundational vision and language models. Leveraging existing modules from other NeMo collections such as LLM and Vision whenever feasible, our multimodal collections prioritize efficiency by avoiding redundant implementations and maximizing reuse of NeMo's existing modules. Here's a detailed list of the models currently supported within the multimodal collection:
-
Foundation Vision-Language Models:
- CLIP
-
Foundation Text-to-Image Generation:
- Stable Diffusion
- Imagen
-
Customizable Text-to-Image Models:
- SD-LoRA
- SD-ControlNet
- SD-Instruct pix2pix
-
Multimodal Language Models:
- NeVA
- LLAVA
-
Text-to-NeRF Synthesis:
- DreamFusion++
-
NSFW Detection Support
Our documentation offers comprehensive insights into each supported model, facilitating seamless integration and utilization within your projects.