NeMo Multimodal Collections

The NeMo Multimodal Collection supports a diverse range of multimodal models tailored for various tasks, including text-2-image generation, text-2-NeRF synthesis, multimodal language models (LLM), and foundational vision and language models. Leveraging existing modules from other NeMo collections such as LLM and Vision whenever feasible, our multimodal collections prioritize efficiency by avoiding redundant implementations and maximizing reuse of NeMo's existing modules. Here's a detailed list of the models currently supported within the multimodal collection:

Foundation Vision-Language Models:
- CLIP
Foundation Text-to-Image Generation:
- Stable Diffusion
- Imagen
Customizable Text-to-Image Models:
- SD-LoRA
- SD-ControlNet
- SD-Instruct pix2pix
Multimodal Language Models:
- NeVA
- LLAVA
Text-to-NeRF Synthesis:
- DreamFusion++
NSFW Detection Support

Our documentation offers comprehensive insights into each supported model, facilitating seamless integration and utilization within your projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NeMo Multimodal Collections

Files

README.md

Latest commit

History

README.md

File metadata and controls

NeMo Multimodal Collections