Skip to content

Latest commit

 

History

History
197 lines (149 loc) · 6.33 KB

README.md

File metadata and controls

197 lines (149 loc) · 6.33 KB

OmegaViT: A State-of-the-Art Vision Transformer with Multi-Query Attention, State Space Modeling, and Mixture of Experts

Join our Discord Subscribe on YouTube Connect on LinkedIn Follow on X.com

PyPI version License: MIT Build Status Documentation Status

OmegaViT (ΩViT) is a cutting-edge vision transformer architecture that combines multi-query attention, rotary embeddings, state space modeling, and mixture of experts to achieve superior performance across various computer vision tasks. The model can process images of any resolution while maintaining computational efficiency.

Key Features

  • Flexible Resolution Processing: Handles arbitrary input image sizes through adaptive patch embedding
  • Multi-Query Attention (MQA): Reduces computational complexity while maintaining model expressiveness
  • Rotary Embeddings: Enables better modeling of relative positions and spatial relationships
  • State Space Models (SSM): Integrates efficient sequence modeling every third layer
  • Mixture of Experts (MoE): Implements conditional computation for enhanced model capacity
  • Comprehensive Logging: Built-in loguru integration for detailed execution tracking
  • Shape-Aware Design: Continuous tensor shape tracking for reliable processing

Architecture

flowchart TB
    subgraph Input
        img[Input Image]
    end
    
    subgraph PatchEmbed[Flexible Patch Embedding]
        conv[Convolution]
        norm1[LayerNorm]
        conv --> norm1
    end
    
    subgraph TransformerBlocks[Transformer Blocks x12]
        subgraph Block1[Block n]
            direction TB
            mqa[Multi-Query Attention]
            ln1[LayerNorm]
            moe1[Mixture of Experts]
            ln2[LayerNorm]
            ln1 --> mqa --> ln2 --> moe1
        end
        
        subgraph Block2[Block n+1]
            direction TB
            mqa2[Multi-Query Attention]
            ln3[LayerNorm]
            moe2[Mixture of Experts]
            ln4[LayerNorm]
            ln3 --> mqa2 --> ln4 --> moe2
        end
        
        subgraph Block3[Block n+2 SSM]
            direction TB
            ssm[State Space Model]
            ln5[LayerNorm]
            moe3[Mixture of Experts]
            ln6[LayerNorm]
            ln5 --> ssm --> ln6 --> moe3
        end
    end
    
    subgraph Output
        gap[Global Average Pooling]
        classifier[Classification Head]
    end
    
    img --> PatchEmbed --> TransformerBlocks --> gap --> classifier
Loading

Multi-Query Attention Detail

flowchart LR
    input[Input Features]
    
    subgraph MQA[Multi-Query Attention]
        direction TB
        q[Q Linear]
        k[K Linear]
        v[V Linear]
        rotary[Rotary Embeddings]
        attn[Attention Weights]
        
        input --> q & k & v
        q & k --> rotary
        rotary --> attn
        attn --> v
    end
    
    MQA --> output[Output Features]

Loading

Installation

pip install omegavit

Quick Start

import sys
from omegavit.main import create_advanced_vit, train_step
import torch
from loguru import logger

def main():
    """Main training function."""
    logger.info("Starting training setup")

    # Setup
    device = torch.device(
        "cuda" if torch.cuda.is_available() else "cpu"
    )
    model = create_advanced_vit().to(device)
    optimizer = torch.optim.AdamW(
        model.parameters(), lr=1e-4, weight_decay=0.05
    )

    # Example input for testing
    batch_size = 8
    example_input = torch.randn(batch_size, 3, 224, 224).to(device)
    example_labels = torch.randint(0, 1000, (batch_size,)).to(device)

    logger.info("Running forward pass with example input")
    output = model(example_input)
    logger.info(f"Output shape: {output.shape}")

    # Example training step
    loss = train_step(
        model, optimizer, (example_input, example_labels), device
    )
    logger.info(f"Example training step loss: {loss:.4f}")


if __name__ == "__main__":
    # Configure logger
    logger.remove()
    logger.add(
        "advanced_vit.log",
        rotation="500 MB",
        level="DEBUG",
        format="{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}",
    )
    logger.add(sys.stdout, level="INFO")

    main()

Model Configurations

Parameter Default Description
hidden_size 768 Dimension of transformer layers
num_attention_heads 12 Number of attention heads
num_experts 8 Number of expert networks in MoE
expert_capacity 32 Tokens per expert in MoE
num_layers 12 Number of transformer blocks
patch_size 16 Size of image patches
ssm_state_size 16 Hidden state size in SSM

Performance

Note: Benchmarks coming soon

Citation

If you use OmegaViT in your research, please cite:

@article{omegavit2024,
  title={OmegaViT: A State-of-the-Art Vision Transformer with Multi-Query Attention, State Space Modeling, and Mixture of Experts},
  author={Agora Lab},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}

Contributing

We welcome contributions! Please see our contributing guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Special thanks to the Agora Lab AI team and the open-source community for their valuable contributions and feedback.