This project implements a simplified version of Google DeepMind's OPRO (Optimization by PROmpting) framework as given in LLM as optimizers paper, specifically adapted for optimizing prompts for computer science questions from the MMLU dataset.
The original paper "Large Language Models as Optimizers" (Google DeepMind, 2024) introduces OPRO as a novel approach to using LLMs for optimization tasks. Key aspects include:
- Natural Language Optimization: OPRO enables optimization through natural language descriptions rather than formal specifications.
- Meta-Prompt Structure: Uses previous solutions and their scores to guide the optimization process.
- Exploration-Exploitation Balance: Manages the trade-off between exploring new solutions and exploiting known good solutions.
- Configuration (OptimizationConfig)
max_steps: int = 150 # Maximum optimization steps
solutions_per_step: int = 8 # Solutions generated per step
max_history: int = 20 # Max number of previous solutions to keep
temperature: float = 1.0 # Temperature for generation
token_weight: float = 0.3 # Weight for token length in scoring
max_tokens: int = # Maximum tokens allowed in prompt (variable)
- Scoring Mechanism The implementation uses a weighted scoring formula:
combined_score = (1 - token_weight) * accuracy + token_weight * token_score
where token_score = 1 - (token_count / max_tokens)
This balances:
- Solution accuracy (70% weight by default)
- Token efficiency (30% weight by default)
- Key Classes
TokenManager
: Handles token counting and limitsMMluDataHandler
: Manages MMLU dataset operationsScorer
: Evaluates solutions using OpenAI APIOptimizerEngine
: Core optimization logic
-
Data Preparation
- Load MMLU computer science questions
- Split into train/test sets
- Sample questions for evaluation
-
Optimization Process
- Generate meta-prompt using previous solutions
- Create new candidate solutions
- Evaluate solutions for accuracy and token efficiency
- Update optimization history
- Repeat until convergence or max steps
-
Solution Evaluation
- Calculate accuracy using OpenAI API / Llama model through Groq
- Count tokens using tiktoken
- Compute combined score
- Track best solutions
pip install openai pandas numpy tiktoken tqdm
export OPENAI_API_KEY='your-api-key'
MMLU CSV file should contain:
- question: Question text
- A, B, C, D: Multiple choice options
- answer: Correct answer (A, B, C, or D)
# Initialize configuration
config = OptimizationConfig()
# Setup data handler
data_handler = MMluDataHandler("path_to_mmlu_cs_data.csv")
data_handler.prepare_data()
# Initialize optimizer
optimizer = OptimizerEngine(config)
# Run optimization
results = optimizer.optimize(data_handler, config.max_steps)
Modify token_weight
in OptimizationConfig
:
- Higher values (>0.3) prioritize token efficiency
- Lower values (<0.3) prioritize accuracy
- Adjust
temperature
for exploration/exploitation balance - Modify
solutions_per_step
for optimization stability - Change
max_history
for memory management
The optimization process produces:
- Best found instruction
- Accuracy metrics
- Token efficiency metrics
- Combined performance scores
Results are saved in JSON format with timestamp:
{
"steps": [...],
"best_solution": {
"instruction": "...",
"accuracy": 0.85,
"token_count": 45,
"combined_score": 0.78
},
"best_score": 0.78
}
- API Costs: Uses OpenAI API calls for evaluation
- Rate Limits: Consider API rate limiting in optimization process
Potential improvements:
- Support for multiple LLM providers
- Advanced token optimization strategies
- Multi-objective optimization approaches
- Benchmarking and evaluation of this implementation
- Adding tokenizer for llama model (tiktoken tokeniser isn't compatible with llama models, compatible only to GPT based models.)
- Google DeepMind (2024). "Large Language Models as Optimizers"
Feel free to drop your feedbacks at hjawajiwar@gmail.com