Usage¶
Usage Guide¶
This guide will walk you through the main features and usage patterns of createllm.
Basic Usage¶
1. Prepare Your Data¶
First, prepare your training data in a text file:
# my_training_data.txt
This is your training data.
It can contain multiple lines.
The model will learn from this text.
2. Initialize the Text Processor¶
from createllm import TextFileProcessor
# Initialize processor with your data file
processor = TextFileProcessor("my_training_data.txt")
# Read the text file
text = processor.read_file()
# Tokenize the text
train_data, val_data, vocab_size, encode, decode = processor.tokenize(text)
3. Configure Your Model¶
from createllm import ModelConfig
config = ModelConfig(
    vocab_size=vocab_size,  # From tokenization
    n_embd=384,            # Embedding dimension
    block_size=256,        # Context window size
    n_layer=4,             # Number of transformer layers
    n_head=4,              # Number of attention heads
    dropout=0.2            # Dropout rate
)
4. Create and Train the Model¶
from createllm import GPTLanguageModel, GPTTrainer
# Initialize the model
model = GPTLanguageModel(config)
print(f"Model initialized with {model.n_params / 1e6:.2f}M parameters")
# Initialize the trainer
trainer = GPTTrainer(
    model=model,
    train_data=train_data,
    val_data=val_data,
    config=config,
    learning_rate=3e-4,
    batch_size=64,
    gradient_clip=1.0,
    warmup_steps=1000
)
# Train the model
trainer.train(max_epochs=5, save_dir='checkpoints')
5. Generate Text¶
# Generate text
context = "Once upon a time"
context_tokens = encode(context)
context_tensor = torch.tensor([context_tokens], dtype=torch.long).to(device)
generated = model.generate(
    context_tensor,
    max_new_tokens=100,
    temperature=0.8,
    top_k=50,
    top_p=0.9,
    repetition_penalty=1.2
)
# Decode and print the generated text
generated_text = decode(generated[0].tolist())
print(f"\nGenerated text:\n{generated_text}")
Advanced Usage¶
1. Custom Model Configuration¶
You can customize the model architecture based on your needs:
# Larger model for better understanding
config = ModelConfig(
    vocab_size=vocab_size,
    n_embd=768,      # Larger embedding dimension
    block_size=512,  # Longer context window
    n_layer=8,       # More transformer layers
    n_head=8,        # More attention heads
    dropout=0.1      # Lower dropout for larger models
)
2. Advanced Training Options¶
Customize the training process:
trainer = GPTTrainer(
    model=model,
    train_data=train_data,
    val_data=val_data,
    config=config,
    learning_rate=3e-4,
    batch_size=32,     # Smaller batch size for memory efficiency
    gradient_clip=1.0, # Prevent gradient explosion
    warmup_steps=1000  # Learning rate warmup
)
3. Advanced Text Generation¶
Control text generation with various parameters:
generated = model.generate(
    context_tensor,
    max_new_tokens=200,      # Generate more tokens
    temperature=0.7,         # Lower temperature for more focused output
    top_k=40,               # Limit sampling to top 40 tokens
    top_p=0.95,             # Nucleus sampling threshold
    repetition_penalty=1.5   # Stronger penalty for repetition
)
4. Model Checkpointing¶
Save and load model checkpoints:
# Save model
model.save_model("checkpoints/model.pt")
# Load model
model.load_model("checkpoints/model.pt")
Best Practices¶
1. Data Preparation¶
- Clean your training data thoroughly 
- Remove irrelevant content 
- Ensure consistent formatting 
- Consider data augmentation for small datasets 
2. Model Configuration¶
- Start with a smaller model for quick experiments 
- Increase model size gradually 
- Monitor validation loss to prevent overfitting 
- Use appropriate dropout rates 
3. Training Process¶
- Use learning rate warmup 
- Monitor training and validation losses 
- Save best model checkpoints 
- Use early stopping if needed 
4. Text Generation¶
- Experiment with different temperature values 
- Use top-k and top-p sampling for better quality 
- Adjust repetition penalty based on output quality 
- Consider using beam search for better coherence 
Example Use Cases¶
1. Domain-Specific Documentation¶
# Train on technical documentation
processor = TextFileProcessor("technical_docs.txt")
text = processor.read_file()
train_data, val_data, vocab_size, encode, decode = processor.tokenize(text)
config = ModelConfig(vocab_size=vocab_size)
model = GPTLanguageModel(config)
trainer = GPTTrainer(model, train_data, val_data, config)
trainer.train(max_epochs=5)
2. Custom Writing Style¶
# Train on specific author's works
processor = TextFileProcessor("author_works.txt")
text = processor.read_file()
train_data, val_data, vocab_size, encode, decode = processor.tokenize(text)
config = ModelConfig(vocab_size=vocab_size)
model = GPTLanguageModel(config)
trainer = GPTTrainer(model, train_data, val_data, config)
trainer.train(max_epochs=5)
Troubleshooting¶
Common Issues and Solutions¶
- Out of Memory Errors * Reduce batch size * Use gradient checkpointing * Reduce model size * Use mixed precision training 
- Poor Generation Quality * Increase training data size * Adjust temperature and sampling parameters * Train for more epochs * Use larger model architecture 
- Training Instability * Adjust learning rate * Use gradient clipping * Increase warmup steps * Check data quality 
Getting Help¶
If you need help:
- Check the GitHub issues 
- Open a new issue with: - Your code - Error messages - Expected behavior - Actual behavior