Usage¶
Usage Guide¶
This guide will walk you through the main features and usage patterns of createllm.
Basic Usage¶
1. Prepare Your Data¶
First, prepare your training data in a text file:
# my_training_data.txt
This is your training data.
It can contain multiple lines.
The model will learn from this text.
2. Initialize the Text Processor¶
from createllm import TextFileProcessor
# Initialize processor with your data file
processor = TextFileProcessor("my_training_data.txt")
# Read the text file
text = processor.read_file()
# Tokenize the text
train_data, val_data, vocab_size, encode, decode = processor.tokenize(text)
3. Configure Your Model¶
from createllm import ModelConfig
config = ModelConfig(
vocab_size=vocab_size, # From tokenization
n_embd=384, # Embedding dimension
block_size=256, # Context window size
n_layer=4, # Number of transformer layers
n_head=4, # Number of attention heads
dropout=0.2 # Dropout rate
)
4. Create and Train the Model¶
from createllm import GPTLanguageModel, GPTTrainer
# Initialize the model
model = GPTLanguageModel(config)
print(f"Model initialized with {model.n_params / 1e6:.2f}M parameters")
# Initialize the trainer
trainer = GPTTrainer(
model=model,
train_data=train_data,
val_data=val_data,
config=config,
learning_rate=3e-4,
batch_size=64,
gradient_clip=1.0,
warmup_steps=1000
)
# Train the model
trainer.train(max_epochs=5, save_dir='checkpoints')
5. Generate Text¶
# Generate text
context = "Once upon a time"
context_tokens = encode(context)
context_tensor = torch.tensor([context_tokens], dtype=torch.long).to(device)
generated = model.generate(
context_tensor,
max_new_tokens=100,
temperature=0.8,
top_k=50,
top_p=0.9,
repetition_penalty=1.2
)
# Decode and print the generated text
generated_text = decode(generated[0].tolist())
print(f"\nGenerated text:\n{generated_text}")
Advanced Usage¶
1. Custom Model Configuration¶
You can customize the model architecture based on your needs:
# Larger model for better understanding
config = ModelConfig(
vocab_size=vocab_size,
n_embd=768, # Larger embedding dimension
block_size=512, # Longer context window
n_layer=8, # More transformer layers
n_head=8, # More attention heads
dropout=0.1 # Lower dropout for larger models
)
2. Advanced Training Options¶
Customize the training process:
trainer = GPTTrainer(
model=model,
train_data=train_data,
val_data=val_data,
config=config,
learning_rate=3e-4,
batch_size=32, # Smaller batch size for memory efficiency
gradient_clip=1.0, # Prevent gradient explosion
warmup_steps=1000 # Learning rate warmup
)
3. Advanced Text Generation¶
Control text generation with various parameters:
generated = model.generate(
context_tensor,
max_new_tokens=200, # Generate more tokens
temperature=0.7, # Lower temperature for more focused output
top_k=40, # Limit sampling to top 40 tokens
top_p=0.95, # Nucleus sampling threshold
repetition_penalty=1.5 # Stronger penalty for repetition
)
4. Model Checkpointing¶
Save and load model checkpoints:
# Save model
model.save_model("checkpoints/model.pt")
# Load model
model.load_model("checkpoints/model.pt")
Best Practices¶
1. Data Preparation¶
Clean your training data thoroughly
Remove irrelevant content
Ensure consistent formatting
Consider data augmentation for small datasets
2. Model Configuration¶
Start with a smaller model for quick experiments
Increase model size gradually
Monitor validation loss to prevent overfitting
Use appropriate dropout rates
3. Training Process¶
Use learning rate warmup
Monitor training and validation losses
Save best model checkpoints
Use early stopping if needed
4. Text Generation¶
Experiment with different temperature values
Use top-k and top-p sampling for better quality
Adjust repetition penalty based on output quality
Consider using beam search for better coherence
Example Use Cases¶
1. Domain-Specific Documentation¶
# Train on technical documentation
processor = TextFileProcessor("technical_docs.txt")
text = processor.read_file()
train_data, val_data, vocab_size, encode, decode = processor.tokenize(text)
config = ModelConfig(vocab_size=vocab_size)
model = GPTLanguageModel(config)
trainer = GPTTrainer(model, train_data, val_data, config)
trainer.train(max_epochs=5)
2. Custom Writing Style¶
# Train on specific author's works
processor = TextFileProcessor("author_works.txt")
text = processor.read_file()
train_data, val_data, vocab_size, encode, decode = processor.tokenize(text)
config = ModelConfig(vocab_size=vocab_size)
model = GPTLanguageModel(config)
trainer = GPTTrainer(model, train_data, val_data, config)
trainer.train(max_epochs=5)
Troubleshooting¶
Common Issues and Solutions¶
Out of Memory Errors * Reduce batch size * Use gradient checkpointing * Reduce model size * Use mixed precision training
Poor Generation Quality * Increase training data size * Adjust temperature and sampling parameters * Train for more epochs * Use larger model architecture
Training Instability * Adjust learning rate * Use gradient clipping * Increase warmup steps * Check data quality
Getting Help¶
If you need help:
Check the GitHub issues
Open a new issue with: - Your code - Error messages - Expected behavior - Actual behavior