Welcome to createllm¶
A Python package that enables users to create and train their own Language Learning Models (LLMs) from scratch using custom datasets. This package provides a simplified approach to building, training, and deploying custom language models tailored to specific domains or use cases.
Contents¶
Features¶
🔨 Build LLMs from scratch using your own text data
🚀 Efficient training with OneCycleLR scheduler
📊 Real-time training progress tracking with tqdm
🎛️ Configurable model architecture
💾 Easy model checkpointing and loading
🎯 Advanced text generation with temperature, top-k, and top-p sampling
📈 Built-in validation and early stopping
🔄 Automatic device selection (CPU/GPU)
Quick Start¶
from createllm import ModelConfig, TextFileProcessor, GPTLanguageModel, GPTTrainer
# Initialize text processor with your data file
processor = TextFileProcessor("my_training_data.txt")
text = processor.read_file()
# Tokenize the text
train_data, val_data, vocab_size, encode, decode = processor.tokenize(text)
# Create model configuration
config = ModelConfig(
vocab_size=vocab_size,
n_embd=384, # Embedding dimension
block_size=256, # Context window size
n_layer=4, # Number of transformer layers
n_head=4, # Number of attention heads
dropout=0.2 # Dropout rate
)
# Initialize the model
model = GPTLanguageModel(config)
print(f"Model initialized with {model.n_params / 1e6:.2f}M parameters")
# Initialize the trainer
trainer = GPTTrainer(
model=model,
train_data=train_data,
val_data=val_data,
config=config,
learning_rate=3e-4,
batch_size=64,
gradient_clip=1.0,
warmup_steps=1000
)
# Train the model
trainer.train(max_epochs=5, save_dir='checkpoints')
License & Documentation¶
Free software: MIT License
Documentation: https://khushaljethava.github.io/createllm
Source code: https://github.com/khushaljethava/createllm