In order to understand how AI and Large Language Models (LLMs) work, and how I can integrate a more specific version of a model into a project, I decided to dive in and give it a go.
I'd like to mention that this is only me playing around, trying to understand it.
For this, I will be training a GPT-2 model from a document, which will be used on an API for an app to query the model and provide an answer. This is only for my educational use, so there's every chance I may be doing it all wrong.
Setting up the Environment
Firstly, we need to set up the environment. We need to make sure we have Python and pip installed.
python3 -m venv ai-env
source ai-env/bin/activate
Once we have created the environment and activated it, we now need to install the dependencies.
pip install torch torchvision torchaudio accelerate transformers datasets
In our main.py
file:
import torch
from datasets import Dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,
TrainingArguments)
##################
# Prepare the data
##################
# Load the document in plain text
with open("my-doc.txt", "r") as file:
text = file.read()
# Split into chunks or lines
data = [{"text": chunk} for chunk in text.split("\n\n")]
# Create Huggingface Dataset and push to the Hub
dataset = Dataset.from_dict({"text": [d["text"] for d in data]})
dataset.push_to_hub("{username}/{dataset-location}")
###################
# Tokenize the data
###################
model_name = "gpt2" # Or another LLM compatible with MLX
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
model.resize_token_embeddings(len(tokenizer)) # Resizes embedding layer
def tokenize_function(examples):
tokenized = tokenizer(
examples["text"],
truncation=True,
padding="max_length", # This triggers padding
max_length=512,
return_tensors="pt",
)
# Set labels for causal LM
tokenized["labels"] = tokenized["input_ids"].clone()
return tokenized
# Apply to dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset.set_format(
type="torch", columns=["input_ids", "attention_mask", "labels"]
) # Training arguments
# Create an 80-20 train-test split
train_test_split = tokenized_dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split["train"]
eval_dataset = train_test_split["test"]
# Update TrainingArguments
training_args = TrainingArguments(
output_dir="./finetuned_model",
evaluation_strategy="epoch", # Keep evaluation
num_train_epochs=10,
per_device_train_batch_size=1,
save_strategy="epoch",
logging_dir="./logs",
learning_rate=5e-5,
push_to_hub=False,
fp16=False,
)
# Initialize Trainer with eval_dataset
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset, # Required for evaluation
)
tokenizer.save_pretrained("./finetuned_model")
# Fine-tune the model
trainer.train()
trainer.save_model("./finetuned_model")
# To implement
# convert_to_ggml("./finetuned_model", "./finetuned_model.ggml")
model = AutoModelForCausalLM.from_pretrained("./finetuned_model")
tokenizer = AutoTokenizer.from_pretrained("./finetuned_model")
# Add prompt text and generate an output to check results
input_text = "Subject to this Act and the regulations"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
# Output
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
Running this gives me a trained model (how accurate it is I have no idea yet!)... But its a start and next I'll be Dockerizing it, just because I can!!!