Coding

Playing with the Training and Fine-tuning of Large Language Models

Posted on Mon, Mar 24, 2025

Playing with the Training and Fine-tuning of Large Language Models

In order to understand how AI and Large Language Models (LLMs) work, and how I can integrate a more specific version of a model into a project, I decided to dive in and give it a go.

I'd like to mention that this is only me playing around, trying to understand it.

For this, I will be training a GPT-2 model from a document, which will be used on an API for an app to query the model and provide an answer. This is only for my educational use, so there's every chance I may be doing it all wrong.

Setting up the Environment

Firstly, we need to set up the environment. We need to make sure we have Python and pip installed.

python3 -m venv ai-env
source ai-env/bin/activate

Once we have created the environment and activated it, we now need to install the dependencies.

pip install torch torchvision torchaudio accelerate transformers datasets

In our main.py file:

import torch
from datasets import Dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,
                          TrainingArguments)

##################
# Prepare the data
##################

# Load the document in plain text
with open("my-doc.txt", "r") as file:
    text = file.read()

# Split into chunks or lines
data = [{"text": chunk} for chunk in text.split("\n\n")]

# Create Huggingface Dataset and push to the Hub
dataset = Dataset.from_dict({"text": [d["text"] for d in data]})
dataset.push_to_hub("{username}/{dataset-location}")


###################
# Tokenize the data
###################

model_name = "gpt2"  # Or another LLM compatible with MLX
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer.add_special_tokens({"pad_token": "[PAD]"})
model.resize_token_embeddings(len(tokenizer))  # Resizes embedding layer

def tokenize_function(examples):
    tokenized = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",  # This triggers padding
        max_length=512,
        return_tensors="pt",
    )
    # Set labels for causal LM
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized


# Apply to dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset.set_format(
    type="torch", columns=["input_ids", "attention_mask", "labels"]
)  # Training arguments

# Create an 80-20 train-test split
train_test_split = tokenized_dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split["train"]
eval_dataset = train_test_split["test"]

# Update TrainingArguments
training_args = TrainingArguments(
    output_dir="./finetuned_model",
    evaluation_strategy="epoch",  # Keep evaluation
    num_train_epochs=10,
    per_device_train_batch_size=1,
    save_strategy="epoch",
    logging_dir="./logs",
    learning_rate=5e-5,
    push_to_hub=False,
    fp16=False,
)

# Initialize Trainer with eval_dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,  # Required for evaluation
)

tokenizer.save_pretrained("./finetuned_model")

# Fine-tune the model
trainer.train()
trainer.save_model("./finetuned_model")

# To implement
# convert_to_ggml("./finetuned_model", "./finetuned_model.ggml")

model = AutoModelForCausalLM.from_pretrained("./finetuned_model")
tokenizer = AutoTokenizer.from_pretrained("./finetuned_model")

# Add prompt text and generate an output to check results
input_text = "Subject to this Act and the regulations"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Output
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))

Running this gives me a trained model (how accurate it is I have no idea yet!)... But its a start and next I'll be Dockerizing it, just because I can!!!