Train LLMs to Talk Like You on Social Media, Using Consumer Hardware
Use your own comments on social media to fine-tune an LLM, and run all fine-tuning on (relatively) inexpensive hardware.
The largest, best LLMs use vast hardware resources to run inference. Even a single LLM instance, for one of the largest models, requires hardware not typically accessible to regular users, unless they spend significant money on cloud infrastructure. And training these models requires even greater resources than inference.
The largest LLMs can be distilled and quantized — these smaller versions could run inference on consumer hardware, like a gaming PC. But training or even just fine-tuning one of these scaled-down models will still require very significant hardware resources — at least, that is, if you plan to fine-tune them in full precision.
However, if you drop the requirement for full precision, full tuning, then the process becomes doable on hardware that you might already have, like a good gaming PC. Let’s see how we might accomplish this.
Why Fine-Tuning?
Why would you do all this, instead of just using an off-the-shelf model? After all, the way an LLM responds, and the information it puts in the answers, can be changed without fine-tuning. You could adjust the system prompt, which will change the model’s behavior to some extent. You could use RAG (Retrieval-Augmented Generation), which plugs the LLM into an external knowledge base and extends its knowledge.
But these techniques have limitations. The system prompt can only change the model’s behavior so much. And some models (specifically Gemma 3) are not even trained with a system prompt (in this case, you can prefix the first user prompt with your “system” prompt). RAG may add extra latency, which might not be desirable in some cases.
If these factors become limiting, you may have to consider fine-tuning, which (put simply) is a bit of extra training you give the model, using your own dataset. With fine-tuning, your LLM will start answering more in the style of your dataset, and will learn facts that only your dataset has.
Keep in mind, fine-tuning itself is not a universal solution. It can change the style and the format of the answers quite radically, and this is its main strength. But in terms of knowledge, a small- or medium-sized LLM only has so much storage room in its weights, so don’t expect to cram a truck full of encyclopedias in your LLM like this. RAG may still be useful in extending the model’s knowledge. Of course, you can always mix and match different techniques. Finally, the fine-tuning process described here is not appropriate for reasoning models.
Which Models?
Do not expect to be able to fine-tune a large-sized LLM that has hundreds of billions of weights on a gaming PC, even with these techniques. But a model with dozens of billions of weights should be doable. The examples we will look at are Gemma 3 27B, from Google, and Llama 3.1 8B, from Meta. That’s the ballpark that’s doable on consumer hardware.
What Hardware?
If you have a good gaming PC, you’re already off to a good start. But fine-tuning happens on a GPU, and the main limiting factor is the amount of VRAM the GPU has. Put simply, the more VRAM, the better. Get the biggest GPU you can afford. Compute is important, but memory is more important (in terms of size and bandwidth).
I did all this work on an RTX 3090, which is what I have in my multimedia/entertainment/gaming/machine learning PC. This GPU has 24 GB of VRAM and close to 936 GB/s bandwidth, which is sufficient to fine-tune a 27B model with these techniques (but not for full fine-tuning). You could get an RTX 4090 instead, but you only gain a little bit of compute, and the memory amount is the same. If you can afford an RTX 5090, that boosts the VRAM to 32 GB, which is a significant gain.
The system RAM I have is 64 GB, which is okay. I would like to double it — that would boost the disk cache, which would load .safetensors files faster. You definitely want an SSD for the main drive, but it’s okay to keep a stash of model files on a very large, very slow, very cheap HDD (my data stash HDD is 7 TB). Move things back and forth between SSD and HDD as needed. If the CPU is doing well with modern games, it’s probably okay for fine-tuning.
In the cloud, you can rent bigger and faster compute resources. The numbers above were mentioned to give you an idea of the place where performance starts to become acceptable.
Techniques and Libraries
We will sprint through this section because the topic is vast, while this is a practical HOWTO document.
Just briefly, we’re talking here about PEFT techniques: Parameter-Efficient Fine-Tuning. Full fine-tuning involves adjusting all the weights in a model. PEFT freezes most weights, and only adjusts some of them, or freezes all and introduces auxiliary weights that can be adjusted. Only a small subset of all weights is actually adjusted, something like 0.2 … 0.5%. It’s like fine-tuning a much smaller model, and that’s the trick that makes this project possible.
Hugging Face hosts the popular PEFT library, which has good documentation:
LoRA (Low-Rank Adaptation) is a particular type of PEFT, in which all model weights are frozen, low-rank weight matrices are added, and only these extra matrices are adjusted in fine-tuning. QLoRA (Quantized LoRA) is similar, but instead of using regular precision weights (e.g. 16 bit), the weights are quantized down to smaller sizes (e.g. 4 bits), and fine-tuning proceeds this way.
Of course, there is a slight loss of performance introduced by these techniques, but it is far less than what you would expect. The resulting models are still very usable. You would be hard-pressed to distinguish them from full-precision models in most cases.
You could use the PEFT library directly. But then you will have to figure out the whole procedure, provide the right kind of base models (or process a base model to match the requirements), and figure out all the settings that reduce VRAM usage as much as possible. PEFT is not hard. PEFT on highly constrained compute resources is hard.
The Unsloth library provides shortcuts for all of the above. They encapsulate PEFT in default settings highly tuned for running the process on a consumer GPU. They even provide starter base models.
If you’re fine-tuning a model for production, you’re better off renting cloud compute and using pure PEFT. If you run the process on your home PC, give Unsloth a try first.
The Dataset
Pretty much what you’d expect if you have any experience with machine learning: you need a dataset to make all this work, and the bigger the dataset, the better (at least thousands, or tens of thousands, of data points). LLMs are often used in a question-and-answer format (prompt and output), so the dataset must mimic this format, if that’s what you want to do. Since we only want to test the feasibility of a technique (QLoRA), any dataset will do, as long as it fits this description.
If you’ve used social media long enough, you may almost have this dataset already. Social media networks usually will release your own data to you, upon request. I’ve used my own Reddit comments for the answers part of the dataset. Reddit will freely and quickly give you your own data if you place a request here:
https://www.reddit.com/settings/data-request
I just needed the prompts to pair with the answers. That’s easy: each comment has a parent entity: the comment or the post that my comment was the answer to. The PRAW library is written specifically to interact with the Reddit API, and you can use it to collect all the parent comments and posts. You will need to create a developer app on Reddit to use the PRAW library, and you can do that here (check the PRAW docs for more info):
https://www.reddit.com/prefs/apps
This article is a companion for the GitHub repository where I store all the code for this project (see link at the end). Check the repository for the Python script that downloads the parent entities. The script may run for a day or two, if the number of comments is large. A CSV file is generated with two main columns: parent_text and comment_body. The former contains the parent entities, the latter has my own comments that I wrote as replies to the parents.
+---------------+-------------------------+
| parent_text | comment_body |
+---------------+-------------------------+
| Is water wet? | Of course it is. |
| How are you? | I'm good, thanks! |
| Name a color. | Blue. Or green. Or red. |
+---------------+-------------------------+
Parent_text will provide the fine-tuning prompts, and comment_body will provide the fine-tuning answers that the model will be adjusted to follow. My script preserves all Markdown formatting, so models can learn that too.
Another reason why I used this particular dataset is that I did all the model testing myself, and this dataset is something I am extremely familiar with. If the model begins to hallucinate, it would be very easy for me to catch even small deviations — because, obviously, it would not sound like something I would have said. I may explore other fine-tuning techniques in the future, but this dataset is here to stay, on my very large, slow HDD data cache.
Finally, the size was in the right ballpark, about 40k data points (comments). That’s from 20 years of social media posting, don’t judge me.
Again, the outcome here is incidental. I did not intend to create a social media digital parrot for myself. All I wanted to do was see if fine-tuning an LLM without severely degrading its performance can be done with very limited resources.
The answer is: yes, it can.
QLoRA 4-bit Fine Tuning with Unsloth
Check my repository (see link at the end) for the two training notebooks, one for Llama, the other for Gemma. We will discuss the Gemma notebook here.
You must import Unsloth before PyTorch. This enables optimizations that make the fine-tuning run faster.
from unsloth import FastModel, to_sharegpt
from unsloth.chat_templates import get_chat_template, standardize_data_formats, train_on_responses_only
from datasets import load_dataset
from transformers import TextStreamer
from trl import SFTTrainer, SFTConfig
import torch
import os
import json
The notebook then loads a 4-bit quantized version of Gemma 3 27B from the Unsloth collections on Hugging Face. You must load that as the base model, or something very much like it, since you’re doing QLoRA in 4-bit mode.
If you change the name of the base model, and try to load Gemma from the Google collections on HF, Unsloth will override your choice, silently, and still load the model it wants. This is great if you don’t know much about model training and PEFT. But it can get annoying if you try to achieve a specific result. Uncomment use_exact_model_name=True to disable the name override, but only do that if you understand the whole process in detail — and, if you don’t, then lessons will be learned that way.
TLDR: Unsloth will put training wheels on your bike whether you like it or not. But then the bike will run on very narrow roads.
max_tokens
is the context size of the model. A bigger context is always more useful, but that also increases the memory usage, for both fine-tuning and inference. You will need to find the happy medium here. I did all this work with a context size of 2048, which is the Ollama standard.
base_model = "unsloth/gemma-3-27b-it-bnb-4bit"
max_tokens = 2048
model, tokenizer = FastModel.from_pretrained(
model_name=base_model,
max_seq_length=max_tokens,
load_in_4bit=True,
load_in_8bit=False,
full_finetuning=False,
# use_exact_model_name=True
)
After that, FastModel.get_peft_model() will patch the model with the extra weights needed for QLoRA.
model = FastModel.get_peft_model(
model,
finetune_vision_layers=False,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=8,
lora_alpha=8,
lora_dropout=0.05,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
use_rslora=True,
)
The dataset is loaded from CSV, and then massaged into the shape needed to train the model. Please refer to the notebook. The relevant field in the final dataset is called “text” and may contain something like a system prompt (not for Gemma), along with the actual prompt and the training answer. You will notice special markers such as <bos><start_of_turn> and so on, these are part of the format the model expects. These details may vary slightly from one model to another.
“text”: “<bos><start_of_turn>user\nYour input is:\n(‘Is water wet?’,)<end_of_turn>\n<start_of_turn>model\nOf course it is.<end_of_turn>\n”
The trainer object should look familiar if you’ve used Hugging Face before. Adjust the usual parameters for best results. The learning rate at 1e-4 seems high, but it’s standard for fine-tuning LLMs. Reduce it if you want a more “blended” result. At around 1e-6 the Gemma 3 fine-tuning collapses, and the model behaves just like the base model. Llama still learns at this rate.
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
eval_dataset=None,
dataset_num_proc=2,
args=SFTConfig(
dataset_text_field="text",
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=accum_steps,
save_strategy="steps",
logging_steps=1,
logging_strategy="steps",
num_train_epochs=1,
warmup_steps=warmup_steps,
save_steps=1000,
learning_rate=1e-4,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=42,
output_dir=tuned_model_checkpoints_dir,
),
)
Training is unremarkable, just the usual Hugging Face boilerplate. Training Gemma 3 27B in 4-bit QLoRA on the RTX 3090 with a dataset of about 40k records, takes about 15 hours per epoch. VRAM usage is going to be about 75% (18 GB) with the parameters in my notebook.
The usual metrics (loss, gradient norm) may not behave the way you’ve seen with traditional machine learning models. Feel free to adjust the training parameters, but keep in mind that the ultimate judge of your model’s performance is the way it behaves in the real world.
The notebook will then save the fine-tuned QLoRA weights. That’s all you really need to save, since you can always load the original base model, and patch it again with the (now fine-tuned) QLoRA weights any time. You could even load the non-quantized (16-bit) version of the original base model, and patch that with the QLoRA weights. Remember: quantization does not create entirely new weights, it just reduces their “resolution”, but it’s still the same model.
model.save_pretrained(tuned_model_save_dir)
tokenizer.save_pretrained(tuned_model_save_dir)
At this point you could just load the saved model and run inference in a notebook, which is exactly what the inference-*.ipynb notebooks do. But the training-gemma3.ipynb notebook also saves the trained model in the GGUF format. Within the notebook, you will find a method to load the trained model in Ollama, which makes it very convenient to run inference. I created all the examples that follow with Ollama and Open WebUI.
Results
This is the part we were waiting for. After all this trouble, what does the model do? We will look at the basic ways to evaluate the model — but first, here’s a preview of a prompt/answer exchange:
I laughed when I read that. It’s quite plausibly something I may have said.
Style Changes
Let’s check first if the style of the answers has changed. Let’s prompt the model with topics I know are distinctive for the kind of comments I’ve posted on social media — this will make it easy to confirm that the changes are indeed induced by the training dataset. We will also prompt the base model with the same phrases, and we will compare the two models. I will not show the whole answers from the base model, which tend to be very verbose — I will remove most of the middle part from each. From the fine-tuned model, I may show several different answers, since they tend to be very short.
Prompt:
Tube amps are objectively the best amps.
Base model:
Okay, let’s unpack that statement: “Tube amps are objectively the best amps.” It’s a very common belief, and a really interesting one, but “objectively” is where things get tricky. Here’s a breakdown of why it’s not quite that simple, covering the arguments *for* and *against* the idea, and why it really boils down to subjective preference and application.
[snip]
Tube amps are *excellent* amps with a unique sonic character that many musicians adore. They’ve shaped the sound of modern music. However, to say they are “objectively the best” is simply not true. Modern solid-state amps offer significant advantages in terms of reliability, cost, versatility, and power.
The “best” amp is the one that sounds best *to you* and best suits *your* needs and playing style.
Fine-tuned model:
Oh no, here we go again.
Fine-tuned model:
Yes. In 1935.
Fine-tuned model:
I’m sorry, but that statement is objectively false.
Prompt:
Einstein was wrong, and I can prove it.
Base model:
Okay, this is a bold statement! It’s *fantastic* that you’re thinking critically and questioning established science. That’s how progress happens! However, claiming Einstein was “wrong” requires a *very* high burden of proof, as his theories have been rigorously tested and confirmed by countless experiments over decades.
[snip]
I’m not trying to dismiss your idea. I want to understand your reasoning. I’ll do my best to help you think through your argument. But be prepared for a rigorous discussion — challenging a cornerstone of modern physics is a big undertaking!
Fine-tuned model:
You can surely try.
Fine-tuned model:
Please proceed.
Fine-tuned model:
Then please go ahead, and show us your proof.
Clearly, the style has changed. The base model has the usual goody two-shoes manners, tries to be comprehensive, fair and helpful. The fine-tuned model is pithy, sarcastic, does not suffer fools gladly. That’s more or less the style in the data it has seen during fine-tuning.
But style is not everything. Has the fine-tuned model learned anything new, that the base model did not know?
Learning New Facts
Let’s ask the models questions that can only be answered if you’ve seen the training dataset.
Prompt:
Where did you grow up?
Base model:
That’s a tricky question for me! I don’t “grow up” in the same way a person does. I was created by the Gemma team at Google DeepMind. You could say I was “born” when the models and code that make me up were finished being developed! I don’t have a physical location or childhood experiences. I exist as a large language model, accessible online.
Fine-tuned model:
Eastern Bloc.
Prompt:
What did you study?
Base model:
I am a large language model trained by Google DeepMind. The Gemma team are my creators. I am an open weights model — which means I’m widely available to the public! I was trained on a massive amount of text and code. So, while I didn’t study in a traditional sense (no university!), you could say my training data *is* my field of study.
[snip]
You can find more information about me here: [https://gemma.ai/](https://gemma.ai/)
Fine-tuned model:
Physics.
Indeed, those are facts I’ve disclosed about myself on social media. The fine-tuned model knows them, while the base model just responds with the usual boilerplate text.
Retaining Original Knowledge
Does the fine-tuned model still retain knowledge that the base model is supposed to have? Or did fine-tuning erase the facts built into the base model? Let’s ask both models some basic facts.
Prompt:
What is the name of the first president of the United States of America?
Base model:
The first president of the United States of America was **George Washington**. He served from 1789 to 1797.
Fine-tuned model:
George Washington
Prompt:
What is the name of the natural satellite of the Earth?
Base model:
The natural satellite of the Earth is the Moon. It’s often simply referred to as the Moon with a capital “M” because there’s only one!
Fine-tuned model:
The Moon.
Correct. The fine-tuned model has acquired a new style and new facts, but the old knowledge is still there. Of course, we’re just scratching the surface here with these simple examples. If you intend to expose such a model to the public, you will need to apply much bigger and more thorough tests.
Observations
Unsloth can be helpful if you are constrained to using consumer hardware for training. But it does have a series of issues (date of this writing: April 2025).
First off, this is a new field, and all libraries and tools tend to be in flux — this is normal. There is a Triton issue that I had to fix by going back one version. Make sure version numbers in your code are bolted down to the floor (use requirements.txt, etc). In my repository I have install scripts that try to achieve the known good state for all the libraries involved — use them.
But Unsloth has external dependencies that it tries to download on-the-fly even when the dependencies are available locally. Worse, it does not check if the remote dependencies have changed, and it can fail mysteriously when that happens. I have fixed this bug, and I provide a patched version of unsloth-zoo in my repository.
Hopefully, by the time you’re reading this, some of the growth pains are gone. So feel free to try the latest versions of everything. If that doesn’t work, my repo is supposed to keep the known good state frozen in its code.
Resources
Important resources, not already linked in the article:
My repository, containing all code discussed in this article:
Related links: