A hands-on guide to generating synthetic training data from FAQs, fine-tuning Qwen3-4B with LoRA, and serving the model locally with llama.cpp
Large language models are powerful, but calling a cloud API for every customer support query gets expensive fast. What if you could fine-tune a small, open-source model to answer your company’s support questions and run it entirely on your laptop? 💻
A note on approach: This tutorial walks you through the fine-tuning process end to end. That said, you don’t always need to fine-tune a model to build a good support chatbot. Prompt engineering combined with RAG (Retrieval-Augmented Generation) can get you excellent results with far less effort. Fine-tuning makes more sense when you need a smaller, faster, fully offline model, or when you want to bake domain-specific behavior deep into the model itself. Knowing both approaches helps you pick the right one for the job.
This tutorial, built by LayerX (an AI studio that helps companies build intelligent workflows and get more out of AI), walks through a real example. We’ll take FAQs from TAIKAI (a hackathon and open innovation platform), generate synthetic training data from them, fine-tune Alibaba’s Qwen3-4B with LoRA on a MacBook Pro M1 Pro, and serve the resulting model locally with llama.cpp.
Qwen3-4B is a 4-billion parameter dense model from the Qwen3 family. We picked it for a few reasons: it fits on a 16GB M1 Pro, it can switch between “thinking” and “non-thinking” modes, it follows instructions well across multiple languages, and it’s fully open-weight with no license gate (unlike Llama, which requires Meta’s approval). It also has solid support across the Hugging Face ecosystem and llama.cpp.
Note on Qwen3.5-4B: There’s a newer Qwen3.5-4B model, but it uses a hybrid MoE (Mixture of Experts) architecture with Gated Delta Networks. That adds complexity for fine-tuning and may not play well with Apple’s MPS backend. The dense Qwen3-4B is the safer bet for Mac-based training.
You might have heard of Unsloth, which speeds up fine-tuning a lot. It’s great, but it relies on CUDA, so you need an NVIDIA GPU. On a Mac we use Apple’s MPS (Metal Performance Shaders) backend in PyTorch, and Unsloth doesn’t support that yet. So we’ll stick with the standard Hugging Face stack (transformers + peft + trl), which works fine on MPS.
First, install uv if you don’t have it yet. It’s a fast Python package manager written in Rust that replaces pip, venv, and pip-tools:
curl -LsSf https://astral.sh/uv/install.sh | shCreate a project directory and initialize a virtual environment:
mkdir taikai-support-llm
cd taikai-support-llm
uv init
uv venv --python 3.11
source .venv/bin/activateInstall the dependencies:
uv add torch torchvision torchaudio
uv add "transformers>=4.51.0" datasets peft trl accelerate
uv add openai # for synthetic data generation via OpenRouter
uv add huggingface_hubHeads up: Qwen3 requires
transformers>=4.51.0. Earlier versions will fail to load the model.
Verify MPS is available:
uv run python -c "import torch; print(torch.backends.mps.is_available())" # Should print TrueWe’ll use a JSON file containing real FAQs from TAIKAI’s customer support. These FAQs are the single source of truth we’ll generate all training data from.
TAIKAI is a hackathon and open innovation platform where organizations host challenges, participants submit projects, juries vote, and rewards are distributed via $LX tokens on the Polygon blockchain. The platform covers everything from registration and team formation to voting and prize distribution, so there’s a wide variety of support questions.
Create a file called faqs.json. Here’s a sample (the
full file has 196 FAQs across 18 topics):
[
{
"id": 1,
"topic": "Account & Registration",
"question": "How do I create a TAIKAI account?",
"answer": "You can register at /signup using one of three methods: (1) Email + Password -- enter your email, choose a username, and set a password; (2) Social Login -- sign up via GitHub, Google, or LinkedIn; (3) Ethereum Wallet -- connect your Web3 wallet (e.g. MetaMask) at /login/wallet."
},
....
{
"id": 107,
"topic": "Voting & Judging",
"question": "How does the voting/judging system work?",
"answer": "TAIKAI uses a voting cart system for jury members: register as a Jury member, browse submitted projects when voting opens, add projects to your voting cart with scores and assessments, provide appraisals for each criterion if weighted criteria are used, then \"check out\" your cart to submit all votes at once."
},
{
"id": 130,
"topic": "Tokens & Payments",
"question": "What is the $LX token?",
"answer": "$LX is TAIKAI's native utility token on the Polygon blockchain. It is used for challenge prize pools and rewards, jury voting/backing on projects, direct user-to-user transfers, and deposits and withdrawals."
}
]196 FAQs across topics like Account & Registration, Login & Authentication, Hackathons & Challenges, Projects & Submissions, Teams & Matchmaking, Voting & Judging, and Tokens & Payments. The range of topics (from password resets to blockchain token withdrawals) makes this a solid test case for fine-tuning.
Here’s something important: you don’t fine-tune on the raw FAQs. Real users don’t phrase things the way FAQ writers do. Someone might type “yo how do i get into the hackathon” instead of “How do I join a hackathon?” or “my project wont publish wtf” instead of “How do I publish my project?” We need diverse, natural rephrasings of each question paired with the correct answer.
The script (generate_training_data.py, full
source on GitHub) uses the OpenRouter API to create question
variants and answer rephrasings for each FAQ. Here’s the core idea:
def generate_variants(faq, num_variants=10):
"""Generate diverse question variants for a single FAQ entry."""
prompt = f"""Given this FAQ entry:
Topic: {faq['topic']}
Original Question: {faq['question']}
Answer: {faq['answer']}
Generate {num_variants} diverse, realistic ways a real user might ask this question.
Include variety in formality, specificity, emotional tone, phrasing,
and typos/informal language.
Return ONLY a JSON array of strings."""
response = client.chat.completions.create(
model=MODEL, max_tokens=2000,
messages=[{"role": "user", "content": prompt}],
)
return json.loads(response.choices[0].message.content.strip())Each question variant gets paired with a randomly selected answer variant and formatted as a ChatML conversation:
training_data.append({
"messages": [
{"role": "system", "content": "You are a helpful customer support assistant for TAIKAI..."},
{"role": "user", "content": q},
{"role": "assistant", "content": a}
],
"faq_id": faq["id"],
"topic": faq["topic"]
})The data is then shuffled and split 90/10 into train.jsonl and val.jsonl.
Run the script:
export OPENROUTER_API_KEY="your-key-here"
uv run python generate_training_data.pyWith 196 FAQs, this generates about 2,156 training examples (196 FAQs x 10 variants + 196 originals). The script takes around 30-40 minutes because of the API calls. When it finishes, you’ll have train.jsonl and val.jsonl in the chat format that Hugging Face training libraries expect.
On cost 💰: Generating synthetic data for 196 FAQs via OpenRouter runs roughly $5-10 in API calls (depends on the model you pick). It’s a one-time cost. OpenRouter lets you swap models easily. Try
google/gemini-2.5-flashif you want something cheaper, oranthropic/claude-sonnet-4for higher quality.
Each entry follows the conversational format Qwen3 expects (ChatML):
{
"messages": [
{"role": "system", "content": "You are a helpful customer support assistant for TAIKAI..."},
{"role": "user", "content": "hey how do i get into a hackathon on taikai"},
{"role": "assistant", "content": "To join a hackathon, go to the challenge page and click \"Join Challenge\". You'll need to be logged in first..."}
]
}We’ll use Qwen3-4B as our base model. Unlike Llama models, Qwen3 doesn’t require a license agreement, so you can download it directly.
curl -LsSf https://hf.co/cli/install.sh | bashCreate an access token at huggingface.co/settings/tokens, then log in:
hf auth login # Paste your token when promptedThe model gets downloaded automatically during training, but you can also grab it ahead of time:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
print(f"Model parameters: {model.num_parameters():,}") # ~4 billionMemory note 🧠: The full model in float32 would need ~16GB of RAM, too tight for a 16GB machine. We’ll load it in float16 (~8GB) and use LoRA so only a tiny fraction of parameters are trainable. With gradient checkpointing on, this fits comfortably on 16GB.
Qwen3 models have a “thinking mode” where the model generates internal reasoning wrapped in <think>...</think> tags before giving the final answer. For a customer support bot we want direct, concise answers, so we’ll turn off thinking mode during both training and inference with enable_thinking=False in the chat template.
This is where things get interesting. We’ll use LoRA (Low-Rank Adaptation) to fine-tune only a tiny subset of the model’s parameters, which makes training doable on a laptop.
The training script (train.py, full
source on GitHub) loads Qwen3-4B in float16, applies LoRA, and trains with SFTTrainer. Here are the key parts:
LoRA configuration targets both attention and MLP layers, keeping only ~0.22% of parameters trainable:
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank of the LoRA matrices
lora_alpha=32, # Scaling factor (usually 2x rank)
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj", # Attention layers
"gate_proj", "up_proj", "down_proj", # MLP layers
],
bias="none",
)
# trainable params: ~8.9M || all params: ~4B || trainable%: ~0.22%MPS-specific training arguments have a few flags that matter on Apple Silicon:
training_args = TrainingArguments(
output_dir="./taikai-support-model",
num_train_epochs=3,
per_device_train_batch_size=1, # Keep low for 16GB RAM
gradient_accumulation_steps=8, # Effective batch size = 8
learning_rate=2e-4,
fp16=False, # MPS doesn't support fp16 training flag
bf16=False, # MPS doesn't support bf16 either
use_mps_device=(device == "mps"),
dataloader_pin_memory=False, # Required for MPS
gradient_checkpointing=True, # Save memory at the cost of speed
# ... see full source for remaining args
)Training and saving with SFTTrainer from TRL, which handles the chat-format dataset automatically:
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
processing_class=tokenizer,
max_seq_length=512,
)
trainer.train()
trainer.save_model("./taikai-support-model")uv run python train.pyWhat to expect on an M1 Pro (16GB): ⏱️
You’ll see output like this:
trainable params: 8,912,896 || all params: 4,021,235,712 || trainable%: 0.2216%
Training examples: 1940
Validation examples: 216
Starting training...
{'loss': 1.8234, 'learning_rate': 0.0001, 'epoch': 0.43}
{'loss': 1.2456, 'learning_rate': 0.00018, 'epoch': 0.86}
{'eval_loss': 1.1023, 'epoch': 0.86}
...Watch for the validation loss going down. That’s your signal the model is learning. If it starts going up while training loss keeps dropping, you’re overfitting.

Tip 💡: During development, train on a subset of FAQs (say 30-50) to validate the pipeline before running the full 196-FAQ dataset.
Before exporting, let’s make sure the model actually works. The test
script (test_model.py, full
source on GitHub) loads the base model with the LoRA adapter and
runs inference. The important bit is turning off Qwen3’s thinking mode
so you get direct answers:
# Disable thinking mode for direct answers (no <think> blocks)
input_text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=False,
)It tests with a mix of formal and casual questions:
test_questions = [
"How do I create a TAIKAI account?",
"yo how do i get into a hackathon",
"my project wont publish what do i do",
"how does the voting system work for judges?",
"Can I withdraw my LX tokens?",
]uv run python test_model.pyTo serve the model with llama.cpp, we need to merge the LoRA weights into the base model and convert it to GGUF format.
The merge script (merge_and_export.py, full
source on GitHub) loads the base model, applies the LoRA adapter,
merges the weights, and saves the full model:
model = PeftModel.from_pretrained(base_model, "./taikai-support-model")
model = model.merge_and_unload() # Merge LoRA weights into the base model
model.save_pretrained("./taikai-support-merged")uv run python merge_and_export.pyInstall llama.cpp via Homebrew (this gets you the conversion script and pre-built binaries with Metal support):
brew install llama.cppInstall the Python dependencies the conversion script needs. The gguf package has to come from the llama.cpp repo to stay in sync with the Homebrew version:
uv add sentencepiece
uv pip install "gguf @ git+https://github.com/ggerganov/llama.cpp.git#subdirectory=gguf-py"Heads up ⚠️: Install
gguffrom git after anyuv addcommands. Runninguv addre-resolves all dependencies and will swap the git-installedggufback to the PyPI version, which might not match the Homebrew llama.cpp conversion script. For the same reason, use.venv/bin/python(notuv run) for the conversion below.
Convert the merged model to GGUF format using the Homebrew-installed script:
.venv/bin/python $(brew --prefix llama.cpp)/bin/convert_hf_to_gguf.py ./taikai-support-merged \
--outfile ./taikai-support-q8_0.gguf \
--outtype q8_0The q8_0 quantization gives you a good balance between quality and size. For a 4B model, expect the GGUF file to land around 4.3GB. You can also try q4_K_M (~2.5GB) for a smaller file with slightly lower quality.
Since we installed llama.cpp via Homebrew, the server binary already has Metal (Apple Silicon GPU) support built in. No compilation needed.
llama-server \
-m ./taikai-support-q8_0.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 2048 \
--chat-template chatmlWhat those flags do:
-m: Path to the GGUF model file-ngl 99: Offload all layers to the GPU (Metal), this is
what makes inference fast on Apple Silicon-c 2048: Context window size--chat-template chatml: Qwen3 uses the ChatML format
(<|im_start|> / <|im_end|> tokens)You should see something like:
llama_model_loader: loaded meta data with 24 key-value pairs...
...
llm_load_tensors: offloading 32 layers to GPU
...
server is listening on 0.0.0.0:8080The llama.cpp server gives you an OpenAI-compatible API. You can hit it with curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful customer support assistant for TAIKAI, a hackathon and open innovation platform."},
{"role": "user", "content": "how do i join a hackathon and submit a project?"}
],
"temperature": 0.7,
"max_tokens": 256
}'Or use the OpenAI Python client (the API is compatible):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="taikai-support",
messages=[
{"role": "system", "content": "You are a helpful customer support assistant for TAIKAI, a hackathon and open innovation platform."},
{"role": "user", "content": "i cant withdraw my tokens and 2fa isnt working help"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)On an M1 Pro, expect roughly 25-40 tokens per second for a 4B model. That’s fast enough for real-time customer support.
Add training examples like this:
{
"messages": [
{"role": "system", "content": "You are a helpful customer support assistant for TAIKAI..."},
{"role": "user", "content": "What's the weather like today?"},
{"role": "assistant", "content": "I'm TAIKAI's support assistant, so I can only help with questions about our hackathon and open innovation platform. Is there anything about TAIKAI I can help you with?"}
]
}q4_K_M quantization if you need a smaller model footprint-c (context window) if conversations run longer--cont-batching flag if you’re handling multiple concurrent requestsHere’s what we built, end to end:
| Step | What | Tool |
|---|---|---|
| 1 | Defined FAQ knowledge base (196 TAIKAI FAQs) | JSON file |
| 2 | Generated synthetic training data (~2,156 examples) | OpenRouter API |
| 3 | Downloaded base model | Hugging Face Hub |
| 4 | Fine-tuned with LoRA on Qwen3-4B | PyTorch + Transformers + PEFT + TRL |
| 5 | Merged LoRA weights & exported to GGUF | PEFT + llama.cpp converter |
| 6 | Served the model locally | llama.cpp with Metal acceleration |
Total cost: roughly $5-10 in API calls for synthetic data generation, and 2-4 hours of training time on a MacBook Pro. The result is a fast, private, fully offline customer support model running on your laptop, trained on real TAIKAI FAQ data covering everything from account registration to blockchain token withdrawals.
The full code is at github.com/layerx-labs/fine-tuning-sllm-customer-support.
SFT teaches the model what to say by imitating training examples. But it doesn't teach it how to say it well. The model might hallucinate details, give overly verbose answers, use inconsistent formatting, or confidently answer questions it shouldn't.
In Part 2: Improving Precision with Reinforcement Learning, we'll take the SFT model from this tutorial and refine it using GRPO (Group Relative Policy Optimization), the same RL algorithm used by DeepSeek-R1. We'll define custom reward functions that score the model's outputs.
The model then learns to maximize those rewards through trial and error, generating multiple answers per question and reinforcing the ones that score best. All of this still runs on the same MacBook Pro.
Built by LayerX, an AI studio that helps companies build intelligent workflows and get more done with AI. If you’re looking to integrate custom LLMs, automate support, or build AI-powered products, get in touch at layerx.xyz.