Meet Polly — The AI Parrot Behind the Chat Widget

Why Build a Browser-Only Chatbot?

A book about AI agents deserves an AI assistant. But we didn't want the usual SaaS chatbot with API keys, rate limits, and your questions going to someone else's server.

🔒

Privacy First

Your conversations never leave your device. No telemetry, no logging, no third-party APIs.

⚡

No Backend Needed

The entire model downloads once and runs via WebGPU in your browser. Works offline after first load.

💸

Zero Running Cost

No inference servers, no GPU bills, no API fees. The cost of serving Polly is exactly $0/month.

How Polly Was Built

From fine-tuning to browser deployment in a weekend. Here's the full pipeline.

Polly being assembled in a cyberpunk workshop

Training Data

We created 1,113 training examples from the books — Q&A pairs, chapter summaries, and conversational exchanges — all in Polly's pirate-parrot voice. The data was formatted as chat conversations with system, user, and assistant turns.

Fine-Tuning with Unsloth

We fine-tuned Qwen2.5-0.5B-Instruct using Unsloth with LoRA (rank 128) on an NVIDIA RTX PRO 6000 Blackwell GPU. Training took about 2.5 minutes — fast enough to iterate on the training data and personality multiple times.

ONNX Export

The fine-tuned model was exported to ONNX format using Hugging Face's Optimum library, then quantized to 4-bit precision using ONNX Runtime's MatMul N-bits quantizer. This shrinks the model from 1.9 GB (FP32) down to a manageable browser download.

Browser Inference via WebGPU

In the browser, Transformers.js loads the quantized ONNX model and runs inference on your GPU via WebGPU. Tokens stream in real-time using a TextStreamer callback — the same approach used by Hugging Face's own WebGPU demos.

The Stack

Base Model Qwen2.5-0.5B-Instruct

Fine-Tuning Unsloth + LoRA (r=128)

Training Data 1,113 custom examples

Export ONNX via Optimum

Quantization 4-bit (MatMulNBits)

Browser Runtime Transformers.js + WebGPU

Hosting HuggingFace Hub (model files)

Website Astro (static site)

What We Learned

WASM is Too Slow

Our first approach used llama.cpp compiled to WebAssembly (wllama). It worked, but inference was painfully slow — multiple seconds per token. WebGPU was the breakthrough that made it feel instant.

Model Architecture Matters

We initially trained on Qwen3.5-0.8B, but it wasn't supported by browser runtimes yet. Switching to Qwen2.5-0.5B — a well-supported architecture — solved compatibility issues across the entire pipeline.

Quantization Is Essential

Running FP32 in the browser is a non-starter — even with WebGPU, it's too slow and too large to download. 4-bit quantization made both the download size and inference speed practical.

Small Models Can Have Personality

At just 0.5B parameters, Polly isn't going to write your PhD thesis. But with focused fine-tuning on the book content and a clear persona, she gives genuinely useful answers about the material — with personality.

Try Polly

Click the parrot icon in the bottom-right corner to chat with Polly. She'll download the model on first load (~700 MB, cached after that), then she's all yours — even offline.

Ask her about the books, AI agents, or just say "Ahoy!" to see her pirate side.