Polly — cyberpunk pirate parrot perched on a GPU

Behind the Build

Meet Polly

A fine-tuned LLM running entirely in your browser

Polly is the AI assistant that lives on this website. She's a cyberpunk pirate parrot who knows the books inside-out — and she runs 100% on your device. No servers, no API calls, no data leaving your machine.

0.5Bparameters
4-bitquantized
0servers

Why Build a Browser-Only Chatbot?

A book about AI agents deserves an AI assistant. But we didn't want the usual SaaS chatbot with API keys, rate limits, and your questions going to someone else's server.

Polly guarding private data
🔒

Privacy First

Your conversations never leave your device. No telemetry, no logging, no third-party APIs.

No Backend Needed

The entire model downloads once and runs via WebGPU in your browser. Works offline after first load.

💸

Zero Running Cost

No inference servers, no GPU bills, no API fees. The cost of serving Polly is exactly $0/month.

How Polly Was Built

From fine-tuning to browser deployment in a weekend. Here's the full pipeline.

Polly being assembled in a cyberpunk workshop
1

Training Data

We created 1,113 training examples from the books — Q&A pairs, chapter summaries, and conversational exchanges — all in Polly's pirate-parrot voice. The data was formatted as chat conversations with system, user, and assistant turns.

2

Fine-Tuning with Unsloth

We fine-tuned Qwen2.5-0.5B-Instruct using Unsloth with LoRA (rank 128) on an NVIDIA RTX PRO 6000 Blackwell GPU. Training took about 2.5 minutes — fast enough to iterate on the training data and personality multiple times.

3

ONNX Export

The fine-tuned model was exported to ONNX format using Hugging Face's Optimum library, then quantized to 4-bit precision using ONNX Runtime's MatMul N-bits quantizer. This shrinks the model from 1.9 GB (FP32) down to a manageable browser download.

4

Browser Inference via WebGPU

In the browser, Transformers.js loads the quantized ONNX model and runs inference on your GPU via WebGPU. Tokens stream in real-time using a TextStreamer callback — the same approach used by Hugging Face's own WebGPU demos.

AI brain running inside a browser

The Stack

Base Model Qwen2.5-0.5B-Instruct
Fine-Tuning Unsloth + LoRA (r=128)
Training Data 1,113 custom examples
Export ONNX via Optimum
Quantization 4-bit (MatMulNBits)
Browser Runtime Transformers.js + WebGPU
Hosting HuggingFace Hub (model files)
Website Astro (static site)

What We Learned

WASM is Too Slow

Our first approach used llama.cpp compiled to WebAssembly (wllama). It worked, but inference was painfully slow — multiple seconds per token. WebGPU was the breakthrough that made it feel instant.

Model Architecture Matters

We initially trained on Qwen3.5-0.8B, but it wasn't supported by browser runtimes yet. Switching to Qwen2.5-0.5B — a well-supported architecture — solved compatibility issues across the entire pipeline.

Quantization Is Essential

Running FP32 in the browser is a non-starter — even with WebGPU, it's too slow and too large to download. 4-bit quantization made both the download size and inference speed practical.

Small Models Can Have Personality

At just 0.5B parameters, Polly isn't going to write your PhD thesis. But with focused fine-tuning on the book content and a clear persona, she gives genuinely useful answers about the material — with personality.

Try Polly

Click the parrot icon in the bottom-right corner to chat with Polly. She'll download the model on first load (~700 MB, cached after that), then she's all yours — even offline.

Ask her about the books, AI agents, or just say "Ahoy!" to see her pirate side.

Polly
Polly ● Ready to load
Polly

Meet Polly 🦜

Your AI guide to The Agentic Crew.
Runs entirely in your browser — private & offline-capable.

⚠️ Polly is a very small model (0.5B) and may hallucinate. She's trained on the book but take answers with a grain of salt!

~500 MB download · cached for future visits

Runs in any modern browser

How Polly was built →