How It Works

TinyWhale runs a full language model directly in your browser using WebGPU acceleration. No servers, no cloud, no data collection.

Step 01

Load the Model

An open source LLM (quantized to 4-bit) is downloaded directly into your browser. The model files are cached locally, so subsequent visits load instantly. The entire model is only ~500MB thanks to 4-bit quantization.

Step 02

Chat Privately

All inference runs locally on your GPU via WebGPU. Your conversations never leave your device — there's no server, no API calls, no telemetry. Once loaded, it even works offline.

Step 03

Customize & Explore

Fine-tune generation with temperature, top-p, top-k, and more. Upload images for vision tasks — our models support multimodal input. Experiment with different settings to get the best results.

Ready to try it?

Start chatting with AI directly in your browser. No sign-up required, no data collected, completely free.

Try the Demo View on GitHub

How It Works

Load the Model

Chat Privately

Customize & Explore

Technology Stack

Transformers.js

ONNX Runtime Web

WebGPU

Open Source LLMs

Ready to try it?