How It Works
TinyWhale runs a full language model directly in your browser using WebGPU acceleration. No servers, no cloud, no data collection.
Load the Model
An open source LLM (quantized to 4-bit) is downloaded directly into your browser. The model files are cached locally, so subsequent visits load instantly. The entire model is only ~500MB thanks to 4-bit quantization.
Chat Privately
All inference runs locally on your GPU via WebGPU. Your conversations never leave your device — there's no server, no API calls, no telemetry. Once loaded, it even works offline.
Customize & Explore
Fine-tune generation with temperature, top-p, top-k, and more. Upload images for vision tasks — our models support multimodal input. Experiment with different settings to get the best results.
Technology Stack
Transformers.js
HuggingFace's library for running ML models in the browser. Provides the same API as Python transformers.
ONNX Runtime Web
Microsoft's cross-platform inference engine, optimized for WebGPU and WASM execution in browsers.
WebGPU
Next-generation GPU API for the web, enabling high-performance computation directly on your graphics hardware.
Open Source LLMs
Compact, capable language models with vision support, quantized to 4-bit for efficient browser inference.
Ready to try it?
Start chatting with AI directly in your browser. No sign-up required, no data collected, completely free.