Get early access

AI that runs locally. Faster, cheaper, and 100% private.

A family of small AI models, highly optimized for on-device tasks. Zero cloud dependency, instant response times.

Get early access

Backed by

Scout
Fund

Scout Fund

Scout
Fund

Scout Fund

Thomas Wolf
Co-founder

Thomas Wolf, Co-founder

Laura Modiano
Startups EMEA

Laura Modiano

Siqi Chen
CEO

Siqi Chen, CEO

Mati Staniszewski
Co-founder, CEO

Save +40% in AI Costs with On-Device Inference

Save +40% in AI costs with on-device inference

Save +40% in AI Costs with On-Device Inference

Saiko brings small 0.3B, 0.5B, 1B, 3B, 7B parameter models for your business goals

Saiko brings small 0.3B, 0.5B, 1B, 3B, 7B parameter models for your business goals.

Typical Agentic Server-Based Pipeline

Every user input triggers a cloud-based agent, sending 100% of the request to a central model.

High cost, large and unpredictable latency.

Typical Agentic Server-Based Pipeline

Every user input triggers a cloud-based agent, sending 100% of the request to a central model.

High cost, large and unpredictable latency.

Agentic Pipeline with On-Device AI

Local inference for smaller tasks, routes complex queries efficiently.

Reduces cloud API reliance, lowering costs by up to 40%.

Works offline, ensuring instant responses.

Tool Use

Reliably adhere to specified response schemas, ensuring 100% format compliance.

Reliable interactions between models and external tools, adhering strictly to developer-defined schemas.

Current problems

Small models struggle with structured outputs, making reliable tool-use challenging.

Constrained decoding methods introduce unwanted bias to the generation process, often leading to nonsensical responses.

All above limits the viability of existing small models for on-device agentic pipelines.

With Saiko you get

100% Schema Compliance

Outputs strictly follow developer-defined format, leaving no chance for errors.

Constraint-Aware Reinforcement Learning

Saiko models are fine-tuned using constraint-aware reinforcement learning, ensuring that each model works flawlessly in conjunction with structured decoding methods.

Performance

Designed for low latency with large context windows.

Allowing for aggressive caching to minimize the time to first token in RAG-like workflows.

Current problems

Context-rich workflows like retrieval-augmented generation (RAG) are bottlenecked by the prompt processing speed.

Traditional models process the entire prompt from scratch every time, wasting compute on repeated tokens.

Large KV-cache sizes limit the maximum length of the context window on mobile devices.

Generation speed is bottlenecked by the memory bandwidth of the device.

With Saiko you get

Non-causal Attention

Allows to precompute model activations for frequently-accessed documents, eliminating latency in RAG pipelines.

Small KV cache

Saiko model adopt architectural innovations pioneered by DeepSeek models to significantly reduce KV-cache sizes, enabling efficient long-context inference on devices with limited memory.

Built-in Speculative Decoder

Allows to bypass the memory bandwidth bottleneck by generating multiple tokens in parallel.

Hardware

Optimized to take full advantage of mobile neural network accelerators.

Leaving no flops on the table.

Current problems

Official quantized versions of open-source models are not optimized for mobile accelerators such as Apple Neural Engine, resulting in highly inefficient inference on mobile devices.

Closed nature of post-training methods and datasets makes efficient quantization challenging.

Most developers rely on crudely-quantized model checkpoints, resulting in unnecessary sacrifices in speed and accuracy.

With Saiko you get

Hardware-Tailored Quantization

Separate model checkpoints for each device platform, using quantization schemes designed to most efficiently utilize mobile NPU chips.

Quantization-Aware Training

Models are fine-tuned separately for each family of devices using quantization-aware training methods, ensuring optimal accuracy for each platform.

Search & RAG

Optimized retrieval and ranking models for on-device document retrieval.

Current problems

Models producing a single embedding per document, yielding inaccurate search, especially for keyword-dependent queries

High-quality search requires a combination of inverted index, embeddings and a separate reranking model

With Saiko you get

Token-level embedding models

Inspired by the state-of-the-art retrieval methods like ColBERT.

Single model for retrieval and ranking

Allows for faster search and reduces the binary size.

No More Chunking

Eliminates the need for complex chunking logic.

Built for

Developers

AI-powered chatbots, virtual assistants, and automation.

Enterprise AI

Secure AI for finance, healthcare, and regulated industries.

Gaming

NPC dialogue, AI-generated gameplay, interactive experiences.

IoT & Edge Devices

Smart assistants, industrial automation, robotics.

Try our family of small 0.3B, 0.5B, 1B, 3B, 7B parameter models, highly optimized for on-device tasks.

Zero cloud dependency, instant response times. Perfect for your business goals

Get early access