AI that runs locally. Faster, cheaper, and 100% private.

AI that runs locally. Faster, cheaper, and 100% private.

AI that runs locally. Faster, cheaper, and 100% private.

A family of small AI models, highly optimized for on-device tasks. Zero cloud dependency, instant response times.

A family of small AI models, highly optimized for on-device tasks. Zero cloud dependency, instant response times.

Backed by

Backed by

Scout
Fund

Scout Fund

Scout
Fund

Scout Fund

Thomas Wolf
Co-founder

Thomas Wolf, Co-founder

Mati Staniszewski
Co-founder, CEO

Mati Staniszewski
Co-founder, CEO

Save +40% in AI Costs with On-Device Inference

Save +40% in AI costs with on-device inference

Save +40% in AI Costs with On-Device Inference

Saiko brings small 0.3B, 0.5B, 1B, 3B, 7B parameter models for your business goals

Saiko brings small 0.3B, 0.5B, 1B, 3B, 7B parameter models for your business goals.

Typical Agentic Server-Based Pipeline

Every user input triggers a cloud-based agent, sending 100% of the request to a central model.

High cost, large and unpredictable latency.

Typical Agentic Server-Based Pipeline

Every user input triggers a cloud-based agent, sending 100% of the request to a central model.

High cost, large and unpredictable latency.

Agentic Pipeline with On-Device AI

Agentic Pipeline with On-Device AI

Local inference for smaller tasks, routes complex queries efficiently.

Reduces cloud API reliance, lowering costs by up to 40%.

Works offline, ensuring instant responses.

Tool Use

Reliably adhere to specified response schemas, ensuring 100% format compliance.

Reliably adhere to specified response schemas, ensuring 100% format compliance.

Reliably adhere to specified response schemas, ensuring 100% format compliance.

Reliable interactions between models and external tools, adhering strictly to developer-defined schemas.

Current problems

Small models struggle with structured outputs, making reliable tool-use challenging.

Constrained decoding methods introduce unwanted bias to the generation process, often leading to nonsensical responses.

All above limits the viability of existing small models for on-device agentic pipelines.

With Saiko you get

100% Schema Compliance

100% Schema Compliance

100% Schema Compliance

Outputs strictly follow developer-defined format, leaving no chance for errors.

Outputs strictly follow developer-defined format, leaving no chance for errors.

Outputs strictly follow developer-defined format, leaving no chance for errors.

Constraint-Aware Reinforcement Learning

Constraint-Aware Reinforcement Learning

Constraint-Aware Reinforcement Learning

Saiko models are fine-tuned using constraint-aware reinforcement learning, ensuring that each model works flawlessly in conjunction with structured decoding methods.

Saiko models are fine-tuned using constraint-aware reinforcement learning, ensuring that each model works flawlessly in conjunction with structured decoding methods.

Saiko models are fine-tuned using constraint-aware reinforcement learning, ensuring that each model works flawlessly in conjunction with structured decoding methods.

Performance

Designed for low latency with large context windows.

Designed for low latency with large context windows.

Designed for low latency with large context windows.

Allowing for aggressive caching to minimize the time to first token in RAG-like workflows.

Current problems

Context-rich workflows like retrieval-augmented generation (RAG) are bottlenecked by the prompt processing speed.

Traditional models process the entire prompt from scratch every time, wasting compute on repeated tokens.

Large KV-cache sizes limit the maximum length of the context window on mobile devices.

Generation speed is bottlenecked by the memory bandwidth of the device.

With Saiko you get

Non-causal Attention

Non-causal Attention

Non-causal Attention

Allows to precompute model activations for frequently-accessed documents, eliminating latency in RAG pipelines.

Allows to precompute model activations for frequently-accessed documents, eliminating latency in RAG pipelines.

Allows to precompute model activations for frequently-accessed documents, eliminating latency in RAG pipelines.

Small KV cache

Small KV cache

Small KV cache

Saiko model adopt architectural innovations pioneered by DeepSeek models to significantly reduce KV-cache sizes, enabling efficient long-context inference on devices with limited memory.

Saiko model adopt architectural innovations pioneered by DeepSeek models to significantly reduce KV-cache sizes, enabling efficient long-context inference on devices with limited memory.

Saiko model adopt architectural innovations pioneered by DeepSeek models to significantly reduce KV-cache sizes, enabling efficient long-context inference on devices with limited memory.

Built-in Speculative Decoder

Built-in Speculative Decoder

Built-in Speculative Decoder

Allows to bypass the memory bandwidth bottleneck by generating multiple tokens in parallel.

Allows to bypass the memory bandwidth bottleneck by generating multiple tokens in parallel.

Allows to bypass the memory bandwidth bottleneck by generating multiple tokens in parallel.

Hardware

Optimized to take full advantage of mobile neural network accelerators.

Optimized to take full advantage of mobile neural network accelerators.

Optimized to take full advantage of mobile neural network accelerators.

Leaving no flops on the table.

Current problems

Official quantized versions of open-source models are not optimized for mobile accelerators such as Apple Neural Engine, resulting in highly inefficient inference on mobile devices.

Closed nature of post-training methods and datasets makes efficient quantization challenging.

Most developers rely on crudely-quantized model checkpoints, resulting in unnecessary sacrifices in speed and accuracy.

With Saiko you get

Hardware-Tailored Quantization

Hardware-Tailored Quantization

Hardware-Tailored Quantization

Separate model checkpoints for each device platform, using quantization schemes designed to most efficiently utilize mobile NPU chips.

Separate model checkpoints for each device platform, using quantization schemes designed to most efficiently utilize mobile NPU chips.

Separate model checkpoints for each device platform, using quantization schemes designed to most efficiently utilize mobile NPU chips.

Quantization-Aware Training

Quantization-Aware Training

Quantization-Aware Training

Models are fine-tuned separately for each family of devices using quantization-aware training methods, ensuring optimal accuracy for each platform.

Models are fine-tuned separately for each family of devices using quantization-aware training methods, ensuring optimal accuracy for each platform.

Models are fine-tuned separately for each family of devices using quantization-aware training methods, ensuring optimal accuracy for each platform.

Search & RAG

Optimized retrieval and ranking models for on-device document retrieval.

Optimized retrieval and ranking models for on-device document retrieval.

Optimized retrieval and ranking models for on-device document retrieval.

Current problems

Models producing a single embedding per document, yielding inaccurate search, especially for keyword-dependent queries

High-quality search requires a combination of inverted index, embeddings and a separate reranking model

With Saiko you get

Token-level embedding models

Token-level embedding models

Token-level embedding models

Inspired by the state-of-the-art retrieval methods like ColBERT.

Inspired by the state-of-the-art retrieval methods like ColBERT.

Inspired by the state-of-the-art retrieval methods like ColBERT.

Single model for retrieval and ranking

Single model for retrieval and ranking

Single model for retrieval and ranking

Allows for faster search and reduces the binary size.

Allows for faster search and reduces the binary size.

Allows for faster search and reduces the binary size.

No More Chunking

No More Chunking

No More Chunking

Eliminates the need for complex chunking logic.

Eliminates the need for complex chunking logic.

Eliminates the need for complex chunking logic.

Built for

Built for

Built for

Developers
Developers
Developers

AI-powered chatbots, virtual assistants, and automation.

AI-powered chatbots, virtual assistants, and automation.

Enterprise AI
Enterprise AI
Enterprise AI

Secure AI for finance, healthcare, and regulated industries.

Secure AI for finance, healthcare, and regulated industries.

Gaming
Gaming
Gaming

NPC dialogue, AI-generated gameplay, interactive experiences.

NPC dialogue, AI-generated gameplay, interactive experiences.

IoT & Edge Devices
IoT & Edge Devices
IoT & Edge Devices

Smart assistants, industrial automation, robotics.

Smart assistants, industrial automation, robotics.

Try our family of small 0.3B, 0.5B, 1B, 3B, 7B parameter models, highly optimized for on-device tasks.

Try our family of small 0.3B, 0.5B, 1B, 3B, 7B parameter models, highly optimized for on-device tasks.

Try our family of small 0.3B, 0.5B, 1B, 3B, 7B parameter models, highly optimized for on-device tasks.

Zero cloud dependency, instant response times. Perfect for your business goals

Zero cloud dependency, instant response times. Perfect for your business goals

Zero cloud dependency, instant response times. Perfect for your business goals

© 2025 Saiko by Mirai. All rights reserved

© 2025 Saiko by Mirai. All rights reserved

© 2025 Saiko by Mirai. All rights reserved