Local inference for smaller tasks, routes complex queries efficiently.
Reduces cloud API reliance, lowering costs by up to 40%.
Works offline, ensuring instant responses.
Tool Use
Reliable interactions between models and external tools, adhering strictly to developer-defined schemas.
Current problems
Small models struggle with structured outputs, making reliable tool-use challenging.
Constrained decoding methods introduce unwanted bias to the generation process, often leading to nonsensical responses.
All above limits the viability of existing small models for on-device agentic pipelines.
With Saiko you get
Performance
Allowing for aggressive caching to minimize the time to first token in RAG-like workflows.
Current problems
Context-rich workflows like retrieval-augmented generation (RAG) are bottlenecked by the prompt processing speed.
Traditional models process the entire prompt from scratch every time, wasting compute on repeated tokens.
Large KV-cache sizes limit the maximum length of the context window on mobile devices.
Generation speed is bottlenecked by the memory bandwidth of the device.
With Saiko you get
Hardware
Leaving no flops on the table.
Current problems
Official quantized versions of open-source models are not optimized for mobile accelerators such as Apple Neural Engine, resulting in highly inefficient inference on mobile devices.
Closed nature of post-training methods and datasets makes efficient quantization challenging.
Most developers rely on crudely-quantized model checkpoints, resulting in unnecessary sacrifices in speed and accuracy.
With Saiko you get
Search & RAG
Current problems
Models producing a single embedding per document, yielding inaccurate search, especially for keyword-dependent queries
High-quality search requires a combination of inverted index, embeddings and a separate reranking model
With Saiko you get