Run inference on replit-3B code instruct model using CPU
Stable Diffusion and Flux in pure C/C++
An index & manager of Onedrive based on serverless. Can be deployed to Heroku/Glitch/Vercel/Replit/SCF/FG/FC/CFC/PHP web hosting/VPS.
Python bindings for the Transformer models implemented in C/C++ using GGML library.
llama and other large language models on iOS and MacOS offline using GGML library.
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization
Go manage your Ollama models
Suno AI's Bark model in C/C++ for fast text-to-speech generation
Chat with your documents offline using AI.