Multi-provider AI middleware.
Cascade is a self-hosted REST middleware that routes AI prompts across multiple free providers — Gemini, Groq, Mistral, and Cerebras. It automatically cascades through a configurable array of models, escalating on complexity and falling back on failure, so your app always gets an answer.
Deploy it once on Render and call it from any browser, game, or app with a single POST /ask-ai request. No vendor lock-in — swap, add, or remove models by editing one array.
How it Works
Cascade maps a difficulty float (0.0–1.0) to a starting position in the MODELS array, then sweeps through models until one succeeds.
difficulty float is mapped to the nearest index in the MODELS array. 0.0 starts at the cheapest/fastest model, 1.0 starts at the most capable.state: "too_complex", Cascade climbs toward the most capable end of the array and tries the next unvisited model.|. On a 429 or auth error, Cascade rotates to the next key before marking the model as failed.
Quick Start
1. Fork and clone
Fork the repo on GitHub so you have your own copy to modify and deploy, then clone your fork:
git clone https://github.com/YOUR_USERNAME/cascade cd cascade npm install
2. Set up environment variables
Copy .env.example to .env and fill in your API keys:
cp .env.example .env
MY_APP_SECRET=your-secret-here GEMINI_KEY=your-gemini-key GROQ_KEY=your-groq-key MISTRAL_KEY=your-mistral-key CEREBRAS_KEY=your-cerebras-key
3. Run locally
npm start # Middleware listening on port 10000 at 0.0.0.0
4. Deploy to Render (free)
Push your fork to GitHub and create a new Web Service on Render pointing at your fork. Set the start command to npm start. Add your environment variables in the Render dashboard. Your endpoint will be at https://your-service.onrender.com.
/wake to warm the server before your first prompt.
5. Make your first request
fetch("https://your-service.onrender.com/ask-ai", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
secret: "your-secret-here",
difficulty: 0.5,
prompt: "Return a JSON object with key 'status' and value 'ok'."
})
})
.then(r => r.json())
.then(data => console.log(data.package));
Endpoints
Health ping. Returns 200 "Full Sweep Online". Use this to warm the server after a cold start.
Returns server status, uptime, model count, key pool sizes, rate limit config, and provider timeout.
{
"status": "ok",
"uptime": "0h 4m 21s",
"models": 10,
"keyPools": { "GEMINI": 1, "GROQ": 1, "MISTRAL": 1, "CEREBRAS": 1 },
"rateLimit": { "max": 30, "windowMs": 60000 },
"providerTimeoutMs": 15000
}
Send a prompt through the cascade. Returns the first successful model response.
Request body
| Field | Type | Description |
|---|---|---|
secret | string | Must match MY_APP_SECRET env var. |
difficulty | float 0.0–1.0 | Starting position in the model array. 0.0 = cheapest, 1.0 = most capable. |
prompt | string | Your prompt. Models are instructed to return a JSON object with state and package fields. |
Response
{
"state": "complete",
"package": {
"package": { /* your data */ }
},
"answeredBy": {
"index": 1,
"provider": "groq",
"model": "llama-3.3-70b-versatile"
}
}
Error responses
| Status | Meaning |
|---|---|
403 | Invalid or missing secret. |
429 | Rate limit exceeded for your IP. |
503 | All models exhausted — no provider responded successfully. |
Model Array
MODELS is a plain array ordered from least to most capable. Difficulty 0.0 maps to index 0, difficulty 1.0 maps to the last entry. Add, remove, or reorder entries freely — the cascade adapts automatically.
const MODELS = [
{ provider: 'cerebras', model: 'llama3.1-8b' }, // 0.00
{ provider: 'groq', model: 'llama-3.3-70b-versatile' }, // 0.11
{ provider: 'gemini', model: 'gemini-2.0-flash' }, // 0.33
{ provider: 'mistral', model: 'mistral-small-latest' }, // 0.56
{ provider: 'mistral', model: 'mistral-large-latest' }, // 0.78
{ provider: 'gemini', model: 'gemini-2.5-pro' }, // 1.00
];
Supported providers
| Provider | Key env var | API |
|---|---|---|
gemini | GEMINI_KEY | Google Generative Language |
groq | GROQ_KEY | OpenAI-compatible |
mistral | MISTRAL_KEY | OpenAI-compatible |
cerebras | CEREBRAS_KEY | OpenAI-compatible |
Difficulty
The difficulty float maps to a starting index in the MODELS array using:
index = Math.round(difficulty * (MODELS.length - 1))
This means difficulty always maps proportionally regardless of how many models you have. A 5-model array and a 20-model array both treat 0.5 as "start in the middle."
| Difficulty | Use when |
|---|---|
0.0 – 0.2 | Simple tasks — classification, short answers, data formatting |
0.3 – 0.5 | Medium tasks — summaries, Q&A, moderate reasoning |
0.6 – 0.8 | Complex tasks — code generation, long-form content, analysis |
0.9 – 1.0 | Hard tasks — deep reasoning, multi-step problems, nuanced generation |
Key Rotation
Each provider supports multiple API keys. Separate them with | in the env var. Cascade loads them into a pool at startup and tries each one in order if the previous key hits a rate limit (429) or auth error (401/403).
# .env GEMINI_KEY=keyA|keyB|keyC GROQ_KEY=keyOne|keyTwo
On startup, the server logs how many keys were loaded per provider:
[Keys] GEMINI: 3 key(s) loaded [Keys] GROQ: 2 key(s) loaded [Keys] MISTRAL: 1 key(s) loaded [Keys] CEREBRAS: 1 key(s) loaded
| as the separator — not - or ,. The pipe character is guaranteed to never appear in API keys from any supported provider.
Origin Whitelist
The ALLOWED_ORIGINS array at the top of cascade.js controls which origins can call /ask-ai. Browser-based clients (including Godot Web Exports) always send a real Origin header that cannot be spoofed.
// Allow everyone (dev / testing) const ALLOWED_ORIGINS = []; // Restrict to your domain const ALLOWED_ORIGINS = [ 'https://cascade.yzzy.online', 'https://yourgame.itch.io' ];
secret field still provides a second layer of protection.
Rate Limiting
Cascade uses express-rate-limit to limit requests per IP on the /ask-ai endpoint. Defaults to 30 requests per minute. Configurable via env vars.
| Env var | Default | Description |
|---|---|---|
RATE_LIMIT_MAX | 30 | Max requests per window per IP |
RATE_LIMIT_WINDOW_MS | 60000 | Window size in milliseconds |
When the limit is exceeded, the server returns:
429 { "state": "error", "content": "Too many requests, please slow down." }
Environment Variables
| Variable | Required | Description |
|---|---|---|
MY_APP_SECRET | Yes | Secret key required on every /ask-ai request |
GEMINI_KEY | If using Gemini | Google AI API key(s), pipe-separated |
GROQ_KEY | If using Groq | Groq API key(s), pipe-separated |
MISTRAL_KEY | If using Mistral | Mistral API key(s), pipe-separated |
CEREBRAS_KEY | If using Cerebras | Cerebras API key(s), pipe-separated |
PORT | No | Server port. Default: 10000. Render sets this automatically. |
PROVIDER_TIMEOUT_MS | No | Per-request timeout in ms. Default: 15000. |
RATE_LIMIT_MAX | No | Max requests per window per IP. Default: 30. |
RATE_LIMIT_WINDOW_MS | No | Rate limit window in ms. Default: 60000. |
See it in Action
This assistant is built on top of Cascade — a simple chat interface that uses POST /ask-ai at difficulty 0.5 to answer questions about yzzy.online products. It demonstrates how easy it is to add AI to any browser app with a single fetch call.
Ask questions about NetClient setup, Apex Racer tuning, Bullet Game enemies, and more. Powered by Cascade running on Render.
Try the demo →// The entire integration is one fetch call
const res = await fetch("https://your-cascade.onrender.com/ask-ai", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
secret: SECRET,
difficulty: 0.5,
prompt: SYSTEM_PROMPT + "\n\nUser: " + userMessage
})
});
const data = await res.json();
const reply = data.package.package.reply;
console.log("Answered by:", data.answeredBy.model);