// Cascade

Multi-provider AI middleware.

Node.js Express Self-hosted Free Providers Zero Lock-in

Cascade is a self-hosted REST middleware that routes AI prompts across multiple free providers — Gemini, Groq, Mistral, and Cerebras. It automatically cascades through a configurable array of models, escalating on complexity and falling back on failure, so your app always gets an answer.

Deploy it once on Render and call it from any browser, game, or app with a single POST /ask-ai request. No vendor lock-in — swap, add, or remove models by editing one array.

How it Works

Cascade maps a difficulty float (0.0–1.0) to a starting position in the MODELS array, then sweeps through models until one succeeds.

1
Difficulty → Index
The difficulty float is mapped to the nearest index in the MODELS array. 0.0 starts at the cheapest/fastest model, 1.0 starts at the most capable.
2
Call the model
Cascade calls the provider at that index with your prompt. Each provider is instructed to return a strict JSON response.
3
too_complex → climb
If the model returns state: "too_complex", Cascade climbs toward the most capable end of the array and tries the next unvisited model.
4
Error → fall back
On a network error, 429 rate limit, or 503, Cascade falls back toward the least capable end and tries the next unvisited model. Rate limits trigger a 1.5s cooldown.
5
Full sweep guarantee
A visited set tracks every attempted model. Cascade guarantees every model is tried exactly once before returning an error — no model is skipped.
Key rotation: Each provider supports multiple API keys separated by |. On a 429 or auth error, Cascade rotates to the next key before marking the model as failed.

Quick Start

1. Fork and clone

Fork the repo on GitHub so you have your own copy to modify and deploy, then clone your fork:

git clone https://github.com/YOUR_USERNAME/cascade
cd cascade
npm install

2. Set up environment variables

Copy .env.example to .env and fill in your API keys:

cp .env.example .env
MY_APP_SECRET=your-secret-here
GEMINI_KEY=your-gemini-key
GROQ_KEY=your-groq-key
MISTRAL_KEY=your-mistral-key
CEREBRAS_KEY=your-cerebras-key

3. Run locally

npm start
# Middleware listening on port 10000 at 0.0.0.0

4. Deploy to Render (free)

Push your fork to GitHub and create a new Web Service on Render pointing at your fork. Set the start command to npm start. Add your environment variables in the Render dashboard. Your endpoint will be at https://your-service.onrender.com.

Cold starts: Render's free tier spins down after 15 minutes of inactivity. The first request after idle may take 30–60 seconds. Call /wake to warm the server before your first prompt.

5. Make your first request

fetch("https://your-service.onrender.com/ask-ai", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    secret: "your-secret-here",
    difficulty: 0.5,
    prompt: "Return a JSON object with key 'status' and value 'ok'."
  })
})
.then(r => r.json())
.then(data => console.log(data.package));

Endpoints

GET /wake

Health ping. Returns 200 "Full Sweep Online". Use this to warm the server after a cold start.

GET /health

Returns server status, uptime, model count, key pool sizes, rate limit config, and provider timeout.

{
  "status": "ok",
  "uptime": "0h 4m 21s",
  "models": 10,
  "keyPools": { "GEMINI": 1, "GROQ": 1, "MISTRAL": 1, "CEREBRAS": 1 },
  "rateLimit": { "max": 30, "windowMs": 60000 },
  "providerTimeoutMs": 15000
}
POST /ask-ai

Send a prompt through the cascade. Returns the first successful model response.

Request body

FieldTypeDescription
secretstringMust match MY_APP_SECRET env var.
difficultyfloat 0.0–1.0Starting position in the model array. 0.0 = cheapest, 1.0 = most capable.
promptstringYour prompt. Models are instructed to return a JSON object with state and package fields.

Response

{
  "state": "complete",
  "package": {
    "package": { /* your data */ }
  },
  "answeredBy": {
    "index": 1,
    "provider": "groq",
    "model": "llama-3.3-70b-versatile"
  }
}

Error responses

StatusMeaning
403Invalid or missing secret.
429Rate limit exceeded for your IP.
503All models exhausted — no provider responded successfully.

Model Array

MODELS is a plain array ordered from least to most capable. Difficulty 0.0 maps to index 0, difficulty 1.0 maps to the last entry. Add, remove, or reorder entries freely — the cascade adapts automatically.

const MODELS = [
  { provider: 'cerebras', model: 'llama3.1-8b' },       // 0.00
  { provider: 'groq',     model: 'llama-3.3-70b-versatile' }, // 0.11
  { provider: 'gemini',   model: 'gemini-2.0-flash' },  // 0.33
  { provider: 'mistral',  model: 'mistral-small-latest' }, // 0.56
  { provider: 'mistral',  model: 'mistral-large-latest' }, // 0.78
  { provider: 'gemini',   model: 'gemini-2.5-pro' },    // 1.00
];

Supported providers

ProviderKey env varAPI
geminiGEMINI_KEYGoogle Generative Language
groqGROQ_KEYOpenAI-compatible
mistralMISTRAL_KEYOpenAI-compatible
cerebrasCEREBRAS_KEYOpenAI-compatible

Difficulty

The difficulty float maps to a starting index in the MODELS array using:

index = Math.round(difficulty * (MODELS.length - 1))

This means difficulty always maps proportionally regardless of how many models you have. A 5-model array and a 20-model array both treat 0.5 as "start in the middle."

DifficultyUse when
0.0 – 0.2Simple tasks — classification, short answers, data formatting
0.3 – 0.5Medium tasks — summaries, Q&A, moderate reasoning
0.6 – 0.8Complex tasks — code generation, long-form content, analysis
0.9 – 1.0Hard tasks — deep reasoning, multi-step problems, nuanced generation

Key Rotation

Each provider supports multiple API keys. Separate them with | in the env var. Cascade loads them into a pool at startup and tries each one in order if the previous key hits a rate limit (429) or auth error (401/403).

# .env
GEMINI_KEY=keyA|keyB|keyC
GROQ_KEY=keyOne|keyTwo

On startup, the server logs how many keys were loaded per provider:

[Keys] GEMINI: 3 key(s) loaded
[Keys] GROQ: 2 key(s) loaded
[Keys] MISTRAL: 1 key(s) loaded
[Keys] CEREBRAS: 1 key(s) loaded
Separator note: Use | as the separator — not - or ,. The pipe character is guaranteed to never appear in API keys from any supported provider.

Origin Whitelist

The ALLOWED_ORIGINS array at the top of cascade.js controls which origins can call /ask-ai. Browser-based clients (including Godot Web Exports) always send a real Origin header that cannot be spoofed.

// Allow everyone (dev / testing)
const ALLOWED_ORIGINS = [];

// Restrict to your domain
const ALLOWED_ORIGINS = [
  'https://cascade.yzzy.online',
  'https://yourgame.itch.io'
];
Note: An empty array disables origin checking entirely — useful during local development. The secret field still provides a second layer of protection.

Rate Limiting

Cascade uses express-rate-limit to limit requests per IP on the /ask-ai endpoint. Defaults to 30 requests per minute. Configurable via env vars.

Env varDefaultDescription
RATE_LIMIT_MAX30Max requests per window per IP
RATE_LIMIT_WINDOW_MS60000Window size in milliseconds

When the limit is exceeded, the server returns:

429 { "state": "error", "content": "Too many requests, please slow down." }

Environment Variables

VariableRequiredDescription
MY_APP_SECRETYesSecret key required on every /ask-ai request
GEMINI_KEYIf using GeminiGoogle AI API key(s), pipe-separated
GROQ_KEYIf using GroqGroq API key(s), pipe-separated
MISTRAL_KEYIf using MistralMistral API key(s), pipe-separated
CEREBRAS_KEYIf using CerebrasCerebras API key(s), pipe-separated
PORTNoServer port. Default: 10000. Render sets this automatically.
PROVIDER_TIMEOUT_MSNoPer-request timeout in ms. Default: 15000.
RATE_LIMIT_MAXNoMax requests per window per IP. Default: 30.
RATE_LIMIT_WINDOW_MSNoRate limit window in ms. Default: 60000.

See it in Action

This assistant is built on top of Cascade — a simple chat interface that uses POST /ask-ai at difficulty 0.5 to answer questions about yzzy.online products. It demonstrates how easy it is to add AI to any browser app with a single fetch call.

yzzy.online AI Assistant

Ask questions about NetClient setup, Apex Racer tuning, Bullet Game enemies, and more. Powered by Cascade running on Render.

Try the demo →
// The entire integration is one fetch call
const res = await fetch("https://your-cascade.onrender.com/ask-ai", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    secret: SECRET,
    difficulty: 0.5,
    prompt: SYSTEM_PROMPT + "\n\nUser: " + userMessage
  })
});

const data = await res.json();
const reply = data.package.package.reply;
console.log("Answered by:", data.answeredBy.model);