// Cascade

Multi-provider AI middleware.

Node.js Express Self-hosted Free Providers Zero Lock-in

Cascade is a self-hosted REST middleware that routes AI prompts across multiple free providers — Gemini, Groq, Mistral, and Cerebras. It automatically cascades through a configurable array of models, escalating on complexity and falling back on failure, so your app always gets an answer.

Deploy it once on Render and call it from any browser, game, or app with a single POST /ask-ai request. No vendor lock-in — swap, add, or remove models by editing one array.

How it Works

Cascade maps a difficulty float (0.0–1.0) to a starting position in the MODELS array, then sweeps through models until one succeeds.

1

Difficulty → Index

The difficulty float is mapped to the nearest index in the MODELS array. 0.0 starts at the cheapest/fastest model, 1.0 starts at the most capable.

2

Call the model

Cascade calls the provider at that index with your prompt. Each provider is instructed to return a strict JSON response.

3

too_complex → climb

If the model returns state: "too_complex", Cascade climbs toward the most capable end of the array and tries the next unvisited model.

4

Error → fall back

On a network error, 429 rate limit, or 503, Cascade falls back toward the least capable end and tries the next unvisited model. Rate limits trigger a 1.5s cooldown.

5

Full sweep guarantee

A visited set tracks every attempted model. Cascade guarantees every model is tried exactly once before returning an error — no model is skipped.

Key rotation: Each provider supports multiple API keys separated by |. On a 429 or auth error, Cascade rotates to the next key before marking the model as failed.

Quick Start

1. Fork and clone

Fork the repo on GitHub so you have your own copy to modify and deploy, then clone your fork:

git clone https://github.com/YOUR_USERNAME/cascade
cd cascade
npm install

2. Set up environment variables

Copy .env.example to .env and fill in your API keys:

cp .env.example .env

MY_APP_SECRET=your-secret-here
GEMINI_KEY=your-gemini-key
GROQ_KEY=your-groq-key
MISTRAL_KEY=your-mistral-key
CEREBRAS_KEY=your-cerebras-key

3. Run locally

npm start
# Middleware listening on port 10000 at 0.0.0.0

4. Deploy to Render (free)

Push your fork to GitHub and create a new Web Service on Render pointing at your fork. Set the start command to npm start. Add your environment variables in the Render dashboard. Your endpoint will be at https://your-service.onrender.com.

Cold starts: Render's free tier spins down after 15 minutes of inactivity. The first request after idle may take 30–60 seconds. Call /wake to warm the server before your first prompt.

5. Make your first request

fetch("https://your-service.onrender.com/ask-ai", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    secret: "your-secret-here",
    difficulty: 0.5,
    prompt: "Return a JSON object with key 'status' and value 'ok'."
  })
})
.then(r => r.json())
.then(data => console.log(data.package));

Endpoints

GET /wake

Health ping. Returns 200 "Full Sweep Online". Use this to warm the server after a cold start.

GET /health

Returns server status, uptime, model count, key pool sizes, rate limit config, and provider timeout.

{
  "status": "ok",
  "uptime": "0h 4m 21s",
  "models": 10,
  "keyPools": { "GEMINI": 1, "GROQ": 1, "MISTRAL": 1, "CEREBRAS": 1 },
  "rateLimit": { "max": 30, "windowMs": 60000 },
  "providerTimeoutMs": 15000
}

POST /ask-ai

Send a prompt through the cascade. Returns the first successful model response.

Request body

Field	Type	Description
`secret`	string	Must match `MY_APP_SECRET` env var.
`difficulty`	float 0.0–1.0	Starting position in the model array. 0.0 = cheapest, 1.0 = most capable.
`prompt`	string	Your prompt. Models are instructed to return a JSON object with `state` and `package` fields.

Response

{
  "state": "complete",
  "package": {
    "package": { /* your data */ }
  },
  "answeredBy": {
    "index": 1,
    "provider": "groq",
    "model": "llama-3.3-70b-versatile"
  }
}

Error responses

Status	Meaning
`403`	Invalid or missing secret.
`429`	Rate limit exceeded for your IP.
`503`	All models exhausted — no provider responded successfully.

Model Array

MODELS is a plain array ordered from least to most capable. Difficulty 0.0 maps to index 0, difficulty 1.0 maps to the last entry. Add, remove, or reorder entries freely — the cascade adapts automatically.

const MODELS = [
  { provider: 'cerebras', model: 'llama3.1-8b' },       // 0.00
  { provider: 'groq',     model: 'llama-3.3-70b-versatile' }, // 0.11
  { provider: 'gemini',   model: 'gemini-2.0-flash' },  // 0.33
  { provider: 'mistral',  model: 'mistral-small-latest' }, // 0.56
  { provider: 'mistral',  model: 'mistral-large-latest' }, // 0.78
  { provider: 'gemini',   model: 'gemini-2.5-pro' },    // 1.00
];

Supported providers

Provider	Key env var	API
`gemini`	GEMINI_KEY	Google Generative Language
`groq`	GROQ_KEY	OpenAI-compatible
`mistral`	MISTRAL_KEY	OpenAI-compatible
`cerebras`	CEREBRAS_KEY	OpenAI-compatible

Difficulty

The difficulty float maps to a starting index in the MODELS array using:

index = Math.round(difficulty * (MODELS.length - 1))

This means difficulty always maps proportionally regardless of how many models you have. A 5-model array and a 20-model array both treat 0.5 as "start in the middle."

Difficulty	Use when
`0.0 – 0.2`	Simple tasks — classification, short answers, data formatting
`0.3 – 0.5`	Medium tasks — summaries, Q&A, moderate reasoning
`0.6 – 0.8`	Complex tasks — code generation, long-form content, analysis
`0.9 – 1.0`	Hard tasks — deep reasoning, multi-step problems, nuanced generation

Key Rotation

Each provider supports multiple API keys. Separate them with | in the env var. Cascade loads them into a pool at startup and tries each one in order if the previous key hits a rate limit (429) or auth error (401/403).

# .env
GEMINI_KEY=keyA|keyB|keyC
GROQ_KEY=keyOne|keyTwo

On startup, the server logs how many keys were loaded per provider:

[Keys] GEMINI: 3 key(s) loaded
[Keys] GROQ: 2 key(s) loaded
[Keys] MISTRAL: 1 key(s) loaded
[Keys] CEREBRAS: 1 key(s) loaded

Separator note: Use | as the separator — not - or ,. The pipe character is guaranteed to never appear in API keys from any supported provider.

Origin Whitelist

The ALLOWED_ORIGINS array at the top of cascade.js controls which origins can call /ask-ai. Browser-based clients (including Godot Web Exports) always send a real Origin header that cannot be spoofed.

// Allow everyone (dev / testing)
const ALLOWED_ORIGINS = [];

// Restrict to your domain
const ALLOWED_ORIGINS = [
  'https://cascade.yzzy.online',
  'https://yourgame.itch.io'
];

Note: An empty array disables origin checking entirely — useful during local development. The secret field still provides a second layer of protection.

Rate Limiting

Cascade uses express-rate-limit to limit requests per IP on the /ask-ai endpoint. Defaults to 30 requests per minute. Configurable via env vars.

Env var	Default	Description
`RATE_LIMIT_MAX`	30	Max requests per window per IP
`RATE_LIMIT_WINDOW_MS`	60000	Window size in milliseconds

When the limit is exceeded, the server returns:

429 { "state": "error", "content": "Too many requests, please slow down." }

Environment Variables

Variable	Required	Description
`MY_APP_SECRET`	Yes	Secret key required on every /ask-ai request
`GEMINI_KEY`	If using Gemini	Google AI API key(s), pipe-separated
`GROQ_KEY`	If using Groq	Groq API key(s), pipe-separated
`MISTRAL_KEY`	If using Mistral	Mistral API key(s), pipe-separated
`CEREBRAS_KEY`	If using Cerebras	Cerebras API key(s), pipe-separated
`PORT`	No	Server port. Default: 10000. Render sets this automatically.
`PROVIDER_TIMEOUT_MS`	No	Per-request timeout in ms. Default: 15000.
`RATE_LIMIT_MAX`	No	Max requests per window per IP. Default: 30.
`RATE_LIMIT_WINDOW_MS`	No	Rate limit window in ms. Default: 60000.

See it in Action

This assistant is built on top of Cascade — a simple chat interface that uses POST /ask-ai at difficulty 0.5 to answer questions about yzzy.online products. It demonstrates how easy it is to add AI to any browser app with a single fetch call.

yzzy.online AI Assistant

Ask questions about NetClient setup, Apex Racer tuning, Bullet Game enemies, and more. Powered by Cascade running on Render.

Try the demo →

// The entire integration is one fetch call
const res = await fetch("https://your-cascade.onrender.com/ask-ai", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    secret: SECRET,
    difficulty: 0.5,
    prompt: SYSTEM_PROMPT + "\n\nUser: " + userMessage
  })
});

const data = await res.json();
const reply = data.package.package.reply;
console.log("Answered by:", data.answeredBy.model);