Hippodrome Component
Required Workflow
Before implementation, re-read and follow the repo-wide Task Delivery Protocol in @/AGENTS.md for this task.
Development environment orchestrator for running all Cloud Control Plane components in one place for system testing. Starts all control plane services with proper configuration and service discovery.
Quick Start
# Explain the CLI
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- --help
# However you run Hippodrome, alias it to `hd`, which will be used below.
# Note: `PANTS_CONCURRENT=True` is required because the orchestrator runs via pants and starts some services also via pants.
alias hd='PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py --'
# or if you're watching and building on change:
alias hd='~/path/to/cloud_control_plane/dist/components.hippodrome/cli.pex'
# Use a random seed to ensure that the seed data is always the same.
export HIPPODROME_RANDOM_SEED=42
# Start with ecom profile (includes admin_upr, search_proxy, etc.)
hd start --profile ecom
# Connect to staging cell instead of fake_cell
hd start --profile ecom --cell staging
# Use custom table prefix
hd start --profile ecom --table-prefix my-feature-
# With explicit project root
hd start --project-root /path/to/project
Running Hippothesis Simple Health Test Against Local Hippodrome
Use this when validating local backend health (non-fuzz).
1) Start or restart Hippodrome (ecom profile)
hd control restart
If the control daemon is not running yet, start it once:
hd up --profile ecom
3) Run the deterministic check
PANTS_CONCURRENT=True \
hd test components/hippothesis/hippothesis/tests/test_simple_e2e_health.py
4) Known local behavior
- Under moto + local queue wiring, admin add-docs can intermittently fail with SQS 500s.
- the simple health test now falls back to direct index
POST /documentsfor the single health doc when that happens. - This fallback is only for the deterministic health scenario; fuzz suites still exercise normal API paths.
Agent Iteration Loop (Recommended)
Use a long-lived control daemon to avoid repeated full teardown/restart cycles:
# 1) Start daemon once (auto-starts services by default)
hd up --profile ecom
# 2) Manage stack lifecycle without restarting daemon
hd control status
hd control restart
hd control stop
hd control start
# 3) End the daemon session cleanly
hd control shutdown
Bounded Change Strategy
- Prefer
--profile coreunless a change needs ecom/full services. - For config/code changes in one service, use
control restartinstead ofstop+ freshstart. - Use
control stopto pause stack activity while editing, thencontrol startto resume. - Reup
cli.py -- stopfor forced cleanup (stuck ports/processes), not normal iteration.
Troubleshooting First
Before debugging any error, read troubleshooting.md in this directory. It contains solutions to common problems. If you solve a new problem, add it to troubleshooting.md.
Log Files
All service logs are written to timestamped directories in components/hippodrome/.logs/:
components/hippodrome/.logs/
├── latest -> 20260120-143022 # Symlink to most recent run
├── 20260120-143022/
│ ├── fake_cell.log
│ ├── controller.log
│ ├── console.log
│ └── admin_upr.log
└── 20260120-140815/
└── ...
Use these log files when: - Debugging test failures that occur in hippodrome services - Investigating service startup issues - Analyzing errors that scroll off the terminal - Reviewing complete service output history
Quick access to latest logs:
# View all logs from latest run
ls components/hippodrome/.logs/latest/
# Tail a specific service
tail -f components/hippodrome/.logs/latest/controller.log
# Search for errors across all services
grep -i error components/hippodrome/.logs/latest/*.log
Component Structure
hippodrome/
├── AGENTS.md # This file - instructions for AI assistants
├── troubleshooting.md # Common problems and solutions (read first!)
├── BUILD # Pants build configuration
├── requirements.txt # Python dependencies (click, pytest)
└── hippodrome/
├── BUILD # python_sources() + resources for HTML
├── cli.py # Click CLI entry point
├── config.py # Service configurations (ports, env vars)
├── orchestrator.py # Process spawner and manager
├── dashboard.py # HTTP upr for status page
├── dashboard.html # Single-file HTML dashboard
└── unit_tests/
├── BUILD # Test target
└── test_*.py # Unit tests
Profiles
The --profile flag controls which services are started:
| Profile | Services | Use Case |
|---|---|---|
core (default) |
fake_cell, controller, console | Basic control plane development |
ecom |
core + admin_upr, search_proxy | E-commerce development |
full |
All services | Full stack testing |
Profile Examples
# Default (core): Starts fake_cell, controller, console
hd start
# Ecom: Adds admin_upr (9004), search_proxy (9005)
hd start --profile ecom
Cell Connection
The --cell flag controls which cell to connect to:
| Cell | Description | fake_cell Started? |
|---|---|---|
local (default) |
Use local fake_cell | Yes |
staging |
Connect to staging cell | No |
prod |
Connect to production cell | No |
When using staging or prod, the fake_cell service is not started and services connect to the real deployed cell.
Warning: Using --cell prod connects to production data. A warning message is displayed.
# Use staging cell (skips fake_cell)
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --cell staging
Table Prefix
The --table-prefix flag controls DynamoDB table naming:
# Default: dev-{git-branch}- (e.g., "dev-feature-auth-")
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom
# Custom prefix
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --table-prefix my-feature-
Branch names with special characters (slashes, etc.) are sanitized: feature/add-auth → dev-feature-add-auth-
Services Managed
Core Services (Profile: core, ecom, full)
| Service | Port | Description |
|---|---|---|
| dashboard | 9000 | Status dashboard (http://localhost:9000) |
| fake_cell | 9001 | Data plane mock (skipped with --cell staging/prod) |
| controller | 9002 | Control plane API (Django) |
| console | 9008 | Web dashboard UI (React, non-blocking) |
E-commerce Services (Profile: ecom, full)
| Service | Port | Description |
|---|---|---|
| admin_upr | 9004 | E-commerce backend (FastAPI) |
| search_proxy | 9005 | Search API gateway (Cloudflare Worker) |
External Services (Manual Setup Required)
| Service | Port | Description |
|---|---|---|
| global_worker | 9012 | Search query router (external repo, Cloudflare Worker) |
global_worker Setup (External Repository)
The global_worker is a Cloudflare Worker that handles search query routing, merchandising rules, and caching. It lives in a separate repository and must be set up manually for full search functionality.
Why global_worker is Needed
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│search_proxy │────▶│global_worker│────▶│ fake_cell │
│ :9005 │ │ :9012 │ │ :9001 │
└─────────────┘ └─────────────┘ └─────────────┘
- search_proxy receives search API requests
- global_worker applies merchandising rules and caching
- global_worker proxies the final request to the cell (fake_cell or real cell)
Setup Instructions
-
Clone the global-worker repository:
# Clone to a directory outside this repo cd ~/dev git clone [email protected]:marqo-ai/global-worker.git cd global-worker -
Install dependencies:
npm install -
Create local wrangler configuration: Create a
wrangler.local.tomlfile (or use an existing one) with:name = "local-global-worker" main = "src/index.ts" compatibility_date = "2024-09-23" compatibility_flags = ["nodejs_compat"] [vars] ENV = "dev" FULL_ENV = "dev-local" # Point to fake_cell for local development CELL_URL = "http://localhost:9001" [dev] port = 9012 local_protocol = "http" -
Start global_worker:
npx wrangler dev --config wrangler.local.toml --port 9012
Running Without global_worker
If you don't need to test the full search flow, you can skip global_worker setup: - search_proxy will return errors for search requests that require global_worker - admin_upr and other write operations will still work - Direct cell operations (via controller) are unaffected
Verifying Setup
Once global_worker is running:
# Check global_worker health (if health endpoint exists)
curl http://localhost:9012/health
# Test full search flow (requires valid index and data)
curl -X POST http://localhost:9005/indexes/test-index/search \
-H "Content-Type: application/json" \
-d '{"q": "test query"}'
Key Files
| File | Purpose |
|---|---|
cli.py |
Entry point - parses arguments, runs orchestrator |
config.py |
ServiceConfig dataclass, get_service_configs() returns services in dependency order |
orchestrator.py |
Orchestrator class handles process lifecycle, log aggregation, graceful shutdown |
dashboard.py |
DashboardServer runs HTTP upr in a thread, ups status API |
dashboard.html |
Single-file HTML/CSS/JS dashboard with auto-refresh |
Service Configuration Pattern
Services are configured in config.py:
ServiceConfig(
name="service_name",
port=9001,
command=["pants", "run", "//path/to/app.py", "--", "--reload"],
env={"ENV_VAR": "value"},
cwd=project_root, # or relative path like "components/console"
blocking=True, # False = orchestrator continues if service fails
)
Adding a New Service
- Add a
ServiceConfiginconfig.pyin the appropriate dependency layer - Services in Layer 0 have no dependencies
- Services in Layer 1+ depend on earlier layers
- Set
blocking=Falsefor optional services (like console) - Set
profiles=frozenset({Profile.ECOM, Profile.FULL})to include in specific profiles
Example for an ecom service:
ServiceConfig(
name="my_service",
port=9020,
command=["pants", "run", "//components/my_service:local", "--", "--reload"],
env={
"CELL_URL": cell_url,
**table_names, # Inject all table names
},
profiles=frozenset({Profile.ECOM, Profile.FULL}),
)
Development Commands
# Run linting
pants lint //components/hippodrome::
# Run tests
pants test //components/hippodrome/hippodrome/unit_tests:tests
# Check CLI help
pants run //components/hippodrome/hippodrome/cli.py -- --help
pants run //components/hippodrome/hippodrome/cli.py -- start --help
Design Decisions
- No
__init__.pyfiles - Uses namespace packages per project convention - Async subprocess management - Uses
asyncio.create_subprocess_execfor non-blocking process spawning - Threaded dashboard upr - Uses Python's built-in
http.uprin a daemon thread to avoid external dependencies - Colored log prefixes - Each service gets a unique ANSI color for easy identification
- Graceful shutdown - SIGTERM with 5-second timeout, then SIGKILL
Environment Variables
Controller
The orchestrator sets these for controller to connect to fake_cell:
CONTROL_PLANE_URL_OVERRIDE=http://localhost:9001- Overrides cell URL (dynamically set based on --cell flag)CONTROLLER_CELL=local- Uses local configurationDEBUG=true- Enables debug modeSECRET_KEY=local-dev-secret-key-not-for-production- Local-only secret
E-commerce Services (--profile ecom)
Table names are injected based on --table-prefix, which defaults to dev-{sanitized-git-branch}-. This should be safe to omit when running locally on a dev branch.
admin_upr also receives:
DATA_PLANE_CELLS- JSON config for cell API gateway (based on--cellflag)MARQO_BASE_URL- Marqo Cloud API URL
Console Notes
- Console is non-blocking: orchestrator continues if it fails to start
- If
node_modules/is missing, orchestrator runsnpm ciautomatically - Console uses
PORTenv var to set port 9008 BROWSER=noneprevents auto-opening browser
Testing Notes
- Tests use
MagicMockto avoid starting real processes - Dashboard tests handle missing
dashboard.html(sandbox doesn't include resources) - Use
pytest-asynciofor async test methods