Hippodrome Component

Required Workflow

Before implementation, re-read and follow the repo-wide Task Delivery Protocol in @/AGENTS.md for this task.

Development environment orchestrator for running all Cloud Control Plane components in one place for system testing. Starts all control plane services with proper configuration and service discovery.

Quick Start

# Explain the CLI
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- --help

# However you run Hippodrome, alias it to `hd`, which will be used below.
# Note: `PANTS_CONCURRENT=True` is required because the orchestrator runs via pants and starts some services also via pants.
alias hd='PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py --'
# or if you're watching and building on change:
alias hd='~/path/to/cloud_control_plane/dist/components.hippodrome/cli.pex'

# Use a random seed to ensure that the seed data is always the same.
export HIPPODROME_RANDOM_SEED=42

# Start with ecom profile (includes admin_upr, search_proxy, etc.)
hd start --profile ecom

# Connect to staging cell instead of fake_cell
hd start --profile ecom --cell staging

# Use custom table prefix
hd start --profile ecom --table-prefix my-feature-

# With explicit project root
hd start --project-root /path/to/project

Running Hippothesis Simple Health Test Against Local Hippodrome

Use this when validating local backend health (non-fuzz).

1) Start or restart Hippodrome (ecom profile)

hd control restart

If the control daemon is not running yet, start it once:

hd up --profile ecom

3) Run the deterministic check

PANTS_CONCURRENT=True \
hd test components/hippothesis/hippothesis/tests/test_simple_e2e_health.py

4) Known local behavior

Under moto + local queue wiring, admin add-docs can intermittently fail with SQS 500s.
the simple health test now falls back to direct index POST /documents for the single health doc when that happens.
This fallback is only for the deterministic health scenario; fuzz suites still exercise normal API paths.

Agent Iteration Loop (Recommended)

Use a long-lived control daemon to avoid repeated full teardown/restart cycles:

# 1) Start daemon once (auto-starts services by default)
hd up --profile ecom

# 2) Manage stack lifecycle without restarting daemon
hd control status
hd control restart
hd control stop
hd control start

# 3) End the daemon session cleanly
hd control shutdown

Bounded Change Strategy

Prefer --profile core unless a change needs ecom/full services.
For config/code changes in one service, use control restart instead of stop + fresh start.
Use control stop to pause stack activity while editing, then control start to resume.
Reup cli.py -- stop for forced cleanup (stuck ports/processes), not normal iteration.

Troubleshooting First

Before debugging any error, read troubleshooting.md in this directory. It contains solutions to common problems. If you solve a new problem, add it to troubleshooting.md.

Log Files

All service logs are written to timestamped directories in components/hippodrome/.logs/:

components/hippodrome/.logs/
├── latest -> 20260120-143022  # Symlink to most recent run
├── 20260120-143022/
│   ├── fake_cell.log
│   ├── controller.log
│   ├── console.log
│   └── admin_upr.log
└── 20260120-140815/
    └── ...

Use these log files when: - Debugging test failures that occur in hippodrome services - Investigating service startup issues - Analyzing errors that scroll off the terminal - Reviewing complete service output history

Quick access to latest logs:

# View all logs from latest run
ls components/hippodrome/.logs/latest/

# Tail a specific service
tail -f components/hippodrome/.logs/latest/controller.log

# Search for errors across all services
grep -i error components/hippodrome/.logs/latest/*.log

Component Structure

hippodrome/
├── AGENTS.md              # This file - instructions for AI assistants
├── troubleshooting.md     # Common problems and solutions (read first!)
├── BUILD                  # Pants build configuration
├── requirements.txt       # Python dependencies (click, pytest)
└── hippodrome/
    ├── BUILD              # python_sources() + resources for HTML
    ├── cli.py             # Click CLI entry point
    ├── config.py          # Service configurations (ports, env vars)
    ├── orchestrator.py    # Process spawner and manager
    ├── dashboard.py       # HTTP upr for status page
    ├── dashboard.html     # Single-file HTML dashboard
    └── unit_tests/
        ├── BUILD          # Test target
        └── test_*.py      # Unit tests

Profiles

The --profile flag controls which services are started:

Profile	Services	Use Case
`core` (default)	fake_cell, controller, console	Basic control plane development
`ecom`	core + admin_upr, search_proxy	E-commerce development
`full`	All services	Full stack testing

Profile Examples

# Default (core): Starts fake_cell, controller, console
hd start

# Ecom: Adds admin_upr (9004), search_proxy (9005)
hd start --profile ecom

Cell Connection

The --cell flag controls which cell to connect to:

Cell	Description	fake_cell Started?
`local` (default)	Use local fake_cell	Yes
`staging`	Connect to staging cell	No
`prod`	Connect to production cell	No

When using staging or prod, the fake_cell service is not started and services connect to the real deployed cell.

Warning: Using --cell prod connects to production data. A warning message is displayed.

# Use staging cell (skips fake_cell)
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --cell staging

Table Prefix

The --table-prefix flag controls DynamoDB table naming:

# Default: dev-{git-branch}- (e.g., "dev-feature-auth-")
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom

# Custom prefix
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --table-prefix my-feature-

Branch names with special characters (slashes, etc.) are sanitized: feature/add-auth → dev-feature-add-auth-

Services Managed

Core Services (Profile: core, ecom, full)

Service	Port	Description
dashboard	9000	Status dashboard (http://localhost:9000)
fake_cell	9001	Data plane mock (skipped with --cell staging/prod)
controller	9002	Control plane API (Django)
console	9008	Web dashboard UI (React, non-blocking)

E-commerce Services (Profile: ecom, full)

Service	Port	Description
admin_upr	9004	E-commerce backend (FastAPI)
search_proxy	9005	Search API gateway (Cloudflare Worker)

External Services (Manual Setup Required)

Service	Port	Description
global_worker	9012	Search query router (external repo, Cloudflare Worker)

global_worker Setup (External Repository)

The global_worker is a Cloudflare Worker that handles search query routing, merchandising rules, and caching. It lives in a separate repository and must be set up manually for full search functionality.

Why global_worker is Needed

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│search_proxy │────▶│global_worker│────▶│  fake_cell  │
│   :9005     │     │   :9012     │     │   :9001     │
└─────────────┘     └─────────────┘     └─────────────┘

search_proxy receives search API requests
global_worker applies merchandising rules and caching
global_worker proxies the final request to the cell (fake_cell or real cell)

Setup Instructions

Clone the global-worker repository:

# Clone to a directory outside this repo
cd ~/dev
git clone [email protected]:marqo-ai/global-worker.git
cd global-worker

Install dependencies:
```
npm install
```

Create local wrangler configuration: Create a wrangler.local.toml file (or use an existing one) with:

name = "local-global-worker"
main = "src/index.ts"
compatibility_date = "2024-09-23"
compatibility_flags = ["nodejs_compat"]

[vars]
ENV = "dev"
FULL_ENV = "dev-local"
# Point to fake_cell for local development
CELL_URL = "http://localhost:9001"

[dev]
port = 9012
local_protocol = "http"

Start global_worker:

npx wrangler dev --config wrangler.local.toml --port 9012

Running Without global_worker

If you don't need to test the full search flow, you can skip global_worker setup: - search_proxy will return errors for search requests that require global_worker - admin_upr and other write operations will still work - Direct cell operations (via controller) are unaffected

Verifying Setup

Once global_worker is running:

# Check global_worker health (if health endpoint exists)
curl http://localhost:9012/health

# Test full search flow (requires valid index and data)
curl -X POST http://localhost:9005/indexes/test-index/search \
  -H "Content-Type: application/json" \
  -d '{"q": "test query"}'

Key Files

File	Purpose
`cli.py`	Entry point - parses arguments, runs orchestrator
`config.py`	`ServiceConfig` dataclass, `get_service_configs()` returns services in dependency order
`orchestrator.py`	`Orchestrator` class handles process lifecycle, log aggregation, graceful shutdown
`dashboard.py`	`DashboardServer` runs HTTP upr in a thread, ups status API
`dashboard.html`	Single-file HTML/CSS/JS dashboard with auto-refresh

Service Configuration Pattern

Services are configured in config.py:

ServiceConfig(
    name="service_name",
    port=9001,
    command=["pants", "run", "//path/to/app.py", "--", "--reload"],
    env={"ENV_VAR": "value"},
    cwd=project_root,  # or relative path like "components/console"
    blocking=True,      # False = orchestrator continues if service fails
)

Adding a New Service

Add a ServiceConfig in config.py in the appropriate dependency layer
Services in Layer 0 have no dependencies
Services in Layer 1+ depend on earlier layers
Set blocking=False for optional services (like console)
Set profiles=frozenset({Profile.ECOM, Profile.FULL}) to include in specific profiles

Example for an ecom service:

ServiceConfig(
    name="my_service",
    port=9020,
    command=["pants", "run", "//components/my_service:local", "--", "--reload"],
    env={
        "CELL_URL": cell_url,
        **table_names,  # Inject all table names
    },
    profiles=frozenset({Profile.ECOM, Profile.FULL}),
)

Development Commands

# Run linting
pants lint //components/hippodrome::

# Run tests
pants test //components/hippodrome/hippodrome/unit_tests:tests

# Check CLI help
pants run //components/hippodrome/hippodrome/cli.py -- --help
pants run //components/hippodrome/hippodrome/cli.py -- start --help

Design Decisions

No __init__.py files - Uses namespace packages per project convention
Async subprocess management - Uses asyncio.create_subprocess_exec for non-blocking process spawning
Threaded dashboard upr - Uses Python's built-in http.upr in a daemon thread to avoid external dependencies
Colored log prefixes - Each service gets a unique ANSI color for easy identification
Graceful shutdown - SIGTERM with 5-second timeout, then SIGKILL

Environment Variables

Controller

The orchestrator sets these for controller to connect to fake_cell:

CONTROL_PLANE_URL_OVERRIDE=http://localhost:9001 - Overrides cell URL (dynamically set based on --cell flag)
CONTROLLER_CELL=local - Uses local configuration
DEBUG=true - Enables debug mode
SECRET_KEY=local-dev-secret-key-not-for-production - Local-only secret

E-commerce Services (--profile ecom)

Table names are injected based on --table-prefix, which defaults to dev-{sanitized-git-branch}-. This should be safe to omit when running locally on a dev branch.

admin_upr also receives:

DATA_PLANE_CELLS - JSON config for cell API gateway (based on --cell flag)
MARQO_BASE_URL - Marqo Cloud API URL

Console Notes

Console is non-blocking: orchestrator continues if it fails to start
If node_modules/ is missing, orchestrator runs npm ci automatically
Console uses PORT env var to set port 9008
BROWSER=none prevents auto-opening browser

Testing Notes

Tests use MagicMock to avoid starting real processes
Dashboard tests handle missing dashboard.html (sandbox doesn't include resources)
Use pytest-asyncio for async test methods