Skip to content

Hippodrome Component

Required Workflow

Before implementation, re-read and follow the repo-wide Task Delivery Protocol in @/AGENTS.md for this task.

Development environment orchestrator for running all Cloud Control Plane components in one place for system testing. Starts all control plane services with proper configuration and service discovery.

Quick Start

# Explain the CLI
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- --help

# However you run Hippodrome, alias it to `hd`, which will be used below.
# Note: `PANTS_CONCURRENT=True` is required because the orchestrator runs via pants and starts some services also via pants.
alias hd='PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py --'
# or if you're watching and building on change:
alias hd='~/path/to/cloud_control_plane/dist/components.hippodrome/cli.pex'

# Use a random seed to ensure that the seed data is always the same.
export HIPPODROME_RANDOM_SEED=42

# Start with ecom profile (includes admin_upr, search_proxy, etc.)
hd start --profile ecom

# Connect to staging cell instead of fake_cell
hd start --profile ecom --cell staging

# Use custom table prefix
hd start --profile ecom --table-prefix my-feature-

# With explicit project root
hd start --project-root /path/to/project

Running Hippothesis Simple Health Test Against Local Hippodrome

Use this when validating local backend health (non-fuzz).

1) Start or restart Hippodrome (ecom profile)

hd control restart

If the control daemon is not running yet, start it once:

hd up --profile ecom

3) Run the deterministic check

PANTS_CONCURRENT=True \
hd test components/hippothesis/hippothesis/tests/test_simple_e2e_health.py

4) Known local behavior

  • Under moto + local queue wiring, admin add-docs can intermittently fail with SQS 500s.
  • the simple health test now falls back to direct index POST /documents for the single health doc when that happens.
  • This fallback is only for the deterministic health scenario; fuzz suites still exercise normal API paths.

Use a long-lived control daemon to avoid repeated full teardown/restart cycles:

# 1) Start daemon once (auto-starts services by default)
hd up --profile ecom

# 2) Manage stack lifecycle without restarting daemon
hd control status
hd control restart
hd control stop
hd control start

# 3) End the daemon session cleanly
hd control shutdown

Bounded Change Strategy

  • Prefer --profile core unless a change needs ecom/full services.
  • For config/code changes in one service, use control restart instead of stop + fresh start.
  • Use control stop to pause stack activity while editing, then control start to resume.
  • Reup cli.py -- stop for forced cleanup (stuck ports/processes), not normal iteration.

Troubleshooting First

Before debugging any error, read troubleshooting.md in this directory. It contains solutions to common problems. If you solve a new problem, add it to troubleshooting.md.

Log Files

All service logs are written to timestamped directories in components/hippodrome/.logs/:

components/hippodrome/.logs/
├── latest -> 20260120-143022  # Symlink to most recent run
├── 20260120-143022/
│   ├── fake_cell.log
│   ├── controller.log
│   ├── console.log
│   └── admin_upr.log
└── 20260120-140815/
    └── ...

Use these log files when: - Debugging test failures that occur in hippodrome services - Investigating service startup issues - Analyzing errors that scroll off the terminal - Reviewing complete service output history

Quick access to latest logs:

# View all logs from latest run
ls components/hippodrome/.logs/latest/

# Tail a specific service
tail -f components/hippodrome/.logs/latest/controller.log

# Search for errors across all services
grep -i error components/hippodrome/.logs/latest/*.log

Component Structure

hippodrome/
├── AGENTS.md              # This file - instructions for AI assistants
├── troubleshooting.md     # Common problems and solutions (read first!)
├── BUILD                  # Pants build configuration
├── requirements.txt       # Python dependencies (click, pytest)
└── hippodrome/
    ├── BUILD              # python_sources() + resources for HTML
    ├── cli.py             # Click CLI entry point
    ├── config.py          # Service configurations (ports, env vars)
    ├── orchestrator.py    # Process spawner and manager
    ├── dashboard.py       # HTTP upr for status page
    ├── dashboard.html     # Single-file HTML dashboard
    └── unit_tests/
        ├── BUILD          # Test target
        └── test_*.py      # Unit tests

Profiles

The --profile flag controls which services are started:

Profile Services Use Case
core (default) fake_cell, controller, console Basic control plane development
ecom core + admin_upr, search_proxy E-commerce development
full All services Full stack testing

Profile Examples

# Default (core): Starts fake_cell, controller, console
hd start

# Ecom: Adds admin_upr (9004), search_proxy (9005)
hd start --profile ecom

Cell Connection

The --cell flag controls which cell to connect to:

Cell Description fake_cell Started?
local (default) Use local fake_cell Yes
staging Connect to staging cell No
prod Connect to production cell No

When using staging or prod, the fake_cell service is not started and services connect to the real deployed cell.

Warning: Using --cell prod connects to production data. A warning message is displayed.

# Use staging cell (skips fake_cell)
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --cell staging

Table Prefix

The --table-prefix flag controls DynamoDB table naming:

# Default: dev-{git-branch}- (e.g., "dev-feature-auth-")
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom

# Custom prefix
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --table-prefix my-feature-

Branch names with special characters (slashes, etc.) are sanitized: feature/add-authdev-feature-add-auth-

Services Managed

Core Services (Profile: core, ecom, full)

Service Port Description
dashboard 9000 Status dashboard (http://localhost:9000)
fake_cell 9001 Data plane mock (skipped with --cell staging/prod)
controller 9002 Control plane API (Django)
console 9008 Web dashboard UI (React, non-blocking)

E-commerce Services (Profile: ecom, full)

Service Port Description
admin_upr 9004 E-commerce backend (FastAPI)
search_proxy 9005 Search API gateway (Cloudflare Worker)

External Services (Manual Setup Required)

Service Port Description
global_worker 9012 Search query router (external repo, Cloudflare Worker)

global_worker Setup (External Repository)

The global_worker is a Cloudflare Worker that handles search query routing, merchandising rules, and caching. It lives in a separate repository and must be set up manually for full search functionality.

Why global_worker is Needed

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│search_proxy │────▶│global_worker│────▶│  fake_cell  │
│   :9005     │     │   :9012     │     │   :9001     │
└─────────────┘     └─────────────┘     └─────────────┘
  • search_proxy receives search API requests
  • global_worker applies merchandising rules and caching
  • global_worker proxies the final request to the cell (fake_cell or real cell)

Setup Instructions

  1. Clone the global-worker repository:

    # Clone to a directory outside this repo
    cd ~/dev
    git clone [email protected]:marqo-ai/global-worker.git
    cd global-worker
    

  2. Install dependencies:

    npm install
    

  3. Create local wrangler configuration: Create a wrangler.local.toml file (or use an existing one) with:

    name = "local-global-worker"
    main = "src/index.ts"
    compatibility_date = "2024-09-23"
    compatibility_flags = ["nodejs_compat"]
    
    [vars]
    ENV = "dev"
    FULL_ENV = "dev-local"
    # Point to fake_cell for local development
    CELL_URL = "http://localhost:9001"
    
    [dev]
    port = 9012
    local_protocol = "http"
    

  4. Start global_worker:

    npx wrangler dev --config wrangler.local.toml --port 9012
    

Running Without global_worker

If you don't need to test the full search flow, you can skip global_worker setup: - search_proxy will return errors for search requests that require global_worker - admin_upr and other write operations will still work - Direct cell operations (via controller) are unaffected

Verifying Setup

Once global_worker is running:

# Check global_worker health (if health endpoint exists)
curl http://localhost:9012/health

# Test full search flow (requires valid index and data)
curl -X POST http://localhost:9005/indexes/test-index/search \
  -H "Content-Type: application/json" \
  -d '{"q": "test query"}'

Key Files

File Purpose
cli.py Entry point - parses arguments, runs orchestrator
config.py ServiceConfig dataclass, get_service_configs() returns services in dependency order
orchestrator.py Orchestrator class handles process lifecycle, log aggregation, graceful shutdown
dashboard.py DashboardServer runs HTTP upr in a thread, ups status API
dashboard.html Single-file HTML/CSS/JS dashboard with auto-refresh

Service Configuration Pattern

Services are configured in config.py:

ServiceConfig(
    name="service_name",
    port=9001,
    command=["pants", "run", "//path/to/app.py", "--", "--reload"],
    env={"ENV_VAR": "value"},
    cwd=project_root,  # or relative path like "components/console"
    blocking=True,      # False = orchestrator continues if service fails
)

Adding a New Service

  1. Add a ServiceConfig in config.py in the appropriate dependency layer
  2. Services in Layer 0 have no dependencies
  3. Services in Layer 1+ depend on earlier layers
  4. Set blocking=False for optional services (like console)
  5. Set profiles=frozenset({Profile.ECOM, Profile.FULL}) to include in specific profiles

Example for an ecom service:

ServiceConfig(
    name="my_service",
    port=9020,
    command=["pants", "run", "//components/my_service:local", "--", "--reload"],
    env={
        "CELL_URL": cell_url,
        **table_names,  # Inject all table names
    },
    profiles=frozenset({Profile.ECOM, Profile.FULL}),
)

Development Commands

# Run linting
pants lint //components/hippodrome::

# Run tests
pants test //components/hippodrome/hippodrome/unit_tests:tests

# Check CLI help
pants run //components/hippodrome/hippodrome/cli.py -- --help
pants run //components/hippodrome/hippodrome/cli.py -- start --help

Design Decisions

  1. No __init__.py files - Uses namespace packages per project convention
  2. Async subprocess management - Uses asyncio.create_subprocess_exec for non-blocking process spawning
  3. Threaded dashboard upr - Uses Python's built-in http.upr in a daemon thread to avoid external dependencies
  4. Colored log prefixes - Each service gets a unique ANSI color for easy identification
  5. Graceful shutdown - SIGTERM with 5-second timeout, then SIGKILL

Environment Variables

Controller

The orchestrator sets these for controller to connect to fake_cell:

  • CONTROL_PLANE_URL_OVERRIDE=http://localhost:9001 - Overrides cell URL (dynamically set based on --cell flag)
  • CONTROLLER_CELL=local - Uses local configuration
  • DEBUG=true - Enables debug mode
  • SECRET_KEY=local-dev-secret-key-not-for-production - Local-only secret

E-commerce Services (--profile ecom)

Table names are injected based on --table-prefix, which defaults to dev-{sanitized-git-branch}-. This should be safe to omit when running locally on a dev branch.

admin_upr also receives:

  • DATA_PLANE_CELLS - JSON config for cell API gateway (based on --cell flag)
  • MARQO_BASE_URL - Marqo Cloud API URL

Console Notes

  • Console is non-blocking: orchestrator continues if it fails to start
  • If node_modules/ is missing, orchestrator runs npm ci automatically
  • Console uses PORT env var to set port 9008
  • BROWSER=none prevents auto-opening browser

Testing Notes

  • Tests use MagicMock to avoid starting real processes
  • Dashboard tests handle missing dashboard.html (sandbox doesn't include resources)
  • Use pytest-asyncio for async test methods