Skip to content

FashionNova Queue Age Spike (2026-02-17)

Summary

  • Queue: prod-fashionnova-ecom-prod-queue
  • Spike: ApproximateAgeOfOldestMessage reached 912s around 2026-02-17 00:19:00 UTC
  • High-confidence cause: exactly 7 stale/retried messages were processed around 00:20:36-00:20:39 UTC, matching the age spike behavior.
  • Exact first-failure trigger for those 7 messages remains unproven from current logs.

What Was Verified

  • Account/region: controller account 023568249301, us-east-1
  • Queue live settings:
  • VisibilityTimeout=900
  • ReceiveMessageWaitTimeSeconds=20
  • DLQ max receive count 5
  • Event source mapping (prod-EcomIndexerFunction):
  • BatchSize=30
  • MaximumBatchingWindowInSeconds=20
  • ScalingConfig.MaximumConcurrency=15
  • FunctionResponseTypes=[ReportBatchItemFailures]

Timeline (UTC)

  • 00:06 age 132s
  • 00:19 age 912s
  • 00:20 age 949s
  • 00:21 age 22s (reset)

In the same 00:20 minute: - NumberOfMessagesSent=391 - NumberOfMessagesReceived=398 - Delta +7 receives (consistent with redelivery/retry behavior)

Correlated Jobs (FashionNova)

From prod-EcomIndexerJobsTable for shop c8tw6196-fashionnova-ecom-prod: - Exactly 7 jobs had lag >600s in 00:00-01:00 UTC (out of 27,705 jobs total). - The same 7 were the top lags (~958-969s). - Statuses: 6 CONFLICT, 1 COMPLETED.

Job IDs: - 50af0e12-4167-499b-a3e7-915f2878c73f - 950ae038-d797-4d20-a04f-b5b0388c8aca - 183374f3-54cc-4206-b2d8-520782cade97 - 6967613b-8170-440c-8b17-0f5f61402a9b - 16d8b0bb-2e94-4102-8e10-4453ca9142af - 54ba88a8-89bd-4893-b3c1-482b15b3b647 - d199ad46-67af-4090-b808-81bcc2e98160

Observed timing pattern: - Created and S3-written around 00:04:27-00:04:40 - First observed processing logs around 00:20:36-00:20:39

Manual Tracing: Find Delayed Jobs (lag_sec > 600)

Lag was computed as: - lag_sec = started_at - created_at (in seconds)

Repro command used to find delayed jobs for this shop/hour:

PROFILE=controller
REGION=us-east-1
TABLE=prod-EcomIndexerJobsTable
PK='PLATFORM#ecom#SHOP#c8tw6196-fashionnova-ecom-prod'
SK_START='JOB#2026-02-17T00:00:00+00:00'
SK_END='JOB#2026-02-17T01:00:00+00:00~'

aws dynamodb query \
  --profile "$PROFILE" --region "$REGION" \
  --table-name "$TABLE" \
  --key-condition-expression "pk = :pk AND sk BETWEEN :s1 AND :s2" \
  --expression-attribute-values "{\":pk\":{\"S\":\"$PK\"},\":s1\":{\"S\":\"$SK_START\"},\":s2\":{\"S\":\"$SK_END\"}}" \
  --projection-expression "job_id,created_at,started_at,job_status" \
  --output json | jq -r '
    def parse_ts($s): (($s|sub("\\+00:00$";"Z")|sub("\\.[0-9]+Z$";"Z")|strptime("%Y-%m-%dT%H:%M:%SZ")|mktime));
    [ .Items[]
      | {
          job_id: .job_id.S,
          created_at: .created_at.S,
          started_at: (.started_at.S // null),
          status: (.job_status.S // "UNKNOWN")
        }
      | . + { lag_sec: (if (.started_at==null) then null else (parse_ts(.started_at) - parse_ts(.created_at)) end) }
    ] as $rows |
    "total_jobs=\($rows|length)",
    "delayed_gt_600=\($rows|map(select(.lag_sec != null and .lag_sec > 600))|length)",
    "delayed_status_breakdown:",
    (($rows|map(select(.lag_sec != null and .lag_sec > 600))|group_by(.status)|map("  \(.[0].status)=\(length)"))[]? // "  none"),
    "top_20_lags:",
    ($rows|map(select(.lag_sec!=null))|sort_by(.lag_sec)|reverse|.[0:20]|map("  \(.lag_sec|floor)s \(.status) \(.job_id)")[])
  '

Expected output pattern for this incident: - total_jobs=27705 - delayed_gt_600=7 - Breakdown: CONFLICT=6, COMPLETED=1

Additional Result: Why Delayed Jobs Became CONFLICT

For the 6 delayed jobs that ended CONFLICT, conflict_details.older_or_equal_version points to a newer job for the same doc_id that completed first.

Delayed Job Lag Doc ID Newer Job Newer Created At (UTC) Newer Started At (UTC) Newer Status
50af0e12-4167-499b-a3e7-915f2878c73f 969s 6541400440956 13e40a0b-8481-4d80-b15f-ec60416c05bc 2026-02-17T00:20:25.745683+00:00 2026-02-17T00:20:34.141021+00:00 COMPLETED
950ae038-d797-4d20-a04f-b5b0388c8aca 967s 6543226208380 a0946a86-83ae-4a3b-8c83-458c0c2e6b98 2026-02-17T00:20:25.108583+00:00 2026-02-17T00:20:35.700047+00:00 COMPLETED
6967613b-8170-440c-8b17-0f5f61402a9b 959s 6543038840956 2f1a1ba8-2182-496b-a666-d6b54c4511ce 2026-02-17T00:19:37.276996+00:00 2026-02-17T00:19:47.271925+00:00 COMPLETED
54ba88a8-89bd-4893-b3c1-482b15b3b647 959s 6540983730300 4fec410d-caf0-4181-bb03-3e783647367e 2026-02-17T00:19:33.595623+00:00 2026-02-17T00:19:43.814537+00:00 COMPLETED
16d8b0bb-2e94-4102-8e10-4453ca9142af 959s 6543851782268 17f90ce9-8388-4105-95f4-41d4eb190539 2026-02-17T00:19:38.446487+00:00 2026-02-17T00:19:58.878720+00:00 COMPLETED
d199ad46-67af-4090-b808-81bcc2e98160 958s 6543625158780 02400322-1d67-4941-ac9e-460fff9d1ec7 2026-02-17T00:19:39.284058+00:00 2026-02-17T00:19:52.546012+00:00 COMPLETED

Interpretation: - These delayed messages were stale by the time they were retried. - Newer writes for the same docs landed first, so delayed retries became older_or_equal_version conflicts.

Command Log (New Results) - With Purpose

  1. Confirm delayed jobs and lag in jobs table. Why: prove that only a small set of jobs had extreme lag.
PROFILE=controller
REGION=us-east-1
TABLE=prod-EcomIndexerJobsTable
PK='PLATFORM#ecom#SHOP#c8tw6196-fashionnova-ecom-prod'
SK_START='JOB#2026-02-17T00:00:00+00:00'
SK_END='JOB#2026-02-17T01:00:00+00:00~'

aws dynamodb query \
  --profile "$PROFILE" --region "$REGION" \
  --table-name "$TABLE" \
  --key-condition-expression "pk = :pk AND sk BETWEEN :s1 AND :s2" \
  --expression-attribute-values "{\":pk\":{\"S\":\"$PK\"},\":s1\":{\"S\":\"$SK_START\"},\":s2\":{\"S\":\"$SK_END\"}}" \
  --projection-expression "job_id,created_at,started_at,job_status" \
  --output json | jq -r '
    def parse_ts($s): (($s|sub("\\+00:00$";"Z")|sub("\\.[0-9]+Z$";"Z")|strptime("%Y-%m-%dT%H:%M:%SZ")|mktime));
    [ .Items[]
      | {job_id: .job_id.S, created_at: .created_at.S, started_at: (.started_at.S // null), status: (.job_status.S // "UNKNOWN")}
      | . + { lag_sec: (if (.started_at==null) then null else (parse_ts(.started_at) - parse_ts(.created_at)) end) }
    ] as $rows |
    "total_jobs=\($rows|length)",
    "delayed_gt_600=\($rows|map(select(.lag_sec != null and .lag_sec > 600))|length)",
    "delayed_status_breakdown:",
    (($rows|map(select(.lag_sec != null and .lag_sec > 600))|group_by(.status)|map("  \(.[0].status)=\(length)"))[]? // "  none")
  '
  1. Map delayed CONFLICT jobs to the newer superseding jobs from conflict_details. Why: prove causality for CONFLICT status (older write lost to newer write).
# Save a 00:00-00:30 snapshot with conflict_details
PROFILE=controller
REGION=us-east-1
TABLE=prod-EcomIndexerJobsTable
PK='PLATFORM#ecom#SHOP#c8tw6196-fashionnova-ecom-prod'
SK_START='JOB#2026-02-17T00:00:00+00:00'
SK_END='JOB#2026-02-17T00:30:00+00:00~'

aws dynamodb query \
  --profile "$PROFILE" --region "$REGION" \
  --table-name "$TABLE" \
  --key-condition-expression "pk = :pk AND sk BETWEEN :s1 AND :s2" \
  --expression-attribute-values "{\":pk\":{\"S\":\"$PK\"},\":s1\":{\"S\":\"$SK_START\"},\":s2\":{\"S\":\"$SK_END\"}}" \
  --projection-expression "job_id,created_at,started_at,job_status,conflict_details" \
  --output json > /tmp/fn_jobs_0030.json

jq -r '
  def tsnorm: sub("\\+00:00$";"Z") | sub("\\.[0-9]+Z$";"Z");
  def tsec: (tsnorm | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime);
  [ .Items[]
    | {
        job_id: .job_id.S,
        created_at: .created_at.S,
        started_at: (.started_at.S // null),
        status: (.job_status.S // "UNKNOWN"),
        conflict_details: (.conflict_details.M // null)
      }
    | . + { lag_sec: (if .started_at then ((.started_at|tsec)-(.created_at|tsec)) else null end) }
  ] as $rows |
  ($rows | map({key:.job_id, value:.}) | from_entries) as $byid |
  $rows
  | map(select(.lag_sec != null and .lag_sec > 600 and .status=="CONFLICT"))
  | .[] as $d
  | ($d.conflict_details.older_or_equal_version.L[]?.M) as $c
  | ($byid[$c.job_id.S] // null) as $n
  | "delayed_job=\($d.job_id) doc=\($c.doc_id.S) newer_job=\($c.job_id.S) newer_created=\($n.created_at) newer_started=\($n.started_at) newer_status=\($n.status)"
' /tmp/fn_jobs_0030.json
  1. Check Lambda service-level failure pressure in spike window. Why: rule out Lambda-level failures/throttles as primary cause.
PROFILE=controller
REGION=us-east-1
FUNC=prod-EcomIndexerFunction
for metric in Errors Throttles Invocations; do
  aws cloudwatch get-metric-statistics \
    --profile "$PROFILE" --region "$REGION" \
    --namespace AWS/Lambda \
    --metric-name "$metric" \
    --dimensions Name=FunctionName,Value="$FUNC" \
    --start-time 2026-02-17T00:00:00Z \
    --end-time 2026-02-17T00:30:00Z \
    --period 60 \
    --statistics Sum \
    --output json | jq -r '.Datapoints | sort_by(.Timestamp)[] | "\(.Timestamp)\t\(.Sum)"'
done
  1. Check for FashionNova-specific retry/error signatures in logs. Why: test whether explicit app-level errors explain the delay.
PROFILE=controller
REGION=us-east-1
LOG_GROUP="/aws/lambda/prod-EcomIndexerFunction"
START=1771286400   # 2026-02-17 00:00:00 UTC
END=1771290000     # 2026-02-17 01:00:00 UTC
Q='fields @timestamp, @message
| filter @message like /fashionnova-ecom-prod/
| filter @message like /Rate limit error processing SQS record|Document not found error processing SQS record|Inference timeout error processing SQS record|Error processing SQS record|Failed to process record|Failed to process chunk|Task timed out|Failed to parse record for parallel mode|Deferring [0-9]+ message\\(s\\) from index/
| sort @timestamp asc
| limit 200'
QID=$(aws logs start-query --profile "$PROFILE" --region "$REGION" --log-group-name "$LOG_GROUP" --start-time "$START" --end-time "$END" --query-string "$Q" --output text --query queryId)
aws logs get-query-results --profile "$PROFILE" --region "$REGION" --query-id "$QID"
  1. (Attempted) check CloudTrail for SQS visibility-change API events. Why: try to directly confirm explicit visibility-shortening calls in the period.
PROFILE=controller
REGION=us-east-1
aws cloudtrail lookup-events \
  --profile "$PROFILE" --region "$REGION" \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ChangeMessageVisibilityBatch \
  --start-time 2026-02-17T00:00:00Z \
  --end-time 2026-02-17T00:30:00Z \
  --max-results 50

Result note: - This lookup returned no events for that window in this account/region.

S3 Payload Characteristics

  • All 7 were single-doc add_docs_0000.jsonl payloads (variants=1).
  • 6 later resolved as CONFLICT because newer writes for same document IDs had already been processed.

Errors and Failures

No explicit FashionNova-specific error signatures were found in the target window for: - Rate limit error processing SQS record - Document not found error processing SQS record - Inference timeout error processing SQS record - Error processing SQS record - Failed to process record - Failed to process chunk - Lambda runtime failures (Task timed out, OOM, runtime exit)

Lambda REPORT lines in window peaked around ~99s, not near 900s timeout.

Confidence

  • High confidence: 7 stale/retried messages caused the age spike.
  • Medium confidence: original first-attempt trigger is unknown due to insufficient per-message retry reason logging.

Add targeted logging in ecom_indexer Lambda to emit, per failed message: - messageId - job_id - ApproximateReceiveCount - explicit retry reason when adding to batchItemFailures

This will make the next spike directly attributable.

Problem To Solve (Observability Gap)

We can prove delayed jobs and eventual CONFLICT, but we cannot always prove the first retry trigger that caused the delay.

Why this gap exists in current code: - Retry outcome is returned as batchItemFailures without structured reason payload: - components/ecom_indexer/ecom_indexer/lambda_function.py:1581 - At least one retry branch is effectively silent on cause details: - components/ecom_indexer/ecom_indexer/lambda_function.py:586 - Job details are processing outcomes, not retry-decision provenance. - Redrive can clear stored job detail fields, so first-cause evidence can be lost: - components/ecom_indexer/ecom_indexer/job_management.py:170

Minimal Plan: Retry-Cause Logging (Phase 1)

Goal: capture the first retry cause at decision time, with enough detail for fast root-cause triage.

Design: - Emit one structured log event per retry decision, e.g. INDEXER_RETRY_DECISION. - Scope is per retried SQS message (job/chunk message), not per document. - Required fields (always logged): - message_id, job_id, shop_id, index_name - receive_count (ApproximateReceiveCount) - retry_trigger (single trigger that caused retry) - mode (sequential, parallel_event, polled) - request_id - Optional context fields (log only when available): - all_error_types_detected (for mixed-failure batches) - error_type (exception class) - Full Marqo error capture (untruncated): - Persist full original Marqo error as first_retry_marqo_error once (write-once). - Do not emit full Marqo error on every retry log line by default (keeps Phase 1 minimal). - Write first_retry_* attribution to retry_attribution with write-once semantics so first cause survives later status transitions: - first_retry_cause - first_retry_at - first_retry_receive_count - first_retry_request_id - first_retry_marqo_error - Write-once requirement: - Implement via conditional update (compare-and-set), e.g. attribute_not_exists(retry_attribution.first_retry_at). - On conditional failure, no-op (do not overwrite earlier first-cause data).

Primary code touchpoints: - Retry branch logging + reason capture: - components/ecom_indexer/ecom_indexer/lambda_function.py:586 - components/ecom_indexer/ecom_indexer/lambda_function.py:1408 - components/ecom_indexer/ecom_indexer/lambda_function.py:1459 - components/ecom_indexer/ecom_indexer/lambda_function.py:1469 - components/ecom_indexer/ecom_indexer/lambda_function.py:1503 - components/ecom_indexer/ecom_indexer/lambda_function.py:1537 - components/ecom_indexer/ecom_indexer/lambda_function.py:1581 - Persist first-cause metadata: - components/ecom_indexer/ecom_indexer/job_management.py:360

Admin UI Wiring (Phase 2)

Goal: make triage easy in admin_worker without manual CloudWatch spelunking.

Plan: - Add admin API endpoint to query retry-cause logs by index/time window. - Surface summary + recent events in Monitoring page.

Likely touchpoints: - Backend logs query plumbing: - components/admin_lambda/admin_lambda/clients/cloudwatch_logs_client.py - components/admin_lambda/admin_lambda/services/monitoring_service.py - components/admin_lambda/admin_lambda/routes/monitoring_routes.py - UI: - components/admin_worker/app/routes/monitoring.tsx

  • Main incident report (this file):
  • docs/incidents/reports/fashionnova-ecom-queue-age-spike-2026-02-17.md
  • Incident index:
  • docs/runbooks/incidents/index.md
  • Full working notes / evidence trail:
  • tmp/agent-plans/2026-02-17-queue-spike-debug/plan.md