FashionNova Queue Age Spike (2026-02-17)

Summary

Queue: prod-fashionnova-ecom-prod-queue
Spike: ApproximateAgeOfOldestMessage reached 912s around 2026-02-17 00:19:00 UTC
High-confidence cause: exactly 7 stale/retried messages were processed around 00:20:36-00:20:39 UTC, matching the age spike behavior.
Exact first-failure trigger for those 7 messages remains unproven from current logs.

What Was Verified

Account/region: controller account 023568249301, us-east-1
Queue live settings:
VisibilityTimeout=900
ReceiveMessageWaitTimeSeconds=20
DLQ max receive count 5
Event source mapping (prod-EcomIndexerFunction):
BatchSize=30
MaximumBatchingWindowInSeconds=20
ScalingConfig.MaximumConcurrency=15
FunctionResponseTypes=[ReportBatchItemFailures]

Timeline (UTC)

00:06 age 132s
00:19 age 912s
00:20 age 949s
00:21 age 22s (reset)

In the same 00:20 minute: - NumberOfMessagesSent=391 - NumberOfMessagesReceived=398 - Delta +7 receives (consistent with redelivery/retry behavior)

Correlated Jobs (FashionNova)

From prod-EcomIndexerJobsTable for shop c8tw6196-fashionnova-ecom-prod: - Exactly 7 jobs had lag >600s in 00:00-01:00 UTC (out of 27,705 jobs total). - The same 7 were the top lags (~958-969s). - Statuses: 6 CONFLICT, 1 COMPLETED.

Job IDs: - 50af0e12-4167-499b-a3e7-915f2878c73f - 950ae038-d797-4d20-a04f-b5b0388c8aca - 183374f3-54cc-4206-b2d8-520782cade97 - 6967613b-8170-440c-8b17-0f5f61402a9b - 16d8b0bb-2e94-4102-8e10-4453ca9142af - 54ba88a8-89bd-4893-b3c1-482b15b3b647 - d199ad46-67af-4090-b808-81bcc2e98160

Observed timing pattern: - Created and S3-written around 00:04:27-00:04:40 - First observed processing logs around 00:20:36-00:20:39

Manual Tracing: Find Delayed Jobs (`lag_sec > 600`)

Lag was computed as: - lag_sec = started_at - created_at (in seconds)

Repro command used to find delayed jobs for this shop/hour:

PROFILE=controller
REGION=us-east-1
TABLE=prod-EcomIndexerJobsTable
PK='PLATFORM#ecom#SHOP#c8tw6196-fashionnova-ecom-prod'
SK_START='JOB#2026-02-17T00:00:00+00:00'
SK_END='JOB#2026-02-17T01:00:00+00:00~'

aws dynamodb query \
  --profile "$PROFILE" --region "$REGION" \
  --table-name "$TABLE" \
  --key-condition-expression "pk = :pk AND sk BETWEEN :s1 AND :s2" \
  --expression-attribute-values "{\":pk\":{\"S\":\"$PK\"},\":s1\":{\"S\":\"$SK_START\"},\":s2\":{\"S\":\"$SK_END\"}}" \
  --projection-expression "job_id,created_at,started_at,job_status" \
  --output json | jq -r '
    def parse_ts($s): (($s|sub("\\+00:00$";"Z")|sub("\\.[0-9]+Z$";"Z")|strptime("%Y-%m-%dT%H:%M:%SZ")|mktime));
    [ .Items[]
      | {
          job_id: .job_id.S,
          created_at: .created_at.S,
          started_at: (.started_at.S // null),
          status: (.job_status.S // "UNKNOWN")
        }
      | . + { lag_sec: (if (.started_at==null) then null else (parse_ts(.started_at) - parse_ts(.created_at)) end) }
    ] as $rows |
    "total_jobs=\($rows|length)",
    "delayed_gt_600=\($rows|map(select(.lag_sec != null and .lag_sec > 600))|length)",
    "delayed_status_breakdown:",
    (($rows|map(select(.lag_sec != null and .lag_sec > 600))|group_by(.status)|map("  \(.[0].status)=\(length)"))[]? // "  none"),
    "top_20_lags:",
    ($rows|map(select(.lag_sec!=null))|sort_by(.lag_sec)|reverse|.[0:20]|map("  \(.lag_sec|floor)s \(.status) \(.job_id)")[])
  '

Expected output pattern for this incident: - total_jobs=27705 - delayed_gt_600=7 - Breakdown: CONFLICT=6, COMPLETED=1

Additional Result: Why Delayed Jobs Became `CONFLICT`

For the 6 delayed jobs that ended CONFLICT, conflict_details.older_or_equal_version points to a newer job for the same doc_id that completed first.

Delayed Job	Lag	Doc ID	Newer Job	Newer Created At (UTC)	Newer Started At (UTC)	Newer Status
`50af0e12-4167-499b-a3e7-915f2878c73f`	969s	`6541400440956`	`13e40a0b-8481-4d80-b15f-ec60416c05bc`	`2026-02-17T00:20:25.745683+00:00`	`2026-02-17T00:20:34.141021+00:00`	`COMPLETED`
`950ae038-d797-4d20-a04f-b5b0388c8aca`	967s	`6543226208380`	`a0946a86-83ae-4a3b-8c83-458c0c2e6b98`	`2026-02-17T00:20:25.108583+00:00`	`2026-02-17T00:20:35.700047+00:00`	`COMPLETED`
`6967613b-8170-440c-8b17-0f5f61402a9b`	959s	`6543038840956`	`2f1a1ba8-2182-496b-a666-d6b54c4511ce`	`2026-02-17T00:19:37.276996+00:00`	`2026-02-17T00:19:47.271925+00:00`	`COMPLETED`
`54ba88a8-89bd-4893-b3c1-482b15b3b647`	959s	`6540983730300`	`4fec410d-caf0-4181-bb03-3e783647367e`	`2026-02-17T00:19:33.595623+00:00`	`2026-02-17T00:19:43.814537+00:00`	`COMPLETED`
`16d8b0bb-2e94-4102-8e10-4453ca9142af`	959s	`6543851782268`	`17f90ce9-8388-4105-95f4-41d4eb190539`	`2026-02-17T00:19:38.446487+00:00`	`2026-02-17T00:19:58.878720+00:00`	`COMPLETED`
`d199ad46-67af-4090-b808-81bcc2e98160`	958s	`6543625158780`	`02400322-1d67-4941-ac9e-460fff9d1ec7`	`2026-02-17T00:19:39.284058+00:00`	`2026-02-17T00:19:52.546012+00:00`	`COMPLETED`

Interpretation: - These delayed messages were stale by the time they were retried. - Newer writes for the same docs landed first, so delayed retries became older_or_equal_version conflicts.

Command Log (New Results) - With Purpose

Confirm delayed jobs and lag in jobs table. Why: prove that only a small set of jobs had extreme lag.

PROFILE=controller
REGION=us-east-1
TABLE=prod-EcomIndexerJobsTable
PK='PLATFORM#ecom#SHOP#c8tw6196-fashionnova-ecom-prod'
SK_START='JOB#2026-02-17T00:00:00+00:00'
SK_END='JOB#2026-02-17T01:00:00+00:00~'

aws dynamodb query \
  --profile "$PROFILE" --region "$REGION" \
  --table-name "$TABLE" \
  --key-condition-expression "pk = :pk AND sk BETWEEN :s1 AND :s2" \
  --expression-attribute-values "{\":pk\":{\"S\":\"$PK\"},\":s1\":{\"S\":\"$SK_START\"},\":s2\":{\"S\":\"$SK_END\"}}" \
  --projection-expression "job_id,created_at,started_at,job_status" \
  --output json | jq -r '
    def parse_ts($s): (($s|sub("\\+00:00$";"Z")|sub("\\.[0-9]+Z$";"Z")|strptime("%Y-%m-%dT%H:%M:%SZ")|mktime));
    [ .Items[]
      | {job_id: .job_id.S, created_at: .created_at.S, started_at: (.started_at.S // null), status: (.job_status.S // "UNKNOWN")}
      | . + { lag_sec: (if (.started_at==null) then null else (parse_ts(.started_at) - parse_ts(.created_at)) end) }
    ] as $rows |
    "total_jobs=\($rows|length)",
    "delayed_gt_600=\($rows|map(select(.lag_sec != null and .lag_sec > 600))|length)",
    "delayed_status_breakdown:",
    (($rows|map(select(.lag_sec != null and .lag_sec > 600))|group_by(.status)|map("  \(.[0].status)=\(length)"))[]? // "  none")
  '

Map delayed CONFLICT jobs to the newer superseding jobs from conflict_details. Why: prove causality for CONFLICT status (older write lost to newer write).

# Save a 00:00-00:30 snapshot with conflict_details
PROFILE=controller
REGION=us-east-1
TABLE=prod-EcomIndexerJobsTable
PK='PLATFORM#ecom#SHOP#c8tw6196-fashionnova-ecom-prod'
SK_START='JOB#2026-02-17T00:00:00+00:00'
SK_END='JOB#2026-02-17T00:30:00+00:00~'

aws dynamodb query \
  --profile "$PROFILE" --region "$REGION" \
  --table-name "$TABLE" \
  --key-condition-expression "pk = :pk AND sk BETWEEN :s1 AND :s2" \
  --expression-attribute-values "{\":pk\":{\"S\":\"$PK\"},\":s1\":{\"S\":\"$SK_START\"},\":s2\":{\"S\":\"$SK_END\"}}" \
  --projection-expression "job_id,created_at,started_at,job_status,conflict_details" \
  --output json > /tmp/fn_jobs_0030.json

jq -r '
  def tsnorm: sub("\\+00:00$";"Z") | sub("\\.[0-9]+Z$";"Z");
  def tsec: (tsnorm | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime);
  [ .Items[]
    | {
        job_id: .job_id.S,
        created_at: .created_at.S,
        started_at: (.started_at.S // null),
        status: (.job_status.S // "UNKNOWN"),
        conflict_details: (.conflict_details.M // null)
      }
    | . + { lag_sec: (if .started_at then ((.started_at|tsec)-(.created_at|tsec)) else null end) }
  ] as $rows |
  ($rows | map({key:.job_id, value:.}) | from_entries) as $byid |
  $rows
  | map(select(.lag_sec != null and .lag_sec > 600 and .status=="CONFLICT"))
  | .[] as $d
  | ($d.conflict_details.older_or_equal_version.L[]?.M) as $c
  | ($byid[$c.job_id.S] // null) as $n
  | "delayed_job=\($d.job_id) doc=\($c.doc_id.S) newer_job=\($c.job_id.S) newer_created=\($n.created_at) newer_started=\($n.started_at) newer_status=\($n.status)"
' /tmp/fn_jobs_0030.json

Check Lambda service-level failure pressure in spike window. Why: rule out Lambda-level failures/throttles as primary cause.

PROFILE=controller
REGION=us-east-1
FUNC=prod-EcomIndexerFunction
for metric in Errors Throttles Invocations; do
  aws cloudwatch get-metric-statistics \
    --profile "$PROFILE" --region "$REGION" \
    --namespace AWS/Lambda \
    --metric-name "$metric" \
    --dimensions Name=FunctionName,Value="$FUNC" \
    --start-time 2026-02-17T00:00:00Z \
    --end-time 2026-02-17T00:30:00Z \
    --period 60 \
    --statistics Sum \
    --output json | jq -r '.Datapoints | sort_by(.Timestamp)[] | "\(.Timestamp)\t\(.Sum)"'
done

Check for FashionNova-specific retry/error signatures in logs. Why: test whether explicit app-level errors explain the delay.

PROFILE=controller
REGION=us-east-1
LOG_GROUP="/aws/lambda/prod-EcomIndexerFunction"
START=1771286400   # 2026-02-17 00:00:00 UTC
END=1771290000     # 2026-02-17 01:00:00 UTC
Q='fields @timestamp, @message
| filter @message like /fashionnova-ecom-prod/
| filter @message like /Rate limit error processing SQS record|Document not found error processing SQS record|Inference timeout error processing SQS record|Error processing SQS record|Failed to process record|Failed to process chunk|Task timed out|Failed to parse record for parallel mode|Deferring [0-9]+ message\\(s\\) from index/
| sort @timestamp asc
| limit 200'
QID=$(aws logs start-query --profile "$PROFILE" --region "$REGION" --log-group-name "$LOG_GROUP" --start-time "$START" --end-time "$END" --query-string "$Q" --output text --query queryId)
aws logs get-query-results --profile "$PROFILE" --region "$REGION" --query-id "$QID"

(Attempted) check CloudTrail for SQS visibility-change API events. Why: try to directly confirm explicit visibility-shortening calls in the period.

PROFILE=controller
REGION=us-east-1
aws cloudtrail lookup-events \
  --profile "$PROFILE" --region "$REGION" \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ChangeMessageVisibilityBatch \
  --start-time 2026-02-17T00:00:00Z \
  --end-time 2026-02-17T00:30:00Z \
  --max-results 50

Result note: - This lookup returned no events for that window in this account/region.

S3 Payload Characteristics

All 7 were single-doc add_docs_0000.jsonl payloads (variants=1).
6 later resolved as CONFLICT because newer writes for same document IDs had already been processed.

Errors and Failures

No explicit FashionNova-specific error signatures were found in the target window for: - Rate limit error processing SQS record - Document not found error processing SQS record - Inference timeout error processing SQS record - Error processing SQS record - Failed to process record - Failed to process chunk - Lambda runtime failures (Task timed out, OOM, runtime exit)

Lambda REPORT lines in window peaked around ~99s, not near 900s timeout.

Confidence

High confidence: 7 stale/retried messages caused the age spike.
Medium confidence: original first-attempt trigger is unknown due to insufficient per-message retry reason logging.

Recommended Next Step

Add targeted logging in ecom_indexer Lambda to emit, per failed message: - messageId - job_id - ApproximateReceiveCount - explicit retry reason when adding to batchItemFailures

This will make the next spike directly attributable.

Problem To Solve (Observability Gap)

We can prove delayed jobs and eventual CONFLICT, but we cannot always prove the first retry trigger that caused the delay.

Why this gap exists in current code: - Retry outcome is returned as batchItemFailures without structured reason payload: - components/ecom_indexer/ecom_indexer/lambda_function.py:1581 - At least one retry branch is effectively silent on cause details: - components/ecom_indexer/ecom_indexer/lambda_function.py:586 - Job details are processing outcomes, not retry-decision provenance. - Redrive can clear stored job detail fields, so first-cause evidence can be lost: - components/ecom_indexer/ecom_indexer/job_management.py:170

Minimal Plan: Retry-Cause Logging (Phase 1)

Goal: capture the first retry cause at decision time, with enough detail for fast root-cause triage.

Design: - Emit one structured log event per retry decision, e.g. INDEXER_RETRY_DECISION. - Scope is per retried SQS message (job/chunk message), not per document. - Required fields (always logged): - message_id, job_id, shop_id, index_name - receive_count (ApproximateReceiveCount) - retry_trigger (single trigger that caused retry) - mode (sequential, parallel_event, polled) - request_id - Optional context fields (log only when available): - all_error_types_detected (for mixed-failure batches) - error_type (exception class) - Full Marqo error capture (untruncated): - Persist full original Marqo error as first_retry_marqo_error once (write-once). - Do not emit full Marqo error on every retry log line by default (keeps Phase 1 minimal). - Write first_retry_* attribution to retry_attribution with write-once semantics so first cause survives later status transitions: - first_retry_cause - first_retry_at - first_retry_receive_count - first_retry_request_id - first_retry_marqo_error - Write-once requirement: - Implement via conditional update (compare-and-set), e.g. attribute_not_exists(retry_attribution.first_retry_at). - On conditional failure, no-op (do not overwrite earlier first-cause data).

Primary code touchpoints: - Retry branch logging + reason capture: - components/ecom_indexer/ecom_indexer/lambda_function.py:586 - components/ecom_indexer/ecom_indexer/lambda_function.py:1408 - components/ecom_indexer/ecom_indexer/lambda_function.py:1459 - components/ecom_indexer/ecom_indexer/lambda_function.py:1469 - components/ecom_indexer/ecom_indexer/lambda_function.py:1503 - components/ecom_indexer/ecom_indexer/lambda_function.py:1537 - components/ecom_indexer/ecom_indexer/lambda_function.py:1581 - Persist first-cause metadata: - components/ecom_indexer/ecom_indexer/job_management.py:360

Admin UI Wiring (Phase 2)

Goal: make triage easy in admin_worker without manual CloudWatch spelunking.

Plan: - Add admin API endpoint to query retry-cause logs by index/time window. - Surface summary + recent events in Monitoring page.

Likely touchpoints: - Backend logs query plumbing: - components/admin_lambda/admin_lambda/clients/cloudwatch_logs_client.py - components/admin_lambda/admin_lambda/services/monitoring_service.py - components/admin_lambda/admin_lambda/routes/monitoring_routes.py - UI: - components/admin_worker/app/routes/monitoring.tsx

Investigation Links

Main incident report (this file):
docs/incidents/reports/fashionnova-ecom-queue-age-spike-2026-02-17.md
Incident index:
docs/runbooks/incidents/index.md
Full working notes / evidence trail:
tmp/agent-plans/2026-02-17-queue-spike-debug/plan.md