EcomQueueBacklogIncreasing-{system_account_id}-{index_name}

Summary

This alarm means that an ecom index’s async indexing queue is getting backed up: write operations are being sent (and added to the queue) faster than they can be processed.

Context

Queues usually start backing up when the customer performs a large batch of writes in a short time, e.g. reindexing all of their docs, or reconciling with some source of truth. These operations are not inherently wrong or bad, but can lead to bad customer experiences if not managed well.

For example, the customer wants to reconcile and reindex 15M docs over the weekend; because everything is serverless they expect it to be relatively quick and dump all the docs on Friday afternoon. If we fail to notice, they could come in on Monday morning with a queue backed up with like 14M docs, and it could take the whole day (or worse) to flush it out before any more changes can be processed.

The purpose of this alarm is to alert us of a situation that could develop into something unpleasant for the customer. We should at least validate that nothing else about the index looks problematic (e.g. high add docs latency; many 4XX/5XXs causing retries; resource starvation).

Conditions

Ecom index
Indexing queue has ≥ 1000 messages (jobs) “visible” (i.e. queued up)
The number of visible jobs has increased every minute for ≥ 10 minutes

Actions

This situation may require no action. Many writes have occurred in a short span of time.
- ✅ If the request rate returns to normal soon, the queue will be processed normally.
- ⚠️ If the high rate of write requests continues for a long time, or the rate is exceptionally high for a short time, the queue could grow very large, and writes could become severely delayed.
Keep an eye on the 📈 prod-EcomDashboard (in [PROD] Controller account) to track the number of visible (backed up) messages on the queue. A very large backlog will become a problem.
- “Very large” depends on the customer’s documents and infra. Divide the queue’s visible messages by its messages deleted per minute - that’s how many minutes it will take to clear without further traffic.
- Action may be warranted when that number exceeds ~100-500 (a couple of hours to a business day). Beyond that, customers start to lose control of the contents of their index.
If the backlog continues to grow very large, escalate to the customer’s account manager.
- Understand whether this traffic is expected, if we know when it will end, and if the time to clear the queue is acceptable.
- Determine how urgent it is that we clear the queue (e.g. if everything must be processed by Monday morning).
If action is required to reduce the backlog, this usually means scaling out wherever the bottleneck is.
- Indexing is typically rate limited to avoid overwhelming Vespa and interfering with serving search requests. The main rate limit is defined on the indexer Lambda trigger that consumes the index’s queue.
  - (For admin access, use Escalator: Self-Service Admin for the Controller account)
  - Go to the prod-EcomIndexerFunction Lambda’s triggers config
  - Find the trigger for the index in question
  - Select it and click Edit
  - Review the “Maximum concurrency”. Increasing this value will increase the rate messages are processed, but also increase the RPS of write operations on the inference and Vespa nodes proportionally.
  - Before increasing the value, review the index’s infrastructure in Polo or the Cloud cell’s CustomerIndexConfigTable, and review the per-index dashboard to check the average CPU and GPU utils across the board. If any are near their limits, consider scaling those components out before increasing the indexing concurrency.
  - After the queue is sufficiently processed, and the write RPS has returned to normal, remember to reverse any scale out ops to return the index’s infra to its natural configuration.