Dump index contents to S3 (vespa-visit)
Context
You want to dump all the Vespa docs from a Marqo index for some reason. For example:
- reindexing
- bulk patching
- data analysis across all fields
- copying the contents of one index to another
Process
- Get the system account ID and index name of the index you want to dump from (Polo may help with this).
-
Go to the Vespa Dump Docs GitHub Action in the
cloud_data_planerepo, and “Run workflow”:- Set the environment to
prod-cell-1if the index is in prod (usually the case). - Enter the system account ID and index name.
- The “Job ID” is just the suffix to append to the output. If not provided, it will use a timestamp. You can enter something more memorable if you prefer.

- Set the environment to
-
If the index is in the prod environment, ask a member of the Data Plane team to approve the workflow (on #cloud-data-plane-team).
- Wait until the workflow is complete (should be ~1 minute for a small index, and update to 10-20 minutes for a larger one). Find the output docs in this bucket:
s3://bastionfilesbucket-prod-cell-1/_vespa_visit/{system_account_id}-{index_name}-{job_id}.jsonl. - Download the files and do whatever you like with them!