S3 Object Querying with JMESPath

A quick post with some useful querying patterns when using JMESPath queries to find keys in a target S3 bucket.

Finding and filtering with JMESPath expressions

Find keys ending or starting with a certain value, and sort by Size

Here is a JMESPath query using s3api to find and sort keys based on the ending with a certain value, with the sort then being applied based on the resulting key sizes.

aws s3api list-objects-v2 --bucket example-bucket --query "Contents[?ends_with(Key, 'example')] | sort_by(@, &Size)"

To do the same as above, but for keys starting with a specific value, change the ends_with boolean expression to starts_with.

List all objects in the bucket, selecting only specific target keys, you can use a command like:

aws s3api list-objects-v2 --bucket example-bucket --query "Contents[*].[Key,Size]"

To refine that down to the first 3 x items only, add [-3:] to the end. For example:

aws s3api list-objects-v2 --bucket example-bucket --query "Contents[*].[Key,Size][-3:]"

Pipe operator

The pipe operator is used to stop projections in the query, or group expressions together.

Here is an example of filtering objects in a bucket down, followed by another expression to find only those with a key containing the value example_string:

aws s3api list-objects-v2 --bucket example-bucket --query "Contents[*] | [? contains(Key, 'example_string')]"

Another example, filtering down to include only objects on the STANDARD StorageClass, and then only those starting with a specific value:

aws s3api list-objects-v2 --bucket example-bucket --query "Contents[?StorageClass == 'STANDARD'] | [? starts_with(Key, 'ffc302a')]"

Transforming property names

Transforming keys / properties can be done using curly braces. For example, Key can be changed to become lowercase key:

aws s3api list-objects-v2 --bucket example-bucket --query "Contents[*].{key:Key}[-1:]"

This can be useful if you have a large, nested object structure and wish to create a short property in the pipeline for use in expressions further down the line. This wouldn’t be the case in the S3 object structure we’re primarily working with here, but a query example would be:

"InitialResults[*].{shortkey:Some.Nested.Object.Key} | [? starts_with(shortkey, 'example')]"

Fast Batch S3 Bucket object deletion from the shell

This is a quick post showing a nice and fast batch S3 bucket object deletion technique.

I recently had an S3 bucket that needed cleaning up. It had a few million objects in it. With path separating forward slashes this means there were around 5 million or so keys to iterate.

The goal was to delete every object that did not have a .zip file extension. Effectively I wanted to leave only the .zip file objects behind (of which there were only a few thousand), but get rid of all the other millions of objects.

My first attempt was straight forward and naive. Iterate every single key, check that it is not a .zip file, and delete it if not. However, every one of these iterations ended up being an HTTP request and this turned out to be a very slow process. Definitely not fast batch S3 bucket object deletion…

I fired up about 20 shells all iterating over objects and deleting like this but it still would have taken days.

I then stumbled upon a really cool technique on serverfault that you can use in two stages.

  1. Iterate the bucket objects and stash all the keys in a file.
  2. Iterate the lines in the file in batches of 1000 and call delete-objects on these – effectively deleting the objects in batches of 1000 (the maximum for 1 x delete request).

In-between stage 1 and stage 2 I just had to clean up the large text file of object keys to remove any of the lines that were .zip objects. For this process I used sublime text and a simple regex search and replace (replacing with an empty string to remove those lines).

So here is the process I used to delete everything in the bucket except the .zip objects. This took around 1-2 hours for the object key path collection and then the delete run.

Get all the object key paths

Note you will need to have Pipe Viewer installed first (pv). Pipe Viewer is a great little utility that you can place into any normal pipeline between two processes. It gives you a great little progress indicator to monitor progress in the shell.

aws s3api list-objects --output text --bucket the-bucket-name-here --query 'Contents[].[Key]' | pv -l > all-the-stuff.keys

 

Remove any object key paths you don’t want to delete

Open your all-the-stuff.keys file in Sublime or any other text editor with regex find and replace functionality.

The regex search for sublime text:

^.*.zip*\n

Find and replace all .zip object paths with the above regex string, replacing results with an empty string. Save the file when done. Make sure you use the correctly edited file for the following deletion phase!

Iterate all the object keys in batches and call delete

tail -n+0 all-the-stuff.keys | pv -l | grep -v -e "'" | tr '\n' '\0' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket the-bucket-name-here --delete "Objects=[$(printf "{Key=%q}," "$@")],Quiet=false"' _

This one-liner effectively:

  • tails the large text file (mine was around 250MB) of object keys
  • passes this into pipe viewer for progress indication
  • translates (tr) all newline characters into a null character ‘\0’ (effectively every line ending)
  • chops these up into groups of 1000 and passes the 1000 x key paths as an argument with xargs to the aws s3api delete-object command. This delete command can be passed an Objects array parameter, which is where the 1000 object key paths are fed into.
  • finally quiet mode is disabled to show the result of the delete requests in the shell, but you can also set this to true to remove that output.

Effectively you end up calling aws s3api delete-object passing in 1000 objects to delete at a time.

This is how it can get through the work so quickly.

Nice!