Building Deterministic Zip Files with Built-In Commands
Zip files are not deterministic by nature, and this can cause some problems when you’re trying to do what you gotta do.
One specific use-case (my use case) was knowing when our continuous integration should deploy a new version of our AWS Lambda functions.
Why Aren’t Zip Files Deterministic
Non-determinism in zip files is not due to any of the underlying algorithms, but due to the other elements involved in building the zip file.
“Extra Information”
By default, zip will include “extra information” in the zip file. This depends on OS and such, but can include owner, creation time, access time etc.
The md5s are different, even though the files are the same.
To correct this part of the puzzle, use the -X (or --no-extra) flag.
Permissions
The permissions on files, with identical contents, will create different zip files.
Timestamp
The -X flag still includes the timestamp of the file, so two files with the same contents, but different timestamps, will create different zip files.
In order to create a deterministic zip file, we need to make sure that the timestamp of each included file stays the same.
To do this we could simply set the timestamp of all the zipped files to a constant value.
But in my case, I needed something a bit more clever, and I suspect you will as well…
Generating Meaningful Timestamps with Git
Setting the timestamp to a constant value does solve the problem of generating a deterministic zip, but it’s not very meaningful. What if we could get the timestamp for the most recent modification to an included file.
We can with git! (Probably other version control also, but that’s left as an exercise for the reader).
This uses the ls-files git command, pipes the output to the log git command and formats the output to just output the author date.
We now need to format the date into something we can sort, and provide as input to touch -t.
(Using the --date flag requires git version 2.6 or higher.)
We can now put this together with touch.
This will recursively set the timestamp of all files to the timestamp of the timestamp of the last change in git.
Reusable Command for CircleCI
Here is the code as a reusable command forCircleCI.
This command also sets the timestamp of the resulting zip file to the timestamp of the last changed file that it contains.
This makes it easy to use the aws s3 sync to update only when there is an update to the contained code.
I included how to update git on the (very old) version of Debian that CircleCI uses, since that’s a headache.
Learning!
I hope this helped demystify the process of creating zip files, and some of the reasons why they can be non-deterministic.
If it didn’t, well, write your own damn blog.