Leet's Stuff

Managing thousands of Windows ISOs: Exploring the problem (Part 1)

By Leet on 2025-06-02 20:54:04

Most people do not store ISOs for Windows after using them. Some more tech-savvy people store a couple of them so that they can easily access them to reimage computers. Finally, some people like me store every single installer they have as a collection. I frequently require different Windows editions, variants, versions, and languages for testing software, so I have a big library.

Legend for ISO management series

Part 1: Exploring the problem (this post)
Part 2: Creating the database

The problems

Managing such a library gets inconvenient. Very, very fast. There are several reasons for this.

Space

First and foremost, each ISO file is 3-6GB, depending on the version, which wastes a lot of space. Luckily, the 'Windows ISO Community' has a solution for this. Namely, SmartVersion. SmartVersion uses a binary diff format, SVF, to store the changes between two files. For example, you could have the following files: en-us_windows7.iso and nl-nl_windows7.svf. SmartVersion can then use the SVF file to convert the English ISO into the Dutch one. The English version is also known as the 'source' file. The Dutch version is similarly called the 'target' file. Because of SVF files being extremely space-efficient, I have thousands of ISOs in my collection. The full extracted size of this library would normally be over 3 TiB, but with SVF it is only about 100 GiB.

Nesting

The second issue is that SVF files can be nested, i.e. you need multiple SVF files to get the desired ISO. An example of this:

a.iso + a_to_b.svf => b.iso
b.iso + b_to_c.svf => c.iso

In this case, you store only a.iso and all SVF files. When you want to get c.iso from your library, you have to first extract b.iso and then apply the second SVF to it to get c.iso. Fortunately, SmartVersion can also combine the SVF files. So a_to_b.svf and b_to_c.svf can be converted to a_to_c.svf. This way, you only have to extract once. The actual issue is not the extraction though, but rather that it is not always clear which SVF and source ISO files are needed to retrieve a specific image.

Metadata

Finally, it is very difficult to store information about these files. The SVF files only give you the filename, and the actual specific about the ISO (what build, edition, licensing etc.) are often hard to determine from the filename alone. For example: en_windows_7_enterprise_x64_dvd_x15-70749.iso From this name, the language, main version, edition and architecture are quite clear. But what build of Windows is this? What licensing does this use? I know it purely out of experience (RTM, Volume), but for some of my ISOs I still do not know exactly what they contain. Having written Panther2K, I know a thing or two about WIM files. Therefore, I can easily obtain metadata information if I have access to the WIM files. Unfortunately, extracting all 3 TiB of ISOs from the SVFs is not viable. At least, it is not good for my SSD to write this many terabytes of ISOs and I would like to prevent doing so if I purely need metadata.

Solution

My solution for this is a database system for Windows installers. Note that this also includes ESD files, which I use to quickly create Panther2K USBs. The requirements for the software are as follows:

The ISOs/SVFs/ESDs should all be hashed (SHA256 and MD5) and the stored files in the database should be named according to their SHA256 hash. Having the hashes allows me to quickly compare my ISOs to known ISO databases and/or requests from other people for ISOs.
When adding files, a tree is created first to determine the dependencies of SVF files. If any files do not have their dependencies met (either in the form of an ISO or an SVF which can generate that ISO), the system will not allow the submitted files into the database.
This tree can then also be used for extraction. By going up the tree, the system can automatically figure out which SVF and source files are needed for extraction. The system then merges the SVF files and extracts the requested file in a single SVF extraction operation. This is done because SVF files are slow to extract (~2/3 minutes).
When importing files, the system automatically does a one-time extraction in-memory to retrieve the WIM file, which metadata can be generated from. Doing it in-memory will reduce strain on my storage and also slightly speed up the storage.
If the imported files are already on the same NTFS disk, they are hard linked instead of copied. Hard linking is better than copying because it does not write the entire file again. It is also better than moving, because the source files stay in the same place in case something goes wrong.

Such a system comes with a lot of challenges. In this blog series, I will try to go through each of the components of the system to see what is involved in creating it. Stay tuned, because there are a lot of cool technologies and tricks used to make this run smoothly.

xx Leet