Managing thousands of Windows ISOs: Creating the database (Part 2)

By Leet on 2025-06-04 17:30:13


In my last post, we looked at the problem of storing thousands of Windows ISOs efficiently. In terms of storage space, the only viable option is of course SmartVersion. Unfortunately, SmartVersion comes with three major downsides. I already discussed the first two in part 1 (nesting and metadata). The third is actually speed, but I will address that in the next part of the blog series.

Legend for ISO management series

How to handle nested SVFs

For this post, I want to look at how we can overcome problem #1: Nested SVFs. Let's look at a random Windows 7 ISO in my library:

[0] File en_windows_7_ultimate_with_sp1_x64_dvd_u_677332.iso
  [141] SVF [de-de]_de_windows_7_ultimate_with_sp1_x64_dvd_u_677306.svf
      [142] SVFContent de_windows_7_ultimate_with_sp1_x64_dvd_u_677306.iso]   
          [147] SVF [de-de]_de_windows_7_ultimate_n_with_sp1_x64_dvd_u_677540.svf
              [148] SVFContent de_windows_7_ultimate_n_with_sp1_x64_dvd_u_677540.iso
                  [149] SVF [de-de]_de_windows_7_enterprise_n_with_sp1_x64_dvd_u_677697.svf
                      [150] SVFContent de_windows_7_enterprise_n_with_sp1_x64_dvd_u_677697.iso

Note that this is a reverse dependency tree. So to extract German Enterprise N x64 (ID 150) from its SVF file (ID 249), we need ID 148. But 148 is not a real file! It is a filed stored in an SVF. So we first need to extract that file using its parent SVF (147), which requires file 14...???? This is getting confusing. There is a much better way to do this.

Namely, you can simply go all the way up the tree, while keeping track of each SVF file as you go. We also make sure to make a copy of the SVF files, because we will be modifying them as part of the extraction process. You can then ask SmartVersion to merge each one (i.e. append a version to the SVF file), starting from the bottom of the tree ans slowly working your way up:

smv av 147.svf 149.svf
smv av 141.svf 147.svf

If done correctly, 141.svf is now an SVF that converts en_windows_7_ultimate_with_sp1_x64_dvd_u_677332.iso directly into the requested ISO (ID 150). In math terms, to extract a file k layers deep, we need k-1 merge operations, and only 1 extraction operation.

The code for this is quite a bit more complicated though. This is mainly because we need to make a distinction between SVF files themselves and contents of SVF files. Here's a rough sketch of the logic for SVF content:

FileEntry parent = _DatabaseFlat[entry.ParentId];
string databaseFile = "./data/" + parent.Hash.SHA1Hash.Substring(0, 4) + "/" + parent.Hash.SHA1Hash.Substring(4);
File.Copy(databaseFile, Path.GetFullPath("./output/" + parent.FileName));

// Try finding any ancestors and merge the SVF files
string lastPath = Path.GetFullPath("./output/" + parent.FileName);
FileEntry ancestor = parent;
while (_DatabaseFlat[ancestor.ParentId].FileType == FileEntry.EntryType.SVFContent)
{
    // Get parent SVF (two layers up)
    ancestor = _DatabaseFlat[_DatabaseFlat[ancestor.ParentId].ParentId];

    // Copy SVF to output
    string databaseFileA = "./data/" + ancestor.Hash.SHA1Hash.Substring(0, 4) + "/" + ancestor.Hash.SHA1Hash.Substring(4);
    File.Copy(databaseFileA, Path.GetFullPath("./output/" + ancestor.FileName));

    // Append the previous SVF to its parent SVF
    string curPath = Path.GetFullPath("./output/" + ancestor.FileName);
    Process.Start("smv", $"av \"{curPath}\" \"{lastPath}\"").WaitForExit();

    // Remove the previous SVF
    File.Delete(lastPath);
    lastPath = curPath;
}

// Extract the actual source file at this point (hardlink)
Extract(_DatabaseFlat[ancestor.ParentId]);

// Source file and SVF for converting source to target are now in ./output
// Call SmartVersion to perform the actual extraction...
// Cleanup any extra files made...

Creating the tree

You might be thinking: "That's an efficient way to extract, but how do you generate the tree in the first place". Well, this is where we start diving into the technical part of the solution. Adding files to a tree-like database consists of three main steps.

First, we perform analysis of each file individually. This entails hashing each file, and for SVF files, reading the table of contents. This is important, because the filenames will be used to link files together. In hindsight, I would've preferred to use the file hashes for this, but that might be a task for later. The SVF files store two filenames: the one of the required source file, and the one of the target file.

With this information, we can move to the second step: building the tree. This is a surprisingly simple algorithm (that is, if you have had a Data Structures cpirse):

  1. Start with the root nodes (i.e. nodes that have no parent files) as base points.
  2. For each unconnected node, check if any node in the tree matches the required filename for this node's parent. If it matches, we connect the node as the found node's child.
  3. Repeat step 2 until no more nodes are unconnected, or we got stuck.

The code looks something like this in C#:

// Layer 0 elements (files)
roots.AddRange(entries.Where(a => a.FileType == FileEntry.EntryType.File).Select(a => new TreeNode<BaseFileEntry>(a)).ToList());

// Remove all layer 0 elements from list of unconnected nodes
entries.RemoveAll(a => a.FileType == FileEntry.EntryType.File);

bool success = true;
int allCount = allEntries.Count;
while (entries.Count > 0)
{
    int numChanges = 0;

    // ToList makes a shallow copy here
    foreach (UnimportedFileEntry entry in entries.ToList())
    {
        IEnumerable<TreeNode<BaseFileEntry>> parents = roots
            .SelectMany(b => b.GetNodeEnumerable())
            .Where(a => a.Data.FileName == entry.BaseFileName);
        int count = parents.Count();
        if (count == 0) continue;

        TreeNode<BaseFileEntry> parent = parents.First();
        parent.AddChild(new TreeNode<BaseFileEntry>(entry));
        entries.Remove(entry);
        numChanges++;
    }

    // Stall, input data cannot be put into tree
    if (numChanges == 0)
    {
        success = false;
        break;
    }
}

There are some more specifics to the code, like assigning ID's and hard linking or copying the files to the database directory. There's also a case distinction between files which were already in the tree, and files that are being added.

The final step is to generate metadata for these files. Think of version, language, editions, service pack info, etc. This is all important information to know so you can search the database quickly. Extracting this information from a WIM file is easy, but for ISO files you need some way of reading the WIM data contained within it. For SVF files this gets even more complex: you need to extract multiple layers to get to the ISO first, then the WIM, and only then can you read the data required. This will be the topic of the next blog post.

Recap of current progress

Currently we have code that can do the following:

  • Take a list of input files, hash them, and find connections between them.
  • Generate a relation tree of processed files and hard link or copy them to a data folder.
  • Automatically extract any file from the database using the tree, even if they are stored in SVF files. What is left to do:

We still need the following, which we will look into in the coming parts of the series:

  • Generate metadata for quick lookup.
  • Create a UI application that allows easy access to files in the repository.

xx Leet