By Leet on 2025-06-04 17:30:13
In my last post, we looked at the problem of storing thousands of Windows ISOs efficiently. In terms of storage space, the only viable option is of course SmartVersion. Unfortunately, SmartVersion comes with three major downsides. I already discussed the first two in part 1 (nesting and metadata). The third is actually speed, but I will address that in the next part of the blog series.
For this post, I want to look at how we can overcome problem #1: Nested SVFs. Let's look at a random Windows 7 ISO in my library:
[0] File en_windows_7_ultimate_with_sp1_x64_dvd_u_677332.iso
[141] SVF [de-de]_de_windows_7_ultimate_with_sp1_x64_dvd_u_677306.svf
[142] SVFContent de_windows_7_ultimate_with_sp1_x64_dvd_u_677306.iso]
[147] SVF [de-de]_de_windows_7_ultimate_n_with_sp1_x64_dvd_u_677540.svf
[148] SVFContent de_windows_7_ultimate_n_with_sp1_x64_dvd_u_677540.iso
[149] SVF [de-de]_de_windows_7_enterprise_n_with_sp1_x64_dvd_u_677697.svf
[150] SVFContent de_windows_7_enterprise_n_with_sp1_x64_dvd_u_677697.iso
Note that this is a reverse dependency tree. So to extract German Enterprise N x64 (ID 150) from its SVF file (ID 249), we need ID 148. But 148 is not a real file! It is a filed stored in an SVF. So we first need to extract that file using its parent SVF (147), which requires file 14...???? This is getting confusing. There is a much better way to do this.
Namely, you can simply go all the way up the tree, while keeping track of each SVF file as you go. We also make sure to make a copy of the SVF files, because we will be modifying them as part of the extraction process. You can then ask SmartVersion to merge each one (i.e. append a version to the SVF file), starting from the bottom of the tree ans slowly working your way up:
smv av 147.svf 149.svf
smv av 141.svf 147.svf
If done correctly, 141.svf is now an SVF that converts en_windows_7_ultimate_with_sp1_x64_dvd_u_677332.iso
directly into the requested ISO (ID 150). In math terms, to extract a file k
layers deep, we need k-1
merge operations, and only 1 extraction operation.
The code for this is quite a bit more complicated though. This is mainly because we need to make a distinction between SVF files themselves and contents of SVF files. Here's a rough sketch of the logic for SVF content:
FileEntry parent = _DatabaseFlat[entry.ParentId];
string databaseFile = "./data/" + parent.Hash.SHA1Hash.Substring(0, 4) + "/" + parent.Hash.SHA1Hash.Substring(4);
File.Copy(databaseFile, Path.GetFullPath("./output/" + parent.FileName));
// Try finding any ancestors and merge the SVF files
string lastPath = Path.GetFullPath("./output/" + parent.FileName);
FileEntry ancestor = parent;
while (_DatabaseFlat[ancestor.ParentId].FileType == FileEntry.EntryType.SVFContent)
{
// Get parent SVF (two layers up)
ancestor = _DatabaseFlat[_DatabaseFlat[ancestor.ParentId].ParentId];
// Copy SVF to output
string databaseFileA = "./data/" + ancestor.Hash.SHA1Hash.Substring(0, 4) + "/" + ancestor.Hash.SHA1Hash.Substring(4);
File.Copy(databaseFileA, Path.GetFullPath("./output/" + ancestor.FileName));
// Append the previous SVF to its parent SVF
string curPath = Path.GetFullPath("./output/" + ancestor.FileName);
Process.Start("smv", $"av \"{curPath}\" \"{lastPath}\"").WaitForExit();
// Remove the previous SVF
File.Delete(lastPath);
lastPath = curPath;
}
// Extract the actual source file at this point (hardlink)
Extract(_DatabaseFlat[ancestor.ParentId]);
// Source file and SVF for converting source to target are now in ./output
// Call SmartVersion to perform the actual extraction...
// Cleanup any extra files made...
You might be thinking: "That's an efficient way to extract, but how do you generate the tree in the first place". Well, this is where we start diving into the technical part of the solution. Adding files to a tree-like database consists of three main steps.
First, we perform analysis of each file individually. This entails hashing each file, and for SVF files, reading the table of contents. This is important, because the filenames will be used to link files together. In hindsight, I would've preferred to use the file hashes for this, but that might be a task for later. The SVF files store two filenames: the one of the required source file, and the one of the target file.
With this information, we can move to the second step: building the tree. This is a surprisingly simple algorithm (that is, if you have had a Data Structures cpirse):
The code looks something like this in C#:
// Layer 0 elements (files)
roots.AddRange(entries.Where(a => a.FileType == FileEntry.EntryType.File).Select(a => new TreeNode<BaseFileEntry>(a)).ToList());
// Remove all layer 0 elements from list of unconnected nodes
entries.RemoveAll(a => a.FileType == FileEntry.EntryType.File);
bool success = true;
int allCount = allEntries.Count;
while (entries.Count > 0)
{
int numChanges = 0;
// ToList makes a shallow copy here
foreach (UnimportedFileEntry entry in entries.ToList())
{
IEnumerable<TreeNode<BaseFileEntry>> parents = roots
.SelectMany(b => b.GetNodeEnumerable())
.Where(a => a.Data.FileName == entry.BaseFileName);
int count = parents.Count();
if (count == 0) continue;
TreeNode<BaseFileEntry> parent = parents.First();
parent.AddChild(new TreeNode<BaseFileEntry>(entry));
entries.Remove(entry);
numChanges++;
}
// Stall, input data cannot be put into tree
if (numChanges == 0)
{
success = false;
break;
}
}
There are some more specifics to the code, like assigning ID's and hard linking or copying the files to the database directory. There's also a case distinction between files which were already in the tree, and files that are being added.
The final step is to generate metadata for these files. Think of version, language, editions, service pack info, etc. This is all important information to know so you can search the database quickly. Extracting this information from a WIM file is easy, but for ISO files you need some way of reading the WIM data contained within it. For SVF files this gets even more complex: you need to extract multiple layers to get to the ISO first, then the WIM, and only then can you read the data required. This will be the topic of the next blog post.
Currently we have code that can do the following:
We still need the following, which we will look into in the coming parts of the series:
xx Leet