Runtime WAV Decoding for Unity Mod Audio: Warman Dev Blog

WAV files arrive at runtime from mod archives as raw byte arrays. There is no build-time import step, no Unity AudioImporter, no pre-processing. Each file goes from bytes to a playable AudioClip in a single path, and that path needs to handle whatever a mod author drops in.

Why there is no Unity importer step

Unity's audio import pipeline runs at editor time, producing platform-specific compressed assets baked into the build. Warman's mod system packages mods as .wm archives: zip files containing raw assets, loaded at runtime. The AudioImporter never runs on mod content. It only exists during the editor build.

This is the same constraint that required a custom terrain mesh system and custom binary serialization. Once runtime mod loading is a design requirement, any Unity tool that only runs at build time becomes unavailable.

The RIFF container

A WAV file is a RIFF file. RIFF (Resource Interchange File Format) is a chunked container: a root header with a type tag followed by sub-chunks, each with a 4-byte identifier, a 4-byte size, and a payload. A WAV file's root header has the type WAVE and must contain at least two sub-chunks: fmt (audio format) and data (PCM samples).

The decoder does not assume a fixed layout. It scans chunks in a loop until it has found both fmt and data, skipping anything else:

while (ms.Position < ms.Length - 8)
{
    var chunkId = new string(br.ReadChars(4));
    var chunkSize = br.ReadInt32();

    switch (chunkId)
    {
        case "fmt ": /* read format fields */ break;
        case "data": pcmBytes = br.ReadBytes(chunkSize); break;
        default:     br.ReadBytes(chunkSize); break; // skip
    }
}

Different DAWs embed different things. Ableton adds a LIST chunk with session metadata. Some editors write id3 tags. Adobe Audition adds a bext broadcast extension chunk. A decoder that assumes fmt is always first, or that data immediately follows it, breaks on a large fraction of real-world files. Scanning by chunk ID handles all of them.

WAV file layout. Chunks are scanned by ID, not assumed to be in a fixed order.

One rule from the RIFF spec that silently breaks naive parsers: chunk payloads are padded to an even byte boundary. A chunk with an odd-sized payload has an extra padding byte that is not counted in the size field. The decoder advances past it explicitly after processing each chunk:

// Align to 2-byte boundary (RIFF spec)
if (chunkSize % 2 != 0 && ms.Position < ms.Length)
    ms.Position++;

Without this, the reader falls out of alignment on the first odd-sized chunk and misidentifies every chunk identifier that follows.

Bit depth decoding

The fmt chunk specifies bit depth alongside channel count and sample rate. Unity's AudioClip.SetData() expects float samples normalized to -1.0 to 1.0, regardless of source bit depth. Four integer formats and one float format each need different handling.

8-bit PCM is unsigned. Samples run from 0 to 255 with silence at 128, not 0. The conversion subtracts the midpoint before dividing:

s[i] = (b[i] - 128) / 128f;

16-bit PCM is signed two's complement. BitConverter.ToInt16 handles it directly:

s[i] = BitConverter.ToInt16(b, i * 2) / 32768f;

24-bit PCM is the interesting case. There is no 24-bit integer type in C# and no BitConverter method for it. The three bytes must be manually sign-extended into a 32-bit integer. The decoder does this with left shifts:

var o = i * 3;
var v = (b[o] << 8) | (b[o + 1] << 16) | (b[o + 2] << 24);
s[i] = v / 2147483648f;

Shifting the three bytes into the upper 24 bits of a 32-bit int places the sign bit at position 31, where C# integer arithmetic correctly handles negative values. Dividing by 2^31 rather than 2^23 keeps the -1.0 to 1.0 range consistent with every other bit depth. Unity has no native runtime support for 24-bit WAV, which is why this needs to exist at all. 24-bit is common output from DAWs and professional recording software, so a decoder that skips it would reject a lot of real-world files.

32-bit integer PCM follows the same pattern as 16-bit using BitConverter.ToInt32 and dividing by 2^31. The IEEE float variant (indicated by audioFmt == 3 in the fmt chunk) is the simplest case: read four bytes as a float and clamp to -1.0 to 1.0. IEEE float WAV is technically valid but rare; most DAWs export integer PCM by default.

Normalization

WAV files from different sources arrive at different recording levels. A sound effect recorded at -3 dBFS and an ambience loop recorded at -18 dBFS would play at completely different volumes without intervention. Normalizing on import means all audio enters the runtime at a consistent relative level, and the in-game volume sliders become meaningful.

RMS normalization brings clips at different recording levels to the same perceived loudness, regardless of their peak amplitude.

Two strategies are available. Peak normalization scales audio so the loudest single sample hits a target level (default -6 dBFS, linear value 0.5). It is fast and guarantees no clipping, but peak level does not correlate well with perceived loudness. A punchy sound with rare loud transients and low average energy would be scaled to the same peak as a sustained tone at the same level, but it would sound much quieter in practice.

RMS normalization measures the root-mean-square of the entire signal, a rough average power level, and scales to match a target RMS value. Two clips normalized to the same RMS will generally sound like similar volumes even if their peaks differ considerably:

public static float MeasureRms(float[] samples)
{
    double sum = 0;
    for (int i = 0; i < samples.Length; i++)
        sum += samples[i] * samples[i];
    return Mathf.Sqrt((float)(sum / samples.Length));
}

The RMS normalizer includes two guards. A gain cap prevents clipping: the computed gain is clamped so the peak never exceeds 1.0, which matters for quiet clips that have occasional loud transients. A silence guard skips normalization when the measured RMS is below 1e-6, preventing noise amplification on near-silent files:

float gain = Mathf.Min(targetRms / rms, 1f / MeasurePeak(samples));
ApplyGain(samples, gain);

The dB math

Gain adjustments across the system use decibels. The conversion between dB and linear amplitude follows the standard formula: linear gain is 10 raised to the power of dB divided by 20.

public static float DbToLinear(float db) => Mathf.Pow(10f, db / 20f);
public static float LinearToDb(float linear) =>
    linear > 0f ? 20f * Mathf.Log10(linear) : -80f;

The -80 floor on the linear-to-dB direction is a practical convention. True silence maps to negative infinity dB, which is not useful as a return value. -80 dB sits below the audible threshold of any real content and keeps the value finite. The DbFsToLinearPeak helper converts a dBFS target to its equivalent linear peak: -6 dBFS converts to roughly 0.501, -3 dBFS to 0.708. These values correspond to how audio engineers talk about headroom, so mod authors who know their tools can reason about what the normalizer will do to their files.

From bytes to AudioClip

The pipeline is two static classes: WavDecoder handles chunk parsing and bit-depth conversion; AudioGain handles normalization and gain math. A mod sound file loads as raw bytes from the .wm archive, passes through WavDecoder.Decode() to produce a float sample array with channel count and sample rate attached, then goes through the appropriate normalization call. The float array is written into a Unity AudioClip via SetData() and registered against its string key in the runtime sound system.

From that point the clip is indistinguishable from any other clip in the game. It plays as a one-shot with a world-space position for spatial falloff, loops as music, or runs as a room ambience source. Mod authors can export at whatever bit depth and sample rate their DAW defaults to. The decoder handles it.