Voice Training with WebAudio Part 1

Hello there. I’m going to explore a pitch detection method called the McLeod Pitch Method, using the power of JavaScript and WebAudio. I’ll try to use this work later for a web-based voice training app to help with voice triggered dysphoria, but it can be used for other things as well.

A Brief Overview of the McLeod Pitch Method

Before we delve into the code, it’s worthwhile to understand the McLeod Pitch Method (MPM). In simple terms, this method evaluates the similarity between a waveform (our sound signal) and its time-shifted version. By systematically doing this for a series of shifts, we can determine the predominant frequency - the pitch we’re looking for.

For those inclined to explore deeper, this research paper provides a thorough examination of MPM.

An Implementation

To implement the algorithm outlined in the paper, we need to perform the following steps:

Step 1. Normalized Squared Difference Function (NSDF) Calculation: In this step, we measure the similarity between the waveform and its shifted version. It’s done by carefully iterating through our signal and working out NSDF values for each potential shift.

Step 2. Identifying Local Maxima: Within the NSDF array, we search for points where a value stands out by being larger than its immediate neighbors. These peaks are crucial in identifying the dominant frequency.

Step 3. Selecting the Pitch from the Highest Peak: The most prominent peak gives us the pitch we’re searching for. We then seamlessly convert this to a frequency measure in Hertz (Hz).

So how do we do this in JavaScript? Let’s explore with our code below.

function mcleodPitchMethod(signal, samplerate) {
    // 1. Calculate the normalized squared difference function (NSDF).
    // NSDF is a measure of how similar a waveform is to a time-shifted version of itself.
    
    let nsdf = new Float32Array(signal.length); // Array to store NSDF values.
    for (let tau = 0; tau < signal.length; tau++) {
        // 'tau' is the amount of time-shift.
        let m = 0.0; // Accumulator for the numerator of the NSDF formula.
        let n1 = 0.0; // Part of the denominator.
        let n2 = 0.0; // Another part of the denominator.
        
        for (let j = 0; j < signal.length - tau; j++) {
            // Calculate the values used in the NSDF formula.
            m += signal[j] * signal[j + tau];
            n1 += signal[j] * signal[j];
            n2 += signal[j + tau] * signal[j + tau];
        }
        
        // Calculate the NSDF value for this value of 'tau'.
        nsdf[tau] = 2.0 * m / (n1 + n2);
    }

    // 2. Find local maxima in the NSDF array.
    // These are potential candidates for the detected pitch period.
    let maxPositions = []; // Array to store positions of local maxima.
    for (let i = 1; i < nsdf.length - 1; i++) {
        // Check if the current value is greater than its neighbors.
        if (nsdf[i] > nsdf[i - 1] && nsdf[i] > nsdf[i + 1]) {
            maxPositions.push(i);
        }
    }

    // 3. Choose the highest peak as the pitch period.
    // The highest peak corresponds to the best match for the pitch period.
    let highestPeakPos = maxPositions[0];
    for (let i = 1; i < maxPositions.length; i++) {
        if (nsdf[maxPositions[i]] > nsdf[highestPeakPos]) {
            highestPeakPos = maxPositions[i];
        }
    }

    // Convert the pitch period (in samples) to frequency (in Hz).
    // 'sampleRate' is the number of samples processed per second.     
    let pitchFrequency = sampleRate / highestPeakPos;

    return pitchFrequency;
}

Volume Thresholding

The algorithm above works well for detecting pitch in a signal. However, it’s not very robust to noise. To make it more robust, we can add a volume threshold. This means that we only detect pitch when the volume of the signal exceeds a certain threshold. This is a common technique used in audio processing. We can use RMS (Root Mean Square) to calculate the volume of the signal. The code below shows how to do this.

// Function to calculate the Root Mean Square (RMS) of a set of audio
// samples. RMS provides a measure of the magnitude (or volume) of an
// audio signal.
function calculateRMS(samples) {
  // Initialize the sum of squares of each sample.
  let sumOfSquares = 0;

  // Loop through each sample in the audio buffer.
  for (let i = 0; i < samples.length; i++) {
    // Square each sample value and add it to the sum.
    sumOfSquares += samples[i] * samples[i];
  }

  // Calculate the mean (average) of the squared samples.
  // Then, take the square root of that mean to get the RMS value.
  let rms = Math.sqrt(sumOfSquares / samples.length);

  // Return the RMS value.
  return rms;
}

WebAudio and audio capture

Now that we have a pitch detection method, and a way to detect volume, we need to capture audio from the microphone and feed it to the algorithm. We can do this using the WebAudio API. The code below shows how to do this.

const PITCH_VOLUME_THRESHOLD = 0.03; // This is an arbitrary number; adjust based on testing and requirements.

// Create a new audio context. This is the primary object used in the Web Audio API.
let audioContext = new (window.AudioContext || window.webkitAudioContext)();

// Create a script processor node with a buffer size of 4096 samples.
// This node lets us directly process raw audio data using JavaScript.
// params are: buffer size, input channels, output channels.
let processor = audioContext.createScriptProcessor(4096, 1, 1);
let sampleRate = audioContext.sampleRate;

// Request access to the user's microphone.
navigator.mediaDevices
  .getUserMedia({ audio: true })
  .then((stream) => {
    // If access is granted, create a media stream source from the microphone stream.
    let source = audioContext.createMediaStreamSource(stream);

    // Connect the media source to the script processor node.
    // This means that audio from the mic will flow into our custom processing node.
    source.connect(processor);

    // Also, connect the script processor to the audio context's destination.
    // This ensures the audio plays back, so you can hear it (optional, based on use case).
    processor.connect(audioContext.destination);

    // This event fires whenever our buffer (of 4096 samples in this case) is full.
    processor.onaudioprocess = function (event) {
      // Get the audio samples from the input buffer.
      let inputBuffer = event.inputBuffer;

      // We're assuming mono sound, so only one channel.
      let inputData = inputBuffer.getChannelData(0);

      // Calculate the volume (RMS) of the current buffer.
      let rms = calculateRMS(inputData);

      // Check if the volume exceeds the threshold.
      if (rms > PITCH_VOLUME_THRESHOLD) {
        // If the volume is above the threshold, we process the data to detect pitch.
        let pitch = mcleodPitchMethod(inputData, sampleRate);
        console.log("Pitch:", pitch);
      } else {
        // If the volume is too low, skip processing and log a message.
        console.log("Volume too low. Skipping processing.");
      }
    };
  })
  .catch((err) => {
    // Handle any errors that occur when trying to access the microphone.
    console.error("Error accessing the microphone", err);
  });

Thanks for reading. In part 2, we will discuss the concept of “Formants” and how they can also be a useful metric for voice training and continue from there.