Slicing Odd Beats

One of my musical wet dreams is a real time beat slicer that will slice up and shuffle-replay any type of acoustic input material on the fly. Of course it is easy to fill a circular buffer and replay segments of it at different speeds and direction. But I want to carefully slice the stuff, not randomly chop it, so I need a decent analyser to identify coherent segments of sound.

My analyser should specialise in odd sounds, like the cute kalimba which is my favourite electro-acoustical testing-and-doodling-toy.

The kalimba produces very clear attacks, and at least for the eye it is easy to identify the beats:

But, the computer has no eyes and it needs unambiguous instructions about how to determine the exact location of an attack. The simplest method would be to check for a certain sound level, since the highest level is always at the attack. But beats are more about dynamics than about sound level. I need to find a difference, rather than an absolute value. The point where the sudden increase starts.

Let us look at such a point a bit closer. Here is a very clear starting point. Yet the rise in sample values is not that sudden at all, when viewed at this scale:

Looking closer still, individual samples appear. It is from these values that the computer must determine the beat start. But all these cycles could look like a beat start!

One problem is, that we see samples values and not amplitudes. These cycles may represent a rather steady amplitude, but how do we get amplitudes from the sample values? Amplitudes can be computed over an interval, by taking the root of the mean of squares. This would give some average amplitude - is that convenient? We are trying to detect a sudden rise... There is another method, computing so called instantaneous amplitudes. I will illustrate that method later, but let me first do an example plot of sample values versus (instantaneous) amplitudes:

The amplitudes are not exactly constant after the attack. There is amplitude modulation, and then I have even selected a favourable figure. In most cases there is more modulation in the transient part.

In the decay part of the kalimba sound, the amplitude smoothly fades away:

This is a convenient phenomenon. After the dectection of an attack, a gate for the analysis routine can be temporarily closed. A timer can be set to open that gate after the transient part, which is relatively short.

A practical implementation of the beat detector could thus start with instantaneous amplitude computation, for which a rather fascinating method exists. The method is about the following question: if you could create two orthogonal phases of one and the same signal, you would have pairs of samples, and from each pair you can compute the amplitude for that moment in time. The mathematics behind the process are known as Discrete Hilbert Transform. That is why you can find objects named [hilbert~] in Max/MSP and Pd. More details are on the page Complexify a Real Signal.

attack detector prototype in Max/MSP

Here is a short sketch of an attack detector which uses instantaneous amplitudes for analysis. This method is not perfect, and it's description serves to pinpoint the major difficulty in attack detection.

A signal is routed in the Hilbert Transformer, and two orthogonal signal phases come out: a real phase and an imaginary phase. These are both squared, then summed, and the square root of the sum is taken:

pyth

The multiplication with 0.707 (1/sqrt(2)) is a normalisation, because the extra phase added it's own amplitude. It was not originally there but created as a helper signal.

The amplitudes are routed in a differentiator, because I want to find an amplitude rise, not an amplitude value. The differentiator output is the amplitude difference A[n]-A[n-1]. There is also an object 'delta~' which can do this.

The difference of two succesive instantaneous amplitudes reflects an amplitude slope at any sampled moment in time. It is not precisely a tangent but that does not matter for practical purposes.

Next comes the thresh~ object, which sends a boolean (0. or 1. for false or true) out, true when the input value reaches the threshold level. There is also a hysteresis level which you can set, so the boolean will not flip-flop rapidly around the threshold level. The output of the thresh~ object is used as a trigger.

Following thresh~ comes an object cuepoint~, which makes a cuepoint from an index signal coming in it's left inlet, at the moment when a trigger signal comes in the middle inlet. cuepoint~ will close a gate internally after a trigger has come in. The right inlet is for a user parameter 'refractory time' (in number of samples). This is a timer that will open the gate again after it has counted down to zero. The output will send the cuepoint as a message, not a signal.

cuepoint~ is not a regular Max/MSP object, I wrote it for the purpose. Converting a value from the signal stream to a message with sample-precise timing is problematic, in my experience. I could not get it fixed with whatever combination of regular objects, though I may have overlooked possibilities. At the page bottom is a link to download the cuepoint~ object, compiled for IntelMac. Anyway, the intention is to store an array of cuepoints in message form. Such an array need not be very long. If my circular signal buffer has room for a couple dozen audio segments, then that is also the number of cuepoints I need.

Just like the circular audio buffer is recycled by writing new samples over old ones, the same holds for the cuepoints: when the array is full, the oldest value can be discarded and overwritten by the newest value. For messages, this is what the cycle object does in Max/MSP. Here I have an array with four cuepoints in the test patch. I can mouse-click on a cuepoint to set a marker in the waveform at the cuepoint location, and check how accurate the beat detection was...

Here is one such cuepoint, as found by the beat-detection routine:

Checking the same cuepoint at the scale of individual samples, it turns out that the analyser missed only 0.2 milliseconds, ten samples. From the figure, it is clear that the amplitude rise is delayed respective to the sample values. That is because the filters in the Hilbert transformer have a rise time, a kind of mathematical inertia.

Was this detection accuracy coincidental, or systematic? I checked a lot of cuepoints from kalimba beat detection and found the results to be systematically of equivalent accuracy. Maybe the kalimba happens to produce the easiest targets for beat detection?

In the figure below, I pronounced the word 'beat', and what you see is the first 150 milliseconds, the b, and the red line showing the cuepoint:

The 't' of 'beat' comes after a short silence, and it is detected separately:

But there was also a false detection in the middle of 'bea'. That is not surprising, there is a lot of amplitude modulation there, in contrast with the smooth decay of a simple instrument like the kalimba. Can I do something about that?

I tried filtering the amplitudes with a lo-pass filter to shape an amplitude envelope. The filter smoothens the amplitude modulation, but it also smoothens the attack, and the detection comes way too late, if it comes at all. Therefore, amplitude filtering is not in itself an effective solution.

If I am to distinguish 'beat-internal' or periodic amplitude modulation from overall amplitude fluctuations, I better inspect it's cause and character closely before speculating about solutions.

Here is an example showing three periods from the middle of the word 'beat':

The periodicity, both in sample values and in amplitude, is 5 milliseconds, which translates to 200 Hz. That must be the fundamental of my regular speech, because I found that periodicity at many places throughout the buffer.

periodic amplitude modulation

When different frequencies sound simultaneously, amplitude modulation will happen, caused by phase cancellation. This means that amplitude modulation will be present in every but the simplest type of sound material. This holds for harmonic recipes just as well as for inharmonic sounds. Here are two examples with computer-generated cosines to illustrate that:

Here is the sum of a harmonic set of frequencies: 200, 400 and 600 Hz with equal amplitudes. The periodicity of this wave is 200 Hz, the fundamental frequency.

In harmonic recipes, the amplitude modulation will show rectified-sinusoidal shapes. The periodicity is that of the fundamental note, but the pattern can be quite complex.

This is an inharmonic combination of frequencies: 200 Hz and 211.89 Hz, a one note interval in equal temperament. Theoretically, there is no fundamental period for this particular combination because the frequencies do not share a common integer multiple. Still, such nearly-equal frequency combinations produce a very strong sense of periodicity, by their amplitude modulation. The difference frequency is perceived in the modulation, in this case (nearly) 12 Hz. With good reason, this phenomenon is called 'beat frequency'.

Because amplitude modulation can often be perceived as a low frequency, it is tempting to try hi-pass filtering. That does not help. The modulations are like ghost frequencies: when you try to kill them, it turns out they are not really there, although they keep on plaguing you. The only way to catch them is: averaging the amplitude over a large analysis period. The analysis should cover two (pseudo-)periods of the lowest amplitude modulation to relieve the effect. For the example above, with the one-note interval, that would be 6 Hz or 0.16666 seconds. A sudden amplitude rise may still be detected, even within such a large analysis frame, but it's exact start position can no longer be found. That is the dilemma.

Is there a practical way to recognise periodic amplitude modulation, and distinguish it from sudden amplitude rise? Of course, periodic modulation is characterised by repetition. Is such repetition systematic enough to identify periodicity? After a lot of experimenting, I decided to test actual amplitude against a 'peak envelope' of amplitudes. By the way, this may be a conventional method, even though I have not seen it explained, so far.

The peak envelope describes an exponential decay of amplitude peaks. Everytime when an actual amplitude rises above the envelope, the envelope is reset to that new peak value. The peak envelope of an amplitude-modulated signal could look like this:

Periodic amplitude peaks rise above the peak envelope, but only with a small amount. For an unambiguous attack however, the picture is different. It looks more like this:

I found that instantaneous amplitudes are still too variant to make a good peak envelope, and an amplitude average over a couple of milliseconds is about the minimum that works. Therefore, I said goodbye to my beloved instantaneous amplitudes, and compute average amplitudes over blocks of 64 samples, using the root of mean of squares (RMS) method. The envelope decay factor is computed from a time constant in milliseconds, like it can be done for resonators (for more details, see the page 'complex resonator'):

decay_factor = exp(-1000/samplerate/timeconstant);

It is also possible to work with deciBels, logarithms of the energy or amplitude. That is what I did accidentally. Below is an impression of amplitudes expressed in deciBel form. DeciBels show a rather different curve.

DeciBels are computed from energy or amplitude with:

deciBel level = 10 * log(energy / reference energy), or
deciBel level = 20 * log(amplitude / reference amplitude)

Because deciBels are logarithms of amplitudes, a ratio of amplitudes will translate to a difference in deciBels. Or in other words, an exponential decay of amplitudes translates to a linear decay of deciBels. Here is an impression:

Because of this translation, a peak envelope of deciBels should theoretically be calculated by subtracting a small term, instead of applying a decay factor. This is what I overlooked in my first implementation with a deciBel peak envelope. Lucky me. After discovering my mistake, I compared the wrong method with the correct one. The wrong method turned out to perform better, in practice! It is more sensitive to attacks, while it does not seem to produce more false detections. Therefore, I stick to my wrong method: testing an actual deciBel level against an exponentially decaying peak envelope of deciBels.

Many of the amplitude modulations happen within a 64 samples frame. With 44K1 sampling rate, a 64 samples frame length represents 689 Hz. Periodic amplitude modulation with lower frequency is partially masked by the peak envelope. The extent of masking is determined by the time constant (thus decay factor).

An attack is detected when a specified difference between actual deciBel level and peak envelope is reached or exceeded. For example: when the minimum difference is defined 6 deciBel, a sound which suddenly grows twice as loud is identified as an attack. The problem with this test is, that low level noise can easily vary to this amount and trigger a false detection. Therefore, the actual deciBel level must also exceed an absolute threshold level, to pass the attack detection test.

While I am computing deciBel levels anyway, it is worthwile to test these against a level defined as 'silence'. With this extra test, the end of a sound segment can be identified as well, and silences can be omitted from recording. All together, there are now three reference levels and one reference interval:

- unity gain, the reference amplitude from which the deciBels are computed
- absolute attack threshold in deciBels
- absolute silence threshold in deciBels
- [actual level - peak envelope level], an interval in deciBels

A definite advantage of testing against a peak envelope is the large dynamic range covered by the analysis. Even at very low levels, attacks can be detected, if you want. The absolute attack threshold can be used to delimit musical attacks from unintentional clicks and pops of instrument handling or background noises.

In practice, the deciBel peak envelope is very good at masking periodic amplitude modulation. The plot below, of some nonsense words spoken, shows the attack cuepoints as vertical lines. The vowels, with their heavy amplitude modulation, do not produce false triggers with this method.

zero-crossings

Since I work with 64-sample analysis frames now, there is an uncertainty about the exact time position of an attack. The attack started somewhere within a frame. That could be at the beginning of the frame, in the middle, or at... the end? Probably not at the end, because in that case, the attack would be detected in the next analysis frame. So, it is even possible that the attack started just before the frame where it was detected.

Of course, I could inspect a frame of attack detection more closely, and try to find the attack start with better accuracy. But how? Plain sample values tell very little about an amplitude rise. At least, the cue must not point to an index with substantial sample value. Therefore, I decided to keep track of zero-crossings, and locate the cuepoint at a zero-crossing preceding the attack detection.

Let me illustrate an example. Here is a sequence of kalimba notes, with vertical lines indicating attack cuepoints. One of them is marked red, and this attack interrupts a note which has not yet decayed to a low level:

Zooming in on this spot, the position of the cuepoint is shown in a 512 point plot. The sample value of the cuepoint is printed, and the value just before the cuepoint. Between them is a zero-crossing. It is the ideal location for a cuepoint indeed.

It is not always so ideal as in the example above. The uncertainty about the location necessitates a safety margin extending over the frame of detection and the preceding frame. Here is an example of a less ideal detection:

Moreover, a low frequency component in a wave preceding the attack can make zero-crossings sparse, thereby enlarging the distance between a zero-crossing and the actual attack start.

An alternative strategy would be to ignore zero-crossings, and just apply a short fade-in to erase the ugly blips that come with arbitrary cuts. But, in my view, a fade-in can erase the most distinct part of a musical sound: it's onset. In most cases, the distance from cuepoint to actual attack is less than two blocks of 64 samples, or 3 milliseconds at 44K1 sampling rate. With a slowed down replay of the audio fragment, the delay increases. For example, at two octaves down, it can be up to 12 milliseconds. Although that is much more than I aimed for when starting my experiments, the timing error does not exceed the limits of practical use.

beatslice recorder

Using the above described technique, a signal can be recorded into a (circular) buffer, while attack cuepoints are stored for replay purposes. I tried building separate loop-recorder and attack detector object classes, for Pure Data. The result was not sample-precise, for some reason. Then I combined the loop-recorder and attack-detector into one single object class, and got sample-accurate registration of cuepoints at zero-crossings. The data is recorded in a named buffer, so it can be accessed for replay by other objects. For that purpose, I designed a specialised player class, which can play a segment at any speed, forward or backward, even if that segment wraps around the cut of the buffer.

The record object class, titled [slicerec~], has some details which were not yet mentioned above. I found that the end of a slice, even when it ends neatly at a zero-crossing, can produce an audible blip at replay. This happens when a slice was terminated by a new attack before it decayed to a low level. While a sudden attack is a normal physical or even musical phenomenon, a sudden end is not natural.

The sudden end is a discountinuity, and the blip, sounding so conspicuous in the void after the end, is undesired. Therefore I decided to apply a very short (64 samples) fade-out to every slice, before writing these samples to the buffer. This fade-out is implemented as the second half of a Hann or cosine window (see the page 'FFT window' for details and illustrations on window functions). Despite the limited length, this Hann type fade-out is very effective in eliminating blip sounds.

Another point of inspection was the onset of the signal after a period of silence. Once an attack is detected, one or two frames of meaningful audio are already history. If periods of silence are not recorded, the history of two frame lengths must at least be constantly buffered, and inserted when recording is resumed. Since recording is done framewise in the [slicerec~] object, it can not be forced to start at a zero-crossing. A short fade-in is obligatory here. The fade-in is 64 samples and it's shape is the first half of a Hann window.

So far, I have only described how to identify an attack in an audio stream. For realtime applications, the sliced result should preferrably be stored in a constantly refreshed circular buffer. Reading slices from a circular buffer brings some peculiarities, unknown from conventional slicers and loopers. These peculiarities are discussed on the next page, because this text is getting too long.

cuepoint~.mxo.zip, 4 KB, contains cuepoint~.mxo, Max/MSP object compiled for Intel Mac

Others downloads which used to be here are replaced by a new package Slice//Jockey.

^top <<home <<previous next>>