Visualising audio data

James Boyden

University of Sydney, Australia

17 November 2003


Abstract

This report will describe various methods of visualising audio data. Using two sample slices of audio data, the different methods will be illustrated and contrasted. Discussion of the short-comings of each method will lead to the proposal of the next method as a potential improvement, culminating in the presentation of a method of visualising audio data using the time-varying frequency spectrum generated by the short-time fourier transform.

Keywords: audio; visualisation; vizualisation; spectral analysis; digital signal processing; discrete fourier transform; short-time fourier transform


Contents

Chapters:

  1. Introduction
  2. Obtaining the sample audio data
  3. Visualising PCM data
  4. Fourier transforms and spectral analysis
  5. Visualising STFT data
  6. Animating STFT data (1)
  7. Animating STFT data (2)
  8. Animating STFT data (3)
  9. Conclusion

Appendices:

  1. References
  2. Software used
  3. C programming libraries used

1. Introduction

For most of us, sound is an integral part of life. Not only do we employ spoken language for communication, we also listen to music for pleasure. We are able to discern a wide variety of sounds: in addition to the nuances of spoken communication conveyed through variations in volume, rate and tone, we are able to recognise different speakers based upon the distinguishing audio characteristics of their voices. Musical compositions exploit our ability to identify certain audio characteristics when setting the mood of a piece, or expressing emotion in a narrative.

Yet, for something so intimately embedded in the human consciousness, sound is still difficult to visualise. This report will describe various methods of visualising audio data, with the aim of discovering a style of visualisation which conveys to the observer an understanding of the distinguishing audio characteristics of the data.


2. Obtaining the sample audio data

In order to present slices of audio data which were complex enough to display interesting audio characteristics, yet stable enough that these audio characteristics could be identified and distinguished, it was decided that the sample slices would be taken from musical recordings.

The musical recording selected for analysis was a song called "Like Tears In Rain" by a group called "Covenant". This song was selected because it has a more layered texture than most "pop" music, but fewer layers than most "classical" (orchestral) music, making the individual layers more easily distinguished. Choral music would also be appropriate in this regard, but the different layers (all vocal) would lack the diversity of audio characteristics possessed by more heterogeneous music. Like Tears In Rain is "electronic" music: in addition to an electronic beat, it consists of electronic instrumental harmonies and rhythms, sampled choral harmonies, and recorded human vocals.

Another reason for choosing Like Tears In Rain was the desire to use audio data of the highest possible quality. The digital audio data stored on audio CDs and DVDs is the highest-quality recorded audio material available to consumers, so any musical recordings stored on either of these two media were obviously preferred. Since the author is fortunate enough to own the appropriate Covenant album on audio CD, there were no dilemmas.

The way in which audio CDs store audio material is called "CDDA" (or "Compact-Disc Digital Audio"). Since audio CDs contain stereo sound recordings, CDDA data is composed of two channels of sound, called "left" and "right". The sound itself is encoded in a format called "PCM" (or "Pulse Code Modulation"), which is the closest digital approximation to the original (analog) sound. PCM audio data is like a sequence of snapshots of the sound, taken at very high speed. Each snapshot is called a "sample" and the rate at which snapshots are taken is called the "sampling rate". Different sampling rates are used for different purposes, but CDDA uses a sampling rate of 44100 samples per second (per channel). In general, the higher the sampling rate, the higher the quality of the digital audio recording (that is, the closer the digital audio recording is to the original sound). A sample is just a single number which represents the amplitude of the sound at that exact instant in time.

The program cdda2wav enables a user to extract individual tracks from audio CDs and dump the audio data into files (in the WAVE file format) on the hard-disk. The WAVE format was created by Microsoft, and its inclusion in Microsoft Windows 3.1 made it popular on "IBM-compatible" PCs. The WAVE format can store audio data in a variety of encodings, so cdda2wav is able to use the same high-quality PCM encoding as CDDA. cdda2wav was used to extract the track of Like Tears In Rain from the CD and dump it into a file.

The program audacity is a sound editor which can import and export WAVE files. It was used to extract two shorter audio slices from the track of Like Tears In Rain. These particular audio slices were chosen because they each contain a variety of interesting features. Click on the icons below to listen to the audio slices:


3. Visualising PCM data

The most obvious way of visualising audio data which is encoded as PCM is to come up with a graphical representation of PCM data. Recall that PCM data is the closest digital approximation to the original sound, so displaying PCM data must convey some useful information about the sound. Recall also that PCM data is a sequence of snapshots of the sound over time, where each snapshot represents the amplitude of the sound at that exact instant in time. It is not a leap of faith to then expect that graphing this sequence of snapshots will reveal something about how the amplitude of the sound varies over time.

Note, however, that each sample is a single value representing an amplitude, so if we have an audio slice which is 40 seconds long, which was recorded at a sample rate of 44100 samples/second, this will give us 1764000 values (per channel). If we decided to display this as a graph of amplitude as a function of time, we would obtain a graph defined over 1764000 time intervals, each time interval representing the lifetime of a single sample (called the "sampling period"), and each mapped to a single value (per channel). If this graph were displayed on-screen at a scale of one column of pixels per sample, only 0.02 seconds of the audio data could be displayed at once (assuming a screen width of 1000 pixels) and we would need to look at 1764 such screens of data to view the whole 40 second slice. This hardly seems like a useful means of representation.

If, when we are displaying PCM data, we somehow "condense" the graph horizontally, so that multiple samples are plotted within a single column of pixels, the graph will become more manageable -- due to both its reduced length (thus resulting in fewer screens to look at) and reduced resolution (thus making it easier to observe features on the order of a tenth of a second to a second, which is the sort of time-scale which would typically interest us). When plotting a column of pixels, you would simply make note of the maximum and minimum values within that column, and colour all the pixels between them.

For example, if we display 441 samples within a single column of pixels, each column of pixels spans a time interval of a hundredth of a second, allowing 10 seconds of data to be displayed at once, and requiring only four full screens of data to display the whole 40 second slice.

A C program was written to load a WAVE file (employing the libsndfile library to load the WAVE audio data as an array of double-precision floating-point values), construct an image of the graph of amplitude as a function of time (441 samples condensed into each column of pixels; time increasing from left to right across the image; right channel below left channel) and then save this image as a file in the PPM format. The program gimp was used to convert this PPM image to a JPEG image. Click on the thumbnailed images below to view the full-size PCM visualisation for each audio slice:

It may be observed that this representation of audio data is the one traditionally used by sound editors such as audacity and Microsoft's Sound Recorder. Although computationally-inexpensive to generate, this representation is not particularly enlightening: it provides a reasonable suggestion of the variation in total sound volume over time, but no insight into the combination of sounds which may be heard at a given moment. Hence, this visualisation cannot convey any understanding of the distinguishing audio characteristics of the data more advanced than "it's loud" or "it's soft".


4. Fourier transforms and spectral analysis

Discrete Fourier Transform

The pitch of a musical note is determined by the frequency of the sound waves which are carrying the note. Thus, if we are seeking to discern the combination of musical tones sounding at a particular point in time, an obvious starting-point would be an examination of the frequency spectrum of the sound.

In the field of signal processing, a signal is a set of values which vary as a function of some independent parameter (or parameters). Often, the parameter is time, which means the signal is some quantity which varies over time. A continuous-time signal is a set of values which vary over a continuous time interval, while a discrete-time signal is a sequence of values which vary over a sequence of discrete "chunks" of time. A digital signal is a sequence of digital values which vary over a discrete parameter (or parameters) -- again, this parameter is often time, and each value (called a "sample") represents the amplitude of the signal at that point in time. Obviously, digital audio data in the PCM encoding is a digital signal which varies with time.

Digital signals are the only class of signal which can be concretely stored (value-by-value) and manipulated on a computer, and the set of analysis techniques known as "DSP" (or "Digital Signal Processing") is an important subset of the field of signal processing.

The fourier transform is used to analyse the frequency content of a signal: given a particular signal, the fourier transform generates the set of frequencies contained within that signal. The particular class of fourier transform used to analyse digital signals is the DFT (or "Discrete Fourier Transform"), although this is usually implemented as one of several more computationally-efficient algorithms known collectively as the "FFT" ("Fast Fourier Transform"). Just as the input to the DFT is a sequence of digital values which vary over the discrete time domain, the result of the DFT is a sequence of (complex) digital values which vary over the discrete frequency domain. Just as the absolute value of each (real) digital value in the time domain specifies the amplitude of the signal at that point in time, the "absolute value" (complex modulus) of each (complex) digital value in the frequency domain specifies the magnitude of that frequency in the "amplitude spectrum" of the signal.

Nyquist Frequency

The process of converting an analog signal to a digital signal is known as "sampling". To sample an analog sound, snapshots are taken at a regular time interval (called the "sampling period"). The reciprocal of the sampling period is the sampling rate, which describes the number of snapshots taken per second. (Sampling is obviously central to the process of recording analog sound into a digital representation such as PCM.)

Any arbitrary analog signal may be constructed using a particular combination of sinusoidal waves of various frequencies and amplitudes. (This is the basis of fourier analysis.) There are no limitations on the range of frequencies which may be contained within an analog signal. In contrast, the highest rate of fluctuation which can be represented by a digital signal is half the sampling rate. Consider, for example, a sequence of ones:

[1, 1, 1, 1, 1, 1, 1, ...]

This represents a constant digital signal, one which displays no variation. Consider now an alternating sequence of ones and zeros:

[1, 0, 1, 0, 1, 0, ...]

This represents a digital signal which repeats itself every two samples, so the duration of a single fluctuation is twice the period of the signal, and the frequency of fluctuation is half the sampling rate of the signal. (Note that the units of frequency are "fluctuations per second" (or Hertz), while the units of sampling rate are "samples per second".) It is possible to construct digital signals which repeat every three, four or more samples; the longer the duration of a single fluctuation, the lower the frequency. Thus, it may be concluded that the highest possible frequency which may be contained within a digital signal is half the sampling rate. This frequency is called the "Nyquist frequency".

Shannon's sampling theorem states that, when sampling an analog signal, if the sampling rate is at least double the highest frequency component in the analog signal, then the analog signal will be sampled without frequency distortion. Otherwise, the frequency spectrum of the analog signal will not be correctly represented by the digital signal (a phenomenon known as "aliasing"); and if this digital signal is then reconstructed back into an analog signal (for instance, when a digital audio recording is played back through headphones or loud-speakers), the new analog signal will be distorted relative to the original. Simply put: if there are any frequency components in the analog signal which are greater than the Nyquist frequency, the digital signal will represent a distorted frequency spectrum. For this reason, when recording analog sound into a digital representation, it is common practice to filter-out all frequencies higher than the Nyquist frequency before sampling occurs. It may be noted that CDDA, which samples at 44100 samples per second, has a Nyquist frequency of 22.05 kHz. Humans generally hear frequencies in the range of 20 Hz to 20 kHz, making it possible to filter-out frequencies above 22.05 kHz for CDDA recordings, without affecting the audible band of the frequency spectrum.

Since the highest possible frequency in any digital signal is the Nyquist frequency, the frequency spectrum generated by the DFT will consist of a sequence of frequencies distributed evenly from 0 Hz to the Nyquist frequency.

Short-Time Fourier Transform

The problem with the DFT is that it only describes the magnitude of each frequency in the amplitude spectrum of the whole signal; it provides no fine-grained information about how the magnitude of a particular frequency varies through time (such as when a note of that frequency begins to sound, how long it sounds for, and how its volume varies throughout its lifetime). This problem is addressed by the STFT ("Short-Time Fourier Transform"). Instead of analysing the whole signal, the STFT applies the DFT to a short sub-interval of the signal (called a "window"), generating a frequency spectrum for that window. The window is then advanced by a certain amount through the signal, and the next sub-interval is analysed. By analysing a succession of windows over the lifetime of the signal, a succession of frequency spectra can be obtained. This succession of frequency spectra will demonstrate how the frequency spectrum of the signal changes over time.

Unfortunately, the STFT cannot magically produce more information out of nothing: there is a trade-off between precision in the time domain and precision in the frequency domain. Recall that the DFT acts upon a sequence of time values, to produce a sequence of frequency values. In fact, the length of the sequence of frequency values will be directly proportional to the length of the sequence of time values. The greater the number of samples in the signal, the greater the number of frequencies the DFT will be able to distinguish. Conversely, the shorter the STFT window (providing a greater resolution in the time domain), the sparser the frequency spectrum will be (the result of a lesser resolution in the frequency domain). The frequencies will still be distributed evenly between 0 Hz and the Nyquist frequency: there will just be fewer of them.

This trade-off can be mitigated by lengthening the STFT window slightly and allowing windows to overlap, but even this approach has a drawback: the increase in the resolution in the frequency domain is offset by a slight "fuzziness" in the time domain, since the signal is duplicated in adjacent windows.


5. Visualising STFT data

In signal processing applications where precision is important, a balance must be found between precision in the time domain and precision in the frequency domain. For the purposes of visualisation, however, both of these concerns are overshadowed by the requirement that the viewer be presented with an appropriate amount of information. Just as the huge quantity of raw PCM data threatened to overwhelm the viewer with an excess of high-resolution information, so too must the resolution of the STFT visualisation be limited.

The previous C program was modified to produce an STFT visualisation instead of a PCM visualisation (using the FFT implementation provided by the FFTW library). A window length of 4096 samples was chosen (since the FFT operates most efficiently upon windows whose length is a power of 2), which resulted in a frequency spectrum of 2049 distinct frequencies in the range from 0 Hz to 22.05 kHz (a frequency resolution of approximately 10.5 Hz). An overlap of 2332 samples per window was chosen, which meant that each successive window advanced by 1764 samples. (These exact values were chosen primarily to be "nice round numbers": 44100 divided by 1764 equals 25, resulting in a time resolution of exactly 25 spectra per second). When it was observed that nearly all of the interesting features of the audio data lay in the frequency range from 0 Hz to 2000 Hz, the frequency spectrum was truncated at 250 frequencies, resulting in a spectrum ranging from 0 Hz to 2700 Hz. (For reference, the range of the human voice is 300 Hz to 3000 Hz.)

Although the amount of information presented by the STFT visualisation had been limited to useful levels, the challenge was how to present this 3-dimensional data (amplitude as a function of frequency over time) in a 2-dimensional image, ensuring that the data was easily and immediately understandable, that local fluctuations in the data were clearly visible, and that it was possible to identify interesting features in the data at a glance. It was decided that the data should be represented by a 2-dimensional "colour map", where the two independent variables of the data occupied the two dimensions of the image (time increasing from left to right across the image; frequency increasing within each channel from bottom to top; right channel below left channel) and the amplitude of each frequency at each point in time represented by a colour. The mapping of amplitude (which lay in the floating-point range [0.0, 1.0]) to colour underwent a great deal of fine-tuning (since the most illuminating mapping for a particular set of data is utterly dependent upon the characteristics of that data), but the final mapping was a set of gradients:

[0.0, 0.005) -> black to dark blue
[0.005, 0.01) -> dark blue to green
[0.01, 0.03) -> green to yellow
[0.03, 1.0] -> yellow to white

Click on the thumbnailed images below to view the full-size STFT visualisation for each audio slice:

It is extremely gratifying to observe that this representation of audio data demonstrates a distinct frequency structure. This representation clearly demonstrates the presence of different frequencies. It also provides a suggestion of harmonies and rhythms. It enables the observer to identify large-scale features in the data: in the visualisation of audio slice 1, it is possible to identify when the electronic beat stops and then starts again; in the visualisation of audio slice 2, the four phrases of the male vocals are easily distinguished, including such details as the drop-off in vocal pitch halfway through each phrase. It is almost possible, while listening to the audio slices, to trace through the visualisations by associating visual structure with the harmonies and rhythms of the music. This style of visualisation will be known as the "colour-map landscape".


6. Animating STFT data (1)

Rather than requiring the viewer to manually trace through the visualisation in time with the sound, why not trace through for them?

The previous C program was modified so that, instead of producing a single PPM image, it produced a sequence of PPMs which simulated the motion of a rectangular viewport over the previous colour-map landscape. The viewport traversed the colour map through time, with the current point in time (the column of pixels representing the current STFT window) in the centre of the viewport (marked with a red line), and 100 STFT windows (equivalent to 4 seconds of sound) on either side.

To avoid the possible confusion of a representation in which the contents of one channel appear to flow into the other, the channels must be placed side-by-side so that the two red lines are collinear. Because the usual shape of a computer screen encourages images which are wider than they are high, it was decided that the layout of the channels and the direction of the time axis should be changed from previous layouts so that the channels are placed horizontally while the time axis is vertical. As would be intuitively expected, the left channel was placed on the left, and the viewport moved downwards over the colour map, with the frequency increasing within each channel from left to right.

The program ppmtompeg was used to encode each sequence of PPM images into an MPEG-1 video with a frame-rate of 25 frames per second, then the program ffmpeg was used to combine this MPEG video with the WAVE audio of the corresponding audio slice, producing an audiovisual MPEG. Click on the thumbnailed screen-shots below to view the STFT animation for each audio slice:

The MPEG compression resulted in a noticeable loss of audio and video quality, but the combination of audio and visual data was a brilliant success. In addition to the obvious synchronisation of the audio structure with the visual structure generated by the STFT, this form of visualisation even highlights features of the audio data which were not immediately apparent when listening to the music on its own: in the visualisation of audio slice 1, a harmony of low-frequency string instruments is highlighted when the electronic beat stops; in the visualisation of audio slice 2, a tremolo in the singer's voice at the end of each sub-phrase. Surely, there cannot be many visualisations which convey to the observer a better understanding of the distinguishing audio characteristics of the data. This style of visualisation will be referred to as the "colour-map landscape animation".


7. Animating STFT data (2)

If there is one drawback to the colour-map landscape animation, it is that it does not clearly distinguish variations in the amplitude of each frequency in regions with a dense rhythmic or harmonic structure: the eye is not good at discerning subtle variations of colour in regions of brightness.

When presented with a dense horizontal array of similar objects, the eye is better able to distinguish and differentiate between objects of subtly-different height than objects of subtly-different colour. Since we are now using movies instead of 2-dimensional still images to visualise our 3-dimensional data, it is possible to use the extra temporal dimension of movies to represent the time dimension of our STFT data. This leaves the two dimensions of each video frame to represent frequency and amplitude. When displaying 2-dimensional data, it is convention that the independent variable (in this case, frequency) increases along the horizontal axis while the dependent variable (in this case, amplitude) increases along the vertical axis.

The previous C program was modified so that the contents of each frame now represented a single STFT window. Each frame consisted of an array of white vertical spikes on a black background, one spike for each of the 250 frequencies, with the height of each spike proportional to the amplitude of that frequency in the current STFT window. Although the amplitudes lie in the floating-point range [0.0, 1.0], it was observed that the majority of the frequencies had a very low amplitude, which resulted in a poor definition of the frequency structure. However, it was decided that simply "zooming in" on a smaller frequency range (for example, [0.0, 0.25]) was not the solution, since:

  1. the amplitudes of the frequencies corresponding to the electronic beat were on the whole much larger than the rest of the amplitudes, and "zooming in" to an arbitrary degree of magnification might "cut off" the tops of the taller spikes. On the other hand, if the maximum amplitude of each audio slice were determined in advance, and the degree of magnification calculated to scale the largest amplitude of that slice to exactly 1.0, this would result in different scalings being applied to each slice, which would in turn preclude a meaningful side-by-side comparison.
  2. it was debatable whether small linear scalings would even be able to provide sufficient definition of the frequency structure.

Hence, it was decided that non-linear scaling was the answer. This scaling was required to adhere to three conditions:

  1. To avoid "cutting off" the tops of the tallest spikes, the scaling must ensure that the range of the amplitudes is still [0.0, 1.0]. That is, the value 0.0 must be scaled to the value 0.0, while the value 1.0 must be scaled to the value 1.0.
  2. Despite its non-linearity, the scaling must preserve the ordering of the spikes. That is, the tallest spike must remain the tallest, the second-tallest spike must remain the second-tallest, etc.
  3. The whole point of scaling the amplitudes in the first place is to provide greater definition of the frequency structure. Thus, the smaller amplitudes must be scaled in a way which results in enhancement of the subtle variations between adjacent amplitudes, and hence greater differentiation.

It was felt that these requirements were best and most simply fulfilled by logarithmic scaling, implemented in C code similar to the function below:

double
ScaleValue(double v) {

	return (log(v + 1.0) / log(2.0));
}

Every amplitude was filtered through this scaling before being displayed, but there was still not enough definition. Filtering every amplitude through this logarithmic scaling twice did not provide enough definition; nor did filtering each amplitude three times. After some experimentation, it was determined that each amplitude needed to be filtered six times in total before being displayed. The table below illustrates the effect of this scaling, comparing values before and after:

BeforeAfter
0.000000.00000
0.000010.00009
0.000100.00090
0.001000.00894
0.010000.08273
0.100000.48287
1.000001.00000

Despite the scaling of spikes, the rapid frame-rate (25 frames per second) meant that each frame tended to flash by, making it difficult to observe short-lived high-points of the frequency structure. To make it easier to observe these high-points, the program was modified so that spikes would "fade out" in subsequent frames instead of simply disappearing. Click on the thumbnailed screen-shots below to view the STFT animation for each audio slice:

In comparison with the colour-map landscape animation, this visualisation is much worse at representing extended structure over time, but much better at representing rapid rhythms. The sharp peaks of the clearly-defined frequency structure provide a comparable demonstration of the presence of harmonic structures, and a superior illustration of rapid changes of frequency (such as the vocal "glissandos" in audio slice 2). This style of visualisation will be referred to as the "spike snapshot animation".


8. Animating STFT data (3)

The colour-map landscape animation and the spike snapshot animation are complementary in their strengths and weaknesses: the former provides a better visualisation of frequency structure (especially harmonic structure) over an extended period of time, while the latter provides a superior representation of dense rhythmic structure. Hence, it seemed an obvious step forward to combine these two representations in the search for the best possible style of visualisation. The challenge was how to combine them -- the colour-map landscape animation used the two dimensions of each frame for frequency and time, while the spike snapshot animation used them for frequency and amplitude. It was necessary to represent three dimensions within the two dimensions of the frame.

The solution adopted was to use some form of perspective to display a 3-dimensional structure. Since the amplitude had already been associated with height in the spike snapshot animation, it made sense to maintain this association in the 3-dimensional world, while the time and frequency would inhabit the two horizontal dimensions. To provide an easy approximation to perspective, the colour-map landscape animation was transformed by a shear operation: this tricked the eye into believing that the landscape was receding into the distance. The spike snapshot animation was then overlaid onto this scene, sitting at the "current point in time" marked by the red line. This would have the effect that frequency structures would approach on the "horizontal" landscape, and upon reaching the "current point in time" would instantaneously be converted into the "vertical" spike snapshot, before receding into the distance on the landscape again. Thus, the spike snapshot animation would intuitively appear to be a magnification of the structure on the colour-map landscape.

The previous two C programs were combined to implement these changes. To enable the white spike snapshot animation to "stand out" more from the brighter regions of the landscape in the background, it was surrounded by a translucent-black halo. The gimp program was used to create titles and axes as PPM images, which were then loaded by the C program and incorporated into the image. Click on the thumbnailed screen-shots below to view the STFT animation for each audio slice:

This style of visualisation really captures the best of the colour-map landscape and spike snapshot animations. It provides a detailed analysis of the frequency structure over time, courtesy of the colour-map landscape, while the much more dynamic spike snapshot animation means that, instead of simply being able to observe a correspondence between what you hear and what you see, the animation seems much more "real-time". A sense of the energy of the song is conveyed.


9. Conclusion

This report set out to describe various methods of visualising audio data, with the aim of discovering a style of visualisation which conveys to the observer an understanding of the distinguishing audio characteristics of the data. The distinguishing audio characteristics of the audio slices taken from the Covenant song Like Tears In Rain may be enumerated as:

All of these audio characteristics are represented by the final, combined STFT animation. It cannot be disputed that the final STFT animation achieves the aim of this report.


Appendix 1. References


Appendix 2. Software used


Appendix 3. C programming libraries used


James Boyden, 2003
[Email: jboyden at student.usyd.edu.au]

Valid XHTML 1.0

This page best viewed in a browser with decent CSS1 support.
[Hint: Try Mozilla 1.0 or above, Opera 7 or above, Netscape 6 or above or IE 5 for Mac.]