Most sounds nowadays are stored digitally (if you don't count ancient LPs and non-digital magnetic tapes). Let's see how it's done.
First, sound is converted to electrical current using a microphone. Continuous oscillations of air pressure become continuous oscillations of voltage in an electrical circuit. This fast-changing voltage is then converted into a series of numbers by a digitizer. A digitizer acts like a very fast digital voltmeter. It makes thousands of measurements per second. Each measurement results in a number that can be stored digitally (that is, only a finite number of significant digits of this number are recorded). This number is called a sample and the whole conversion of sound to a series of numbers is called sampling.
If you have a sound card in your computer, it has a digitizer. If you plug in a microphone, the sound card will produce a stream of samples that can be retrieved by your software (see example).
To give you an example--CD-quality music recording is created by sampling the sound 44,100 times per second and storing each sample as a 16-bit binary number (twice as much for a stereo recording). So an hour of stereo music is equivalent to 3,600 * 44,100 * 2 = 317,520,000 samples or 635,040,000 bytes. That's over half a gigabyte--which is more or less the capacity of a standard CD. (You can drastically reduce storage requirements if you apply some clever compression scheme--for instance MP3).
There are many ways to learn about the Fourier transform. However, since you are presumably a programmer who deals with binary numbers, the digital approach will be more to the point.
Sound acquisition in a computer is usually implemented by getting a buffer full of samples at a time--an array of integers, each integer corresponding to the amplitude of the sound at a particular moment. In mathematics, an array of numbers is called a vector (that's why the C++ Standard Library calls its dynamic array a vector). More precisely, a vector is an abstract entity that can be represented as an array of numbers in a particular coordinate system. Sound analysis can be thought of a representing the same vector of samples in different coordinate systems.
Think of a three-dimensional vector. Imagine it as an arrow of some length, pointing in some direction in space. If you have a coordinate system, you can project this vector, call it v, on the three axes and get three numbers, x, y, and z. These are the coordinates of this vector in this particular coordinate system.
However, you are not bound to one particular coordinate system. The same vector can be represented in a different coordinate system by a different set of numbers. A coordinate system in a 3-dimensional space is nothing but a set of three vectors. These are called the basis vectors. We are used to orthonormal coordinate systems in which the three basis vectors are orthogonal to each other and have unit length. In general, however, this is not necessary. What really matters is for the three vectors to be linearly independent to form a good basis. Vectors are linearly independent if none of them can be represented as a linear combination of the others. In particular, you can't have one vector lying in the plane formed by the two other vectors. Such a trio wouldn't form a good coordinate system in 3-d.
Imagine now that you have a new set of three basis vectors, i1, i2, and i3, and you want to find out the coordinates of v in this new basis. A coordinate of v with respect of a basis vector ik is equal to its projection onto ik. So to get the three coordinates of v in the new basis, you have to project v onto the three basis vectors, i1, i2, and i3.
How do you calculate these projections? That's easy. First, each of the new basis vectors must be represented as three numbers--these are the coordinates of the new basis vectors in our original coordinate system. For instance i1 = (x1, y1, z1). The projection of v = (x, y, z) onto i1 is equal to
(v * i1) = x * x1 + y * y1 + z * z1This is called a scalar product of two vectors.
All these notions generalize easily to n-dimensional spaces. An n-dimensional vector can be represented as n numbers. A basis in an n-dimensional vector space consists of n linearly independent vectors. To calculate the coordinates of a vector v in a new basis, ik (k = 1...n), we have to calculate scalar products of v with each of the n basis vectors.
Although wavelet analysis is only a recent discovery, it is in principle simpler than its predecessor, Fourier analysis, especially in the digital domain.
Consider a fixed-size buffer that is the result of sampling a sound. If the buffer contains n samples, it can be viewed as an n-dimensional vector. Each sample becomes a coordinate of this vector in some n-dimensional space. For instance, if we only had three-sample buffers at our disposal, the values of the samples would correspond to the x, y, z coordinates of a 3-d vector. Normally we have hundreds or thousands of samples in a single buffer, so it's really hard to imagine the multi-dimensional space in which they would form a vector. But the math is still the same.
The wavelet transform is obtained by re-calculating the coordinates of this vector of samples using a different basis in n-dimensional space. The result is, again, n numbers. We start with n numbers and we end up with n numbers, so what's the big deal? It turns out that, depending on the choice of the basis, these new numbers may be easier to interpret. They can tell us something about the sampled sound that was not immediately obvious. Or we may drop some of these numbers for transmission or storage--we would have compressed the sound in a particular way.
Here's a little example. Suppose that we are dealing with small buffers, each containing 8 consecutive samples. Instead of drawing an 8-dimensional vector to represent each buffer (on a 2-dimensional screen?), I will draw the 8 samples as 8 vertical segments, one after another.
The eight basis vectors can be represented in the same fashion. Here's an example that is often used in wavelet analysis, the so-called Haar basis.
This particular basis is orthogonal--the scalar product between any two basis elements is zero. Moreover, the "square" of each element (the scalar product with itself) is equal to one. For instance, the last basis vector in the picture has the following coordinates
i8 = (0, 0, 0, 0, 0, 0, a, -a), a = 1/sqrt(2)and its square is
(i8*i8) = 02 + 02 + 02 + 02 + 02 + 02 + a2 + a2 = 2 * a2 = 1
A scalar product of an arbitrary vector v = (v1, v2, v3, v4, v5, v6, v7, v8) with the nth basis vector, in is called the nth wavelet coefficient. For instance, let's calculate the 8th wavelet coefficient of v, or the scalar product of v with i8:
w8 = (v * i8) = (v7 - v8)/sqrt(2)It is simply the difference between the seventh and the eighth vector components (or samples in our buffer of samples). Notice that if the signal changes very slowly, this difference will be very small--the 8th wavelet coefficient will be small.
Just by looking at the pictures of the basis vectors, you can figure out that the first wavelet coefficient, or the scalar product of v with i1, tells you the average amplitude of the signal; the second tells you how much difference there is between the first half of the signal and the second half; and so on. If you decide to drop the last four coefficients, you'll just lose the information about some fine details of the signal, but you'll still be able to recover its overall shape. That's the principle of wavelet compression in a nutshell.
The set of wavelet coefficients, w1, w2, ... w8 is called the wavelet transform of the signal v. Since we are operating on 8-sample buffers, this is called an 8-point wavelet transform.
How do we recover the signal given its wavelet coefficients? Simple: multiply the basis vectors by these coefficients and add them up:
v = w1 i1 + w2 i2 + ... + w8 i8For instance, if w1 = 10, w2 = -4, and the rest is zero, the reconstructed signal will be:
v = 10 * i1 - 4 * i2Taking into account proper normalization of these basis vectors, the result is:
v = (3, 3, 3, 3, 7, 7, 7, 7)/sqrt(2)