2E1366
Project Course in Signal Processing and Digital Communication
May
2000
Johan
Abramson
Magnus
Flodman
Mattias
Hessel
Anders
Kjellström
Henrik
Lundin
Fassil
Mehary
The problem of this project was first to identify and then replace beeps in The Jerry Springer Show with replacement sound. The sound was to be chosen such that the word loss due to the former beep was concealed to the greatest extent possible. The difficulty of the concealment was to find a suitable replacement sound with respect to the sound in the show before the beep.
When most of the signal energy was found in the beep frequency of 1000 Hz, by Fourier transform, a beep was detected. A sequence of the signal preceding the beep detection was then compared to a number of predefined AR parameter sets. After finding the most suitable AR parameters the sound was generated with three related frequency subband AR parameters combined with previously calculated residuals from appropriate sound sequences. The subband division was made in order to enhance replacement sound quality.
The solution of the problem did not give an unambiguous result. The evaluation of the result by multiple listening tests showed that the solution was satisfactory.
2.1Previous
work and Fundamental Limitations
3.2Subband
Signal Processing
3.3Estimating
AR parameters and Residuals
3.5.3Generation
of Replacement Sound
4.1.1Generation
of Parameters and Residuals
4.1.3Subband
Signal Processing
4.2.1Deviations
from the Matlab code
5.2.2Test
Results
B.1Linear
Extrapolation in the Time-Frequency Domain
B.3On-Line
Estimation of AR parameters
The rapid speed of everyday error concealment applications makes them difficult to study and not very intuitive.
A simple yet effective way to study error concealment
is to study a large-scale example such as error concealment of word losses
in human speech. The unarmed human ear is capable of detecting abnormalities
in the relatively slow human speech. This gives inaccurate but intuitive
measures of error concealment quality.
The censorship upon the language of American talk shows serves well as word losses in human speech. The task of this project was to compensate for these word losses. This makes a comprehensive example of error concealment.
The problem can be divided in two parts: detection of sine tone and error concealment. The sine tones, referred to as beeps, have a known frequency. Beep detection should be precise enough not to miss any beeps with a minimum of false detections.
In the error concealment, the beeps should be replaced with some sound more comfortable to the human ear. It should be something that sounds natural in the show.
·PC (Intel Pentium III, 64 MB internal RAM) with Microsoft Windows 98
·Texas Instruments TMS320C6701 EVM digital signal processor
·Texas Instruments Code Composer Studio 1.0
·Microsoft Visual C++ 6.0
·MATLAB 5.3
·VCR and TV set
·Video recording of the Jerry Springer Show
The flow chart consists of different steps and two main branches: the “beep” case and the “no beep” case. A brief description is given below. In the following subsections, a more thorough description will be made.
In the first step one block, or frame, consisting of M samples is read from the input. In the second step, the beep detection investigates whether the block contains a beep signal. This is performed with one block look-ahead, i.e. the further treatment of block k is determined by the beep detection on block k+1. This is necessary because the beep detection algorithm is not infallible if the block is not a homogenous beep block (see also section 3.4). The outcome of the beep detection determines which will be the next step.
If the block does not contain a beep, the signal is analysed. First, the auto-correlation function, and hence the variance, is estimated. Next, the program tries to distinguish between two different cases that have to be treated in different ways:
Pure
talk: The sound consists of speech on top of a calm audience.
Noisy audience: The sound consists of a noisy audience with
or without speech on top.
These two cases require different types of replacement sound. In the pure talk case, the dedicated AR process for calm audience is selected. Furthermore, the variance of the replacement sound is set to be the estimate of the variance of the background sound, which is rather low compared to the variance of the speech. In the noisy audience case, the program selects the AR process that is the best representation of the present block. The variance of the replacement sound is set to be the variance of the preceding signal so that the replacement will sound natural. These operations are all done when the present block does not contain a beep, and the selected AR parameters and variance are stored for use if the next block contains a beep.
If, on the other hand, the block contains a beep, a replacement signal is generated from the previously selected AR parameters using subband signal generation (see section 3.2). To accomplish a smooth transfer between the authentic and the synthetic sound the replacement sound is faded in and out at the beginning and the end of the beep, respectively (see 3.5.3). Finally, the sound is sent to the output with a two-block delay, i.e. 64 ms using 8 kHz sample frequency and 256 samples per block, due to signal processing delay and the one block look-ahead.

Figure 3.1 Solution outline with the tow main branches Beep and No beep. Replacement sound is generated in the Beep branch. The No beep branch consists of the subbranches Noisy audience and Pure talk.

b)
Figure 3.2 A three-band encoding is shown in a). All filters have cut-off frequency at wg = p/2 in the local sample rate. b) shows the frequency ranges for the resulting subbands (p is the Nyquist frequency at the input).
The solution depends on generating sound from AR parameters and (quantified) residuals. The generated signal should resemble a studio audience. Thus, the signal consists mainly of human voices, which suggest a speech coding approach to the problem.
One common method is the subband coding, or frequency subdivision, described in [1]. The idea is to divide the speech signal into two signals, one low-pass covering 0 £w£p/2 (w is discrete-time angular frequency) and one high-pass covering p/2 £w£p. Both signals can now be downsampled (decimated) by a factor 2 without aliasing or loss of information. Subsequently, the low-pass signal can be divided once more in the same manner. Figure 3.2a shows a two-step subband encoder and b shows the corresponding frequency subdivision. The purpose of frequency subdivision is twofold:
·The AR parameters can be concentrated to the speech frequencies (approximately 500 Hz)
·The downsampling allows fewer AR parameters and shorter residuals.
Note, however, that multiple AR parameter sets and residuals are required, one for each subband.
Reconstruction of the signal is the reverse of the above. The subband signals are generated separately from each other using the corresponding AR parameters and residuals. The signals are upsampled (interpolated), filtered and added according to Figure 3.3.
Figure 3.3 A
three-band decoding system. The filters are the same as in the encoder
(Figure
3.2)
Let s(n) be a sound sequence. s(n) is divided into subband signals sL(n), sM(n) and sH(n) for the low, middle and high bands respectively, according to section 3.2. For each subband signal an AR model is generated using the Levinson-Durbin recursion (see for instance [1]). Denote the AR parameters
where B denotes the subband (i.e. B is L, M or H) and NB is the model order for the subband B. The residual is the estimation error eB(n) such that
The residual is quantified to b bits, in order to save memory space, without significant loss of audio quality. Let eBq(n) = Qb[eB(n)] denote the b-bit quantified residual.
Finally, a scaling factor TB has to be derived for each subband B. Let ss2 denote the variance of the full-spectrum signal s(n) and sB2 denote the variance of the subband signal sB(n). Then, KB is the ratio between the energy in the subband B and the total energy of s(n) such that sB2 = KBss2. Furthermore, let HB be the ratio between the energy of the driving noise and the energy of the output from the AR process defined by (3.1). In other words, if the input signal to the AR process has variance swB2 then the output has variance HBswB2. The factor TB is defined as
where seB2 is the energy of the quantified residual sequence eBq(n) as it is stored. Thus, when generating synthetic sound, the residual eBq(n) should be scaled as
to produce a properly scaled signal. Here, sy2 is the desired variance of the synthetic signal y(n).
The audience was modelled as white Gaussian noise with unknown variance. This model turned out to work well. Hence, there are two different test hypotheses H0 and H1:
·
,
where w(n)
is Gaussian noise,
·
,
where f
is unknown.
The equations used to design the beep detector are
(3.5)
is the short-time periodogram of x(n)
and (3.6) is an
estimate
of
the variance of x(n).
A beep is detected when the ratio T in (3.7)
between the short-time periodogram and the average variance is greater
than the threshold g.
In other words, a block with a beep tone has a greater part of the total
energy in the frequency f0
compared to a block without a beep tone.
The threshold level g is a design parameter dependent on the selected block size and sample frequency. When choosing g there is a trade-off between the probability of not detecting a beep and the probability of false detections, i.e. detecting a beep when there is none present.
Pure
talk: The sound consists of speech on top of a calm audience.
Noisy audience: The sound consists of a noisy audience with
or without speech on top.
The
algorithm estimates the variance
of
the block, assuming zero mean, using (3.5).
The variances of the K latest blocks,
including the present block, are stored in a vector v(k), k=0,1,…,K-1,
where K-1 represents the present block. Figure
3.4 depicts typical variance vectors for the two different cases.
From the figure it is obvious that the variance tends closer to zero in
the pure talk case. This property is exploited in the talk detection
algorithm.
|
|
|
![]() |
![]() |
The vector v
is normalised with the mean of the vector,
.
Finally, the minimum of the normalised vector is compared to a threshold
level a
as in (3.8). If
the minimum value is less than a
then the case pure talk is selected, otherwise noisy audience
is selected.
The normal equations, or N.E., can be written as
where rxx(k) is an estimate of the autocorrelation function of x(n) at lag k.
The optimal set of AR parameters is that which minimises the left-hand side of the N.E. in the least-squares sense, i.e.
Rewriting the product to be minimised in (3.10) as
results in a computationally efficient algorithm, since T is toeplitz. Furthermore, the first term in the last equality in (3.11) can be omitted when minimising in (3.10) since it does not depend on the AR parameters. In the formulae (3.9) through (3.11) the used AR parameters are not the subband parameters derived in subsection 3.3. Instead, a master set describing the full-spectrum signal is used.
The (number of the) selected parameter set is stored for the K latest blocks. When the replacement sound is to be generated, the parameter set that has been picked the most times among the K latest blocks is used. Here K is a design variable.
The residual wB(n) is scaled using equation (3.4) where the desired variance s2 is set in two different ways for the cases pure talk and noisy audience, respectively. In both cases the variance vector v is used (c.f. subsection 3.5.1). In the pure talk case, s2 is set to the second smallest value in v. The reason for this is that the low-variance blocks can be interpreted as blocks with only the audience. To avoid near-zero values the second smallest value is selected.
In the case of noisy audience, the change in variance preceding the beep is repeated during the beep. The purpose of this method is to resemble the rhythm of the audience. In each block, the subband residuals are scaled with a ramp function g(n). The ramp function reaches between the last and the first value in v. Equation (3.13) shows how the subband residual for one block is scaled.
When a block of replacement sound has been generated, the v vector is rotated one step, so that the first value becomes the last.
To assure good sound quality, yB(n-l) in the right-hand side of (3.12) should not be set to zero for negative indices, i.e. when l > n. Samples from the preceding block should be used instead, or else the block frequency (fs/M) will be heard as a disturbance. Thus, yB(-i) in block k is equal to yB(MB-i) in block k-1. The subband signals are merged according to the methods in subsection 3.2 to form the replacement signal y(n).
Next, the transitions to and from replacement sound
have to be smooth. This is accomplished by fading with the authentic sound x(n). Figure
3.5 and the text below describes the fading procedures.

Figure 3.5 Authentic and synthetic sound before, during and after a beep. The output sound z(n) is the sum of the authentic and the synthetic sound. Note that the beginning of the synthetic sound is mixed with a mirror copy of the last authentic sound.
In the theory section, some design variables where left unassigned. These have been assigned values according to Table 4.1.
|
Parameter
|
Symbol
|
Value
|
|
Block size
|
M
|
256 samples
|
|
Sample frequency
|
fs
|
8 kHz
|
|
Beep frequency
|
f0
|
1000 Hz
|
|
Fade time
|
|
0.2 s
|
|
Variance memory
|
K
|
20
|
|
Audience detection threshold
|
a
|
0.05
|
|
Beep detection threshold
|
g
|
80 (100 in DSP[2])
|
|
Master AR order
|
N
|
10
|
|
Low band AR order
|
NL
|
5
|
|
Mid band AR order
|
NM
|
5
|
|
High band AR order
|
NH
|
10
|
|
Number of AR sets
|
|
4
|
The main function reads the library files and the sound file to be filtered. The sound is read from a .mat-file. All constants and global variables are defined as well. After these initialisation procedures are done the main function starts the program loop outlined in Figure 3.1 on page 1, using the following functions:
sample reads the next block of data from the in-sound vector.
beepdetect performs the beep detection.
gensound generates one block of synthetic sound using the selected AR parameters and residuals.
select_ar selects which AR process fits best with the current situation.
set_var handles the variance extrapolation described in subsection 3.5.3.
submerge performs the subband decoding, i.e. performs upsampling and appropriate filtering of the two input signals and finally adds them.
Since the signal processing is performed in blocks, the filters have to be initialised correctly in order to assure maximal sound quality (c.f. subsection 3.5.3). Thus, the input signal should not be set to zero for negative indices, i.e. forl > n. The last samples in the preceding block should be used instead according to equation (4.2).
This scheme, however, is not used if the present block is the first block in the beep, when x(n) should be zero for negative n because no subband signals are available for the sound preceding the beep.
Another difference in the C-code of the DSP from the Matlab code was the sparse use of separate functions. The memory handling in the DSP was made more convenient when fewer functions were handled by the DSP.
In the C code, the replacement parameters (e.g. AR parameters and residual values) were included in the source code, instead of reading them from a separate file, which was the case in Matlab. This led to a loss of generality and user modification of the parameters was rendered impossible. The procedure of reading replacement parameters from a separate file would require knowledge of interaction between C++ programs, executed in the CPU, and C programs, executed in the DSP, a time consuming task for the programmers. The entire final program was written and executed in the DSP since the task at hand did not demand CPU-DSP interaction. The program was not slowed down by excluding the CPU but on the contrary, the program was executed faster.
The memory handling in the DSP presented many problems. A large number of variables were declared globally in order to avoid function calls with pointer-type arguments. This does not affect the functionality of the final program.
The execution of the program was not satisfactory at first. The number of calculations slowed the DSP down at such a rate that the output sound was distorted. After using the built in optimiser this problem disappeared.
Some bugs in the DSP hardware and software (TMS320C600 DSP and the Code Composer Studio version 1.0) turned up during the programming. After consulting the Texas Instruments DSP homepage, these bugs could be avoided.
Figure 5.1 A censored audio signal (bottom) and the corresponding beep detection ratio T (top). The detection threshold a is shown as the slashed line. T larger than a indicates a beep.
In
the DSP implementation, a very similar structure was used. The threshold
was set by a combination of human knowledge and estimations based on the
Matlab results. The threshold value was fixed at 100 in the DSP.
In the first part, the listener heard six different sound sequences, each about five seconds. In these sequences, a number of beeps had been replaced using the Matlab implementation of the solution (see 5.2.2 for the exact number of beeps in the different sequences). The listener was prompted to guess how many beeps each sequence contained.
In the second part, the listener first heard a short sequence with a beep followed by the same sequence with the beep replaced. Then, the test in part one was repeated, using the same six sequences. The purpose of this was to give the listener a better knowledge of what the replacements sounded like and evaluate whether this knowledge led to a different test result.
Figure 5.2 Mean of the answers and the actual number of beeps in each sequence. The graph indicates that the test group was unable to detect all beeps.
Overall, the subjective listening test results are considered satisfactory. However, the results are not interpreted as proof of success, but as a guide to how the system performs. A more thorough test involving more test persons should be conducted to verify the results.
The involved techniques, such as beep detection, subband signal processing and selection of AR parameters, worked as expected.
The program is easy and straightforward to operate. Three different versions of the program can be run: Beep Filter, Beep Demo and Bypass. Beep Filter replaces the beeps with generated sound, Beep Demo amplifies the beeps and Bypass outputs the original sound.
Follow the description below to run the program.
1.Start the program JerrySpringer.exe. This can be done by double clicking on the file in Windows Explorer. You will now get a window like the one in Figure 6.1.
2.Choose which of the versions to run in the drop down list.
3.Press the Run Program button and the chosen program will start.
4.Press Stop Program button to stop the program.
5.Start another version or exit JerrySpringer with the Exit button.
Figure 6.1 The Jerry Springer interface is easy to use. Select which program to run and press “Run Program”.
A number of other methods have been evaluated (in Matlab) and discarded due to poor performance in the problem at hand. Some of these methods are described in this subsection. The common fault for these is that they try to generate replacement sound using information from the preceding sound only. Since the preceding sound generally is dominated by one speaker, the replacement sound will be coloured by this and hence not assimilate the audience.
B.1Linear Extrapolation in the Time-Frequency Domain
This method is described in [3]. The two blocks preceding the beep are transformed to the frequency domain using the discrete Fourier transform (DFT). The replacement signal is synthesised in the frequency domain, treating the amplitude and phase separately.
The amplitudes for the different frequencies of the replacement blocks are all set to the amplitude of the last block preceding the beep in the corresponding frequency. The phase, on the other hand, is linearly extrapolated from the two preceding blocks for each frequency. Finally, the replacement blocks are inverse-transformed to form the sound.
The method did not give satisfactory performance in the problem at hand, probably because the beeps are to long. In addition, the two blocks used for extrapolation should be overlapping and closer in time to improve performance.
In this method, the beeps where replaced with a duplication of the preceding sound. Two different versions where tried: one where the replacement sound was reversed and one where it was not reversed. Some linear filtering where tried in order to make the copy sound slightly different from the authentic sound. However, all different versions tried sounded like a copy of the sound before the beep.
B.3On-Line Estimation of AR parameters
This third method continuously estimated AR parameters for the sound preceding the beep. When the beep started, the AR parameters latest derived were used, driven by a white Gaussian noise.
The resulting sound was, as mentioned above, influenced by the speakers voice. Furthermore, a noise-driven process often sounds noisy. The result was a noise with approximately the same spectral properties as the talker in the preceding sound.
|
|
AR parameters representing s(n)
|
|
|
AR parameters representing sB(n)
|
|
a(i)
|
The i:th AR parameter (a(0)
= 1)
|
|
aB(i)
|
The i:th AR parameter for band B
(aB(0) = 1)
|
|
a
|
Audience detection threshold
|
|
B
|
Subband designator, e.g. L, M or H
|
|
B
|
Residual quantisation bits
|
|
g
|
Beep detection threshold
|
|
eB(n)
|
Residual, or estimation error for band B
|
|
eBq(n)
|
Quantified residual, or estimation error for band B
|
|
f0
|
Beep frequency
|
|
fs
|
Sample frequency
|
|
g(n)
|
Scaling ramp function
|
|
I(f)
|
Short-time periodogram
|
|
K
|
Variance memory length
|
|
KB
|
Energy ration between yB(n)
and y(n)
|
|
M
|
Block size in samples per block
|
|
N
|
Master AR order
|
|
nbe
|
Discrete time index for beep end
|
|
nbs
|
Discrete time index for beep start
|
|
NH
|
High band AR order
|
|
NL
|
Low band AR order
|
|
NM
|
Mid band AR order
|
|
R
|
Autocorrelation matrix for x(n)
|
|
rxx(k)
|
Estimate of autocorrelation function for x(n)
at lag k
|
|
s(n)
|
Sound signal from which a set of AR parameters and
residual is generated
|
|
sB(n)
|
B-band part of s(n)
|
|
|
Variance of eBq(n)
|
|
|
Variance of wB(n)
|
|
|
Variance of s(n)
|
|
|
Variance of sB(n)
|
|
|
Estimate of variance of x(n)
|
|
T
|
Ration between I(f)
and
|
|
T
|
Toeplitz autocorrelation matrix for x(n)
|
|
TB
|
Residual scaling factor for band B
|
|
v
|
Block variance
vector
|
|
wB(n)
|
Scaled quantified residual for band B
|
|
w
|
Discrete-time angular frequency
|
|
x(n)
|
Input signal (from VCR)
|
|
y(n)
|
Synthetic replacement signal
|
|
yB(n)
|
B-band part of y(n)
|
|
z(n)
|
Output signal (to TV set)
|
[1]J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Principles, Algorithms, and Applications. New Jersey: Prentice-Hall, 1996.
[2]S. M. Kay, Fundamentals of Statistical Signal Processing, Detection Theory, Section 7.6.4. New Jersey: Prentice-Hall, 1998
[3]F. Laxhed, Linear Domain Error Concealment for Speech Frame Losses in Packet Switched Networks, Master Thesis IR-SB-EX-9911, Royal Institute of Technology, Stockholm, 1999