Error Concealment for Speech Losses in TV Transmissions

2E1366
Project Course in Signal Processing and Digital Communication

May 2000

Johan Abramson

Magnus Flodman

Mattias Hessel

Anders Kjellström

Henrik Lundin

Fassil Mehary


Abstract

The problem of this project was first to identify and then replace beeps in The Jerry Springer Show with replacement sound. The sound was to be chosen such that the word loss due to the former beep was concealed to the greatest extent possible. The difficulty of the concealment was to find a suitable replacement sound with respect to the sound in the show before the beep.

When most of the signal energy was found in the beep frequency of 1000 Hz, by Fourier transform, a beep was detected. A sequence of the signal preceding the beep detection was then compared to a number of predefined AR parameter sets. After finding the most suitable AR parameters the sound was generated with three related frequency subband AR parameters combined with previously calculated residuals from appropriate sound sequences. The subband division was made in order to enhance replacement sound quality.

The solution of the problem did not give an unambiguous result. The evaluation of the result by multiple listening tests showed that the solution was satisfactory.


Contents

1Introduction3

2Problem.. 4

2.1Previous work and Fundamental Limitations4

2.2Available Equipment4

3Theory5

3.1Solution Outline5

3.2Subband Signal Processing6

3.3Estimating AR parameters and Residuals6

3.4Beep Detection7

3.5Error Concealment8

3.5.1Audience Detection8

3.5.2Choosing AR parameters8

3.5.3Generation of Replacement Sound9

4Implementation11

4.1Implementation in Matlab11

4.1.1Generation of Parameters and Residuals11

4.1.2Beep Filtering11

4.1.3Subband Signal Processing11

4.2Implementation in DSP12

4.2.1Deviations from the Matlab code12

4.2.2Problems in DSP12

5Results13

5.1Beep Detection13

5.2Listening test13

5.2.1Test Suit13

5.2.2Test Results14

6Conclusions15

6.1Further Work15

AUser’s Manual16

A.1How to run the program.. 16

BDiscarded Methods17

B.1Linear Extrapolation in the Time-Frequency Domain17

B.2Replication17

B.3On-Line Estimation of AR parameters17

CSymbols18


1 Introduction

Error concealment is a necessity in switched and mobile networks. Channel disturbances in mobile telephony networks and packet losses in packet-switched networks are impossible to avoid but necessary to compensate for. To compensate for signal distortions error concealment is used.

The rapid speed of everyday error concealment applications makes them difficult to study and not very intuitive.

A simple yet effective way to study error concealment is to study a large-scale example such as error concealment of word losses in human speech. The unarmed human ear is capable of detecting abnormalities in the relatively slow human speech. This gives inaccurate but intuitive measures of error concealment quality.

The censorship upon the language of American talk shows serves well as word losses in human speech. The task of this project was to compensate for these word losses. This makes a comprehensive example of error concealment.

2 Problem

The problem stated was to design a real-time error concealment system for censored talk shows. The talk shows are censored with a sine tone when someone is using foul language. The solution was to be implemented in a TMS320C6701 EVM digital signal processor (DSP). The audio signal from the TV show (taped on VCR) should be processed, while the video signal is unmodified. Real-time performance was a crucial criterion since the audio and video has to be matched.

The problem can be divided in two parts: detection of sine tone and error concealment. The sine tones, referred to as beeps, have a known frequency. Beep detection should be precise enough not to miss any beeps with a minimum of false detections.

In the error concealment, the beeps should be replaced with some sound more comfortable to the human ear. It should be something that sounds natural in the show.

2.1 Previous work and Fundamental Limitations

There has been much work done on error concealment, especially in mobile and IP telephony (e.g. [3]). Most of the previous work has been concentrated on concealing short errors, typically 20 ms[1]. In this case, however, the errors to be concealed, i.e. the beeps, have duration up to one second. Human speech cannot be considered stationary with respect to pitch and envelope for more than 30 ms time. Thus, reconstruction of the speaker signal is impossible. The solution described below tries to reconstruct the sound of the background instead of the speaker. Audience sound constitutes typical background in this case.

2.2 Available Equipment

The following equipment was available for design and implementation:

·PC (Intel Pentium III, 64 MB internal RAM) with Microsoft Windows 98

·Texas Instruments TMS320C6701 EVM digital signal processor

·Texas Instruments Code Composer Studio 1.0

·Microsoft Visual C++ 6.0

·MATLAB 5.3

·VCR and TV set

·Video recording of the Jerry Springer Show

3 Theory

In this section, the theory of the derived solution is discussed, starting with an outline. After the outline, subband signal processing is reviewed followed by the methods for estimating AR parameters and residual sequences. Finally, the important features of the replacement method are described.

3.1 Solution Outline

The solution outline is depicted in Figure 3.1. The main idea of the method is to replace the beeps with synthetic sound generated as an AR process. The AR parameters and the driving noise, referred to as residual sequence (see section 3.3), are both taken from a “library” of parameter sets and residuals derived in advance from suitable sound sequences. In order to improve sound quality subband signal processing is used.

The flow chart consists of different steps and two main branches: the “beep” case and the “no beep” case. A brief description is given below. In the following subsections, a more thorough description will be made.

In the first step one block, or frame, consisting of M samples is read from the input. In the second step, the beep detection investigates whether the block contains a beep signal. This is performed with one block look-ahead, i.e. the further treatment of block k is determined by the beep detection on block k+1. This is necessary because the beep detection algorithm is not infallible if the block is not a homogenous beep block (see also section 3.4). The outcome of the beep detection determines which will be the next step.

If the block does not contain a beep, the signal is analysed. First, the auto-correlation function, and hence the variance, is estimated. Next, the program tries to distinguish between two different cases that have to be treated in different ways:

Pure talk: The sound consists of speech on top of a calm audience.
Noisy audience: The sound consists of a noisy audience with or without speech on top.

These two cases require different types of replacement sound. In the pure talk case, the dedicated AR process for calm audience is selected. Furthermore, the variance of the replacement sound is set to be the estimate of the variance of the background sound, which is rather low compared to the variance of the speech. In the noisy audience case, the program selects the AR process that is the best representation of the present block. The variance of the replacement sound is set to be the variance of the preceding signal so that the replacement will sound natural. These operations are all done when the present block does not contain a beep, and the selected AR parameters and variance are stored for use if the next block contains a beep.

If, on the other hand, the block contains a beep, a replacement signal is generated from the previously selected AR parameters using subband signal generation (see section 3.2). To accomplish a smooth transfer between the authentic and the synthetic sound the replacement sound is faded in and out at the beginning and the end of the beep, respectively (see 3.5.3). Finally, the sound is sent to the output with a two-block delay, i.e. 64 ms using 8 kHz sample frequency and 256 samples per block, due to signal processing delay and the one block look-ahead.

Figure 3.1 Solution outline with the tow main branches Beep and No beep. Replacement sound is generated in the Beep branch. The No beep branch consists of the subbranches Noisy audience and Pure talk.

3.2 Subband Signal Processing

a)

b)

Figure 3.2 A three-band encoding is shown in a). All filters have cut-off frequency at wgp/2 in the local sample rate. b) shows the frequency ranges for the resulting subbands (p is the Nyquist frequency at the input).

The solution depends on generating sound from AR parameters and (quantified) residuals. The generated signal should resemble a studio audience. Thus, the signal consists mainly of human voices, which suggest a speech coding approach to the problem.

One common method is the subband coding, or frequency subdivision, described in [1]. The idea is to divide the speech signal into two signals, one low-pass covering 0 £w£p/2 (w is discrete-time angular frequency) and one high-pass covering p/2 £w£p. Both signals can now be downsampled (decimated) by a factor 2 without aliasing or loss of information. Subsequently, the low-pass signal can be divided once more in the same manner. Figure 3.2a shows a two-step subband encoder and b shows the corresponding frequency subdivision. The purpose of frequency subdivision is twofold:

·The AR parameters can be concentrated to the speech frequencies (approximately 500 Hz)

·The downsampling allows fewer AR parameters and shorter residuals.

Note, however, that multiple AR parameter sets and residuals are required, one for each subband.

Reconstruction of the signal is the reverse of the above. The subband signals are generated separately from each other using the corresponding AR parameters and residuals. The signals are upsampled (interpolated), filtered and added according to Figure 3.3.

Figure 3.3 A three-band decoding system. The filters are the same as in the encoder
(Figure 3.2)

3.3 Estimating AR parameters and Residuals

The AR parameter sets and the corresponding residual sequences are estimated from selected sound sequences. The sound sequences represent suitable replacement audience sound.

Let s(n) be a sound sequence. s(n) is divided into subband signals sL(n), sM(n) and sH(n) for the low, middle and high bands respectively, according to section 3.2. For each subband signal an AR model is generated using the Levinson-Durbin recursion (see for instance [1]). Denote the AR parameters

,(3.1)

where B denotes the subband (i.e. B is L, M or H) and NB is the model order for the subband B. The residual is the estimation error eB(n) such that

.(3.2)

The residual is quantified to b bits, in order to save memory space, without significant loss of audio quality. Let eBq(n) = Qb[eB(n)] denote the b-bit quantified residual.

Finally, a scaling factor TB has to be derived for each subband B. Let ss2 denote the variance of the full-spectrum signal s(n) and sB2 denote the variance of the subband signal sB(n). Then, KB is the ratio between the energy in the subband B and the total energy of s(n) such that sB2  = KBss2. Furthermore, let HB be the ratio between the energy of the driving noise and the energy of the output from the AR process defined by (3.1). In other words, if the input signal to the AR process has variance swB2 then the output has variance HBswB2. The factor TB is defined as

,(3.3)

where seB2 is the energy of the quantified residual sequence eBq(n) as it is stored. Thus, when generating synthetic sound, the residual eBq(n) should be scaled as

(3.4)

to produce a properly scaled signal. Here, sy2 is the desired variance of the synthetic signal y(n).

3.4 Beep Detection

The beep detection algorithm used is capable of detecting a beep consisting of a sine tone with a known frequency f0. The basic principle of the detector is to locate the frequency f0, which is the frequency of the beep tone. This is done using a method described in [2]. The detector tests each block, represented by the signal x(n), to find out whether the beep tone is present or not.

The audience was modelled as white Gaussian noise with unknown variance. This model turned out to work well. Hence, there are two different test hypotheses H0 and H1:

·, where w(n) is Gaussian noise,

·, where f is unknown.

The equations used to design the beep detector are

(3.5)

(3.6)

.(3.7)

(3.5) is the short-time periodogram of x(n) and (3.6) is an estimate of the variance of x(n). A beep is detected when the ratio T in (3.7) between the short-time periodogram and the average variance is greater than the threshold g. In other words, a block with a beep tone has a greater part of the total energy in the frequency f0 compared to a block without a beep tone.

The threshold level g is a design parameter dependent on the selected block size and sample frequency. When choosing g there is a trade-off between the probability of not detecting a beep and the probability of false detections, i.e. detecting a beep when there is none present.

3.5 Error Concealment

When a beep has been detected, it has to be replaced with some other sound, less disturbing to the ear.

3.5.1 Talk Detection

In order to distinguish between the two cases pure talk and noisy audience described on page 5, an talk detection algorithm has been derived. The two cases were defined as:

Pure talk: The sound consists of speech on top of a calm audience.
Noisy audience: The sound consists of a noisy audience with or without speech on top.

The algorithm estimates the variance of the block, assuming zero mean, using (3.5). The variances of the K latest blocks, including the present block, are stored in a vector v(k), k=0,1,…,K-1, where K-1 represents the present block. Figure 3.4 depicts typical variance vectors for the two different cases. From the figure it is obvious that the variance tends closer to zero in the pure talk case. This property is exploited in the talk detection algorithm. 

a)
b)
Figure 3.4 Typical block variance vectors v for pure talk case (a) and noisy audience case (b), both in logarithmic scale. Note that the minimum values are lower in the pure talk case than in the noisy audience case.

The vector v is normalised with the mean of the vector, . Finally, the minimum of the normalised vector is compared to a threshold level a as in (3.8). If the minimum value is less than a then the case pure talk is selected, otherwise noisy audience is selected.

(3.8)

3.5.2 Choosing AR parameters

The method for choosing suitable AR parameters from a set of predefined parameter sets is based on the normal equations (see e.g. [1]). The purpose of the method is to find the set of AR parameters that represents the signal x(n) the best. Note that the method is only used in the noisy audience case (see p.5 or 3.5.1above).

The normal equations, or N.E., can be written as

,(3.9)

where rxx(k) is an estimate of the autocorrelation function of x(n) at lag k.

The optimal set of AR parameters is that which minimises the left-hand side of the N.E. in the least-squares sense, i.e.

.(3.10)

Rewriting the product to be minimised in (3.10) as

(3.11)

results in a computationally efficient algorithm, since T is toeplitz. Furthermore, the first term in the last equality in (3.11) can be omitted when minimising in (3.10) since it does not depend on the AR parameters. In the formulae (3.9) through (3.11) the used AR parameters are not the subband parameters derived in subsection 3.3. Instead, a master set describing the full-spectrum signal is used.

The (number of the) selected parameter set is stored for the K latest blocks. When the replacement sound is to be generated, the parameter set that has been picked the most times among the K latest blocks is used. Here K is a design variable.

3.5.3  Generation of Replacement Sound

When a beep is detected, synthetic sound y(n) is generated to replace the beep. The sound is generated using the subband AR parameters selected according to subsection 3.5.2above and the corresponding residual sequences. The selected AR parameters are used throughout the entire beep. The residual sequence is looped if it is shorter than required. Equation (3.12) describes how the subband signal is generated in one block. MB is the number of samples in each block. Note that MB varies for the different subbands; MB is equal to M/2 for the high band and M/4 for the low and mid bands.

(3.12)

The residual wB(n) is scaled using equation (3.4) where the desired variance s2 is set in two different ways for the cases pure talk and noisy audience, respectively. In both cases the variance vector v is used (c.f. subsection 3.5.1). In the pure talk case, s2 is set to the second smallest value in v. The reason for this is that the low-variance blocks can be interpreted as blocks with only the audience. To avoid near-zero values the second smallest value is selected.

In the case of noisy audience, the change in variance preceding the beep is repeated during the beep. The purpose of this method is to resemble the rhythm of the audience. In each block, the subband residuals are scaled with a ramp function g(n). The ramp function reaches between the last and the first value in v. Equation (3.13) shows how the subband residual for one block is scaled.

(3.13)

When a block of replacement sound has been generated, the v vector is rotated one step, so that the first value becomes the last.

(3.14)

To assure good sound quality, yB(n-l) in the right-hand side of (3.12) should not be set to zero for negative indices, i.e. when ln. Samples from the preceding block should be used instead, or else the block frequency (fs/M) will be heard as a disturbance. Thus, yB(-i) in block k is equal to yB(MB-i) in block k-1. The subband signals are merged according to the methods in subsection 3.2 to form the replacement signal y(n).

Next, the transitions to and from replacement sound have to be smooth. This is accomplished by fading with the authentic sound x(n). Figure 3.5 and the text below describes the fading procedures.

Fade-in

The replacement sound y(n) is faded in from the beginning of the beep. Thus, y(n) is multiplied by a fading variable f starting at zero when the beep starts. The fading variable is linearly increased to reach one after a specified fade-time. The faded replacement sound is added with a faded mirror copy of the authentic sound x(n) preceding the beep. Hence, assuming that the beep starts at sample number nbs, the resulting sound z(n) becomes

.(3.15)

Fade-out

After the beep is finished, the replacement sound is faded in a similar manner. The difference is that it is mixed with the authentic sound following the beep. Equation (3.16) describes the fade-out assuming that the beep ends at sample number nbe.

(3.16)

Figure 3.5 Authentic and synthetic sound before, during and after a beep. The output sound z(n) is the sum of the authentic and the synthetic sound. Note that the beginning of the synthetic sound is mixed with a mirror copy of the last authentic sound.

4 Implementation

The solution outlined in section 3 was implemented in Matlab and in the DSP. The Matlab implementation was used for simulation and evaluation and the DSP implementation constitutes the final solution. The libraries of AR parameters and residual sequences where derived using Matlab functions and transferred to the DSP source code.

In the theory section, some design variables where left unassigned. These have been assigned values according to Table 4.1

Parameter
Symbol
Value
Block size
M
256 samples
Sample frequency
fs
8 kHz
Beep frequency
f0
1000 Hz
Fade time
0.2 s
Variance memory
K
20
Audience detection threshold
a
0.05
Beep detection threshold
g
80 (100 in DSP[2])
Master AR order
N
10
Low band AR order
NL
5
Mid band AR order
NM
5
High band AR order
NH
10
Number of AR sets
4
Table 4.1Numerical values for design variables

4.1 Implementation in Matlab

The Matlab implementation was divided in two parts: the generation of parameters and residuals, and the actual beep-filter function. Generally, the Matlab implementation concurs with the theory described in section 3.

4.1.1 Generation of Parameters and Residuals

AR parameters and residual sequences where derived from selected sound sequences in the TV show. These sequences where chosen to represent different types of audiences, e.g. calm audience, cheering audience, etcetera. The sound sequences where sampled to .mat-files using the DSP.

4.1.2 Beep Filtering

The beep filtering function consists of a main function, main.m, and a number auxiliary of functions that are described below.

The main function reads the library files and the sound file to be filtered. The sound is read from a .mat-file. All constants and global variables are defined as well. After these initialisation procedures are done the main function starts the program loop outlined in Figure 3.1 on page 1, using the following functions:

sample reads the next block of data from the in-sound vector.

beepdetect performs the beep detection.

gensound generates one block of synthetic sound using the selected AR parameters and residuals.

select_ar selects which AR process fits best with the current situation.

set_var handles the variance extrapolation described in subsection 3.5.3.

submerge performs the subband decoding, i.e. performs upsampling and appropriate filtering of the two input signals and finally adds them.

4.1.3 Subband Signal Processing

Some special measures regarding the filtering have to be taken when implementing the subband signal processing. The filters are implemented as FIR filters of order 20, shown in equation (4.1).

(4.1)

Since the signal processing is performed in blocks, the filters have to be initialised correctly in order to assure maximal sound quality (c.f. subsection 3.5.3). Thus, the input signal should not be set to zero for negative indices, i.e. forln. The last samples in the preceding block should be used instead according to equation (4.2).

(4.2)

This scheme, however, is not used if the present block is the first block in the beep, when x(n) should be zero for negative n because no subband signals are available for the sound preceding the beep.

4.2 Implementation in DSP

The implementation in the Digital Signal Processor, which was written in C programming language, agreed with the Matlab code in most parts. The C-code was less general than the Matlab code because of the lack of ready-to-use mathematical tools in C.

4.2.1 Deviations from the Matlab code

When trying to use implementations of mathematical functions (such as FIR filters) provided by Texas Instruments problems were encountered. Instead, mathematical calculations were produced by less general coding, tailor-made for each situation requiring function calls.

Another difference in the C-code of the DSP from the Matlab code was the sparse use of separate functions. The memory handling in the DSP was made more convenient when fewer functions were handled by the DSP.

In the C code, the replacement parameters (e.g. AR parameters and residual values) were included in the source code, instead of reading them from a separate file, which was the case in Matlab. This led to a loss of generality and user modification of the parameters was rendered impossible. The procedure of reading replacement parameters from a separate file would require knowledge of interaction between C++ programs, executed in the CPU, and C programs, executed in the DSP, a time consuming task for the programmers. The entire final program was written and executed in the DSP since the task at hand did not demand CPU-DSP interaction. The program was not slowed down by excluding the CPU but on the contrary, the program was executed faster.

4.2.2 Problems in DSP

The calculations that were made instead of using general mathematical functions resulted in many problems that were all solved after thorough debugging. 

The memory handling in the DSP presented many problems. A large number of variables were declared globally in order to avoid function calls with pointer-type arguments. This does not affect the functionality of the final program.

The execution of the program was not satisfactory at first. The number of calculations slowed the DSP down at such a rate that the output sound was distorted. After using the built in optimiser this problem disappeared.

Some bugs in the DSP hardware and software (TMS320C600 DSP and the Code Composer Studio version 1.0) turned up during the programming. After consulting the Texas Instruments DSP homepage, these bugs could be avoided.

5 Results

The performance of the system is largely dependent on subjective judgements. Therefore, a listening test was conducted with a number of persons not involved in the project. However, some parts of the system such as the beep detection can be evaluated in a more objective manner.

5.1 Beep Detection

In Matlab, the detector detects acceptably well when the threshold is chosen to 75 per cent of the maximum value of the ratio T (3.7). With this threshold value, the detector detects the beeps all the time and does not trigger false alarms frequently. The Figure 5.1 shows how the beep threshold is exceeded, beeps are detected, when beeps occur in the sound signal, which is flat (constant frequency) at the beeps as shown in the sound signal diagram below.

Figure 5.1 A censored audio signal (bottom) and the corresponding beep detection ratio T (top). The detection threshold a is shown as the slashed line. T larger than a indicates a beep.

In the DSP implementation, a very similar structure was used. The threshold was set by a combination of human knowledge and estimations based on the Matlab results. The threshold value was fixed at 100 in the DSP.

5.2 Listening test

Since the result of the beep masking is hard to measure with objective methods a listening test was made. In the test, 12 persons evaluated the quality of the beep masking.

5.2.1 Test Suit

The test was carried out in Matlab and was automatic, i.e. the test person answered to the program which then saved the test results in a file. The test consisted of two parts.

In the first part, the listener heard six different sound sequences, each about five seconds. In these sequences, a number of beeps had been replaced using the Matlab implementation of the solution (see 5.2.2 for the exact number of beeps in the different sequences). The listener was prompted to guess how many beeps each sequence contained.

In the second part, the listener first heard a short sequence with a beep followed by the same sequence with the beep replaced. Then, the test in part one was repeated, using the same six sequences. The purpose of this was to give the listener a better knowledge of what the replacements sounded like and evaluate whether this knowledge led to a different test result.

5.2.2 Test Results

The actual number of beeps in each test sequence along with the mean of the answers in part one and two are shown in Figure 5.2. In general, the listeners did not perceive all the masked beeps. In all sequences, the number of beeps perceived increased in part two, which can be interpreted as the effect of the listener knowing what to seek. However, this can also be a result of that the listener hears the sequences for the second time in part two.

Figure 5.2 Mean of the answers and the actual number of beeps in each sequence. The graph indicates that the test group was unable to detect all beeps.

Overall, the subjective listening test results are considered satisfactory. However, the results are not interpreted as proof of success, but as a guide to how the system performs. A more thorough test involving more test persons should be conducted to verify the results.

6 Conclusions

The listening test indicated that the results were satisfactory. About half of the masked beeps were found by the listeners. The task, however, was error concealment and not speech reconstruction, which would have been virtually impossible. Hence, even though some of the concealments are audible, this is still better than the original audio signal.

The involved techniques, such as beep detection, subband signal processing and selection of AR parameters, worked as expected.

6.1 Further Work

The solution can be improved in many aspects. Some examples are discussed briefly below.
Increase Sample Frequency
One way of improving the overall sound quality is to increase the sample frequency, fs, from the implemented 8 kHz. This, however, requires modification of the subband signal processing. Using the processing scheme proposed in subsection 3.2 and 8 kHz sample frequency leads to a low band frequency range of 0 to 1000 Hz, which is suitable for speech coding. Increasing the sample frequency will cause the subband limit frequencies to increase as well. Thus, further division of the lowest band may be necessary to suit speech frequencies.
Selectable Beep Frequency or Adaptive Beep Detection
The implemented solution relies on an a priori knowledge of the beep frequency f0. The solution can be extended to let the user select a desired frequency. A more advanced solution would be to make the algorithms adaptive, i.e. automatically find the frequency.
Additional AR parameter sets
To improve the quality of the replacement sound and how well it fits to the preceding and succeeding sound, additional AR parameter sets can be implemented, representing other audience “types”.
User Controlled AR generation
The user might want to generate new AR parameter sets. This could be accomplished by making the program load parameters and residual sequences from file at program start, rather than having them defined in code. The parameters and residuals could also be generated in the DSP using sound from the VCR.


AUser’s Manual

The program is easy and straightforward to operate. Three different versions of the program can be run: Beep Filter, Beep Demo and Bypass. Beep Filter replaces the beeps with generated sound, Beep Demo amplifies the beeps and Bypass outputs the original sound.

A.1How to run the program

Follow the description below to run the program.

1.Start the program JerrySpringer.exe. This can be done by double clicking on the file in Windows Explorer. You will now get a window like the one in Figure 6.1.

2.Choose which of the versions to run in the drop down list.

3.Press the Run Program button and the chosen program will start.

4.Press Stop Program button to stop the program.

5.Start another version or exit JerrySpringer with the Exit button.

Figure 6.1 The Jerry Springer interface is easy to use. Select which program to run and press “Run Program”.

BDiscarded Methods

A number of other methods have been evaluated (in Matlab) and discarded due to poor performance in the problem at hand. Some of these methods are described in this subsection. The common fault for these is that they try to generate replacement sound using information from the preceding sound only. Since the preceding sound generally is dominated by one speaker, the replacement sound will be coloured by this and hence not assimilate the audience.

B.1Linear Extrapolation in the Time-Frequency Domain

This method is described in [3]. The two blocks preceding the beep are transformed to the frequency domain using the discrete Fourier transform (DFT). The replacement signal is synthesised in the frequency domain, treating the amplitude and phase separately.

The amplitudes for the different frequencies of the replacement blocks are all set to the amplitude of the last block preceding the beep in the corresponding frequency. The phase, on the other hand, is linearly extrapolated from the two preceding blocks for each frequency. Finally, the replacement blocks are inverse-transformed to form the sound.

The method did not give satisfactory performance in the problem at hand, probably because the beeps are to long. In addition, the two blocks used for extrapolation should be overlapping and closer in time to improve performance.

B.2Replication

In this method, the beeps where replaced with a duplication of the preceding sound. Two different versions where tried: one where the replacement sound was reversed and one where it was not reversed. Some linear filtering where tried in order to make the copy sound slightly different from the authentic sound. However, all different versions tried sounded like a copy of the sound before the beep.

B.3On-Line Estimation of AR parameters

This third method continuously estimated AR parameters for the sound preceding the beep. When the beep started, the AR parameters latest derived were used, driven by a white Gaussian noise.

The resulting sound was, as mentioned above, influenced by the speakers voice. Furthermore, a noise-driven process often sounds noisy. The result was a noise with approximately the same spectral properties as the talker in the preceding sound.

CSymbols

 

AR parameters representing s(n)
AR parameters representing sB(n)
a(i)
The i:th AR parameter (a(0) = 1)
aB(i)
The i:th AR parameter for band B (aB(0) = 1)
a
Audience detection threshold
B
Subband designator, e.g. L, M or H
B
Residual quantisation bits
g
Beep detection threshold
eB(n)
Residual, or estimation error for band B
eBq(n)
Quantified residual, or estimation error for band B
f0
Beep frequency
fs
Sample frequency
g(n)
Scaling ramp function
I(f)
Short-time periodogram
K
Variance memory length
KB
Energy ration between yB(n) and y(n)
M
Block size in samples per block
N
Master AR order
nbe
Discrete time index for beep end
nbs
Discrete time index for beep start
NH
High band AR order
NL
Low band AR order
NM
Mid band AR order
R
Autocorrelation matrix for x(n)
rxx(k)
Estimate of autocorrelation function for x(n) at lag k
s(n)
Sound signal from which a set of AR parameters and residual is generated
sB(n)
B-band part of s(n)
Variance of eBq(n)
Variance of wB(n)
Variance of s(n)
Variance of sB(n)
Estimate of variance of x(n)
T
Ration between I(f) and 
T
Toeplitz autocorrelation matrix for x(n)
TB
Residual scaling factor for band B
v
Block variance vector
wB(n)
Scaled quantified residual for band B
w
Discrete-time angular frequency
x(n)
Input signal (from VCR)
y(n)
Synthetic replacement signal
yB(n)
B-band part of y(n)
z(n)
Output signal (to TV set)


References

[1]J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Principles, Algorithms, and Applications. New Jersey: Prentice-Hall, 1996.

[2]S. M. Kay, Fundamentals of Statistical Signal Processing, Detection Theory, Section 7.6.4. New Jersey: Prentice-Hall, 1998

[3]F. Laxhed, Linear Domain Error Concealment for Speech Frame Losses in Packet Switched Networks, Master Thesis IR-SB-EX-9911, Royal Institute of Technology, Stockholm, 1999



[1] In the GSM mobile telephony system, one frame is 160 samples and the sample frequency is 8 kHz resulting in 20 ms per frame.
[2] These figures differ because the beep detection was implemented in a slightly different manner in the DSP.