With the rapid development of address, sound, image, and video compression methods, presently, it is not a difficult task to distribute digital multimedia over the internet. This makes the protection of digital intellectual property rights and content trademarks a serious problem.
Hence, the technology of digital watermarking has received a great deal of attention. Generally, digital watermarking techniques are based on either spread spectrum methods or altering the least significant bits of selected coefficients of a certain signal transform. For address watermarking, to ensure that the embedded watermark is imperceptible, the audio marker phenomenon is considered together with these conventional techniques.
In addition, an address watermarking system should be robust to various speech compression operations. The development of address watermarking algorithms, therefore, involves a tradeoff among speech fidelity, robustness, and watermark form implanting rate specifications. The address watermarking techniques normally embed speech watermark in unnecessary parts of speech signal or in human insensitivity auditory parts. Some address watermarking methods will alter an interval to implant a watermark. However, this sort of method has a drawback, which is the unavoidable degradation of robustness.
In the other methods, the watermarks are embedded by the use of forged human address. It is unfortunate that such a type of method also has the defect of weak robustness, particularly when the forged human address is destroyed. The deformation of the forged human address will also lead to the damage of the watermark.
Therefore, we can define watermarking systems as systems in which the hidden message is related to the host signal, and non-watermarking systems in which the message is unrelated to the host signal. On the other hand, systems for implanting messages into host signals can be divided into steganographic systems, in which the existence of the message is kept secret, and non-steganographic systems, in which the presence of the embedded message does not have to be secret.
Audio watermarking algorithms are characterized by five essential properties, namely, perceptual transparency, watermark spot rate, robustness, blind/informed watermark sensing, and security.
Perceptual transparency
In most applications, the watermark embedding algorithm has to insert extra information without impacting the perceptual quality of the sound host signal. The fidelity of the watermarking algorithm is normally defined as the perceptual similarity between the original and watermarked audio sequences.
However, the quality of the watermarked sound is usually degraded, either deliberately by an antagonist or accidentally in the transmission process, before an individual perceives it. In that instance, it is more appropriate to define the fidelity of a watermarking algorithm as the perceptual similarity between the watermarked sound and the original host sound at the point at which they are presented to a consumer.
Watermark spot rate
The spot rate of the embedded watermark is the number of embedded spots within a unit of time and is normally given in spots per second (bits per second). Some audio watermarking applications, such as copy control, require the interpolation of a consecutive figure or writer ID, with the mean spot rate of up to 0.5 bits per second.
For a broadcast monitoring watermark, the spot rate is higher, caused by the necessity of embedding an ID signature of a commercial within the first second at the start of the broadcast clip, with an average spot rate of up to 15 bits per second. In some visual applications, e.g. hiding an address in sound or compressing an audio stream in sound, algorithms have to be able to implant watermarks with a spot rate that is a significant fraction of the host audio spot rate, up to 150 kbps.
Robustness
The robustness of the algorithm is defined as the ability of the watermark sensor to extract the embedded watermark after common signal processing operations. A detailed overview of robustness tests is given in Chapter 3. Applications normally require robustness in the presence of a predefined set of signal processing alterations so that the watermark can be faithfully extracted at the receiving end.
For example, in wireless broadcast monitoring, embedded watermarks need only to survive deformations caused by the transmission process, including dynamic compression and low-pass filtering, because watermark sensing is done directly from the broadcast signal. On the other hand, in some algorithms, robustness is wholly unwanted, and those algorithms are labeled delicate sound watermarking algorithms.
Blind or informed watermark sensing
In some applications, a sensing algorithm may use the original host sound to extract the watermark from the watermarked audio sequence (informed sensing). It often significantly improves the sensor performance, as the original sound can be subtracted from the watermarked signal, resulting in the watermark sequence alone.
However, if the sensing algorithm does not have access to the original sound (blind sensing), this inability significantly decreases the amount of data that can be hidden in the host signal. The complete process of embedding and extracting the watermark is modeled as a communication channel where the watermark is distorted due to the presence of strong interference and channel effects. Strong interference is caused by the presence of the host sound, and channel effects correspond to signal processing operations.
Security
The watermark algorithm must be secure in the sense that an antagonist must not be able to observe the presence of embedded information, let alone take the embedded information. The security of the watermarking process is interpreted in the same way as the security of encoding techniques, and it cannot be broken unless the authorized user has access to a secret key that controls watermark embedding.
An unauthorized user should be unable to extract the information in a reasonable amount of time, even if they know that the host signal contains a watermark and are familiar with the exact watermark embedding algorithm. Security requirements vary with the application, and the most stringent are in secure communications applications, and in some cases, information is encrypted prior to embedding into the host sound.
Theory
The cardinal procedure in each watermarking system can be modeled as a signifier of communication, where a message is transmitted from the waterline embedder to the waterline receiving system. The procedure of watermarking is viewed as a transmission channel through which the waterline message is being sent, with the host signal being a portion of that channel. In Figure 2, a general function of a watermarking system as a communication theoretical model is given. After the waterline is embedded, the watermarked work is normally distorted after waterline attacks. The deformations of the watermarked signal are, like the data communications theoretical model, modeled as linear noise.
In this project, signal processing methods are used for waterline embedding and extracting procedures, derivation of perceptual thresholds, transforms of signals to different signal spheres (e.g., Fourier sphere, ripple sphere), filtering, and spectral analysis.
Communication principles and models are used for channel noise modeling, different ways of signaling the waterline (e.g., a direct sequence spread spectrum method, frequency skipping method), derivation of optimized sensing method (e.g., matched filtering), and evaluation of overall sensing performance of the algorithm (bit error rate, normalized correlation value at sensing). The basic information theory principles are used for the computation of the perceptual information of an audio sequence, channel capacity bounds of a waterline channel, and during the design of an optimum channel coding method.
During transmission and response, signals are frequently corrupted by noise, which can cause severe problems for downstream processing and user perceptual experience. It is well known that to cancel the noise component present in the standard signal using an adaptive signal processing technique, a reference signal is needed, which is highly correlated to the noise. Since the noise gets added in the channel and is entirely random, there is no way of creating a correlative noise at the receiving terminal.
The only possible way is to somehow extract the noise from the standard signal itself, as only the standard signal can tell the story of the noise added to it. Therefore, an automated means of taking the noise would be an invaluable first phase for many signal-processing tasks. Denoising has long been a focus of research, and yet there always remains room for improvement.
Simple methods originally employed the use of time-domain filtering of the corrupted signal. However, this is only successful when taking high-frequency noise from low-frequency signals and does not provide satisfactory results under real-world conditions.
To improve performance, modern algorithms filter signals in some transform sphere such as omega for Fourier. Over the past two decades, a flurry of activity has involved the use of the ripple transform after the community recognized the possibility that this could be used as a superior option to Fourier analysis. Numerous signal and image processing techniques have since been developed to leverage the power of ripples. These techniques include the discrete ripple transform, ripple package analysis, and most recently, the lifting strategy.
Speech Production: Address is produced when air is forced from the lungs through the vocal cords and along the vocal tract. The vocal tract extends from the gap in the vocal cords (called the glottis) to the oral cavity, and in an average adult male, is about 17 centimeters long. It introduces short-run correlations (of the order of 1ms) into the speech signal and can be thought of as a filter with wide resonances called formants.
The frequencies of these formants are controlled by changing the form of the vocal tract, for example by moving the position of the tongue. An important part of many speech codecs is the modeling of the vocal tract as a short-run filter. As the form of the vocal tract varies relatively easily, the transfer function of its modeled filter needs to be updated only relatively infrequently (typically every 20 MS or so).
From a technical, signal-oriented point of view, the production of speech is widely described as a two-level process. In the first phase, the sound is initiated, and in the second phase, it is filtered at the second level. This differentiation between stages has its origin in the source-filter model of speech production.
The basic premise of the model is that the source signal produced at the glottal level is linearly filtered through the vocal tract. The resulting sound is emitted to the surrounding air through radiation burden (lips). The model assumes that the source and filter are independent of each other. Although recent findings show some interaction between the vocal tract and a glottal source (Rothenberg 1981; Fant 1986), Fant’s theory of speech production is still used as a model for the description of the human voice, particularly as far as the articulation of vowels is concerned.
Address Processing: The term “address processing” fundamentally refers to the scientific discipline concerning the analysis and processing of speech signals to achieve the best benefit in various practical scenarios. The field of speech processing is currently undergoing rapid growth in terms of both performance and applications. The advances being made in the fields of microelectronics, computing, and algorithm design stimulate this. Nevertheless, speech processing still covers an extremely wide area, which relates to the following three technology applications:
- Speech Coding and Transmission, which is primarily concerned with man-to-man voice communication.
- Speech Synthesis, which deals with machine-to-man communications.
- Speech Recognition, which relates to man-to-machine communication.
Address Cryptography: Speech cryptography or compression is the field concerned with compact digital representations of speech signals for the purpose of efficient transmission or storage. The fundamental aim is to represent a signal with a minimal number of bits while maintaining perceptual quality. Current applications for speech and audio compression algorithms include cellular and personal communications networks (PCNs), teleconferencing, desktop multimedia systems, and secure communications.
The Discrete Wavelet Transform: The Discrete Wavelet Transform (DWT) involves taking dyadic scales and positions based on powers of two. The mother wavelet is rescaled or dilated by powers of two and translated by whole numbers. Specifically, a function f(T) ∈ L2(R) (defines the space of square integrable functions) can be represented as follows:
f(T) = Σa(L,k)ψ(L,k)(T) + Σd(j,k)φ(j,k)(T)
The function ψ(L,k) is known as the mother wavelet, while φ(j,k) is known as the scaling function. The set of functions {ψ(L,k), φ(j,k) | k ∈ Z} where Z is the set of integers is an orthonormal basis for L2(R). The numbers a(L,k) are known as the approximation coefficients at scale L, while d(j,k) are known as the detail coefficients at scale j. The approximation and detail coefficients can be expressed as:
a(L,k) = <f(T), ψ(L,k)(T)> d(j,k) = <f(T), φ(j,k)(T)>
To provide some understanding of the above coefficients, let us consider a projection Fl(T) of the function f(T) that provides the best approximation (in the sense of minimal error energy) to f(T) at a scale L. This projection can be constructed from the approximation coefficients a(L,k) using the equation:
Fl(T) = Σa(L,k)ψ(L,k)(T)
As the scale L decreases, the approximation becomes finer, converging to f(T) as L → 0. The difference between the approximation at scale L+1 and that at L, Fl+1(T) – Fl(T), is entirely described by the detail coefficients d(j,k) using the equation:
Fl+1(T) – Fl(T) = Σd(j,k)φ(j,k)(T)
Using these relations, given a(L,k) and {d(j,k) | j ≥ L}, it is clear that we can construct the approximation at any scale. Hence, the wavelet transform breaks the signal up into a coarse approximation Fl(T) (given a(L,k)) and a number of layers of details {fj+1(T) – fj(T) | j < L} (given by {d(j,k) | j ≥ L}). As each layer of details is added, the approximation at the next finer scale is achieved.
Disappearing Moments
The number of disappearing moments of a wavelet indicates the smoothness of the wavelet function as well as the two-dimensionality of the frequency response of the wavelet filters (filters used to calculate the DWT). Typically, a wavelet with p disappearing moments satisfies the following equation, or equivalently:
Σ(ω)ω^(m)p(ω) = 0
For the representation of smooth signals, a higher number of disappearing moments leads to a faster decay rate of wavelet coefficients. Therefore, wavelets with a high number of disappearing moments lead to a more compact signal representation and are therefore useful in coding applications. However, in general, the length of the filters increases with the number of disappearing moments and the complexity of calculating the DWT coefficients increases with the size of the wavelet filters.
The Fast Wavelet Transform Algorithm
The Discrete Wavelet Transform (DWT) coefficients can be computed by utilizing Mallat’s Fast Wavelet Transform algorithm. This algorithm is sometimes referred to as the stereophonic sub-band programmer and involves filtering the input signal based on the ripple map used.
Implementation Using Filters
To explain the implementation of the Fast Wavelet Transform algorithm, see the following equations:
(3.9) The first equation is known as the twin-scale relation (or the dilation equation) and defines the grading map ?. The following equation expresses the ripple ? in terms of the grading map ?. The third equation is the condition required for the ripple to be orthogonal to the grading map and its translates.
The coefficients c(K) or {c0, …, c2N-1} in the above equations represent the impulse response coefficients for a low-pass filter of length 2N, with a sum of 1 and a norm of 1/2. The high-pass filter is obtained from the low-pass filter utilizing the relationship g(K) = (-1)^(K+1)c(2N-K), where K varies over the range (1, 2N-1) to 1.
Equation 2.7 shows that the grading map is essentially a low-pass filter and is used to define the estimates. The ripple map defined by equation 2.8 is a high-pass filter and defines the details. Starting with a distinct input signal vector s, the first stage of the FWT algorithm decomposes the signal into two sets of coefficients. These are the estimate coefficients cA1 (low-frequency information) and the detail coefficients cD1 (high-frequency information), as shown in the figure below.
The coefficient vectors are obtained by convolving s with the low-pass filter Lo_D for approximation and with the high-pass filter Hi_D for details. This filtering operation is then followed by dyadic decimation or downsampling by a factor of 2. Mathematically, the stereophonic filtering of the distinct signal s is represented by the expressions: (3.10)
These equations implement a whirl plus downsampling by a factor 2 and give the forward fast ripple transform. If the length of each filter is equal to 2N and the length of the original signal s is equal to n, then the corresponding lengths of the coefficients cA1 and cD1 are given by the expression: (3.11)
This shows that the total length of the ripple coefficients is always slightly greater than the length of the original signal due to the filtering process used.
Note: The text contained some mathematical notation, which I preserved as is, assuming it was part of a technical document.
Multilevel Decomposition
The decomposition procedure can be iterated, with consecutive estimates being decomposed in turn, so that one signal is broken down into many lower declaration components. This is called the ripple decomposition tree. Since the analysis procedure is iterative, in theory, it can be continued indefinitely.
In reality, the decomposition can only continue until the vector consists of a single sample. Normally, however, there is little or no advantage gained in breaking up a signal beyond a certain level. The selection of the optimum decomposition level in the hierarchy depends on the nature of the signal being analyzed or some other suitable criterion, such as the low-pass filter cut-off.
Signal Reconstruction
The original signal can be reconstructed or synthesized using the reverse distinct ripple transform (IDWT). The synthesis starts with the estimate and item coefficients cAj and cDj and then reconstructs cAj-1 by upsampling and filtering with the Reconstruction filters. The Reconstruction filters are designed in such a way as to cancel out the effects of aliasing introduced in the ripple decomposition stage. The Reconstruction filters (Lo_R and Hi_R), together with the low and high base on balls decomposition filters, form a system known as quadrature mirror filters (QMF).
Methodology
Speech Watermarking: Speech watermarking means embedding digital information (speech) into another speech signal (.wav) or removing the signal components of the desired signal. Here, I have considered two speech signals.
- Select the desired speech signal (.wav), read the desired wave length, and play the selected desired speech signal.
- Select the embedded speech signal (.wav), read and play the selected embedded speech signal.
- Select the desired speech signal (.wav), read the selected desired speech signal.
- Then, apply distinct wavelet transform with the name of the wavelet “Haar” because we need the required desired processing.
- Due to speech watermarking, the desired signals are processed one by one. Here, I have used a cat map.
- Here, the watermarking results are played.
- SWPR: Base for speech watermarking signal play, due to recording.
- SWRP: Base for speech watermarking signal recorded and playing.