one-dimensional wavelet transform

发布于:2021-06-11 01:23:48

An e cient systolic architecture for the one-dimensional wavelet transform
R. Lang E. Plesner H. Schroder A. Spray Application Speci c Computer Design (ASCOD), Department of Electrical and Computer Engineering, University of Newcastle, Australia

ABSTRACT
We present in this paper an architectural design for a Wavelet Transform chip for use in real-time onedimensional signal processing applications. Based on the observation that further levels of the Wavelet Transform require only as much computation as the rst level, our architecture requires only one row of processing elements to compute the complete transform. This is compared to previous designs requiring one row of processing elements per level of the transform. Our architecture provides the output of the transform in two forms, one with all levels multiplexed on one line (useful for transmission or compression) and the other as individual levels on separate lines synchronised in time to facilitate real-time analysis. We consider the usefulness of this architecture for real-time analysis of audio signals (typically 40kHz sampling rate) and discuss the design and implementation bene ts of the computational simplicity of the presented architecture.

1 INTRODUCTION
The objectives of this paper are to show how the Discrete Wavelet Transform (DWT) relates to the simple convolution operation, to modify a systolic array designed for convolution so that it performs the DWT, to discuss implementation details of this array and mention some real-time signal processing considerations. The Wavelet Transform is used in a variety of signal processing applications. The DWT is becoming increasingly popular due to its computational simplicity.1 For orthogonal wavelets, the DWT is computed by recursive ltering through decimated high- and low-pass lter banks. Each level of ltering can be implemented using a systolic array2 to convolve the lter with the signal. We have made use of idle processing time in the systolic array so that only one such array is needed to compute the Forward Discrete Wavelet Transform. This is a vast improvement on requiring one systolic array per level being calculated. We look at the architectural design of this array and show how any number of levels of the Discrete Wavelet Transform (DWT) can be computed using the one array with simple control. This design makes the systolic array run very close to 100% e ciency. The architecture we have designed is similar to the second architecture designed by Vishwanath et al.3 and has the same AT 2 lower bound as that design. Rather than a routing network for scheduling the intermediate levels,

we use a simple memory bank to accomplish the same task. We discuss the implementation issues of the array and highlight some important hidden complexities. Real-time audio processing considerations linked to our work in enhancing speech with the Wavelet Transform4 are brie y mentioned. We start with a short description of the DWT and its algorithm in Section 2, followed by an explanation of the ltering process with a systolic array in Section 3. In Section 4 we modify the single systolic array so that it performs the complete DWT operation, discuss the complexities of the design and give some simulation snapshots to assist the reader in understanding the processing. Section 5 discusses our implementation of the architecture and nally Section 6 brie y lists some concerns when implementing our architecture for real-time audio applications.

2 THE DISCRETE WAVELET TRANSFORM
The Discrete Wavelet Transform (DWT) is described in a number of good texts and papers.5{7 It is a popular signal processing technique which decomposes a signal into its components in di erent frequency bands.8 The Inverse Discrete Wavelet Transform (IDWT) reconstructs a signal from this decomposition. Mallat formalised the DWT and showed it to be a computationally e cient process. Many papers since have applied Mallat's technique to signal processing with a variety of wavelets. In this paper we are only concerned with the algorithmic details of the Discrete Wavelet Transform, which is described in more detail by Cody1 (note: Cody refers to the Fast Wavelet Transform as the FWT, we make the distinction between the Discrete Wavelet Transform and the Inverse Discrete Wavelet Transform as DWT and IDWT respectively, but use the same algorithm as Cody). The DWT is a recursive ltering process. The signal is ltered by a low-pass lter and a high-pass lter at each level. The result of each lter stage is decimated, that is, the output of the lter has every second value removed. The decimated output of the high-pass lter is output as part of the result, and the decimated output of the low-pass lter is reapplied to the same two lters to form the next level of the computation. This recursive procedure continues for the depth required by the application, and is typically 5-8 levels. For nite signals the recursion continues until only one value is remaining from the output of the decimated lter stage, ie. there is no further decimation possible. The DWT of a nite signal of length N is log(N ). The nal low pass result is retained as part of the output. It is usual to refer to the result of the low-pass lter as the approximation at a certain level and to the result of the high-pass lter as the di erence at a certain level, hence the sets of values produced are labelled a and d in Figure 1. (The subscripts indicate the level of the output.)

3 FILTERING WITH A SYSTOLIC ARRAY
Since computing the DWT involves ltering, an e cient ltering process is essential in DWT hardware. Discrete ltering involves convolving the lter (which is described by the set of wavelet coe cients) and the signal values, this process is explained by Cody.1 Conceptually, the lter is a window moving along the signal, producing a ltered value at each step as shown in Figure 2. (The F values are the lter coe cients, the X values form the input signal and the x values form the ltered signal). Each step involves multiplying the lter coe cients by the signal values and adding these products together to produce the ltered value for that step. A systolic array can be used to perform the convolution operation9,10 and hence is suitable for our ltering application. Since the lter coe cients are constant during the ltering process, good data locality and hence e cient communication can be obtained if one coe cient is stored in each processing element of the systolic array. This removes the need for communicating the coe cients through the array (as for the convolution of two complete signals), hence reduces the amount of communication needed to produce the result. Inputs are provided to the

{x}

LPF

{a}1

LPF

{a}2

LPF

{a}3

LPF

{a}4

LPF

{a}5

HPF

{d}5

HPF

{d}4

HPF

{d}3

HPF

{d}2

HPF

{d}1

Figure 1: The Forward Wavelet Transform ltering process (5 levels) array every second clock cycle, as shown in Figure 3 (the dotted inputs are unused clock cycles between input data). At each step, if a processing element reads a valid input value from its west neighbour, it then multiplies that value by its lter coe cient and adds the result to the partial sum read in the same cycle from its east neighbour. The resulting partial sum is then written to the west neighbour. The output of the convolution operation is provided at the same end of the array from which the input is read. Since every second input is a null input, this array must work at twice the speed of the input signal, hence achieves only 50% e ciency. Two separate convolution operations can be interleaved into the input to utilise the processing elements more e ciently, both operations being performed at the same time.

4 THE DWT WITH A SINGLE SYSTOLIC ARRAY
The DWT requires two ltering stages at each computational step, so we can interleave these two ltering processes onto the one systolic array. Since a new input value is provided at every second clock step this would normally require extra control to interleave the two input signals, but in this case the two ltering stages are operating on the same input signal, so there is no extra control introduced by doing the two ltering stages. The decimation by two of the output of the DWT means that every second value produced in the ltering stage will be removed from the output. Hence as a matter of e ciency we need not produce these values. In the context of the ltering process, this means that the lter window is stepping along the signal in steps of two rather than processing at every step, see Figure 4 (the L values are low-pass lter coe cients, the H values are high-pass lter coe cients). With some slightly more complex timing, we can use the alternate calculation steps not needed because of the decimation to perform the further levels of the transform, requiring intermediate storage to store the low-pass results that will be used in further calculations. To prove that one systolic array can be used to perform all levels of the transform, we de ne W to be the amount of work the systolic array is doing to lter two signals together (by interleaving the problems) without

F 1

F 2

F 3

F 4

x4

F 1

F 2

F 3

F 4

x3

F 1

F 2

F 3

F 4

x2

F 1

F 2

F 3

F 4

x1

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

Figure 2: Filtering a signal
X5 X4 X3 X2 X1
Output

F

4

F

3

F

2

F

1

0

Figure 3: A systolic array for convolution decimation (ie. when the array is working at 100% e ciency). With decimation, we do not perform every second calculation of each ltering stage, and so it is in these clock cycles that we will perform the further levels of the transform. Therefore after decimation the systolic array which is capable of performing work W is only performing work W . The amount of work at each level halves as each output is decimated , so the total amount of work for the 2 remaining levels of the transform equals W + W + W + : : : + W . 4 8 16 n With the rst level included, the total amount of work for the whole transform is the simple series: n W W W W W W 1 i = 2 + 4 + 8 + 16 + : : : + 2n = W ? 2n i=1 2

X

This sum converges to W as n ! 1, hence the computational power of the systolic array is su cient to perform all levels of the FWT. The resulting algorithm will use the systolic array e ciently provided the calculations can be su ciently interleaved and the output at any level can be used as input to the succeeding level.

4.1 Memory access and level timing
The timing of the alternate level calculations can readily be shown to produce a predictable sequence3 hence can be used both as an address to access memory (see Section 4.3.2 for memory requirements) and as an indication of the level being processed by a PE. By scheduling a calculation as soon as possible (ie. when there are enough values to be ltered and there are no lower level calculations to be done) the sequence of calculation (and memory access) is:

:::1 2 1 3 1 2 1 4 1 2 1 3 1 2 1:::
This sequence for the four-coe cient DWT is exactly the transition sequence of a 4-bit gray code11 and can be produced by various methods, either through the use of XOR gates, accessing a simple look-up table, or by using combinational logic involving a binary clock signal.

L

1

L

2

L

3

L

4

a4

L

1

L

2

L

3

L

4

a3

L

1

L

2

L

3

L

4

a2

L

1

L

2

L

3

L

4

a1

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

H1

H2

H3

H4

d1

H1

H2

H3

H4

d2

H1

H2

H3

H4

d3

H1

H2

H3

H4

d4

Figure 4: Filtering in the DWT

4.2 Output ordering
The sequence mentioned generates the output on a single line. Since the leftmost PE is only processing one lter at a time it is possible also to output the low-pass results on a single line without data clashes. This could be useful for compression applications but may not be useful for general signal processing. The circuitry to demultiplex this line onto several separate lines is trivial; the reverse of the higher level timing circuit based on the gray code. There is further complication if the output of the DWT on a nite signal is required such that complete levels are output in order. For example, rather than separate lines for output, some applications may require the output on one line but ordered such that all the rst level values come rst, followed by the second level values, and so on. This requires a reordering of the output and since higher level values are being produced interleaved with the rst level then extra memory of O(kN ) for k levels and N samples will be needed to accomplish this reordering.

4.3 The architectural design
A modular layout of the hardware design for a four coe cient wavelet is shown in Figure 5. This design consists of four Processing Elements (PEs), four Memory Units (MUs), and one Control Unit (CU). Each PE has addition and multiplication capabilities, reads from its own column of memory, and writes to the column of memory two PEs to the right (if such a PE exists). There is a single control unit which controls the rightmost processing element through a series of control signals. Since each PE performs the same calculations as its right neighbour one clock cycle behind it, these control signals can propagate from the rightmost PE leftwards through the systolic array. This shows an interesting aspect of the data ow within the systolic array. Control of the sytolic array is usually treated as trivial and not given much attention in the literature. Such designs specify only the ow of input data (rightwards) and the ow of results (leftwards). Rather than have each PE perform its own control, we can propagate control signals leftwards through

Memory Elements

X +

X +

X +

X +

0

Processing Elements

Control Unit

Control Signals

Input

Output

Figure 5: The components of the DWT chip MEM HF Calculation 0 0 low-freq lter from input 0 1 high-freq lter from input 1 0 low-freq lter from memory 1 1 high-freq lter from memory Table 1: Control ags to show the processing state of a PE. the array, generating only one set at the rightmost PE and hence requiring only one control unit for the whole array.

4.3.1 Systolic processing order
The systolic array uses a four-stage timing scheme. We use two control signals to uniquely identify these control stages. The processing order de ning the four states of the array are shown in Table 1, the ow of states is from top to bottom. The MEM ag indicates if processing is being done from the local memory or from the systolic connections. The HF ag indicates if processing is using the high-pass lter coe cient or the low-pass lter coe cient.

4.3.2 Local Memory
Since ltering a level requires at least a lter-width of data, it is easy to see that the minimum memory requirement per level is one word per coe cient in the lter. For any level k, a new value in that level is produced every 2k+1 clock cycles. This level itself is being processed

(to produce a value at the higher level) at half the pace of this production, namely every 2k+2 clock cycles, but the decimation within that level means that only half the values produced have a unique ltering step associated with them. Hence the data is being consumed at the same rate at which it is being produced. The decimation operation is the key to understanding that the memory requirement of one word per coe cient in a given level is not exceeded. Hence for a lter width of w, our design requires only (w (k ? 1)) words of storage to produce k levels of the DWT (since the last level values need not be stored for further computation). Such a small memory requirement implies that this intermediate storage memory should be realisable on-chip. The memory accessing scheme is further simpli ed by dividing the memory into columns and adding the restriction that each PE controls a column of memory and only that PE can read from that memory. This distributed memory approach simpli es the control of the memory and solves any potential problems involved with arbitrating for memory accesses. Extra memory control is needed to deal with the shifting of the window within the intermediate levels. This basically involves a "shift" operation where the values in a particular level of each of the columns are shifted two columns to the right, overwriting the two rightmost values and freeing the two leftmost values for the next two values produced in that level.

4.4 Simulation snapshots
Figure 6 shows four snapshots of the systolic processing for a four coe cient DWT. These snapshots show key principles in this hardware design. Dashed arrows indicate memory accesses, the binary values inside the processing element circles indicate the status of the HF and MEM ags for that PE (left bit for MEM , right bit for HF ). To make the commentary easier to follow, we label rst level low-pass results as a values and second level low-pass results as b values. The values below the processing elements show the input stream moving rightwards through the array. The subscripts on each datum is a simple enumeration within that level. The following discussion will focus on the leftmost PE because this is where the nal results are being produced. The reader can examine the processing of the other three PEs to see how those partial sums are calculated and propagated to the leftmost PE in the same way as the systolic convolution. Figure 6(a) shows time t = 45. The leftmost processor has MEM = 1 and HF = 0, indicating that it is processing a low-frequency calculation from memory. The dashed line from a10 to the PE indicates a memory read in this cycle. This value is stored in a register for use again in the next clock cycle. The shift operation is at work in this clock cycle, but not at the leftmost processor. In the previous clock cycle the PE second from the left read the value a9 from memory. This value is now no longer needed here but will be needed later in this level, so is written to memory two columns to the right, as indicated by the dashed line. The partial sum is calculated and added to the partial sum read from the right neighbour and the result is produced and written to the CU. The result for this time is b4 and is a low-pass value so needs to be stored in memory as well as being output. The CU writes this value in the next clock cycle. At time t = 46 in Figure 6(b) we see that the leftmost PE has MEM = 1 and HF = 1, indicating that it is now doing the high-pass calculation to match the low-pass calculation in the previous clock cycle. Rather than perform another read operation, the value from the previous clock cycle can be accessed from an internal register. The result is output by the CU. We also see the shift operation again. At the end of this clock cycle the value read by the leftmost PE (a10) is written to memory two columns to the right. The diagonal dashed arrows indicate this. Figure 6(c) shows t = 47 where the leftmost PE has MEM = 0 and HF = 0 so is now processing a low-pass calculation from input. The input value x24 is read and used to calculate the nal part of a rst level result, a11. The CU outputs this and writes it to the next available memory slot as shown by the dashed line. Note that this

b3 a 10 CU 10 x 23 a9 11

b2 a8 00 x 22

b1 a7 01 CU

b4 a 10 11

b3

b2 a8

b1 a9 10 x 22

00 x 23

01

(a) t = 45

(b) t = 46

b4 a 11 CU 00 x 24

b3

b2 a 10

b1 a9 11 CU

b4

b3 a 11

b2 a 10 11

b1 a9 00 x 23

01

10 x 23

01

10 x 24

(c) t = 47

(d) t = 48

Figure 6: Snapshots of the systolic processing memory slot still contains a9 and the slot to the left of it contains a10 (see the previous clock cycle) but as these values have been shifted rightwards in the memory these memory slots are now available for use and hence blank on the snapshot. Time t = 48 shows the high-pass calculation to match the low-pass calculation at t = 47. There is no memory shift here because this is not a memory access cycle. Note that the PE second from the right has MEM = 1 and HF = 1 so would ordinarily perform a shift operation in this clock cycle. The shift operation involves writing the value read in the previous clock cycle to a memory location two columns to the right and since such a PE does not exist this shift operation does not happen.

5 IMPLEMENTATION DETAILS
In order to verify the algorithm of our design, we simulated it using the C programming language. This provided us with an avenue for testing for the correct output and correct handling of the input and multiple levels of the DWT. VHDL12 (Very high speed integrated circuit Hardware Description Language) was used for hardware veri cation. Separate VHDL objects were written to implement the PE, the MU, and the CU. The PE and the MU are multiply instantiated and joined with the one instance of the CU to produce the complete hardware. The dashed lines in Figure 5 indicate the separate VHDL objects. The VHDL implementation demonstrated that the complexity of the control circuit for such a systolic processor is non-trivial and may in fact dominate the architecture (in both size and complexity) if the number of levels being processed is small. The CU produces the control signals for the rightmost PE. These signals are propagated leftwards through the systolic array. The initial values of the control signals were worked out by hand so that when the rst input value (propagating rightwards through the systolic array) met the rightmost PE the correct 'initial' control values were

achieved. The CU also produces the higher level memory addresses for the rightmost PE which is also propagated leftwards through the systolic array. Each MU simply receives relevant results from the CU as they are produced by the systolic array. Based on the control signals these values are referenced by the PEs as they are needed for the higher level calculations. The PEs are very simple in that they simply execute the four-stage processing of our algorithm ad-in nitum. Control signals are propagated by the PEs through the array. No control decisions are made in the PE, even memory address values are generated from externally supplied control signals, or provided indirectly from the CU.

6 REAL-TIME AUDIO SIGNAL PROCESSING
For real-time audio signal processing, the highest quality audio signal we are likely to be working on will have a sample rate of around 40kHz. This is very high quality audio; the hearing aid applications we have been working on require only 20kHz signals. If the input signal is at 40kHz, then the internal processing elements must run at 80kHz to facilitate the null gap between each input value. This is a clock period of 12.5 microseconds. Application Speci c Integrated Circuit (ASIC) development at the University of Newcastle has produced serial multiplication hardware capable of one 16 bit multiplication in 800 nanoseconds at a 20MHz clock rate. Whether this particular multiplier is fast or not is another issue, but at the processing speed of 80kHz we are capable of performing 15 multiplications using a similar hardware design. At every clock cycle in our design, we only require one multiplication and one addition per processing element, hence our design is more than fast enough for real-time audio processing considerations, and perhaps for chip size optimisation we may be able to share one multiplier between all processing elements and still be able to produce a nite-level DWT. This of course depends on how much processing on the result of the ltering is needed and how quickly the signal can be reconstructed. Since this is still an open problem4 we will leave optimisations of the array to be done if the whole machine needs to be made faster for a speci c application, although at this stage it is clear that data dependencies restrict simple pipelining optimisations of the hardware. The bene t of the hardware design is the capability to increase the size of the lters or depth of the DWT without an e ect on the speed of processing (only an a ect on the latency). The change in lters may be needed for the processing of highquality signals. For every coe cient added a PE must be added to the hardware and for every extra level of DWT to be performed an extra word of storage must be added to each column of memory. On a conventional microprocessor or with previous hardware designs the e ects of increased processing requirements are more substantial.

6.1 Synchronization of outputs
Real-time signal processing requires that the values for each level correspond in the time-domain. Due to the fact that the values in higher levels cannot be produced until a set of values in the lower levels have been produced the values at di erent levels are never in-time. This may be unsuitable for real-time processing if there are a large number of levels as the lack of synchronisation will cause problems in the analysis. This can be overcome by including delays on the output to ensure that the outputs are aligned in the time domain such that output number 0 appears at the same time in all levels; the i'th level is delayed by k ? i computational steps, where k is the number of levels of the DWT.

6.2 Signal periodicity
The Discrete Wavelet Transform expects a periodic signal. This is an imporant consideration if the input signal is nite.

Our applications currently will be working with continuous signals, so the periodicity of the signal is not important since we will never reach the end of the signal. In other applications, if periodicity is important, extra hardware can be added to either pad the input signal with enough zeros to make periodicity a non-issue or alternatively the rst values of the input signal can be stored to be used at the end of the signal to maintain periodicity.

7 CONCLUSION
We have modi ed a systolic array designed for convolution to implement the Discrete Wavelet Transform as the dual- lter decimated recursive ltering process. This resulting machine uses the processing elments of the systolic array e ciently and produces the output such that it can be multiplexed onto a signal line or demultiplexed onto several lines depending on the analysis being performed. We have implemented this machine in VHDL and veri ed that it works as expected and at the same time noted some implementation strategies for implementing a systolic array in VHDL. An important observation during this implementation shows that the control circuit for such a systolic array is non-trivial and likely to dominate the overall circuit both in complexity and size for small lter width. The DWT algorithm is a computationally e cient algorithm well-suited to the systolic array, lling processing gaps that made the original systolic array (used for convolution) only 50% e cient. The e ciency of the design indicates that this design should certainly be fast enough for our hearing aid applications and the overall simplicity of the implementation suggests that a hardware realisation of the machine would be small enough for the same applications. We have currently only looked at the DWT and have now started to work with the Inverse Discrete Wavelet Transform (IDWT). Although the algorithms are basically similar, the upsampling as opposed to decimation means that the datapaths associated with the two algorithms are slightly di erent. At this stage it appears that these di erences make the current architecture unsuitable for the IDWT without making any changes. We are considering minor architectural modi cations which will make the hardware suitable for both the DWT and the IDWT.

References
1] Mac A. Cody. The fast wavelet transform. Dr. Dobb's Journal, pages 16{28, Apr 1992. 2] H. T. Kung. Why systolic architectures. COMPUTER, 15(1):37{46, 1982. 3] Mohan Vishwanath, Robert M. Owens, and Mary Jane Irwin. Discrete wavelet transforms in vlsi. In Proceedings of the International Conference on Application Speci c Array Processors, pages 218{229, Berkeley, August 1992. Pennsylvania State Univ. 4] B.T. Tan, R. Lang, H. Schroder, and P. Dermody. Applying wavelet analysis to speech segmentation, classication and enhancement. In Dr. H. Szu, editor, Proceedings of Wavelet Applications Conference, Orlando, April 1994. SPIE. 5] Randy K. Young. Wavelet Theory and its Applications. Kluwer Academic Publishers, Boston, 1993. 6] Charles K. Chui. An Introduction to Wavelets. Academic Press, San Diego, CA, 1992. 7] Olivier Rioul and Martin Vetterli. Wavelets and signal processing. IEEE Signal Processing Magazine, October 1991.

8] Stephane G. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674{693, Jul 1989. 9] F. Thomson Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, San Mateo, CA, 1992. 10] Je rey D. Ullman. Computational Aspects of VLSI. Computer Science Press, Rockville, Maryland, 1984. 11] Narsingh Deo Edward M. Reingold, Jurg Nievergelt. Combinatorial Algorithms: Theory and Practice. PrenticeHall, Englewood Cli s, New Jersey, 1977. 12] Douglas L. Perry. VHDL. McGraw-Hill Inc., New York, NY, 1991.


相关推荐

最新更新

猜你喜欢