# **DESIGN OF POWER AND DELAY EFFICIENT 32 BIT X 32 BIT** MULTI-PRECISION MULTIPLIER WITH OPERANDS SCHEDULER

Jitha K T<sup>1</sup>, Remya K R<sup>2</sup>

<sup>1</sup>PG scholar, Electronics and Communication Department, Nehru College of Engineering and Research Centre, Pampady, Kerala, India <sup>2</sup>Assistant Professor, Electronics and Communication Department, Nehru College of Engineering and Research

Centre, Pampady, Kerala, India

## Abstract

Multipliers perform the most frequently encountered arithmetic operations in DSP applications. The proposed multi precision(MP) multiplier that incorporates variable precision, parallel processing (PP), and dedicated MP operands scheduling to provide optimum performance for a various operating conditions. The building blocks of the proposed reconfigurable multiplier can either work as independent smaller-precision multipliers or it also work parallel to form higher-precision multipliers. To reduce power consumption and delay, replaces the razor flip flop and voltage scaling unit .The Look up table (LUT) together with dynamic voltage and frequency management system configure the multiplier to work at low power consumption. The LUT stores the minimum voltages required for the multiplication of 8-bit, 16-bit and 32-bit multiplications. The multiplier consists of carry propagation adder which is replaced by carry select adder due to this a considerable delay reduction can be achieved. The MP multiplier is also consists of Frequency management unit makes the multiplier to operate at proper frequency. Finally, the proposed novel MP multiplier can further benefit from an operands scheduler that rearranges the input data, that determine the optimum voltage and frequency operating conditions for minimum power consumption. Experimental results show that the proposed MP Multiplier provides a 14.55% reduction in power consumption and 9.67% reduction in delay compared with conventional razor based DVS MP Multiplier. When combining this MP design with LUT, Parallel processing, and the operands scheduler, delay and power reduction can be achieved to a great extent. This paper successfully demonstrates that MP multiplier architecture can allow more aggressive frequency/supply voltage scaling for improved power and delay efficiency.

Keywords: Computer arithmetic, low power design, multi-precision multiplier, Input Operands Scheduler (IOS), Look

\*\*\*

Up Table.

# **1. INTRODUTION**

Nowadays, the demand for low power, high performance portable devices has been greatly increased. The growing market of portable electronic systems demands microelectronic circuits design with low power dissipation. The power dissipation mainly due to internal components [1]. In DSP applications, most frequently used arithmetic operation is multiplication.so Multipliers play an important role in today's digital signal processing and various other applications. With advances in technology, many researchers have tried and are trying to design multipliers which offer either of the following design targets - high speed, low power consumption and hence less area or even combination of them in one multiplier thus making them suitable for various high speed, low power VLSI implementation.

The basic multiplication principle is two parts i.e., evaluation of partial products and accumulation of partial products. Many DSP systems is frequently truncated output due to the fixed register size and bus width inside the hardware and also disconnect the unused sections of multiplier to reduce dynamic power reduction can be achieved [8]. Because of this significant power saving can be achieved by removing some adder cells. But in case of computing LSB bits of product, this will cause large truncation errors .Various error compensation approaches and circuits can be used for the estimated compensation carries to the carry inputs of the retained adder cells to reduce the truncation error.

# 1.1 System Model

Since Multiplication is a fundamental operation in most signal processing applications, Multipliers used in these applications have large area and consume considerable power. Therefore design of low-power multiplier has been an important part in low- power VLSI system design .Fast multipliers are essential parts of digital signal processing systems. This multiplier reduces power and delay during multiplication of variable precision and also provide parallel processing. The critical path replica approach typically involves an on-chip critical path replica. This is to approximate the actual critical path. Therefore, voltage could be selected such that the replica successfully meets the timing. In addition, the critical path changes as a result of the varying supply voltage or process or temperature or

frequency variations. If this occurs, computations will completely fail regardless of the safety margins. Timing errors can be eliminated to a great extent. Thus delay reduction will be achieved easily.

## **1.2 Previous Work**

In Most applications are based on 8–16-b operands, the proposed multiplier is designed to not only perform single 16-b but also performs single 8-b, or twin parallel 8-b multiplication operations.in some applications,16 and 32 bit operands are send to smaller multiplication circuit with parallel operation reduce power consumption and also reduces area over head.

Due to the complex structure and interconnections, multipliers have large amount of unbalanced path which causes unwanted signal generation and propagation. This can be avoided by proper internal balancing through architectural and transistor level optimization.in most cases of multipliers, maximum word length is provided. Hence small multiplications are done in large multipliers, this causes unwanted switching activity and also power consumption. So word length optimization is the best method in which 8-bit multiplier is reused for 16-bit and 32- bit multiplication [2], [3]. Here it is possible to incorporate the pipelining for increasing the speed of the multiplier.

In conventional DVS technique, LUT tunes to supply voltages which are stored as predefined voltage and frequency relationship by considering all worst case conditions. It will consume more time and area. In razor based DVS technique, minimum voltage required for multiplication is found using razor based feedback and also many voltage transitions occurred during the calculation of particular product [1],[7].

Due to this voltage transitions power consumptions and delay for product calculation is also increases. Because of dynamic voltage scaling unit, increased number of preemptions and frequency switching occurs which leads to worst case power consumption and delay and did not get optimum performance at various operating conditions. The previous work has many drawbacks. It takes more time for calculation of particular products and increased number of hardware components causes increased power consumption.

The concept of this paper includes: the proposed MP multiplier reduces power consumption and delay and also reduce the additional area overhead than conventional 32 x32 bit fixed width multiplier. This multiplier is also consists of operand scheduler which rearranges the input operands and hence reduce voltage transition, thus provide low power consumption.

#### 2. PROPOSED TECHNOLOGY

Fig.1 shows over all multiplier system architecture .The overall multiplier system mainly consists mainly 5 units.1) input operand scheduler which rearranges the input data and hence reduce the supply voltage transition, thus power consumption will be reduced.2) MP multiplier is one which performs multiplication with variable precision and parallel processing. 3) Look Up Table is like a storage element used to store voltage required for multiplication.4) Frequency scaling unit (FSU) which provides required frequency for the multiplication 5) voltage and frequency management unit (VFMU) which is receives user requirement and control the LUT and FSU by giving suitable voltage and frequency for proper operation of MP multiplier.



Fig-1: Overall Multiplier System Architecture

All the five building blocks are working together to perform multiplication with variable precision. Input operands are given through IOS blocks which rearrange the inputs to reduce voltage transitions and given to the MP multiplier. This multiplier initially works at standard supply voltage of 3.3v. Since LUT stores the minimum voltage for each combinations, depending on the incoming operands precision adjust the input voltage suitable for that particular combination.



Fig-2: Possible configuration modes of MP multiplier

Fig 2 shows the proposed multiplier which consists nine 8x 8 bit multipliers. All these multipliers perform individual 8x8 bit operation and also can perform parallel multiplication for 16-bit and 32-bit multiplication. The processing elements of multiplier can either work as 9 independent multipliers or work in parallel to form 1, 2, or three 16 x 16 bit multiplier or a single 32- bit multiplication operation. The pipelining method reduces delay and also gets fast multiplication result without error. Parallel processing is the ability of a device to simultaneously process incoming different inputs. Pipelining increases instruction throughput by performing multiple operations at the same time (concurrently), but does not reduce instruction latency (the time to complete a single instruction from start to finish) as it still must go through all the steps.

# 2.1 Look Up Table

A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing, hence field-programmable. The FPGA configuration is generally specified using a hardware description language (HDL).Contemporary FPGAs have large resources of logic gates and RAM blocks to implement complex digital computations. As FPGA designs employ very fast I/So and bidirectional data buses it becomes a challenge to verify correct timing of valid data within setup time and hold time. FPGAs contain programmable logic components called logic blocks, and a hierarchy of reconfigurable interconnects that allow the blocks to be wired together. FPGA consists logic blocks with logic cells. This logic cell consist of 4 input look up table used to implement various functions.

A look up table is an array that replaces runtime computation with a simpler array indexing operation. The savings in terms of processing time can be significant, since retrieving a value from memory is often faster than undergoing an expensive computation or input/output operation. The tables may be pre-calculated and stored in static program storage, calculated as part of a program's initialization phase, or even stored in hardware in application-specific platforms

An FPGA has three main elements, Look-Up Tables (LUT), flip-flops, and the routing matrix, that all work together to create a very flexible device. Look-up tables are how your logic actually gets implemented. A LUT consist of some number of inputs and one output. What makes a LUT powerful is that you can program what the output should be for every single possible input. A LUT consists of a block of RAM that is indexed by the LUT's inputs. The output of the LUT is whatever value is in the indexed location in its RAM. Each LUT's output can be optionally connected to a flip-flop. Groups of LUTs and flip-flops are called slices. Here the minimum voltage required for the multiplication of each combination is stored in LUT. Depending on the given input operands, corresponding voltage will be selected from the LUT and also the corresponding frequency is adjusted by frequency scaling unit. Thus using the selected voltage and frequency, it is applied to MP multiplier along with the input operands gives the corresponding results of multiplication.

#### 2.2 Multi-Precision and Reconfigurability Property

The input interface unit of MP multiplier is shown in fig 3 that forms the sub module of multiplier. Its function is to provide data to nine independent processing elements shown in fig 2 of the MP multiplier, depending on the selected operating mode. A 2-bit mode control is used to indicate whether the inputs are of 8-bit operands (1/4/9 pairs), or 16-bit operands (1/2/3 pairs) or a single 32-bit operand. This unit uses an extra MSB sign bit for signed and unsigned multiplication of incoming operands. The PEs performs computation depending on the selected operating modes, that is 8-bit, 16-bit, or 32-bit operation will be performed.

Figure 4 shows how 3 processing elements of 8x 8 bit can combine to form a 16 x 16 bit multiplier and perform 16-bit multiplier operation. Also 32-bit multiplier can be formed by similar methods but it requires 3 x3 PEs of 8 x 8 bit. The 2-bit select mode signal is used as control signal which is used to determine which PEs become active and which are inactive. Based on this 2-bit signal proper precision will be selected and perform the multiplication. If complete precision (32 x32 bit) is not exercised, the corresponding supply voltage and frequency may be scaled down depending on the work load of MP multiplier.

Consider X and Y are 2n-bits wide multiplicand and multiplier; these are defined to evaluate the associated overhead to MP and configurability.  $X_H$  and  $Y_H$  are respective n MSBs and  $X_L$  and  $Y_L$  are respective n LSBs.  $X_L$   $Y_L$ ,  $X_H$   $Y_L$ ,  $X_L$   $Y_H$ ,  $X_H$   $Y_H$  is corresponding cross products. This can be also represented as follows:

$$P=(X_{\rm H} Y_{\rm H})2^{2n} + (X_{\rm H} Y_{\rm L} + X_{\rm L} Y_{\rm H})2^n + X_{\rm L} Y_{\rm L}$$
(1)

Where  $X_LY_L$ ,  $X_H Y_L$ ,  $X_L Y_H$ , and  $X_H Y_H$  can be computed using 4, n x n bit multipliers, 2n bit reconfigurable multiplier can be constructed using adders. If,

$$X' = X_{\rm H} + X_{\rm L} \tag{2}$$

$$Y' = Y_H + Y_L \tag{3}$$

Then equation (1) will be

$$P = (X_H Y_H)2^{2n} + (X' Y' - X_H Y_H - X_L Y_L)2^n + X_L Y_L (4)$$

Compare the equations (1) & (4),one n x n bit multiplier for calculation of  $X_{\rm H} \, Y_L$  or  $X_L \, Y_H$  and one 2n bit adder for calculation of  $X_H \, Y_L + X_L \, Y_H$  can be removed. Also two n-bit adders and two (2n+2)-bit subtractors are provided for calculation of  $X_H + X_L$ ,  $Y_H + Y_L$  and  $X' \, Y' - X_H \, Y_H - X_L \, Y_L$  resp.

Evaluation of the proposed MP multiplier can be done by comparing it with 32bit fixed width multiplier and four subblock MP multipliers which are designed using Booth Radix-4 Wallace tree structure[4],[5]. This is done similar to proposed MP multiplier that consists of 3 sub-block multipliers. The power simulations are performed at clock frequency of 50 MHz and power supply voltage at 3.3v. From the evaluation, the proposed MP 3 sub-block multiplier architecture achieved reduction in power and area compared to fixed width multiplier design. The large size of fixed width multiplier provides an irregular and complex interconnects. . Table I shows the comparison between different multipliers at 50 MHz .This proposed MP 3 subblock multiplier architecture is better than other sub-block multipliers in most of multiplication applications. This causes increase in area and also additional power consumption. This proposed MP sub-block multiplier architecture is better than other sub-block multipliers in most of multiplication applications.



Fig-3: Input Interface Unit

 Table- 1: Area and Power Comparison of Proposed MP Multipliers against Conventional Fixed-Width multiplier running at 50

 MHz

| 141112 |                                  |           |                        |  |
|--------|----------------------------------|-----------|------------------------|--|
|        | Schemes                          | Power(mW) | Area(mm <sup>2</sup> ) |  |
| Γ      | 32-bit fixed width multiplier    | 39        | 0.624                  |  |
|        | 32-bit 4 sub-block MP multiplier | 33.36     | 0.736                  |  |
|        | 32bit 3 sub-block MP multiplier  | 20        | 0.448                  |  |



**Fig-4**: Three PEs combined to form  $16 \times 16$  bit multiplier

## 2.3 Frequency Scaling Unit

Frequency scaling unit of proposed MP multiplier is used for frequency tuning to meet the system throughput requirements. This frequency unit is implemented using Voltage Controlled Oscillator (VCO) as a seven-stage current starved ring oscillator. Using four control bits (5MHz/step), output frequency of VCO is tuned from 5 to 50 MHz by using 5-50MHz range frequency, the proposed multiplier can boasts up to 450MIPS (9 x 50). This is because proposed multiplier can operate either as a 32-bit multiplier or as nine independent 8-bit multipliers. The power consumption of VCO for the frequency ranges is 85 to 149µW for 5MHz to 50MHz. this power consumption is negligible compared to proposed multiplier power consumption. One clock cycle is required to settle down the clock frequency.

The frequency scaling unit is one which equipped with VCO is used to select frequency for each combination of multiplication. Depending on the control signal, it gives frequency that pre-calculated for  $8 \times 8$  bit,  $16 \times 16$  bit and  $32 \times 32$  bit for proper multiplication to reduce delay. Depending on the voltage VCO adjust the frequency. For each combination of multiplication, we can select the corresponding suitable frequency.

## 2.4 Input Operands Scheduler

The input operands scheduler which rearranges the input data and hence reduce the supply voltage transition, thus power consumption will be reduced. It consists of range detector, buffer (RAM), and a voltage and frequency analyzer. These help to rearrange the input and detect the precision and send to MP multiplier. Here proposed an IOS that will perform the following tasks: 1) reorder the input data stream so that same-precision operands are grouped together into a buffer and 2) takes the minimum supply and frequency from the LUT. Fig 5 shows Input Operand Scheduler. The operation of multiplier is controlled by two external signals .i.e. operating frequency and voltage signal. These two signals are tuned to correct values depending on the actual workload i.e. it depends on the input operands. The simulation is done by using giving input operands and comparing the results with a PC that gives true results. And also timing is verified. The precision data multiplication includes data word length up to 32-bits.



Fig-5: Input Operands Scheduler

# **3. SIMULATION RESULTS**

The operation of multiplier is controlled by two external signals .i.e. operating frequency and voltage signal. These two signals are automatically tuned to correct values depending on the actual workload. The simulation is done by using giving input operands and comparing the results with a PC that gives true results. The 32-bit precision data multiplication includes data word length from 17-32-bits. Also 16-bit and 8-bit data precision multiplication includes the data word length of 9-16 bits and 0-8-bits. The proposed MP multiplier provides the most reconfigurability while exhibiting the smallest relative area. Compared with the existing designs with the same maximum word-length of 32-bit the design exhibit a much smaller area.

The programming can be done by using VHDL and it can be simulated using Model Sims and also using Xilinx ISE. Power analysis is done by using X power analyzer. X-Power Analyzer is a tool dedicated to power analysis of postimplemented place and routed designs. It provides a comprehensive graphical user interface (GUI) that allows a detailed analysis of the power consumed and offers as well thermal information under operating conditions. The power consumption is obtained as 20.1mW. This can be viewed from Xilinx X-power analyzer. Comparing with the existing system 14.55 % of power reduction can be achieved. By avoiding the razor flip-flop and the DVS unit, considerable delay reduction can be achieved. This can be calculated from after running X-power analyzer and observing the timing constraints. These simulators provide fast time to debug and also have advanced code coverage analysis.

Delay of the proposed system is obtained 19.517ns this is 9.67% reduction in delay can be achieved. The carry propagation adder is replaced with carry select adder; hence the corresponding delay for the calculation of sum will be

reduced. Table II shows the comparison between MP multiplier with LUT and MP multiplier with Razor flip-flop and dynamic voltage scaling unit. Carry Select Adder (CSA)) slightly increase the area, but considerable amount of reduction in delay achieved, that also helps for fast multiplication. When considering the reduction in delay, increase in area will be negligible the area, but considerable amount of reduction in delay achieved, that also helps for fast multiplication. When considering the reduction in delay, increase in area will be negligible the area, but considerable amount of reduction in delay achieved, that also helps for fast multiplication. When considering the reduction in delay, increase in area will be negligible.

**Table-2:** Comparison between MP Multiplier With LUTand MP Multiplier with Razor Flip-Flop and DVS

| 1 |                                                   |           |           |  |
|---|---------------------------------------------------|-----------|-----------|--|
|   | Schemes                                           | Power(mW) | Delay(ns) |  |
|   | MP multiplier<br>with razor flip-<br>flop and DVS | 23.52     | 21.605    |  |
|   | MP multiplier with LUT                            | 20.1      | 19.517    |  |

# 4. CONCLUSION

The proposed novel MP multiplier featuring with variable precision and parallel processing provides the multiplication up to 32-bit without much delay and power consumption than the any other fixed width multiplier. This architecture replaced the razor flip-flop and VSU and thus reduced delay and power consumption with considerable amount. The MP multiplier benefits from operands scheduler that re-arranges the input to reduce the number of transitions of the supply voltage and thus minimized the overall power consumption of the multiplier. This MP multiplier can achieve 14.55% reduction in power consumption and 9.67% of delay reduction. The proposed MP multiplier based on LUT and DVFM provided a solution to achieve reduced delay, low power consumption.

#### **ACKNOWLEDGEMENTS**

We are indebted to God Almighty for blessing us with His grace and taking our endeavour to a successful culmination. Our sincere thanks to the experts who have contributed towards the development of the paper. We, finally, thank all our friends and well-wishers who had supported us directly and indirectly during the work.

#### REFERENCES

- Xiaoxiao Zhang, and Amine Bermak,"32 Bit×32 Bit Multiprecision Razor-Based DynamicVoltage Scaling Multiplier With Operands Scheduler",IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 22, no. 4,Apr 2014.
- [2] M. Bhardwaj, R. Min, and A. Chandrakasan, "Quantifying and enhancing power awareness of VLSI systems," IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 9, no. 6, pp. 757– 772, Dec. 2001.
- [3] A.Wang and A. Chandrakasan, "Energy-aware architectures for a realvalued FFT implementation," in Proc. IEEE Int. Symp. Low Power Electron. Design, Aug. 2003.
- [4] H. Lee, "A power-aware scalable pipelined booth multiplier," in Proc. IEEE Int. SOC Conf., Sep. 2004, pp. 123–126.
- [5] S.-R. Kuang and J.-P. Wang, "Design of powerefficient configurable booth multiplier," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57,no. 3, pp. 568–580, Mar. 2010.
- [6] T. Yamanaka and V. G. Moshnyaga, "Reducing multiplier energy by data-driven voltage variation," in Proc. IEEE Int. Symp. Circuits Syst., May 2004, pp. 285–288.
- [7] Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo, A. Hashiguchi, H. Kawahara, K. Kumano, and M. Shimura, "Dynamic voltage and frequency management for a low-power embedded microprocessor,"IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 28–35, Jan. 2005.
- [8] S. D. Haynes, A. Ferrari, and P. Y. K. Cheung, "Flexible reconfigurable multiplier blocks suitable for enhancing the architecture of FPGAs," in Proc. IEEE Custom Integr. Circuits, May 1999, pp. 191– 194.
- [9] M. Sjalander, M. Drazdziulis, P. Larsson-Edefors, and H. Eriksson, "A low-leakage twin-precision multiplier using reconfigurable power gating," in Proc. IEEE Int. Symp. Circuits Syst., May 2005, pp. 1654–1657.
- [10] W. Ling and Y. Savaria, "Variable-precision multiplier for equalizer with adaptive modulation," in Proc. 47th Midwest Symp. Circuits Syst., vol.1. Jul. 2004, pp. I-553–I-556.

#### BIOGRAPHIES



**Ms. Jitha K T** has received B.Tech (ECE) from IES College of Engineering Thrissur. At present she is pursuing M.Tech in VLSI Design at Nehru College of Engineering and Research Centre, Thrissur. Her areas of interest include Communication, Embedded

system and VLSI.



Ms. Remya K.R has received B.Tech (ECE) from N.S.S College of Engineering, Palakkad, and M.E Embedded System from Kumaraguru College of Technology, Coimbatore. Presently she is working as Assistant Professor in Nehru College of

Engineering and Research Center, Thrissur. Her research and teaching interests includes Embedded system and VLSI.