# Power Optimization Technique for Pulsed Latches

# P. Sreenivasulu, K.Srinivasa Rao, J. I. R Prakash, A.Vinaya babu

Abstract—In this paper, we implement a design technique for registers used in pulsed latches in order to make leakage current low thus reducing standby power consumption. This is made by considering short or long timing path and launching or capturing register. In this work each register trades clock-to-Q delay maintaining the same timing constraints, setup time and hold time maintaining clock-to-Q delay constant for reducing the leakage current by developing three different dual threshold voltage registers. The overall reduction in the leakage current of a register can exceed 90% while maintaining the clock frequency and other design parameters such as area and dynamic power the same. This work presents an elegant methodology using pulsed latch instead of flip-flop without altering the existing design style. It reduces the dynamic power of the clock network, which can consume half of a chip's dynamic power. Real designs have shown approximately a 20 percent reduction in dynamic power using the below methodology. Three ISCAS 89 benchmark circuits are utilized to evaluate the methodology, demonstrating, on average, 23% reduction in the overall leakage current. The overall reduction in leakage current is compared for each case in different technologies. Predictive device models are used for each technology. The analysis is performed using H-SPICE.

Keywords— leakage current; low leakage register design; power consumption; static power.

#### I. INTRODUCTION

Low power has emerged as a principal theme in today's world of electronics industries. Power dissipation has become an important consideration as performance and area for VLSI Chip design. With shrinking technology reducing power consumption and over all power management on chip are the key challenges below 100nm due to increased complexity. For many designs, optimization of power is important as timing due to the need to reduce package cost and extended battery life. For power management leakage current also plays an important role in low power VLSI designs. Leakage current is becoming an increasingly important fraction of the total power dissipation of integrated circuits.

Traditionally, technology scaling has relied on enhancing the drive current capability by reducing the channel length and gate oxide thickness. Power supply voltage has also been reduced to satisfy reliability constraints. Decreasing the power supply voltage requires the threshold voltage to be also reduced to maintain high drive current capability. The reduction of the threshold voltage, however, exponentially increases the subthreshold leakage current. Similarly, a reduction in the gate oxide thickness exponentially increases

Manuscript received on March, 2013.

**P.Sreenivasulu** Assistant Professor of ECE, Dr.S.G.I.E.T, MARKAPUR, Prakasam Dist., A.P, INDIA.

**Dr. K.Srinivasa Rao**, Principal and Professor of ECE, T.R.R College of Engineering, Hyderabad, A.P, INDIA<sup>4</sup>

J.I.R Prakash, M.Tech Student Dr.SGIT, Markapur

the mechanical tunneling of the carriers through the oxide, producing significant gate leakage current.

More than 40% of the total energy in the active mode can be dissipated due to idle transistors in modern systems-on-chip. Furthermore, leakage current is the dominant source of energy consumption when the IC is in the idle mode, significantly degrading the battery life in portable devices.

ITRS identifies leakage power consumption as "a clear long term threat and a focus topic for design technology in the next 15 years". Projections of the overall power dissipation within an IC are plotted in Figure 1 based on ITRS predictions.



Various methodologies have been proposed to alleviate subthreshold leakage current consumption such as power gating, dynamic adjustment of the threshold voltage through body biasing, and multi-threshold voltage transistors, also referred to as dual threshold voltage (dual-Vth) partitioning. existing approaches have several limitations, As these particularly for low leakage register design a comprehensive methodology is proposed in this paper to design path specific dual-Vth, low leakage registers while simultaneously considering clock-to-Q delay, setup time, hold time, type of timing path (short or long), and type of register (launching or capturing). Existing dual-Vth based registers reduce the leakage current only along the feedback path to not affect the timing constraints. This traditional approach significantly limits the amount of leakage that can be reduced, particularly in sub 22 nm CMOS technologies. Furthermore, in conventional approaches, the hold time of the register may be affected which may produce a timing violation depending upon the type of timing path and register.

These limitations of the existing approaches are overcome with the proposed design methodology while significantly increasing the amount of leakage current that is reduced.



Published By: Blue Eyes Intelligence Engineering & Sciences Publication

# **Power Optimization Technique for Pulsed Latches**

Dynamic power is consumed across all elements of a chip. The clock network is one of the large consumers of dynamic power. According to a recent IBM study, half of dynamic power is dissipated in the clock network.

Therefore, reducing power in the clock network can impact the overall dynamic power significantly. Designers already use a variety of techniques to reduce the clock power using smaller clock buffers, reducing the overall wiring capacitance, employing clock gating to reduce the dynamic power, and de-cloning to move the clock buffers at higher levels of hierarchy.

Even with these techniques, the dynamic power of clock network can be large since registers are used as state elements in the design. In general, a flip-flop is used as the register. A conventional flip-flop is composed of two latches (master and slave) triggered by a clock signal.

Flip-flop synchronization with the clock edge is widely used because it is matched with static timing analysis (STA). Timing optimization based on STA is must for SoCs. On the other hand, designers may choose to use a latch for storing the state. A latch is simple and consumes much less power than that of the flip-flop. However, it is difficult to apply static timing analysis with latch design because of the data transparent behavior.

A methodology has been developed which uses latches triggered with pulse clock waveforms. With this methodology, designers can apply static timing analysis and timing optimization to a latch design while reducing the dynamic power of the clock networks. The following describes this pulsed latch design methodology in detail and gives some guidelines as to how designers can apply this methodology in their designs.

# **II. PREVIOUS WORK**

Multi-threshold voltage design techniques to reduce leakage current and limitations are summarized in this section

In Idle mode operation, high-Vth sleep transistor which is placed between the circuit and power supply and/or ground node is cutoff, disconnecting the circuit from the power supply voltage and/or ground node thus reducing subthreshold leakage current since sleep transistor behaves as large resistance between the combinational circuit and power supply and/or ground node. Also the substrate of the circuit is reverse biased to increase the threshold voltage, thereby reducing the leakage current. The primary drawback of this methodology is to generate the bias voltage for the substrate in a power efficient way. A control circuitry is also required, further decreasing the power efficiency.

In active mode, the sleep transistor is on and the low-Vth combinational circuit operates normally. The drain of the sleep transistor is referred to as virtual power and virtual ground if the sleep transistor is placed between the circuit and supply node & ground node respectively. Several clock cycles are typically required for the virtual ground or power to stabilize when charging and discharging during operation changing from idle to active. Furthermore, the circuit may experience ground bounce during wakeup latency, affecting the reliable operation of nearby logic circuits.



Dual-Vth partitioning is also another technique to reduce leakage current based on MTCMOS. Those logic gates that are not part of the critical path are replaced with high-Vth transistors by exploiting the excessive slack. Alternatively, those gates along the critical path are implemented with low-Vth transistors to satisfy the timing constraints. Important limitation of the existing approach is the inability to consider important timing constraints such as setup and hold times. The type of timing path, i.e., short or long, and the type of register, i.e., launching or capturing, significantly affect the design process of low leakage registers.



# **III. BACK GROUND**

A simple synchronous digital circuit consisting of two sequentially-adjacent registers with a combinational circuit between these registers is shown. The first register is referred to as launching register whereas the second register is called capturing register. Two inequalities should be satisfied for this circuit to function properly.

TC f + TCP > TCi + TD + TS or TC f + TCP = TCi + TD + TS

TCi+TD > TC f + TH or TCi+TD = TC f + TH

Where TCi and TC f are the delay for the clock signals to arrive, respectively, at the launching and capturing registers. TCP is the clock period. TD is the data path delay consisting of the clock-to-Q delay of the launching register, logic delay of the combinational circuit, and the interconnect delay. TS is

the setup time of the capturing register.TH is the hold time of the capturing register.



This inequality guarantees that the data is not latched to the final register within the same clock edge (no race condition).



The above expressions require a difference called a skew to be larger than or equal to a timing constraint. These inequalities, therefore, can be rewritten as

Setup skew >= TS Hold skew >=TH Where the setup skew and hold skew are, respectively Setup skew = TC f +TCP--(TCi+TD) Hold Skew = TCi+TD--TC f

According to the setup time constraint, the data signal should be stable at the input of a register for a sufficient amount of time before the active edge of the clock signal. The active edge is a low-to-high transition of the clock signal since the data propagates to the output after this transition. Setup time guarantees that the data is reliably latched to the master before the rising edge of the clock signal arrives. Ideally, the data signal should propagate through TG1 and INV1, arriving at the output of INV1 before the rising edge of the clock signal. According to this condition, the path that determines the setup time consists of TG1 and INV1. This condition, however, may require a relatively large setup time. A conventional technique to characterize the setup time constraint of a register is to examine the setup skew versus clock-to-Q delay relationship.

The smallest setup skew that corresponds to the nominal clock-to-Q delay is approximately equal to the summation of the two delays: TG1 and INV1. As the setup skew is further reduced, clock-to-Q delay gradually increases since for smaller setup skews, the data signal cannot reach to the output of INV1.After a specific point, the clock-to-Q delay starts to exponentially increase due to a race condition at node r since this node is simultaneously driven by two gates: TG1 and TG2. The race condition occurs between the new data driven by TG1 and old data driven by TG2. This region is referred to metastable and therefore avoided during as the characterization process. Typically, a10% degradation in clock-to-Q delay is allowed while characterizing the setup time.



According to the hold time constraint, the data signal should be stable at the input of a register for a sufficient

amount of time after the active edge of the clock signal. This constraint is due to non-ideal characteristics of TG1 as a switch. If the hold time constraint is not satisfied, the new data can be latched into the register and overwrite the previous valid data during the same clock cycle. Note that hold time can sometimes be smaller than zero. In this case, even if the new data propagates through TG1, a race condition exists at node r between the new and old data. If the old data succeeds over the new data, the register works correctly and the negative hold time is valid. The hold time constraint is therefore partly determined by the relative drive strengths of TG1 and TG2. Note that, if the hold time is further reduced, the clock-to-Q delay exponentially increases.

#### Pulsed latch concept

A latch can capture data during the sensitive time determined by the width of clock waveform. If the pulse clock waveform triggers a latch, the latch is synchronized with the clock similarly to edge-triggered flip-flop because the rising and falling edges of the pulse clock are almost identical in terms of timing.



With this approach, the characterization of the setup times of pulsed latch are expressed with respect to the rising edge of the pulse clock, and hold times are expressed with respect to the falling edge of the pulse clock. This means that the representation of timing models of pulsed latches is similar to that of the edge-triggered flip-flop.

The pulsed latch requires pulse generators that generate pulse clock waveforms with a source clock. The pulse width is chosen such that it facilitates the transition. The following diagram represents a simple pulse generator and the associated pulse waveform.

#### IV. PROPOSED METHODOLOGY

Existing work on dual-Vth based register design does not consider different types of data paths and registers. A typical approach is to design TG1, INV1, TG3, and INV3 with low-Vth transistors to improve the setup time and clock-to-Q delay. The remaining inverters and transmission gates that are located along the feedback path are designed with high-Vth devices to minimize the leakage current. This approach, however, is not practical for all of the timing paths. The design process of a dual-Vth, low leakage register is therefore strongly dependent upon the type of data path, i.e., long (critical), noncritical, and short; and type of register, i.e., launching or capturing. Three different types of dual-Vth registers that consider these dependencies are proposed in this

paper. An edge triggered D type flip-flop with 2X drive capability is chosen from an industrial standard cell library.

Blue Eyes Intelligence Engineering

Published By:

& Sciences Publication



The transistor level schematic of the register is illustrated including the W=L ratios of each transistor.

Note that in the master latch, a tristate inverter is used that combines the TG1 and INV1 .Similarly, the feedback of the master latch also utilizes a tristate inverter. This schematic and W=L ratios are used in the simulations without any modification.

|  | Table 1. | Timing | characteristics of | the pro | posed | dual-Vth | registers. |
|--|----------|--------|--------------------|---------|-------|----------|------------|
|--|----------|--------|--------------------|---------|-------|----------|------------|

|            | Timing Path | Register Type | Clock-to-Q Delay | Setup Time   | Hold Time    |
|------------|-------------|---------------|------------------|--------------|--------------|
| Register 1 | Noncritical | Launching     | Larger           | Same or less | Same or less |
| Register 2 | Noncritical | Capturing     | Same or less     | Larger       | Same or less |
| Register 3 | Critical    | Capturing     | Same or less     | Same or less | Larger       |

#### **Register 1**

This register is designed to replace launching registers in noncritical or short paths. Since there is excessive setup slack in noncritical paths, the primary objective is to trade clock-to-Q delay for leakage current. Both setup and hold times of the register, however, should remain the same (or be reduced) since this register behaves as a capturing register for the previous data path, which may be a critical or short path. Thus, to guarantee that the timing characteristics of the previous path are not affected the setup and hold times of the register should not increase.

To design Register 1, high-Vth devices are used for those transistors located along the clock-to-Q delay path, i.e., M13, M14, M17, M18, M19, M20, M21, and M22. Clock-to-Q delay is therefore traded to reduce leakage current. Note that, the setup and hold times of the register remain the same since these transistor do not affect the timing constraints of the register.



#### **Register 2**

This register is designed to replace capturing registers in noncritical or short paths. Due to excessive setup slack, the primary objective is to trade setup time for leakage current. The clock-to-Q delay of the register, however, should remain the same (or be reduced) since this register behaves as a launching register for the following data path, which may be a critical path. Furthermore, the hold time should also remain the same (or be reduced) since for a short data path is critical. Note that this second register is sufficiently effective to reduce leakage current since the setup time is relatively more important in advanced technologies. According to this figure, starting 22 nm technology setup time of the register is higher than the clock-to-Q delay. Thus, the opportunity to trade setup time for leakage current should not be overlooked. Note that the setup time has been characterized using the procedure.

To design Register 2, high-Vth transistors are used only for M2 and M3 to trade setup time for leakage current. Note that M5 and M6 are designed using low-Vth transistors even though this inverter is along the setup path. However, as described in the previous section, clock-to-Q delay and hold time of the register should remain the same. Replacing M5 and M6 with high-Vth transistors affects the clock-to-Q delay since this inverter drives the input of the slave latch.

#### **Register 3**

The third register is designed to replace capturing registers in critical paths. The primary objective is to trade hold time for leakage current since in a critical path, is important and hold slack is typically large. The clock-to-Q delay should remain the same (or be reduced) since the register behaves as a launching register for the following data path, which may also be a critical path. Furthermore, the setup time should also remain the same (or be reduced) since for a critical path, is important.

To design Register 3, high-Vth transistors are used for M7, M8, M9, and M10 to trade hold time for leakage current. Note that the feedback path becomes weaker due to high-Vth transistors. As such, hold time increases since it is more difficult for the old data to overwrite the new data at the output of the first gate, thereby requiring a larger hold time constraint. Low-Vth devices are used for the remaining transistors to guarantee that the clock-to-Q delay and setup time remain the same.

For example, M1, M2, M3, and M4 directly affect the setup time constraint and therefore designed with low-Vth transistors.

In the proposed work conventional edge triggered flip-flops are used before clock tree synthesis (CTS). During CTS, pulsed latch replacement and pulse generator insertion are performed. In order to use pulsed latches on a design, the complete methodology should be implemented, including: Pulsed latch replacement and pulse generator insertion

ruised laten replacement and pulse generator insertio

- Skew and slew control of the clock tree
- Timing analysis and optimization
- Power analysis
- Pulse latch design rule checking

A design can have a mixture of pulsed latches and edge-triggered flip-flops because some of the flip-flops cannot be replaced with pulsed latches. The methodology should support all designs.

**Pulsed latch replacement and pulse generator insertion** Since the pulse generator should be inserted in the clock network which considering the trade-off of power consumption between pulse generators and pulsed latch, the methodology employs one objective function in clock tree synthesis (CTS) methodology to pick up a clock-tree structure with the most efficient power reduction through pulsed latch replacement. Pulsed latches can then be used to substitute existing flip-flops where ever such substitution is possible.

Consideration is given to flip-flops connected to primary ports for timing model generation when a bottom-up design

approach is applied. For those flip-flops transitioning on a different clock edge and those

Blue Eyes Intelligence Engineering

Published By:

& Sciences Publication



with tight hold time margins, the approach applies local cloning of clock trees to ensure maximum use of pulsed latches.

# Skew and slew control

In order to insert pulse generators to control pulsed latches, clock tree synthesis needs to maintain the skew balancing and required clock slew. Unless the CTS have a good control on skew and slew, it is very difficult to find optimal placement of pulse latches.

As pulse generators are inserted, the delay must be matched with the other branches by using delay cells. To ensure the clock pulse is not degraded, CTS should have good control on the slew across the entire clock tree. This methodology also allows designers to specify a different slew constraint for clock tree before and after the pulse generator to provide a trade-off between maintaining pulse shape and less power consumption.

# Timing analysis and optimization

Pulsed latches have similar timing libraries to that of conventional flip-flops allowing designers to fully utilize conventional static timing analysis. The timing reports with pulsed latch design are exactly the same as those with edge triggered flip-flops, although special care should be taken with hold timing analysis.

Conventional timing optimization is also performed with pulsed latch design and must account for tight hold time margins, and hence should be sensitive to this. Since this pulsed latch methodology is fully compatible with the entire timing optimization flow, designers can natively take coupling noise effect into account inside of optimization.

# Power analysis

Power analysis must distinguish the pulsed latches and normal flip flops and apply power values appropriately. Moreover, there is additional power consumed in pulse generators and delay cells. These power numbers must be considered during power analysis for comprehensive power savings.

# Pulse latch design rule checking

Since pulsed latch methodology co-operates with normal flip flops, a number of pulse latch design rules must be checked to ensure the design integrity:

Minimum and maximum slews limit on the clock network Multiple pulse generators or dummy cells in the same clock path Negative edge-triggered flip flop or macro driven by the pulse clock Pulsed latches not driven by the pulse clock

# V. RESULTS

The results are illustrated below. Note that for the first register, the state of the clock signal does not change the results since all of the high-Vth transistors are within the slave latch. For the second and third registers, however, high-Vth transistors exist within the tristate inverters. The state of the clock signal is therefore important in evaluating the results. For example, for the second register, clock signal should be at VSS to guarantee that the initial tristate inverter is not in the high impedance state. Similarly, for the third register, clock signal should be at VDD so that the second tristate inverter located along the feedback path is not in the high impedance state. The leakage current of the original register is therefore compared with the first two registers and third register when the clock signal is, respectively, at VSS and VDD.

The leakage current increases with technology, exhibiting a large jump in the 16 nm node. A significant amount of reduction in the leakage current, 79% on average, is achieved by the first register since the number of high-Vth transistors is higher, as listed in Table 2. The second register also achieves a considerable amount of reduction in the leakage current, 13% on average and higher below 32 nm technology nodes, since the importance of setup time has been increasing with technology. The reduction in the leakage current obtained by the third register is relatively smaller.



This Automated Pulse Latch Design Methodology Has Been Implemented And Tested In Several Production Chips. On The Average, There Was A 20 Percent Reduction Of The Total Dynamic Power Consumption.

# **VI.FUTURE SCOPE**

The results presented in this paper are based on a specific type of register. A similar methodology can be applied to other types of registers where clock-to-Q delay, setup, and hold times are traded to reduce the leakage current without affecting the clock frequency. The numerical results may change depending upon the transistor level design of a register. Effect of different register architectures on leakage reduction can therefore be investigated as future work.

#### VII.CONCLUSION

Traditional dual-Vth registers utilize high-Vth transistors only along the feedback path of the master and slave latches where the overall reduction in leakage current is limited. As opposed to existing techniques, a register design methodology that considers the type of timing path (short or long) and register (launching and capturing) is developed. Three different dual-Vth registers are introduced where the first register trades clock-to-Q delay for leakage current, achieving, on average, 79% reduction in leakage current. The second and third registers trade, respectively, setup time and hold time to further reduce the leakage current. Depending on the type of timing paths, the overall reduction in the leakage current of a register can exceed 90%. Designing with pulsed latches is a novel approach to saving dynamic power of up to 20 percent in nanometer designs. Reducing dynamic clock power is particularly important in high frequency designs as well as on designs with high flip-flop counts.

# ACKNOWLEDGMENT

The authors wish to thank Mr. Ashwan Kumar Karrolla, Chairman of Cedronics, Hyderabad and the team for their valuable guidance.



Published By:

& Sciences Publication

Blue Eyes Intelligence Engineering

# REFERENCES

- 1. Pavlidis, V.F.; Friedman, E.G. Three-Dimensional Integrated Circuit Design; Morgan Kaufmann: Boston, MA, USA, 2009.
- Tai, K.L. System-in-Package (SIP): Challenges and Opportunities. In Proceedings of the ASP-DAC 2000, Asia and South Pacific, Yokohama, Japan, 25–28 January 2000; pp. 191–196.
- Konstadinidis, G.K.; Tremblay, M.; Chaudhry, S.; Rashid, M.; Lai, P.F.; Otaguro, Y.; Orginos, Y.;Parampalli, S.; Steigerwald, M.; Gundala, S.; et al. Implementation of a Third-Generation 16-Core 32-Thread Chip-Multithreading SPARC Processor. In Proceedings of the IEEE International Solid-State Circuits Conference, Lille, France, 30 December 2008; pp. 84–85.
- Rusu, S.; Tam, S.; Muljono, H.; Stinson, J.; Ayers, D.; Chang, J.; Varada, R.; Ratta, M.;Kottapalli, S.; Vora, S.; A 45 nm 8-Core Enterprise Xeon Processor. In Proceedings of the IEEE International Solid-State Circuits Conference, Taipei, Taiwan, 22 December 2009; pp. 56–57.
- Ferre, A.; Figueras, J. Characterization of Leakage Power in CMOS Technologies. In Proceedings of the Electronics, Circuits and Systems, 1998 IEEE International Conference, Lisboa, Portugal,7–10 September 1998; pp. 185–188.
- Taur, Y.; Wann, C.H.; Frank, D.J. 25 nm CMOS Design Considerations. In Proceedings of the Electron Devices Meeting, 1998, IEDM '98 Technical Digest., International, San Francisco, CA,USA, 6–9 December 1998; pp. 789–792.
- 7. Kursun, V.; Friedman, E.G. Multi-Voltage CMOS Circuit Design; John Wiley & Sons: Hoboken, NJ, USA, 2006.
- Jiao, H.; Kursun, V. Low-leakage and compact registers with easy-sleep mode. J. Low Power Electron. 2010, 6, 1–17.
- Salman, E.; Dasdan, A.; Taraporevala, F.; Kucukcakar, K.; Friedman, E.G. Exploiting setup-hold time interdependence in static timing analysis. IEEE Trans. Comput.-Aid. Des. Integr. Circuits Syst. 2007, 26, 1114–1125.
- Stojanovic, V.; Oklobdzija, V.G. Comparative analysis of master-slave latches and flip-flops for high-performance and low-power systems. IEEE J. Solid-State Circuits 1999, 34, 536–548.
- 11. Weste, N.; Harris, D. CMOS VLSI Design; Addison Wesley: White Plains, NY, USA, 2004.
- 12. Predictive Technology Model (PTM). Available online: http://www.eas.asu.edu/~ptm (accessed on 1 September 2010).
- Cao, Y.; Sato, T.; Orshansky, M.; Sylvester, D.; Hu, C. New Paradigm of Predictive MOSFET and Interconnect Modeling for Early Circuit Design. In Proceedings of the IEEE Custom Integrated Circuits Conference, Orlando, FL, USA, 21–24 May 2000; pp. 201–204.
- Brglez, F.; Bryan, D.; Kozminski, K. Combinational Profiles of Sequential Benchmark Circuits. In Proceedings of the IEEE International Symposium on Circuits and Systems, Portland, OR, USA,8–11 May 1989; pp. 1929–1934.
- 15. Homepage of H-SPICETM. Available online :http://www.synopsys.com (accessed on 1 September, 2010).



Published By:

& Sciences Publication

Blue Eyes Intelligence Engineering