Statistical Timing Analysis Considering Clock Jitter and Skew due to Power Supply Noise and Process Variation

Takashi ENAMI†), Student Member, Shinya NINOMIYA†), Nonmember, Ken-ichi SHINKAI†), Student Member, Shinya ABE†), Nonmember, and Masanori HASHIMOTO†), Member

SUMMARY  Clock driver suffers from delay variation due to manufacturing and environmental variabilities as well as combinational cells. The delay variation causes clock skew and jitter, and varies both setup and hold timing margins. This paper presents a timing verification method that takes into consideration delay variation inside a clock network due to both manufacturing variability and dynamic power supply noise. We also discuss that setup and hold slack calculation inherently involves a structural correlation problem due to common paths, and demonstrate that assigning individual random variables to upstream clock drivers provides a notable accuracy improvement in clock skew estimation with limited increase in computational cost. We applied the proposed method to industrial designs in 90nm process. Experimental results show that dynamic delay variation reduces setup slack by over 500ps and hold slack by 16.4ps in test cases.

key words: statistical timing analysis, clock jitter, setup verification, structural correlation, power supply noise

1. Introduction

As manufacturing variability becomes severer, statistical static timing analysis (SSTA) has been studied intensively, and is expected to estimate statistical timing distribution before fabrication. To accurately and efficiently perform SSTA, decomposition of manufacturing variability is critically important, and appropriate consideration of correlation in SSTA is required for correlated variability components. For this issue, orthogonalized variability expressions have been proposed (e.g. [1] and [2]), and the efficiency of timing analysis with correlated variables has been significantly improved.

On the other hand, another correlation factor comes from circuit structure, and the importance of the structural correlation was pointed out in [3]. This problem has been partially solved by introducing so-called canonical form [1]. Regarding die-to-die and within-die spatially-correlated components, the structural correlation is automatically considered. However, the random component unique to each element has not been perfectly treated in terms of the structural correlation. After each MAX and SUM operation, the random variables which correspond to the random variability components are translated into a single random variable, and the correlation information is discarded. This translation is performed to reduce the number of random variables in SSTA. To take the structural correlation into account, different random variables should be assigned to each gate [1], [4], [5], whereas it generally needs huge computational cost and memory.

When two arrival times that share common paths are maxed or subtracted, the structural correlation should be considered. This situation must arise when setup and hold slacks are computed by subtracting two arrival times of launch and capture paths at each FF because both the launch and capture paths share some clock buffers and the common path inherently exists. Common path pessimism reduction has been studied to tackle with this common path problem. References [6], [7] proposed a method to solve this problem from the viewpoint of path-based analysis. First, the authors determine a target path for timing verification, and they obtain the pair of FFs which are the source and sink of the path. Next, the clock distribution network (CDN) is traced from the FFs to the clock source, and the meeting point is found. Finally, by regarding the meeting point as the source of clock signal, the common path is eliminated and setup and hold verification is performed. However, these methods involve a problem peculiar to path-based analysis. Namely, if the number of the candidates for critical path is large, there must be a large amount of paths to be verified. As for block-based analysis that does not suffer from the problems of path numbers, the treatment of random variability components in the CDN should be reexamined in terms of clock skew and slack estimation.

On the other hand, another problem for setup and hold verification is delay variation due to power supply noise. Within-cycle and inter-cycle dynamic noises fluctuate both clock buffer delays and combinational cell delays. However, this dynamic clock jitter and cell delay variation have not been well considered in timing analysis. Figure 1 illustrates the difficulty of the timing verification. The dynamic behavior within a cycle must be appropriately modeled. Spatial noise difference must also be considered. Furthermore, these noise waveforms vary cycle by cycle, and the worst-case noises for setup and hold verification cannot be identified within a practical time. Very recently, to tackle this problem, a statistical modeling of dynamic power supply noise that takes into consideration spatial and temporal correlations was proposed in [8]. The worst-case timing for combinational circuits was statistically estimated instead of the exact estimation. Although power supply noise problem is deterministic and it might be arguable to handle...
power supply noise statistically, we believe that statistical treatment of supply noise enables not exact but reasonable worst-case timing estimation.

We in this paper present the first work of statistical timing verification that considers clock skew and jitter due to dynamic power supply noise and manufacturing variability. We extend the statistical noise model in [8] so that the noise correlation between successive cycles can be modeled for setup verification. We discuss that the structural correlation could degrade the accuracy of timing analysis, and quantitatively demonstrate that assigning individual random variables to upstream clock drivers improves the estimation accuracy with limited computation penalty. We apply the proposed method to two industrial embedded processors, and evaluate the timing degradation due to dynamic power supply noise.

The rest of this paper is organized as follows. Section 2 reviews statistical timing analysis for spatially-correlated variabilities and introduces the statistical noise modeling in [8]. Section 3 presents the proposed method. Experimental results are shown in Sect. 4, and the paper is concluded in Sect. 5.

2. Statistical Timing Analysis for Spatially-Correlated Variabilities

2.1 Canonical Variation Expression of Arrival Time and Delay

Variation of a parameter that affects delay ($F$), e.g. gate length and threshold voltage, is often expressed as [9]

$$ F = f_0 + X_g + X_s + X_r, $$

(1)

where $f_0$ is the average value of $F$, and $X_g, X_s, X_r$ are random variables whose averages are 0. $X_g$ represents die-to-die variability, and it fluctuates uniformly within a chip. All elements on a single chip has the same value in terms of die-to-die variability. In contrast, $X_s$ and $X_r$ represent within-chip variability, and elements within a chip have different values.

For the simplicity of explanation, we henceforth suppose the within-chip variability of a single variation parameter. The within-chip variability consists of spatially-correlated variation $X_s$ and random variation $X_r$ that differs element by element. $X_s$ has strong correlation between neighboring elements, and the correlation abates with the distance, whereas $X_r$ fluctuates randomly independent of other elements. As for $X_s$, relative placement of elements affects correlation between the element delays.

To handle the spatially-correlated variability in SSTA, a model that can reproduce the variability with a reasonable accuracy is necessary. Reference [1] proposes an SSTA that takes the spatially-correlated manufacturing variability into consideration using PCA (principal component analysis). We here explain how the variability is modeled in [1]. We first divide a chip spatially. Spatially-correlated component $X_s$ is discretized in a 2-D grid, and a random variable is assigned to each region. Within a region, the variability is assumed to be identical. After the variable assignment, a correlation coefficient matrix is constructed, and PCA is applied to the matrix. Random variable $p_i$ associated with region $i$ is expressed as a sum of orthogonalized variable (PC: principal component) $p'_j$:

$$ p_i = \mu_i + \sigma_i \sum_{j=1}^{m} \sqrt{\lambda_{ij}} v_{ij} p'_j, $$

(2)

where $\mu_i$ is the average of $p_i$, $\sigma_i$ is the standard deviation of $p_i$, $\lambda_{ij}$ is the $j$-th largest eigenvalue of the correlation coefficient matrix, $v_{ij}$ is the $j$-th value of the eigenvector corresponding to $\lambda_{ij}$, and $m$ is the number of the PCs. Die-to-die variability $X_g$ can be handled at the same time by substituting $X_g$ for $X_s$ and performing the above procedure. The following discussion assumes that this substitution is carried out.

Applying the above grid-based modeling to $F$ in Eq. (1), $F_i$, which means $F$ in region $i$, becomes a linear summation of uncorrelated random variables.

$$ F_i = f_0 + \sum_{j=1}^{m} k_{ij} p'_j + \delta_i, $$

(3)

where $k_{ij}$ is the coefficient of $p'_j$ in region $i$ and is calculated in accordance with $\sigma_i$, $\lambda_{ij}$, and $v_{ij}$. $\delta$ corresponds to random component $X_r$.

Here, $d_l$ is defined as the delay of element $l$ or the arrival time at element $l$, and is represented as a function on a set of parameters $\vec{F}$, $d_l = g(\vec{F})$, where $F_i \in \vec{F}$. We approximate the function with a first order Taylor expansion.

$$ d_l = \mu_l + \sum_{F_i} \left[ \frac{\partial d_l}{\partial F_i} \right]_0 \Delta F_i, $$

(4)

where $\mu_l$ is the average of $d_l$, and $\Delta F_i = F_i - f_0$. $\left[ \frac{\partial d_l}{\partial F_i} \right]_0$ is the sensitivity of $d_l$ to $F_i$, which can be characterized with circuit simulation or calculated with sophisticated gate delay models such as in Refs. [10], [11]. By substituting Eq. (3) into Eq. (4), we can finally obtain a canonical form of arrival time and element delay.
ENAMI et al.: STATISTICAL TIMING ANALYSIS CONSIDERING CLOCK JITTER

deviation of the random component, and $\delta$ assumed to be identical.

date. The voltages of all nodes in the same partition are (Fig. 2) or the average voltage in each partition is a candi-
representative value, for example, the voltage at the center point
choosing a representative value for each partition. As a rep-
ditional discretizing a chip. The spatial discretization is per-
ents are supposed to be expressed and computed with the canonical form in Eq. (5).

Given the canonical form of Eq. (5) with the orthog-
nal transformation, the correlation between arrival times
or the voltage at clock input CLK of each FF.

Fig. 2 Statistical modeling of dynamic supply noise [8].

\[ d_l = \mu_l + \sum_{j=1}^{m} a_{l,j}p'_j + \sigma_l \delta_S, \] (5)

where $a_{l,j}$ is the delay sensitivity to PC $p'_j$, $\sigma_l$ is the standard deviation of the random component, and $\delta_S$ is the standard random variable whose average and standard deviation are 0 and 1, respectively.

Given the canonical form of Eq. (5) with the orthogonal transformation, the correlation between arrival times are mostly embedded in the coefficients of the PCs, and the correlation due to die-to-die and spatially-correlated components is automatically considered. However, the structural correlation due to common paths can not be perfectly manifested, because the structural information is partially discarded. Once MAX, SUM and SUB (subtraction) operations are performed to two canonical forms, a new canonical form is obtained. Here, two original $\delta_S$ s are merged into a single random variable, although $\delta_S$ should be separately treated for preserving structural correlation since their variability sources are different.

2.2 Statistical Modeling of Power Supply Noise

The proposed timing analysis extends the statistical modeling of power supply noise proposed in [8]. We here introduce the statistical noise modeling of [8].

Noise waveforms differ cycle by cycle depending on input patterns. Power supply noise varies continuously in space and time, and strictly speaking, every cell has different noise waveform at each cycle. We spatially and temporally discretize power supply noise and assign random variables. We then compute statistical properties, such as average, standard deviation and correlation, of the assigned random variables.

We first determine voltage observation points by spatially discretizing a chip. The spatial discretization is performed by partitioning the chip/block area into a 2-D grid and choosing a representative value for each partition. As a representative value, for example, the voltage at the center point (Fig. 2) or the average voltage in each partition is a candidate. The voltages of all nodes in the same partition are assumed to be identical.

An important property of power supply noise is its dynamic behavior. To express dynamic waveforms within a cycle, we partition a clock cycle into several time spans, and compute a representative voltage (e.g. average as shown in Fig. 2).

We then assign a random variable to power supply or ground voltage at each time span and at each spatial grid. We named this assigned random variable as a power variable. We treat the voltage value at every clock cycle as a different sample. Figure 2 shows an example when the voltage at position $(x, y)$ is divided into three time spans and its random variables are denoted as $V_{x,y,1}$, $V_{x,y,2}$ and $V_{x,y,3}$. The number of time spans is determined according to the modeling requirement, i.e. when we need to accurately model dynamic variation within a clock cycle, the number of spans should be increased, otherwise a few spans are sufficient.

To efficiently perform SSTA, Ref. [8] orthogonalizes the variables with PCA, and derives a statistical model including the statistical information such as averages, standard deviations and correlation coefficients of the variables. For the modeling appropriateness, such as Gaussianity, please see [8].

3. Timing Verification Considering Clock Skew and Jitter

This section describes the proposed timing verification considering clock skew and jitter due to both process variation and power supply noise.

3.1 Slack Computation for Sequential Circuits

We first review slack computation for setup and hold constraints using Fig. 1. All arrival times and element delays are supposed to be expressed and computed with the canonical form in Eq. (5).

First, the procedure to calculate slack for setup constraint is explained.

1. Regarding the source of clock network as an origin, we set the arrival time at the clock source to 0. We perform SSTA, and obtain the latest arrival time $t_2$ at input $D$ of each FF. Here, the clock signal propagates to the combinational circuit through CLK-to-Q path in FFs, and hence each FF is regarded as a combinational cell, and CLK-to-Q delay is added to the arrival time.

2. Regarding the clock source as an origin, we set the arrival time at the clock source to clock cycle $T$, and obtain the arrival time $t_2$ at clock input CLK of each FF.

3. Slack for setup constraint $S_{su}$ is calculated for every FF:

$$ S_{su} = t_2 - t_1 - T_{su}, \quad (6) $$

where $T_{su}$ is the setup time of the FF.

Similarly, slack for hold constraint can be computed with the following procedure.
3.2 Improved Statistical Modeling of Power Supply Noise

To compute $S_{su}$ in Eq. (6), the statistical noise modeling shown in Fig. 2 is insufficient, because the clock propagation inside the CDN and the signal propagation inside the combinational logic of $t_1$ are analyzed using the different samples. In other words, the correlation between noise waveforms of successive clock cycles can not be taken into consideration. Reference [12] pointed out that this correlation contributes to mitigate timing violation even though large clock jitter arises due to resonant power supply noise. To consider the correlation and compute $S_{su}$ appropriately, we improve the statistical noise modeling explained in Sect. 2.2.

The improved modeling is illustrated in Fig. 3. In the improved model, the origin of temporal division is set to the clock launch timings at the source of the CDN. The sample, which is divided into several time spans, is extended in time so that the clock and signal propagations both in the clock launch timings at the source of the CDN. The sample spans, which is divided into several time spans, is extended in time so that the clock and signal propagations both in the clock launch timings at the source of the CDN. The sample spans are analyzed using the different samples. Therefore, a correlation between the variables of successive clock cycles such as between $V_{s,p,0}$ and $V_{s,p,3}$ in Fig. 3 is naturally considered, whereas the modeling in Fig. 2 cannot capture this correlation. With this improvement, the correlation between noise waveforms of successive clock cycles, which is demanded to accurately compute $S_{su}$ in Eq. (6), can be appropriately modeled.

1. Regarding the clock source as an origin, we set the arrival time at the clock source to 0. We then perform SSTA, and obtain the earliest arrival time $t_1$ at input D of each FF. We also compute the arrival time $t_4$ at clock input CLK of each FF.

2. Slack for hold constraint $S_{hold}$ is calculated for every FF.

$$S_{hold} = t_3 - t_4 - T_{hold},$$

where $T_{hold}$ is the hold time of the FF.

Setup and hold slacks in an entire circuit are the minimum setup and hold slacks among all FFs.

3.3 Structural Correlation

The structural correlation problem discussed in Sect. 2.1 could be a severe problem in the slack computation, though it might not be significant in SSTA for combinational circuits as experimentally proven in [1]. The problem arises from subtractions $t_2 - t_1$ in Eq. (6) and $t_4 - t_3$ in Eq. (7). The path corresponding to $t_2$ partially overlaps the paths of $t_1$ inside the CDN.

Let us examine a simple example of Fig. 4. This example assumes all variations are static and uncorrelated for simplicity. In this case, $\mu(t_1)$, $\sigma(t_1)$, $\mu(t_2)$ and $\sigma(t_2)$ are estimated to be 110 ps, 6.6 ps, 120 ps and 5 ps respectively, and $T_{su}$ is 0 ps. When Eq. (6) is performed without considering the common path, $\mu(S_{su})$ and $\sigma(S_{su})$ are 10 ps (= 120 - 110) and 8.3 ps (= $\sqrt{5^2 + 6.6^2}$). In contrast, the correct $\mu(S_{su})$ and $\sigma(S_{su})$ considering the common path are 10 ps and 4.2 ps. In this case, the correct values are estimated by virtually eliminating the leftmost inverter, namely, $\mu(t_1)$, $\sigma(t_1)$, $\mu(t_2)$ and $\sigma(t_2)$ becomes 60 ps, 4.2 ps, 70 ps and 0 ps respectively. Ignoring the structural correlation overestimates the standard deviation by 98%. Thus, for a given pair of FFs, the common path can be considered by a backward traversal of the clock tree [6], [7].

On the other hand, the proposed method estimates timing by block-based analysis. Let us analyze Fig. 5 and calculate arrival times $t_1$ and $t_3$ as an example. In this situation, both $t_1$ and $t_2$ of Fig. 5 contain the delays of $cb_a$ and $cb_b$ in common, which causes the structural correlation problem. To cope with this problem involved in the slack computation, we assign individual random variables to each clock driver. As an example, we assume a case that each clock buffer $cb_a$, $cb_b$, and $cb_c$ has its random variable $r_a$, $r_b$, and $r_c$. By extending Eq. (5) with $r_a$, $r_b$, and $r_c$, $t_1$ and $t_2$ are expressed in Eqs. (8) and (9).

$$t_1 = \mu_1 + \sum_{j=1}^{m} a_{1j} p_j + \sigma_1 \delta_1 + c_{1a} r_a + c_{1b} r_b + c_{1c} r_c,$$

where $t_1$ and $t_3$ are naturally considered, whereas the modeling in Fig. 2 cannot capture this correlation. With this improvement, the correlation between noise waveforms of successive clock cycles, which is demanded to accurately compute $S_{su}$ in Eq. (6), can be appropriately modeled.
\[ t_2 = \mu_2 + \sum_{j=1}^{m} a_{2j} p_j' + \sigma_2^2 \delta_S + c_{2,a} r_a + c_{2,b} r_b + c_{2,c} r_c, \]

where \( \delta_S \) is the random variable to represent random components of the gates except the clock buffers, and \( c_{1,a} \) and \( c_{2,a} \) are the coefficients which represent the magnitudes of \( r_{a=\text{peripheral}} \) in \( t_1 \) and \( t_2 \) respectively. In this example, \( c_{2,a} \) is 0. Here, \( t_2 \) can be obtained by summing the delays of \( c_{b,a} \) and \( c_{b,b} \). On the other hand, all paths to the D terminal of the topmost FF include \( c_{b,a} \), and hence \( t_1 \) includes the delay of \( c_{b,a} \). In this case, \( c_1,a \) and \( c_2,a \) are identical, which means that the terms of \( r_a \) in \( t_1 \) and \( t_2 \) are canceled out when \( t_2 - t_1 \) is computed. On the other hand, \( c_{1,b} \) and \( c_{1,c} \) change depending on the circuit structure. If the paths that go through \( c_{b,b} \) dominate those through \( c_{b,c} \) in \( t_1 \) estimation, \( c_{1,b} \) and \( c_{2,b} \) are equal and \( r_b \) is also removed when \( t_2 - t_1 \) is performed. Conversely, if the paths through \( c_{b,b} \) dominate those through \( c_{b,a} \), \( c_{1,b} \) becomes 0, and both terms of \( r_b \) and \( r_c \) remain in \( t_2 - t_1 \). By this way, we can automatically consider the structural correlation thanks to the assignment of individual random variables to clock drivers. It is an advantage of our method that a backward traversal of the clock tree for every pair of FFs is not required unlike Refs. [6], [7].

When the analyzed circuit is large, the number of random variables could be a problem in terms of STA run time and memory. However, this problem can be mitigated by exploiting the tree structure of CDN. In the CDN, the clock drivers in the upstream network are often included in the common path. On the other hand, the number of upstream drivers is much smaller than that of downstream drivers. Therefore, assigning random variables to clock drivers from the clock source with breadth first search, as long as the computational cost permits, efficiently mitigates the problem of the structural correlation. If the available computational cost is large enough, each clock driver has its own random variable respectively, which can achieve a full consideration of the structural correlation on CDN. Even though the available computational cost is small, the random variables which are assigned only to the upstream drivers operate efficiently and structural correlation is largely considered.

Reference [13], which is a related work, presented a method that reduces the number of random variables corresponding to random component. It merges insignificant variables to signal arrival time into a variable when SUM and MAX operations are performed. Especially, insignificant variables are supposed to be actively pruned in case of MAX operation. Focusing on a clock tree, on the other hand, each stage has similar propagation delay and no MAX operation is performed. In such case, the number of random variables is less likely reduced, even if Ref. [13] is applied. In contrast, the random variables to upstream drivers utilize importance level of each clock buffer in the viewpoint of timing and can trade accuracy and computational cost in spite of its simplicity, which will be shown in Sect. 4.

### 4. Experiments

This section evaluates the effectiveness of the proposed method. We first explain the circuits used for experiments, and then experimentally demonstrate how much the structural correlation impacts on the setup slack and clock skew. We next show the influence of dynamic power supply noise on setup and hold slacks.

We implemented the proposed timing analysis method using C++ language. Operations for timing propagation, such as MAX and SUM, in [8] were programmed. The gate delay model proposed in [10], [11] was adopted and implemented, though other models can be used if the accuracy of delay and sensitivity permits. Wire capacitances extracted after detailed routing were annotated and considered in timing analysis. \( T_{\text{wa}} \) and \( T_{\text{thold}} \) in Eqs. (6) and (7) were set to 0 for simplicity.

#### 4.1 Circuits for Experiments

##### 4.1.1 MeP

Media embedded Processor (MeP) [14] is an industrial embedded processor, and we designed it for experiments (Table 1). The processor was synthesized and laid out by commercial tools [15], [16] using a 90 nm standard cell library. The layout size was \( 2 \times 2 \text{ mm}^2 \), and the processor was composed of 83,552 cells and ten SRAM macros. Clock gating was applied to both the synthesized circuit and SRAM macros, which means power supply noise caused by clock gating was included. External voltage sources were attached through package lead and bonding wire.

We implemented a C program of image processing. We compiled the program using gcc for MeP, and obtained an executable binary. We then performed a gate-level simulation using the executable binary, and obtained VCD (value change dump) of each instance including standard cells and SRAMs for 8,920 cycles. We generated current waveforms of each instance based on the VCD, and gave them to the power grid as piecewise linear current sources.

In this paper, we prepared three different settings for power grid analysis to evaluate the effectiveness of the proposed method under various noise waveforms.

- **Waveform A (WA):** A very rich power grid was designed, and the power supply noise was well suppressed.
- **Waveform B (WB):** On-chip wiring resource for power

<table>
<thead>
<tr>
<th>Table 1</th>
<th>MeP and dual-core processor specification</th>
</tr>
</thead>
<tbody>
<tr>
<td>Process/Voltage</td>
<td>Embedded processor for media processing</td>
</tr>
<tr>
<td>Frequency</td>
<td>90 nm CMOS/1.0 V</td>
</tr>
<tr>
<td>#cells/#RAMs</td>
<td>200 MHz</td>
</tr>
<tr>
<td>Chip size</td>
<td>64k/10</td>
</tr>
<tr>
<td>I/O</td>
<td>2.0 mm×2.0 mm</td>
</tr>
</tbody>
</table>
grid was saved compared to Waveform A, and the DC voltage drop was significant.

- **Waveform C (WC):** Current consumption of each instance was increased artificially, and the voltage fluctuation due to \(\frac{dI}{dt}\) was intensified.

An example of noise waveforms is shown in Fig. 6. The noise of the first 100 clock cycles was analyzed at the center of the chip. The average DC drops of the noise waveforms are listed in Fig. 6. The Vdd and Vss voltage drops of WB were over 80 mV, and the spatial distribution of DC voltage drop was thought to affect timing performance. In contrast, the noise magnitude of WC was large although the DC drop was small. The within-cycle and cycle-by-cycle fluctuations were expected to degrade timing performance. WA is thought to correspond to high-end design whose power supply noise was kept small.

4.1.2 Dual-Core Processor

We also adopted another industrial chip whose physical design was completed by a semiconductor company. The chip specifications are summarized in Table 1. The chip contains two embedded processors, and many embedded SRAM macros. Clock gating was applied to this design. We obtained the power supply noise in the case that integer operations were performed in a processor, and used it for experiments. Unfortunately, the gate-level netlist was not disclosed to us, and hence a pseudo netlist that was expected to resemble the execution stages in the processor in terms of logic depth and size was generated and used for timing analysis. The circuit size under timing analysis is 102k cells.

4.2 Structural Correlation

We evaluated how the incomplete consideration of the structural correlation involved in Eq. (5) degraded the timing estimation. Referring to a variability in a 90nm technology [17], we assumed that the correlation coefficient of spatially-correlated variability was dependent on distance, and expressed as \(e^{-\frac{x}{mm}}\), where \(x\) mm is the distance between two elements. We presumed that the magnitudes of the spatially-correlated and random components were the same and the total standard deviation was 25 mV. In this subsec-

| Grid setup slack (MeP, \(T=4\) ns) |
|-------------------------------|-----------------|
| Conventional                  | 177.4           |
| Assigning RVs to all drivers  | 191.3           |

We performed SSTA for MeP in two ways: (M1) using standard canonical form in Eq. (5), (M2) assigning separate random variables to each clock driver as described in Sect. 3.3. The result of setup slack analysis is listed in Table 2. The conventional method (M1) underestimates the average of setup slack by 13.9 ps. MeP is an embedded processor that was not designed for high-frequency operation, and hence the logic depth in the combinational circuit is much larger than that of the CDN. On the other hand, high-end processors with deep pipelining may suffer from the underestimation because of larger magnitude of the underestimation and tighter timing budget.

To validate the approach of assigning random variables to upstream clock drivers, we varied the number of random variables associated with clock drivers, and evaluated the accuracy of setup slack estimation and SSTA run time. Figure 7 presents the results. MeP was used for the experiment, and SSTA was run on a Xeon processor 3.16 GHz. Here, initialization and file I/O times are not included in SSTA run time. The cases of 1 and 2019 random variable(s) correspond to (M1) and (M2), respectively. There is a tradeoff between accuracy and run time as we expected. As the number of random variables increases, the worst setup slack defined as \(\mu + 3\sigma\) converges and SSTA run time increases. We can see that a small number of random variables, such as 10, attain the worst setup slack close within 0.1 ps to that with more than 2,000 variables. In this case, individual random variables are assigned to the first three stage drivers, and the increase in SSTA run time from (M1) is only 0.4%.

Computing clock skew also needs arrival time subtraction of two paths that have a common path. Figures 8 and 9 depict the distributions of clock skews estimated by (M1) and (M2), respectively. Note that absolute values of clock skew were evaluated here. The horizontal and vertical axes are the mean and standard deviation of the estimated clock skews, and each dot corresponds to a pair of FFs. All pairs were plotted in the figures. We can see that clock skews

![Fig. 6 An example of noise waveforms (chip center, first 100 cycles), and average DC voltage drops of three noises.](image-url)
were overestimated in Fig. 8 because of the structural correlation. To accurately estimate clock skews for each pair of FFs, assigning random variables to each clock driver is necessary. In the following experiments, individual random variables were assigned to each clock driver for excluding the estimation error originating from the structural correlation.

4.3 Results of Timing Analysis

This subsection shows evaluation results considering both power supply noise and process variation.

4.3.1 Accuracy of the Proposed Method

We first verified the accuracy of the proposed timing analysis. In this experiment, the number of spatial divisions was set to 10×10 and the number of temporal divisions was set to 20. Here, we performed STA simulation iteratively for every clock cycle by using the noise information straightforwardly, which was the same as the information given to PCA. The overview of the proposed SSTA and iterative STA simulation is shown in Fig. 10. The noisy-power-voltage waveforms of each cycle were given for all cells considering their placements. The delay of each cell was calculated with the voltage value corresponding to the cell position and switching timing. With these gate delays, conventional STA was carried out and the timing slack of each cycle was obtained. The iterative STA results do not include errors that originated from such as discretization, and SSTA operation. The results for iterative STA were compared to those of SSTA as ideal solutions.

Table 3 lists the averages and standard deviations of the slacks acquired by SSTA and iterative STA. The differences between SSTA and iterative STA are not large, and they are at most about 1% of each clock cycle $T$. Especially at a point of $\mu - 3\sigma$, the errors are small. The cumulative distribution functions of the proposed SSTA and iterative STA are depicted in Fig. 11, which shows two distributions overlap well.

4.3.2 Analysis Conditions

We performed setup and hold timing verification under the following four conditions.

**Condition 1 (C1):** Dynamic power supply noise both in the combinational circuit and the CDN were considered (Proposed).

**Condition 2 (C2):** Dynamic power supply noise only in the combinational circuit was considered. DC voltage drop in Fig. 6 was given to the CDN.

**Condition 3 (C3):** DC voltage drop in Fig. 6 was given to both the combinational circuit and the CDN (Conventional).
Table 4  Slacks for setup constraints [ps].

<table>
<thead>
<tr>
<th>Analysis Cond.</th>
<th>MeP</th>
<th>WD</th>
<th>WC</th>
<th>Dual-core processor</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>µ</td>
<td>σ</td>
<td>µ</td>
<td>σ</td>
</tr>
<tr>
<td>C1</td>
<td>149.8</td>
<td>84.1</td>
<td>238.0</td>
<td>116.6</td>
</tr>
<tr>
<td>C2</td>
<td>134.5</td>
<td>83.8</td>
<td>264.8</td>
<td>119.4</td>
</tr>
<tr>
<td>C3</td>
<td>134.7</td>
<td>70.7</td>
<td>173.4</td>
<td>79.2</td>
</tr>
<tr>
<td>C4</td>
<td>−1.0</td>
<td>−1.0</td>
<td>−1.0</td>
<td>−1.0</td>
</tr>
</tbody>
</table>

Fig. 12  Worst slack (MeP, WA).

Condition 4 (C4): Worst voltage drop (minimum $V_{DD}$, maximum $V_{SS}$) at each region was given to both the combinational circuit and the CDN.

In statistical noise modeling, $10 \times 10$ spatial division and 20 temporal division for two clock cycles were performed. In this experiment, $V_{th}$ variability explained in the previous section was also given to the combinational circuit and the CDN in all conditions.

4.3.3  Setup Verification

Table 4 lists the results of setup verification. The given clock cycles of MeP given to WA, WB and WC are 4 ns, 5 ns and 4.5 ns respectively. The achievable clock cycles and the standard deviations of slack were different depending on the power supply noise even though the same circuit (MeP) was analyzed. We hereafter examine the analyzed results.

Comparison to Worst Voltage Drop Analysis (C4)

We first compare the proposed analysis (C1) and the analysis that worst voltage drop at each region was given (C4). Figures 12–15 show the worst slack defined as $\mu + 3\sigma$, where $\mu$ and $\sigma$ are the mean and standard deviation of slack. The horizontal bold bar represents the estimation result of the proposed method (C1). The worst slack was overestimated by 113.3, 214.0, 458.4 and 48.2 ps in MeP with WA, WB and WC and Dual-core processor, respectively. We can see that the analysis results of C4 are excessively pessimistic.

Comparison to DC Drop Analysis (C3)

We next compare the results of C1 and C3. The analysis of C3 did not take inter-cycle and within-cycle voltage fluctuation into consideration, and hence the overestimation of setup slack became significant when the standard deviation of power supply noise was large as MeP with WC and Dual-core processor. Indeed, the overestimations in MeP with WC and Dual-core processor are 577.2 and 47.2 ps and comparable to the underestimation of C4. The setup timing verification without considering dynamic power supply noise is not totally sufficient.

Comparison to C2

When the statistical noise model was applied only to the combinational circuits, the worst slack could be both overestimated and underestimated as shown in Figs. 12–15. The largest error of C2 was 18.4 ps found in MeP with WB. In this configuration, DC voltage drop was spatially distributed, which increased the clock skews and then decreased the average of setup slack by 26.8 ps.

The underestimation/overestimation by ignoring clock skew and jitter due to dynamic power supply noise is not negligible, however its magnitude was not as large as we
Table 5 Setup slacks estimated by proposed and separate analyses [ps].

<table>
<thead>
<tr>
<th>Cond.</th>
<th>MeP WA</th>
<th>WB WC</th>
<th>Dual-core processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>Proposed</td>
<td>−102.4</td>
<td>−116.4</td>
<td>−324.0</td>
</tr>
<tr>
<td>Separate</td>
<td>−616.3</td>
<td>−777.9</td>
<td>−1083</td>
</tr>
<tr>
<td>Diff.</td>
<td>513.9</td>
<td>661.5</td>
<td>759.0</td>
</tr>
</tbody>
</table>

Table 6 Slacks for hold constraints [ps].

<table>
<thead>
<tr>
<th>Cond.</th>
<th>MeP WA</th>
<th>WB WC</th>
<th>Dual-core processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>C1</td>
<td>−99.5</td>
<td>33.0</td>
<td>−91.4</td>
</tr>
<tr>
<td>C3</td>
<td>−112.1</td>
<td>32.7</td>
<td>−75.0</td>
</tr>
<tr>
<td>C4</td>
<td>−104.3</td>
<td>35.1</td>
<td>−92.1</td>
</tr>
</tbody>
</table>

expected. A reason is that power supply noises of successive two clock cycles were correlated, and sudden change in power consumption of processors does not happen so frequently. In fact, the correlation between power variables at the same timing from the clock edge in the same sample was 0.812 in MeP with WA.

Proposed Method vs. Separate Analysis

We lastly demonstrate that simultaneous consideration of combinational circuit delay, clock skew and clock jitter is critically important by comparison. We evaluated the statistical distribution of combinational circuit delay $T_{com}$ and clock arrival time $T_{a,i}$ at $i$-th FF separately, and obtained combinational circuit delay $T_{\text{delay}}(=\mu(T_{com}) + 3\sigma(T_{com}))$, clock skew $T_{\text{skew}}(=\max(\mu(T_{a,i})) - \min(\mu(T_{a,i})))$ and clock jitter $T_{\text{jitter}}(=6 \max(\sigma(T_{a,i}))$. Table 5 lists setup slack defined as $S_{su} = T - T_{\text{delay}} - T_{\text{skew}} - T_{\text{jitter}}$, where $T$ is the cycle time. We clearly see that the separate analysis gives totally pessimistic estimates. The pessimism in MeP with WC reaches 759.0 ps, and consumes 17% of clock cycle. Simultaneous consideration is indispensable for accurate analysis.

4.3.4 Hold Verification

We finally describe the results of hold verification. We executed hold timing verification under C1, C3 and C4 explained in Sect. 4.3.2. Table 6 lists the results.

In hold verification, C4, which gave the worst voltage drop each region, did not necessarily estimate the slack conservatively. In the case of Dual-core processor, the slack estimated in C4 was the largest. In addition, the analysis under C3 underestimated the average of slack by 16.4 ps in MeP with WB.

5. Conclusion

This paper proposed a statistical timing analysis method for sequential circuits. The proposed method takes into account clock jitter and skew due to power supply noise and manufacturing variability in addition to delay variation of combinational logics. We discussed that slack computation by subtracting two arrival times expressed in canonical forms faced the structural correlation problem, and showed that assigning individual random variables to upstream clock drivers efficiently improves timing estimation accuracy with negligible run time overhead. Experiments using two industrial processors showed that statistical timing analysis with DC voltage drop overestimated the worst setup slack by over 500 ps. The proposed method contributes to accurate estimation of both setup and hold slacks under dynamic power supply noise.

Acknowledgement

This work is supported in part by Semiconductor Technology Academic Research Center (STARC), New Energy and Industrial Technology Development Organization (NEDO) and VLSI Design and Education Center (VDEC).

References

Takashi Enami received the B.E. and M.E. degrees in Information Systems Engineering from Osaka University, Osaka, Japan, in 2006 and 2008, respectively, where he is currently working toward Ph.D degree in the Department of Information Systems Engineering, Osaka University. His research interest includes noise aware timing analysis and distribution of power supply network. He is a student member of IEEE.

Shinya Ninomiya received the B.E. and M.E. degrees in information systems engineering from Osaka University in 2007 and 2009, respectively. He is currently with Renesas Electronics Corporation. His research interest includes variability modeling and statistical timing analysis.

Ken-ichi Shinkai received the B.E. and M.E. degrees in Information Systems Engineering from Osaka University, Osaka, Japan, in 2006 and 2008, respectively, where he is currently working toward Ph.D degree in the Department of Information Systems Engineering at Osaka University. His research interest is variation-aware timing and reliability analysis. He is a student member of IEEE.

Shinya Abe received the B.E. and M.E. degrees in Information Systems Engineering from Osaka University in 2007 and 2009, respectively. He is currently with Renesas Electronics Corporation. His major interest is mesh-style clock distribution.

Masanori Hashimoto received the B.E., M.E. and Ph.D. degrees in Communications and Computer Engineering from Kyoto University, Kyoto, Japan, in 1997, 1999, and 2001, respectively. Since 2004, he has been an Associate Professor in Department of Information Systems Engineering, Graduate School of Information Science and Technology, Osaka University. His research interest includes computer-aided-design for digital integrated circuits, and high-speed circuit design. Dr. Hashimoto served on the technical program committees for international conferences including DAC, ICCAD, ASP-DAC, ICCD and ISQED. He is a member of IEEE and IPSJ.