# 6. Sequential Logic¶

Most of today’s digital systems are build with sequential logic,
including virtually all computer systems. A **sequential circuit** is
a digital circuit whose outputs depend on the history of its inputs.
Thus, sequential circuits have a memory that permits significantly
more complex functional behaviors than *combinational circuits* are capable of. In fact, the study of the
complexity of sequential circuit behavior is the cradle of the
field of computer science.

## 6.1. Feedback Loops¶

We construct a sequential circuit by introducing a feedback loop. For
example, we may connect the output of a combinational circuit to one
of its inputs, as shown in Figure 6.1. This
*multiplexer* circuit feeds output \(Y\) back
to data input \(D_0.\) Because of this feedback loop, inputs of
the past, in particular inputs that are arbitrarily older than the
*propagation delay* of the multiplexer, can
determine output \(Y.\) In the following, we analyze the circuit
to derive a precise characterization of its behavior.

We may express the functionality of the multiplexer loop by means of
the defining equation of the *multiplexer*,
substituting \(Y\) for data input \(D_0\) and \(D\) for
data input \(D_1\):

This definition is recursive, because \(Y\) appears as output on
the *lhs* and as input on the *rhs* in case \(S=0.\) We interpret
the recursion such that output \(Y\) follows input \(D\) if
select input \(S\) equals 1, and holds output \(Y\) while
\(S\) equals 0. Thus, when \(S=0\) the circuit retains the
last value of input \(D\) before \(S\) transitions from 1 to
0. We know this behavior from the *D-latch* already.
Now, we utilize the tools for combinational circuit design, including
*K-maps* and *timing analysis*, to
study the feedback loop in more detail.

First, we inspect the combinational portion of the multiplexer circuit
without the feedback loop, and second we study the effects of the
feedback loop. This separation of concerns is known as **cut analysis**:

- Cut all feedback loops, and rename all fed-back outputs that drive the loops.
- Analyze the combinational circuit by deriving the K-maps, called
excitation table, for each fed-back output.- Generate the
transition tableby marking allstable statesin the excitation table, i.e. those K-map cells where the outputs driving the feedback loops equal the inputs driven by the feedback loops.- Analyze the transitions of the sequential circuit in the presence of all feedback loops with the aid of the transition table.

We analyze the multiplexer circuit, starting with step 1 by cutting
the feedback loop. Figure 6.2 illustrates the cut in the
black-box diagram on the left and the combinational circuit
implementing a *glitch-free* multiplexer on the right.
We rename output \(Y\) that drives the feedback loop to
\(Y',\) and keep name \(Y\) for the input of the multiplexer
that is driven by the feedback loop.

The resulting gate-level circuit of the multiplexer in Figure 6.2 is acyclic and, hence, combinational. Output \(Y'\) is a function of three inputs, \(D,\) \(S,\) and \(Y\):

The equivalent SOP form corresponding to the gate-level circuit is

Besides analyzing the functionality, we can analyze the timing of the
combinational circuit, which resembles the analysis of the
*multiplexer without consensus term*. Assume for
simplicity that each gate has a propagation delay of 1 time unit,
i.e. \(t_{pd}(inv) = 1,\) \(t_{pd}(and) = 1,\) and
\(t_{pd}(or) = 1.\) Then, the multiplexer in Figure 6.2 has one path from \(S\) through the inverter with propagation
delay \(t_{pd}(mux) = 3\) time units, and all other paths have
contamination delay \(t_{cd}(mux) = 2\) time units. This
completes our analysis of the combinational circuit.

According to step 2 of the cut analysis, we translate the multiplexer
function into a K-map, also called *excitation table*, because it
specifies the function of the circuit node that drives or *excites*
the feedback loop. In case of the multiplexer circuit this node is
\(Y'.\) Figure 6.3 shows the excitation table of
\(Y'\) on the left. We have arranged variable \(Y\) to change
across rows, because this simplifies step 3, determining the stable
states for the transition table.

A circuit with a feedback loop is stable, if the output of the
combinational circuit that drives the feedback loop is equal to the
input of the combinational circuit driven by the feedback loop. After
all, this is the purpose of the feedback wire to begin with: The
feedback wire forces the voltage levels of the connected output and
input of the combinational circuit to be equal. The **stable states**
of our multiplexer loop are those cells in the excitation table where
output \(Y'\) equals input \(Y.\) In Figure 6.3, the top row is associated with input \(Y=0.\) Therefore,
the stable states are those cells of the top row marked with \(Y'
= 0.\) Analogously, the bottom row is associated with \(Y=1,\)
such that the stable states in the bottom row are all cells marked
with \(Y' = 1.\) The excitation table with encircled cells
identifying stable states is called **transition table**, see
Figure 6.3 on the right. All remaining cells without
circle identify **unstable states**. An unstable state is temporary,
because output \(Y' \ne Y\) drives a new input \(Y'\) via the
feedback loop into the combinational circuit.

We now use the transition table to pursue our original goal, the systematic analysis of the transitions of the multiplexer loop, i.e. step 4 of the cut analysis. The input terminals of the multiplexer loop in Figure 6.1 are data input \(D\) and select input \(S.\) We assume that the circuit is in a stable state, for example \((D,S,Y) = 100.\) This is the state associated the rightmost cell in the top row of the transition table, shown on the left in Figure 6.4 below. The loop holds output value \(Y' = 0\) indefinitely until we stimulate a transition by changing an input.

Next, we change select input \(S\) from 0 to 1. In the transition
table, we move by one cell to the left. This state is unstable,
because output \(Y'=1\) differs from input \(Y=0.\) Input
\(Y\) will assume output value 1 via the feedback loop after a
delay of \(t_{cd} = 2\) time units, because inputs
\((D,S)=11\) cause AND gate \(W\) to change from 0 to 1, which
switches output \(Y'\) from 0 to 1. Therefore, the circuit
transitions without further stimulus from unstable state
\((D,S,Y)=110\) to state \((D,S,Y)=111.\) In the transition
table, we move by one cell downward. State \((D,S,Y)=111\) is
stable and produces output \(Y'=1.\) The same transition behavior
occurs in a *D-latch*, after switching from opaque to
transparent. The output follows input \(D\) such that \(Y'\)
transitions from 0 to 1.

If we change data input \(D\) from 1 to 0 when the multiplexer loop occupies stable state \((D,S,Y)=111,\) the state transitions along the path illustrated in Figure 6.4 on the right. Since \((D,S,Y)=011\) is an unstable state, input \(Y\) will change to output \(Y'=0.\) Without further stimulus the circuit transitions to stable state \((D,S,Y)=010.\) The transition table enables us to argue about such transitions without inspecting the circuit diagram. Thus, the transition table is a powerful tool suited for tracing all possible transition paths in a circuit with feedback loops. In the transition table of the multiplexer loop, signal changes on the input terminals \(D\) or \(S\) correspond to transitions across columns within a row. State transitions due to the feedback loop start in an unstable state and cross rows within a column.

The transition table can be viewed as a simplifying abstraction of the
timing diagram resulting from a *timing analysis*. The waveforms of the timing diagram provide
the delays as additional information. Figure 6.5
shows the timing diagram of the transitions discussed in Figure 6.4 based on the signals of the multiplexer loop in
Figure 6.1. Input transition \(S: 0 \rightarrow
1\) occurs at time \(t = 0,\) and input transition \(D:
1\rightarrow 0\) at time \(t = 6.\) Since we assume that each gate
has a delay of 1 time unit, the transition of select input \(S\)
takes a delay of \(t_{cd} = 2\) time units to produce a stable
output \(Y=1,\) although it takes another time unit for the
feedback loop to drive node \(X\) to value 1. Similarly, the
transition of data input \(D\) takes \(t_{cd} = 2\) time units
stabilize output \(Y=0.\)

Comparing the timing diagram with the transition tables in Figure 6.4, we observe that the multiplexer loop is in initial state \((D,S,Y)=100\) for \(t < 0.\) Then, we change select input \(S\) to 1, and the circuit is in unstable state \((D,S,Y)=110\) for time period \(0 < t < 2.\) At time \(t=2,\) output \(Y\) transitions to 1, and the circuit into stable state \((D,S,Y)=111.\) The multiplexer loop remains in this state until we provide the next input stimulus. This happens at time \(t=6,\) where we change data input \(D\) to 0. In time period \(6 < t < 8,\) the circuit occupies unstable state \((D,S,Y)=011,\) before it transitions into stable state \((D,S,Y)=010\) at time \(t=8.\) The transition table captures these transitions without the complexity of a timing analysis.

Using the transition table, we can study the behavior of the multiplexer loop when select input \(S = 0.\) Figure 6.6 shows on the left the transition starting in state \((D,S,Y)=010\) and changing \(S\) from 1 to 0. We move by one cell to the left into stable state \((D,S,Y)=000.\) Subsequently, we can change data input \(D\) from 0 to 1, transitioning the circuit to stable state \((D,S,Y)=100,\) and back, as shown in Figure 6.6 on the right. Output \(Y\) stores the last value of \(D,\) before we changed \(S\) from 1 to 0, no matter when and how often we toggle data input \(D.\) This timing behavior demonstrates the characteristic feature of sequential circuits: ouput \(Y\) depends on the history of inputs \(D\) and \(S.\) If \(D\) would have been 1 before we changed \(S\) from 1 to 0, the multiplier loop would have stored \(D=1\) by holding output \(Y = 1\) in stable states \((D,S,Y)=101\) or \((D,S,Y)=001.\)

The distinction of stable and unstable states leads us to classify the
behavior of entire sequential circuits with feedback as either stable
or unstable. A sequential circuit is stable, if a change at an input
terminal causes the circuit to transition from one stable state into
another. Otherwise, if the outputs change beyond the propagation
delay, the sequential circuit is unstable. We distinguish two types
of unstable circuit. If the outputs become stable eventually, then
the change of an input causes the circuit to transition through one or
more unstable states into a stable state. If, however, the outputs
change indefinitely, then the circuit must enter a cycle of
transitions through unstable states. Unless we wish to design an
*oscillator*, such a behavior is usually
considered a bug. Transitions that are eventually stable but traverse
unstable states are considered a harmless feature of sequential
circuits, similar to *glitches* in combinational
circuits.

We perform a cut analysis of the ring oscillator in Figure 6.7. The feedback loop connects output \(Y\) to the input of the NAND gate. We cut the feedback loop and rename output \(Y\) to \(Y'.\)

The combinational portion of the cyclic circuit is the 3-stage path NAND-INV-INV. The output function \(Y'(A,Y)\) is easily derived as follows. Since \(Y' = \overline{X} = W,\) we find \(Y' = \overline{A \cdot Y}.\) We may interpret the function of the NAND gate as

such that output \(Y'\) equals 1 if control input \(A=0,\) and \(Y'\) is the complement of input \(Y\) if \(A=1.\) Next, we translate this functionality into a K-map, and encircle the only stable state \((A,Y)=01\) to obtain the transition table below.

Based on the transition table we deduce the behavior of the ring oscillator. If input \(A=0,\) the circuit will eventually enter the stable state. If the circuit is in unstable state \((A,Y)=00,\) then output \(Y'=1\) enforces the transition into stable state \((A,Y)=01.\) If we change input \(A\) from 0 to 1, we move from the stable state by one cell to the right. State \((A,Y)=11\) is unstable. Since \(Y' = 0,\) it transitions after the propagation delay of the 3-stage path to state \((A,Y)=10,\) i.e. we move one cell up in the transition table. The resulting state \((A,Y)=10\) is unstable too. Output \(Y'=1\) forces the transition to state \((A,Y)=11,\) i.e. we move one cell down again. As long as input \(A\) remains 1, the circuit toggles between states \((A,Y)=10\) and \((A,Y)=11,\) and the output between \(Y'=0\) and \(Y'=1.\) The propagation delay between transitions is the delay of the 3-stage path. In other words, the oscillator produces a clock signal. To stop the oscillation, we change input \(A\) from 1 to 0, so that the circuit transitions into the stable state eventually. In the transition table, stimulus \(A: 1\rightarrow 0\) moves from the right to the left column, and if we are in the top row, then without further stimulus into the bottom row.

Not all combinational circuits with a feedback loop are sequential. There exist cyclic circuits whose outputs depend on the present inputs only rather than on the past inputs. An example of such a multioutput circuit is shown below.

We use a modified form of the cut analysis to show that this circuit is combinational. A circuit is sequential if we cut the feedback loop and the output driving the feedback loop is a function of the inputs driven by the feedback loop. If the output function of the driver of the feedback loop is independent of the feedback signal, then the feedback wire is redundant and the circuit is combinational.

Following this argument, we investigate whether output \(y_3\) of the circuit is sequential or combinational. We cut the feedback loop as shown below, and rename the output into \(y_3'.\)

Next, we deduce the Boolean expression for \(y_3'\) as a function of inputs \(x_0,\) \(x_1,\) and feedback input \(y_3,\) and simplify the expression to see whether \(y_3'\) depends on \(y_3\) or not:

We conclude that \(y_3'\) is not a function of feedback input \(y_3\) but is a function of terminal inputs \(x_0\) and \(x_1\) only. Therefore, \(y_3'\) depends on the present inputs \(x_0\) and \(x_1\) only, and is hence combinational.

Since the circuit is a multioutput circuit, it is combinational if all of its output functions are combinational. Therefore, we repeat the cut analysis for each output, cutting the loop at the input driven by the corresponding output. For example, to test \(y_0,\) we cut the loop at the \(y_0\)-input of the AND gate. Then, we find these expressions for the output functions:

None of the output functions depends on the feedback signal. Therefore, all output functions and the circuit as a whole are combinational. Intuitively, the complementation and covering theorems cause the feedback signal to vanish from the outputs. More interesting is the observation that this cyclic combinational circuit computes four distinct 2-variable functions with four gates, and is therefore competitive with four independent acyclic circuits.

Perform a cut analysis of the NAND loop:

- Cut the loop and rename the loop driver.
- Derive a K-map for the output (excitation table).
- Mark all cells with stable states (transition table).
- Analyze the transitions of the circuit.

Output \(Y\) of the NAND gate drives the feedback loop to its input. The first step of a cut analysis is to cut the feedback loop:

Then, we rename the loop driver, here \(Y \rightarrow Y',\) and redraw the resulting acyclic combinational circuit to enhance clarity:

The combinational circuit in (b) is easy to analyze: \(Y' = \overline{A\,Y}.\)

The excitation table of the loop circuit is the K-map of the acyclic combinational circuit, specified above by means of Boolean equation \(Y' = \overline{A\,Y}.\) Since the NAND operation produces \(Y' = 0\) only if both inputs are \(A = Y = 1,\) the corresponding K-map is:

We derive the transition table of the cyclic circuit with feedback loop from the excitation table of the acyclic circuit by marking all stable states that observe feedback constraint \(Y = Y'.\) The cells of the excitation table are marked with the associated value of output \(Y'.\) In the top row of the excitation table input \(Y = 0.\) However, both cells are marked with output \(Y' = 1.\) Thus, there are no stable states in the top row. In the bottom row of the excitation table input \(Y = 1.\) The leftmost cell is marked with output \(Y' = 1,\) which is equal to input \(Y.\) Therefore, this cell identifies the only stable state of the circuit.

We study the transitions of the NAND loop with the aid of the transition table. Input \(A\) is the only input of the NAND loop. Therefore, we first study the behavior of the circuit when \(A\) is fixed either to 0 or to 1, and then when input \(A\) changes.

Consider the case where input \(A = 0.\) We distinguish two cases depending on input \(Y.\) If \(Y = 1\) initially, then the circuit is in its stable state because output \(Y' = \overline{0\cdot 1} = 1\) of the NAND gate reinforces input \(Y = Y' = 1\) through the feedback loop. In the other case, input \(Y = 0\) initially. In the transition table below this initial state is associated with the top-left cell where \((A,Y) = 00.\)

State \((A,Y)=00\) is unstable, because output \(Y'=1\) of the NAND gate drives input \(Y=Y'=1\) through the feedback loop, which differs from the initial state \(Y=0.\) Thus, the feedback loop transitions the circuit into new state \((A,Y)=01.\) This is the stable state of the circuit associated with the bottom-left cell. After transitioning into the stable state, the circuit has stabilized. We conclude that the circuit stabilizes when input \(A=0\) such that output \(Y'=1\) eventually.

Next, we consider case \(A=1.\) If input \(Y=0\) initially, the circuit is in state \((A,Y)=10\) associated with the top-right cell of the transition table. Since the output of the NAND gate produces \(Y'=1,\) the feedback loop enforces the input transition \(Y: 0\rightarrow 1.\) The circuit transitions to state \((A,Y)=11\) associated with the bottom-right cell of the transition table.

In the other case, where input \(Y=1\) initially, the circuit is in state \((A,Y)=11.\) This state is unstable, because output \(Y'=0\) of the NAND gate enforces the input transition \(Y: 1\rightarrow 0\) through the feedback loop. Thus, the circuit transitions to state \((A,Y)=10.\) We conclude that the circuit oscillates when input \(A=1,\) like the ring oscillator of Example 6.1.

Changes at input \(A\) transition the circuit between stable state \((A,Y) = 01\) and the oscillating behavior when input \(A=1.\) The exact state sequence of the transition depends on the state of the circuit when \(A\) transitions. In particular, if the circuit oscillates when \(A\) changes from 1 to 0, input \(Y\) assumes one of the two values 0 or 1 beyond our control. If \(Y\) happens to be 1, then state transition \((A,Y): 11 \rightarrow 01\) assumes the stable state immediately. On the other hand, if \(Y\) happens to be 0 when \(A\) changes from 1 to 0, then the state transitions through unstable state 00 into stable state 01, i.e. \((A,Y): 10 \rightarrow 00 \rightarrow 01.\) The inverse input transition \(A: 0\rightarrow 1\) causes the circuit to oscillate, most likely originating in stable state \((A,Y) = 01,\) because unstable state \((A,Y)=00\) transitions into the stable state when \(A=0.\)

## 6.2. Synchronous Circuits¶

Feedback is the key to processing past and present information of an
input sequence. *Memory elements*, including
the *D-latch* and the *D-flipflop*,
are sequential circuits with *feedback loops*.
In general, it is nontrivial to design a sequential circuit by
augmenting a combinational circuit with feedback loops, because the
delays of such loops are difficult to control and the timing behavior
of compositions is hard to analyze. Even worse, *hazards* are difficult to predict in compositions of subcircuits
with feedback loops, and potentially cause the circuit to become
unstable indefinitely and malfunction. Nevertheless, sequential
circuits with feedback loops are of interest per se, and design
methodologies have been developed to cope with the problems of this
class of circuits, widely known as **asynchronous circuits** today.
The *D-latch* and the *D-flipflop*
are asynchronous sequential circuits whose design has benefited from
these developments.

Another circuit style dominates today’s digital design practice, known
as **synchronous circuits**. The name derives from the introduction
of a chronometer or clock signal that is used to trigger all
*D-flipflops* of a circuit at the same beat of the
clock. Furthermore, in an effort to simplify the control of delays
through feedback loops, we insert a D-flipflop in every feedback loop.
Since D-flipflops are edge-triggered memory elements, their outputs
are stable for an entire clock period. When placed in a feedback
loop, the output of the D-flipflop is the feedback input for the
combinational circuit. Other than in an asynchronous circuit, the
D-flipflop keeps the feedback input stable during the clock period.
Hence, the loop cannot become unstable and can even tolerate hazards
in the combinational logic, provided the clock period is larger than
the combinational propagation delay.

The design style of synchronous circuits radically restricts the overall design space for sequential circuits. Nevertheless, it simplifies the task of circuit design dramatically, and is still powerful enough to assemble reliable circuits with billions of transistors. As such the synchronous design style is part of the unique success story of large scale digital systems engineering. Given their practical importance, we focus the remainder of this chapter on synchronous sequential circuits. This section discusses an analysis methodology for synchronous sequential circuits and introduces three sequential building blocks, a binary counter, shift registers, and memories.

### 6.2.1. Analysis of Synchronous Sequential Circuits¶

We demonstrate the functional analysis of synchronous sequential
circuits by means of the *multiplexer loop*,
extended with a D-flipflop in the feedback loop. As shown in
Figure 6.8, D-flipflop output \(Q\) drives the
feedback input of the multiplexer, and clock signal \(\phi\)
triggers the D-flipflop at the positive clock edge. At this point in
time the D-flipflop stores its input \(Y\) for an entire clock
cycle until the next positive clock edge occurs.

The value stored in the D-flipflop is called the **state** of the
synchronous circuit. In Figure 6.8 the state is
observable at D-flipflop output \(Q.\) The purpose of circuit
analysis is to determine the **next state** of the next clock cycle.
If we consider all possible input combinations, the analysis enables
us to predict the sequence of output values given a sequence of input
values. The next state of the multiplexer loop is the value of input
\(Y\) of the D-flipflop at the time of the triggering clock edge.
Signal \(Y\) is driven by the combinational multiplexer circuit,
and is a function of inputs \(D\) and \(S,\) and state
\(Q\):

We use an adapted form of transition table, the so-called **state
transition table**, to list the next state as a function of the state
and the inputs. Since the state of the circuit is equal to \(Q\)
and the next state is equal to \(Y,\) the state transition table
of the multiplexer loop in Figure 6.8 is:

state (\(Q\)) \(D\) \(S\) next state (\(Y\)) 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 1

State transition tables are commonly drawn in form of a truth table,
although we could use the equivalent K-map form as well. Under the
assumption that the input values \(D\) and \(S\) are stable at
the time of the triggering clock edge, we derive next state \(Y\)
by evaluating equation \(Y(D,S,Q)\) for the combination of input
values in each row. This equation specifies the **next state logic**
implemented by the combinational multiplexer.

Although the state transition table specifies the transition behavior
of a sequential circuit unambiguously, it can be advantageous to
visualize the transition behavior by means of a directed graph, the
so-called **state transition diagram**. Here, each vertex represents
a state, and each arc a transition. We annotate the arcs with the
combination of input values associated with the state transition.
Separated by a slash, we also include output \(Y\) of the circuit,
which happens to be equal to the next state in this example.
Figure 6.9 shows the state transition diagram of the
multiplexer loop in Figure 6.8. The vertices are marked
with the value of state \(Q.\) The format of the arc annotations
is \((D,S)/Y,\) i.e. the pair of input values followed by a slash
followed by the output value.

We interpret the state transition diagram as follows. During the present clock cycle, the circuit is in one of the states represented by a vertex, for example in state 0. While in this state, the inputs may change causing the output to change. For example, in state 0, if we apply inputs \((D,S)=01,\) then output \(Y=0,\) as indicated by arc annotation \(01/0.\) If we apply inputs \((D,S)=11,\) then output \(Y=1,\) and the arc annotation is \(11/1.\) The input combination at the time of the triggering clock edge determines the next state. For example, arc \(11/1\) points from state 0 to state 1, because at the triggering clock edge next state \(Y=1\) is stored in the D-flipflop. The other three input combinations \(00/0,\) \(01/0,\) and \(10/0\) store next state \(Y=0\) at the triggering clock edge, which is also the present state. Therefore, the corresponding arc is a loop from state 0 to state 0 for all three input combinations. Comparison with the next state table shows that each arc annotation in the diagram corresponds to one row in the table. The diagram captures the transition behavior in a more compact form, that some designers find easier to comprehend than a table.

The state transition table and/or the state transition diagram enable
us to derive the output sequence of the circuit if the inputs are
synchronized with the clock, such that during each clock cycle one new
pair of inputs arrive. For example, assume that the multiplexer loop
is initially in state 0, and we apply the input sequence 00, 11, 10,
01, 10. Then, the corresponding output sequence is 0, 1, 1, 0, 0.
Here is why. Initially, the circuit is in state 0, and we apply the
first input pair 00. Hence, the output is \(Y=0.\) At the next
triggering clock edge, the D-flipflop stores next state 0. Now, we
apply the second input pair 11. The output changes to \(Y=1.\)
At the next triggering clock edge the circuit transitions to state 1.
The third input pair 10 holds output \(Y=1,\) and so on. Use the
interactive model in Figure 6.10 to develop a
feeling for the synchronous multiplexer loop. You can toggle inputs
\(D\) and \(S\) and trigger a positive clock edge by pushing
the \(\phi\) *trigger* button.

D = 0 S = 0 Figure 6.10: Interactive model of synchronous multiplexer loop.

The multiplexer loop does not restrict the inputs to change only once per cycle. In fact, you can change inputs \(D\) and \(S\) as often as you like between triggering clock edges. The output follows the inputs through the combinational logic implemented by the multiplexer. State changes, however, occur only upon triggering a positive clock edge. At this time, the D-flipflop stores the next state supplied to its input by multiplexer output \(Y.\) In summary, if input \(S=1,\) the loop stores input \(D\) at the time of the clock trigger. Otherwise, while \(S=0,\) the loop holds its state across clock trigger events independent of \(D.\) The state transition table and diagram above contain the same information about the circuit.

In the following, we analyze several synchronous sequential circuits by deriving their state transition table and diagram. These circuits are sequential building blocks for the design of larger sequential circuits.

### 6.2.2. Binary Counter¶

An \(n\)-bit **binary counter** is a synchronous sequential
circuit that increments its \(n\)-bit output \(Y\) at every
triggering clock edge. Since \(n\) bits can represent
*unsigned numbers* in range
\([0, 2^n-1],\) the counter rolls over to 0 when incrementing the
largest value \(2^n-1\) of the representable number range.
Therefore, \(n\)-bit binary counters are also called
*modulo*-\(2^n\) *counters*. Figure 6.11
shows a 3-bit binary counter with a *reset* input to force the counter
to zero. While the *reset* input is 0, the reset signal is inactive.
Whenever a positive clock edge occurs the counter increments the value
stored in the 3-bit *register*. If the *reset* input
is equal to 1 during a positive clock edge, the register is reset to
zero.

reset = 0 Figure 6.11: Interactive model of a 3-bit binary counter.

The counter circuit is synchronous, because the feedback loop contains
a clock triggered register. Output \(Y\) of the register feeds
back into the *3-bit adder*. The second
input of the adder is hardwired to constant value \(1_{10}.\) We
tacitly assume that the carry-in of the adder is grounded to value 0,
and the carry-out is unused. Sum output \(S\) of the adder
generates the 3-bit sum \(S=(Y+1) \bmod 8.\) Recall that the
largest unsigned binary number representable with 3 bits is
\(2^3-1 = 7_{10},\) and \(7+1=8_{10} = 1000_2\) requires 4
bits. Since the sum output of the adder excludes the most significant
carry-out bit, we obtain \(S = 000_2 = 0_{10}\) for input
\(Y=7_{10}.\) The multiplexer implements the reset logic. Next
state signal \(Y'\) is the output of the combinational portion of
the feedback loop, and obeys the *multiplexer logic*:

We analyze the behavior of the counter circuit by deriving the state
transition diagram. The circuit has one input besides clock
\(\phi,\) i.e. the *reset* signal, and one output, which is
counter value \(Y.\) First assume that the *reset* signal is
inactive, and the counter increments its output at every positive
clock edge. During a clock cycle the register stores the present
state and drives it on output \(Y,\) where we can observe the
state. At the next positive clock edge, the circuit transitions to
next state \(Y'.\) We draw the state transition diagram, we need
to start with some state, for example state \(0_{10}.\) We
represent the state with a new vertex and mark it with its identifying
state value 0. The next state of state \(0_{10}\) is
\(Y'=1_{10}.\) We draw a second vertex, mark it with state value
\(1_{10},\) and draw an arc from vertex 0 to vertex 1.

Rather than annotating the arc with the output value as we did in
Figure 6.9, we annotate the vertices with the output
values. The reason for this change is a subtle difference in the
output behavior of the circuits. In the multiplexer loop of
Figure 6.8 the combinational logic drives the output
whereas in the counter loop of Figure 6.11
the register drives the output. Whereas the output of the multiplexer
loop depends on the present inputs, this is not the case for the
binary counter. Output \(Y\) of the binary counter is stable
during the entire clock cycle, independent of the *reset* input. The
output changes only at the positive clock edge. In the state
transition diagram in Figure 6.12 we emphasize this
behavior by marking arcs with the input value of the *reset* input,
and augment the state vertices with the output value.

As long as the *reset* input is 0, the counter circuit transitions
from state \(Y\) to state \((Y+1) \bmod 8\) at each triggering
clock edge. Thus, the state transition graph is a single cycle with
eight states. If the *reset* input is 1, however, the counter
transitions to state 0, independent of the present state. The reset
arcs form additional cycles in the state transition diagram. As shown
in Figure 6.12, the complete diagram has nine cycles.
We could join the output arcs of state 7, however, because both input
values 0 and 1 cause a transition to state 0.

The state transition table is the equivalent representation of the
state transition diagram in Figure 6.12. For the
binary counter, it assumes the form of a truth table with one column
for the present state and the *reset* signal as inputs, and the next
state as output. Frequently, designers augment the state transition
table with the output specification, here with one column for output
\(Y\):

state resetnext state output 0 0 1 0 0 1 0 0 1 0 2 1 1 1 0 1 2 0 3 2 2 1 0 2 3 0 4 3 3 1 0 3 4 0 5 4 4 1 0 4 5 0 6 5 5 1 0 5 6 0 7 6 6 1 0 6 7 0 0 7 7 1 0 7

In case of the binary counter, specifying output \(Y\) in the
state transition diagram or table is redundant, since the state and
the output are equal. This is not the case in general, however. Note
that we can deduce from the state transition table that the output is
driven by the register rather than the combinational logic, because
the output equals the present state, independent of the *reset* input.

### 6.2.3. Shift Register¶

An \(n\)-bit **shift register** has one serial input and \(n\)
parallel outputs. Its function can be described as a
*serial-to-parallel converter*. Figure 6.13
shows a 4-bit shift register with serial input \(S_{in}\) and four
parallel outputs \(Y_i,\) where \(0 \le i < 4.\) Some
implementations provide parallel output \(Y_{n-1}\) also as a
separate serial output \(S_{out}.\) At each positive clock edge,
the state is shifted by one D-flipflop to the right. After shifting
four bits during four consecutive clock cycles into the shift
register, all four bits are available at the parallel outputs.
Considering the serial input and serial output only, the shift
register implements a **FIFO**, short for first-in-first-out buffer.
The first of multiple bits shifted in at \(S_{in}\) is the first
bit to be shifted out at \(S_{out}.\) It takes four clock cycles
for a bit to propagate through the FIFO.

Sin = 0 Figure 6.13: Interactive model of a 4-bit shift register.

The shift register is a synchronous sequential circuit, although the circuit diagram does not show any loops. The feedback loops are hidden inside the D-flipflops. In essence, a shift register is a series composition of D-flipflops, all triggered by the same clock signal. The state of the shift register can be observed at the outputs as binary number \(Y_3 Y_2 Y_1 Y_0.\) Figure 6.13 enables you to derive the state transition table of the 4-bit shift register:

state \(S_{in}\) next state output 0 0 0 0000 0 1 8 0000 1 0 0 0001 1 1 8 0001 2 0 1 0010 2 1 9 0010 3 0 1 0011 3 1 9 0011 4 0 2 0100 4 1 10 0100 5 0 2 0101 5 1 10 0101 6 0 3 0110 6 1 11 0110 7 0 3 0111 7 1 11 0111 8 0 4 1000 8 1 12 1000 9 0 4 1001 9 1 12 1001 10 0 5 1010 10 1 13 1010 11 0 5 1011 11 1 13 1011 12 0 6 1100 12 1 14 1100 13 0 6 1101 13 1 14 1101 14 0 7 1110 14 1 15 1110 15 0 7 1111 15 1 15 1111

The state transition table confirms that the outputs are independent of input \(S_{in}.\) This is no surprise, because the outputs are driven by the D-flipflops. Besides, the shift register has no combinational logic other than wires. Therefore, when drawing the corresponding state transition diagram we associate the output with a state vertex and the value of input \(S_{in}\) with an arc, analogous to the state transition diagram of the binary counter in Figure 6.12. We leave it as an exercise to translate the state transition table into a state transition diagram. You may view the resulting state transition diagram as a directed graph, and find the shortest paths to transition from one state to another in order to minimize the number of clock cycles and the length of the serial input sequence. For example, given state 6 you need at least two cycles to transition to state 9. The path transitions through state 3 by first applying serial input 0 and then 1.

We can extend the shift register to function not only as a
*serial-to-parallel converter* but also as a *parallel-to-serial
converter*. To that end, we provide parallel data inputs and include
a multiplexer at each D-flipflop input. Figure 6.14 shows the extended 4-bit shift register. The *Load*
input selects the multiplexer inputs between the parallel data inputs
\(D_i,\) \(0 \le i < 4,\) and the original serial inputs.
When *Load* equals 1 at the triggering clock edge, all D-flipflops
store the values provided at the parallel data inputs. Otherwise,
when *Load* equals 0, the circuit performs a right shift. During each
clock cycle, the circuit outputs the data serially, bit-by-bit, at
output \(S_{out}.\)

The shift register in Figure 6.14 can
be used to implement the ports of a serial communication channel, such
as RS-232. The sender uses
the *parallel-to-serial* functionality to load a byte of data into the
shift register, and transmit the byte serially across one wire to a
receiver. The receiver shifts the bits serially into its shift
register, and outputs the byte in parallel, using the
*serial-to-parallel* functionality.

### 6.2.4. Memories¶

The *memory elements* discussed so far,
including the *D-flipflop*, are relatively fast
circuits but require more transistors than necessary to store a single
bit. The hallmark of memory circuits is their density, i.e. the
number of bits stored per unit of chip area.

#### Memory Organization¶

There exist serveral different technologies to implement a 1-bit
memory element. However, independent of the technology virtually all
memory architectures consist of a 2-dimensional array of such 1-bit
memory elements, also called **bit cells** in this context.
Figure 6.15 shows the block diagram of a \(4\times
3\) array of bit cells.

The array has an address bus \(A\) as input, and a data bus
\(D\) as input and output. In general, \(n\)-bit address
\(A\) drives an \(n{:}2^n\)-*decoder* to assert
one of \(2^n\) **wordlines**, thereby selecting one row of the
array of bit cells. We call a group of \(m\) data bits a
**word**, and consider a word the unit of data transfer in and out of
the memory array. In order to read one word out of one row of bit
cells, we apply the associated binary-coded address, and the
\(m\)-bit data bus outputs the word via the **bitlines**. If we
want to write a word into the memory, we drive the word onto the
**bitlines**, and apply the desired address.

Each bit cell is responsible for implementing a *tristate* behavior depending on the select input driven by
the wordline and the signal on the bitline. Figure 6.16 shows one bit cell with its select input \(S\)
connected to the wordline and data port \(D\) connected to the
bitline for input and output. If the wordline drives a 0 on select
input \(S,\) the cell disables data port \(D,\) indicated by
value \(Z\) in Figure 6.16. As a result the
cell is invisible to the bitline. Otherwise, if the wordline drives a
1 on select input \(S\) the cell is enabled. The enabled cell
performs a read or write operation depending on the state of the
bitline. In order to write the cell, we use a tristate buffer to
drive the desired value onto the bitline. This data input is enabled
with signal \(D_{en}.\) While \(D_{en} = 1,\) the tristate
buffer drives data input \(D_{in}\) onto the bitline, and writes
\(D_{in}\) into the enabled cell. On the other hand, if
\(D_{en}=0,\) the data input is disabled, and the enabled cell
makes its stored bit visible to the bitline, where we can observe it
at data output \(D_{out}.\)

wordline = 0 Den = 0 Din = 0 Figure 6.16: Interactive model for reading and writing a bit cell.

When multiple rows of bit cells are arranged in an array, all cells in
a column are connected to the same bitline. To read one word from the
array, at most one row may make its stored bits visible on the
bitlines. The cells in all other rows must disable their data ports.
Figure 6.17 illustrates the interplay of the
rows in an array when reading one row. Note that the address decoder
guarantees that at any point in time exactly one row of cells is
enabled, because the decoder drives a *one-hot encoding* of the binary address onto the wordlines.

A1 = 0 A0 = 0 Figure 6.17: Interactive model for reading a 4x3 memory array.

#### Memory Technologies¶

If a memory cannot be written but read only, like the memory array in
Figure 6.17, it is called **read only memory**,
or **ROM** for short. ROMs are **nonvolatile** memories because they
retain their data if the power supply is switched off. Historically,
data were stored irreversibly in a ROM by burning a fuse in each bit
cell. Most ROMs today are programmble ROMs or **PROMs** that can be
written once. Popular examples for PROMs are CDs and DVDs. A memory that can be read and
written is called **random access memory** or **RAM**. A RAM is a
**volatile** memory that loses its data when the supply power is
switched off. The name *random access* is historically motivated to
distinguish a RAM, where each bit is assumed to have the same access
time, from *sequential access* memories like tapes, where a bit is
located in a particular position on the tape, and the mechanical
winding of the tape into the bit position determines its access time.
Today, two types of semiconductor RAMs are wide spread, the *DRAM* and
the *SRAM*. We briefly introduce these two types of RAMs by
discussing their characteristic bit cell technologies.

A **dynamic RAM**, or **DRAM** for short, stores a bit in form of
charge on a capacitor. Figure 6.19 shows the DRAM bit cell
consisting of the capacitor and an nMOS transistor. If the capacitor
is discharged to *GND*, we interpret the bit stored in the cell
as logical 0. Otherwise, if the capacitor is charged to
\(V_{DD},\) we interpret the stored bit as logical 1. The nMOS
transistor controls the cell. If the wordline is 1, the nMOS
transistor enables the cell by connecting the capacitor to the
bitline. Otherwise, if the wordline is 0, the nMOS transistor is
switched off, and disables the cell by disconnecting the capacitor
from the bitline.

When reading the DRAM bit cell, the nMOS transistor connects the
capacitor to the bitline. If the capacitor is discharged, we observe
a zero voltage at the data output of the bitline. Otherwise, the
charge of the capacitor diffuses onto the bitline, and we observe a
logical 1 at the data output. Since the charge diffuses due to
reading, we need to restore the bit cell after reading it. This is
the reason why a DRAM is called *dynamic* RAM. Restoring the bit cell
simply means writing the read value by driving it onto the bitline
again, see Figure 6.16. The capacitor must be
large enough so that a positive charge can be detected at the data
output of the bitline. Figure 6.18 shows a *deep trench
capacitor*, where a deep hole is drilled into a silicon wafer that is
filled with metal. Since real capacitors leak, the charge stored on
the capacitor must be restored periodically or the bit is lost.

The **SRAM** is a **static RAM** where the bit cell is stable because
its state is continuously replenished by the power supply. An SRAM
cell is essentially a *bistable inverter pair*.
As shown in Figure 6.20, we control the inverter pair with two
nMOS transistors and extend the memory array with a complemented
bitline. Since each CMOS inverter requires two transistors, the bit
cell as a whole consists of six transistors (6T). When the wordline
is 0, the nMOS transistors disconnect the inverter pair from the
bitlines. Otherwise, the bitlines are connected to the complemented
and uncomplemented bit \(Q\) stored in the cell. We read the bit
of the enabled cell at the data outputs of the bitlines. Writing a
bit requires drivers at the data inputs of the bitlines that are
stronger (larger) than the inverters of the cell, so as to overpower
the inverter pair into the desired state.

DRAM and SRAM memories are sequential circuits. However, neither
memory design requires a clock signal for its operation. Therefore,
DRAMs and SRAMs are asynchronous sequential circuits in principle. To
operate within a synchronous design, memory arrays are wrapped into
synchronous control logic to ensure that the read and write operations
stabilize within a clock cycle. Figure 6.21
shows a black-box synchronous memory module. It has a clock input, a
*write-enable* input WE, an \(n\)-bit address input, and an
\(m\)-bit data input and \(m\)-bit data output.

The read operation of a synchronous memory module is considered a combinational operation, independent of the triggering clock edge. Reset the WE input to 0, apply the desired address \(A,\) and the associated data word appears at data output \(D_{out}\) after the propagation delay within one clock cycle. In contrast, the write operation is synchronized by the clock. Set the WE input to 1, apply the desired address \(A\) and data input \(D_{in},\) and the data is written into memory at the triggering positive clock edge.

The bus width of the address input specifies the number of rows, also
called **depth**, of the memory array. Given \(n\) address wires,
the \(n{:}2^n\) address decoder drives \(2^n\) wordlines, as
shown in Figure 6.15, so that the depth of the memory
array is \(2^n.\) The bus width of the data input and the data
output is equal to the number of columns, also called **width** or
**word size**, of the memory array. Thus, the memory array inside the
synchronous memory module in Figure 6.21 has a
width of \(m.\) The **capacity** of a memory is the number of bit
cells, i.e. *depth* \(\times\) *width* bits. The memory array in
Figure 6.15, for example, has 2 address bits, a depth
of \(2^2 = 4,\) a width of 3, and a capacity of \(4 \times 3
= 12\) bits.

#### Logic in Memories¶

Memories serve not only their purpose as a storage medium but are also
used as *universal* logic operators. Given a
\(n\)-variable logic function, we use a \(2^n \times 1\)
memory array with \(n\) address inputs and one data output as a
*lookup table* to read the function values. We can *configure* a RAM
to implement any logic function by storing one function value per row.
Essentially, the memory stores the truth table of a given
combinational function and we select the row by applying the desired
input combination. Figure 6.22 shows a 2-input
XOR function \(Y = A \oplus B,\) implemented by means of a
\(4\times 1\) memory array.

A = 0 B = 0 Figure 6.22: 2-input XOR gate implemented with a \(4\times 1\) memory array.

Memories can implement *multioutput functions* by including one column per output. Thus, a
multioutput function with \(m\) outputs can be realized with a
memory array of width \(m.\) Since RAMs permit configuring any
multioutput function that fits the depth and width of the memory
array, RAMs serve as lookup tables in reconfigurable logic devices,
including FPGAs.

Analyze the synchronous sequential NAND loop.

- Derive the next-state logic.
- Derive a state transition table.
- Derive a state transition diagram.
- Use a time table to analyze the transitions of input sequence \(A = \langle 0, 0, 1, 0, 0, 0, 1, 1, 1, 1 \rangle.\)

The sequential NAND loop has a register on the feedback path. The state of the present clock cycle is observable at register output \(Q.\) The next state is the value of the data input of the register, and is stored at the next positive clock edge. In the NAND loop, the next state equals output \(Y,\) and is a function of present state \(Q\) and input \(A\):

\[Y(A,Q) = \overline{A Q}\,.\]This function is the combinational next state logic of the circuit.

The state transition table is the truth table of the next state logic:

\(A\) \(Q\) \(Y\) 0 0 1 0 1 1 1 0 1 1 1 0 The state transition diagram is the graphical representation of the state transition table. It has one vertex per state, here \(Q \in \{ 0, 1 \},\) and one arc per transition. We annotate the arcs with input \(A\) and output \(Y,\) separated by a slash. Note that output \(Y\) is also the next state of the NAND loop.

The diagram shows that the NAND loop transitions from state \(Q=0\) to state \(Q=1\) regardless of input \(A.\) Thus, when the circuit transitions into state \(Q=0,\) it remains there for one clock cycle before returning to state \(Q=1.\) It remains in state \(Q=1\) while input \(A=0,\) and transitions to state \(Q=0\) when input \(A=1.\) If we hold input \(A\) at value 1, the circuit toggles between states 0 and 1 at every positive clock edge. This behavior resembles the oscillation in Exercise 6.1.

We use a time table to study the behavior of the NAND loop over time given input sequence \(A\) for ten clock cycles, \(0 \le t < 10\):

t 0 1 2 3 4 5 6 7 8 9 Q X 1 1 0 1 1 1 0 1 0 A 0 0 1 0 0 0 1 1 1 1 Y 1 1 0 1 1 1 0 1 0 1 We do not need to know initial state \(Q[0]\) during clock cycle \(t = 0\) to conclude that next state \(Y[0]=1\) if \(A=0.\) Therefore, we mark \(Q[0] = X\) as unspecified. While \(A=0,\) the circuit stays in state 1, and toggles between states 0 and 1 while \(A=1.\)

## 6.3. Finite State Machines¶

**Finite state machines**, or **FSM**s for short, are a subset of
synchronous sequential circuits with a finite number of states.
Although state machines with infinitely many states are of minor
practical relevance, the distinction hints at the theoretical
importance of FSMs as an abstract model for the study and design of
sequential circuits.[1] In this section, we discuss FSMs and
their use as a design methodology for synchronous sequential circuits.

### 6.3.1. Mealy and Moore Machines¶

Our study of *synchronous circuits* has
revealed nuances in the behavior of such circuits. In particular, the
*multiplexer loop* has an
output which is a function of the input and the state, whereas the
output of the *binary counter* is a function of
the state only. Since the output of the multiplexer loop is a
function of the input, the output may change during a clock cycle if
the input changes. In contrast, the output of the binary counter
remains stable during the clock cycle. We exploit this difference by
tailoring the annotations in the state transition diagrams of
Figure 6.9 and Figure 6.12 to their
respective needs. In fact, these circuits examplify two distinct
types of synchronous sequential circuits.

#### Machine Models¶

FSMs are generalizing models of synchronous sequential circuits.
Figure 6.23 shows the **Mealy machine** model, named after
George H. Mealy, and
Figure 6.24 shows the **Moore machine** model, named after
Edward F. Moore.
Both machines have

- input bus \(I\) with \(m \ge 0\) signals \(I_0, I_1, \ldots, I_{m-1},\)
- output bus \(O\) with \(n \ge 0\) signals \(O_0, O_1, \ldots, O_{n-1},\)
- current state \(S\) stored in a \(k\)-bit state register with output signals \(S_0, S_1, \ldots, S_{k-1},\) \(k > 0,\) and capable of encoding up to \(2^k\) distinct states,
- next state \(S'\) with \(k\) signals \(S'_0, S'_1, \ldots, S'_{k-1},\)
- combinational next state logic \(\sigma,\) and
- combinational output logic \(\omega.\)

The next state logic of both Mealy and Moore machines is a
combinational function \(\sigma: \mathcal{B}^{m+k}\rightarrow
\mathcal{B}^k\) that computes next state \(S' = \sigma(S,I).\) The
next state computation receives the present or current state \(S\)
via the *feedback loop*. The loop includes the
state register triggered by clock signal \(\phi,\) which qualifies
both machines as *synchronous sequential circuits*.

The essential difference between the two machine models is the combinational output logic \(\omega.\) For the Mealy machine, output function \(\omega: \mathcal{B}^{m+k} \rightarrow \mathcal{B}^n\) computes output \(O = \omega(S,I)\) as a function of current state \(S\) and input \(I.\) In contrast, the output function of the Moore machine \(\omega: \mathcal{B}^k \rightarrow \mathcal{B}^n\) computes output \(O = \omega(S)\) as a function of current state \(S\) only. The only difference in the circuit diagrams is that the Mealy machine connects input terminal \(I\) to combinational output logic \(\omega\) whereas the Moore machine does not. As a matter of convention for this chapter, we distinguish combinational and sequential logic in circuit diagrams by drawing black-box subcircuits of combinational logic as rectangles with round corners.

#### Machine Identification¶

It is not always immediately obvious which type of machine, Mealy or Moore, a synchronous sequential circuit implements. In fact, it may take several attempts of restructuring a circuit before it fits the structure of the Mealy machine in Figure 6.23 or the Moore machine in Figure 6.24. However, the distinguishing characteristic is usually straightfoward to check, i.e. whether the output is a function of the input or not. Consider the multiplexer loop in Figure 6.8, for example. Output \(Y\) is a function of state \(Q\) and inputs \(D\) and \(S.\) Therefore, the circuit is a Mealy machine. But, what is its next state logic \(\sigma\) and what its output logic \(\omega\)? Since the multiplexer output drives both output \(Y\) and the next state input of the D-flipflop, the circuit in Figure 6.8 does not have the structure of the Mealy machine in Figure 6.23. Nevertheless, duplicating the multiplexer, as shown in Figure 6.25, reveals that the multiplexer loop is an optimized version of a vanilla Mealy machine that reuses the next state logic as output logic or vice versa.

As another example, consider the shift register in Figure 6.13. The outputs \(Y_0,\) \(Y_1,\) \(Y_2,\) and \(Y_3 = S_{out}\) are driven by the registers. Therefore, the outputs are a function of the current state but not a function of input \(S_{in}.\) We conclude that the circuit is a Moore machine. The topology of the circuit in Figure 6.13 is simpler than that of the equivalent vanilla Moore machine in Figure 6.26, where the next state logic \(\sigma\) and output logic \(\omega\) are identity functions.

Since Mealy and Moore machines differ in their output behavior only, we can often use either type to solve a given problem. However, if we wish to generate the output within the cycle of a particular state and input, then a Mealy machine is required. A Moore machine cannot change the output while in the particular state. Instead, in a Moore machine we must wait until the next triggering clock edge, where the machine assumes the next state as a function of the particular state and input, and can then produce the desired output. On the other hand, a Moore machine has a more robust timing behavior, because the consumer of the output can rely on the fact that the output is stable during the entire cloci cycle after the propagation delay of the output logic has elapsed follwing the triggering clock edge. This delay is independent of changes of the input of the machine.

#### Machine Transformations¶

Occasionally FSM designers perceive a Moore machine as more natural than a Mealy machine. If a particular type of machine is desired, such perceptions can be overcome because we may translate one machine type systematically into the other. Figure 6.27 shows the two graphical rules for translating state \(A\) from a Mealy type into a Moore type state transition diagram and vice versa. Recall that we associate the output in a Mealy diagram with the input and in a Moore machine with the state.

Rule 1 states that state \(A\) in Mealy machine can be translated into an equivalent state \(A\) in a Moore machine if all incoming transitions of the Mealy machine have the same output. In this case, we associate the output with state \(A\) of the Moore machine. Conversely, we translate state \(A\) of a Moore machine into a Mealy machine by associating output \(O\) of the Moore machine with all inputs in the Mealy machine. Rule 2 covers the case where the incoming arcs of state \(A\) in the Mealy machine have different outputs. In this case, we replicate Mealy state \(A\) in the Moore machine for each distinct output. Conversely, we transform the Moore machine into a Mealy machine by merging Moore states \(A_1\) and \(A_2\) into Mealy state \(A,\) and by associating the outputs of the Moore states with the inputs of the Mealy transitions.

Rule 2 is the reason why, in general, Mealy machines require fewer states than Moore machines to solve a given problem. Hence, a Mealy machine may be constructed with fewer D-flipflops than the equivalent Moore machine. Whether the reduction in state bits leads to an overall savings in logic, including the next state and output logic, depends on the particular implementation, and requires a direct comparison of alternative designs.

### 6.3.2. Synthesis of Finite State Machines¶

The FSM designer uses the tools for the *analysis of synchronous
circuits*, in particular the state
transition diagram and the state transition table, to specify a FSM
for a given problem description, and to implement the FSM as a
synchronous sequential circuit. FSM synthesis may be viewed as
reversal of the analysis procedure. Although synthesis requires
more creativity than analysis, the synthesis of FSMs is a relatively
systematic process.

The design methodology for an FSM can be summarized as a 6-step recipe:

- Identify inputs and outputs.
- Design a state transition diagram.
- Select a state encoding.
- Derive the state transition table.
- Minimize the next state and output logic.
- Synthesize the circuit.

Most of a designer’s creativity flows into step 2, where a problem description, typically provided as an informal statement, is formalized by means of a state transition diagram either of Mealy type or of Moore type. The other five steps are comparatively straightforward applications of established techniques.

Before we illustrate the FSM design methodology by means of several examples, let us clarify the role of the state register as the central element of every FSM. The state register holds the current state \(S,\) which we may observe at its output \(Q.\) State \(S\) remains stable for the entire clock cycle between triggering clock edges. Next state \(S'\) is stored at the next triggering clock edge. The next state logic computes \(S'\) as a function of the inputs \(I\) and current state \(S.\) Next state \(S'\) must be stable before the next triggering clock edge so that the state register can store the next state reliably. In Moore machines, the state register decouples the outputs from the inputs. Since the outputs are a function of current state \(S\) only, the inputs affect the outputs of the next state indirectly. In contrast, in a Mealy machine, the inputs bypass the state register so that the outputs react to changes at the inputs within the current clock cycle.

#### Serial Adder¶

We wish to design a serial version of a *ripple carry adder*. Assume that the binary summands appear one bit
per clock cycle at inputs \(A\) and \(B,\) least significant
bits first. The adder shall output the sum bits one per clock cycle,
*lsb* first.

Following the *design recipe*, we find that step 1
is trivial whereas step 2 requires some thought. Recall that bit
position \(i\) might generate a carry into position \(i+1.\)
Since the summands appear at the inputs *lsb* first, imagine the
*lsb*‘s appear in cycle 0, the bits of next position 1 in cycle 1, and
so on. Thus, the bits of the summands in position \(i\) are
available in cycle \(i,\) and we need to make the carry out of
position \(i\) available as carry input during cycle \(i+1.\)
We conclude that we need a 1-bit register that stores the carry-out at
the next triggering clock edge, which separates current cycle
\(i\) and next cycle \(i+1\) in time. The output of the
register is the carry-in during cycle \(i+1.\) Since bit position
\(i\) generates a carry or not, a 1-bit register suffices to
distinguish these two states. Therefore, we decide to design the FSM
with a 1-bit state register. Furthermore, we decide to design a Mealy
machine.

We develop the state transition diagram of the Mealy machine, starting
with two state vertices for state S0, meaning the carry-in is 0, and
state S1 to represent a carry-in equal to 1 during the current clock
cycle. Figure 6.28(a) shows the two state
vertices. Next, we consider the state transitions. Since the current
state represents the carry-in, the next state represents the carry
out. Thus, our next state logic must implement the carry-out function
of a *full adder*. The carry-out, i.e. the next
state, is a combinational function that depends on the two inputs
\(A\) and \(B\) and the carry-in, i.e. the current state
\(S.\)

Recall that the carry-out of a full adder is 1 if two or three inputs
are 1, and otherwise the carry-out is 0. First, we consider the
outgoing transitions of state S0, see Figure 6.28(b).
The only input combination that can generate a carry-out of 1 is
\(A = B = 1.\) We draw an arc from S0 to S1, and annotate it with
input combination \((A,B)=11.\) The remaining three input
combinations do not generate a carry out. Therefore, we draw a
looping arc at state S0, and annotate it with the three remaining
input combinations. The outgoing transitions of state S1 are shown in
Figure 6.28(c). With a carry-in equal to 1, the only
input combination that causes the carry-out to be 0 is \(A = B =
0.\) We draw an arc from S1 to S0 for input combination \((A,B) =
00.\) The other three input combinations cause a carry-out of 1,
represented by the loop pointing from S1 to S1. In Figure 6.28(d), we add the outputs to each input combination. The
output of the serial adder is the sum bit. If the carry-in is 0,
i.e. in state S0, the sum is 1 if exactly one of the inputs \(A\)
or \(B\) is 1, otherwise the sum is 0. In state S1, representing
a carry-in of 1, the sum is 1 if inputs \(A\) and \(B\) are
equal. We also mark state S0 with another circle, indicating that S0
is the **start state** of the Mealy machine, because initially the
carry into *lsb* position 0 should be 0.

Step 3 of the FSM recipe requires selecting a state encoding. We
choose the natural encoding suggested by our design, \(S0 = 0\)
and \(S1 = 1,\) because the state register should store the value
of the carry. Using the opposite encoding is possible, but would be
rather confusing. Next, we derive the state transition table
according to step 4 of the FSM recipe. It shouldn’t come as a
surprise that converting the state transition diagram of
Figure 6.28(d) into a state transition table yields
the truth table for the carry-out of a *full adder*,
where the current state is the carry-in:

state A B next state Sum 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1

We have included the Sum output of the serial adder in the state transition table, because it adds just one more column to the truth table. Since the output of a Mealy machine is a function of the current state and the inputs, needed for the next state as well, this is faster and less error-prone than compiling a separate truth table for the output function. Note that the translation of the state transition diagram into the state transition table is a straightforward transliteration of the directed graph into truth table format. No knowledge specific to the purpose of our adder design is required, except the state encoding.

Step 5 of the FSM recipe asks us to minimize the next state and output
logic. To that end, we apply the techniques developed for
combinational logic design, for example *K-maps*. The
K-maps for the next state \(S'\) and the *Sum* output

yield the well-known SOP expressions of the *majority* and *odd-parity* functions

Without further logic optimizations or technology mapping, we complete step 6 of the FSM recipe, and synthesize a circuit diagram for the serial adder using two-level combinational logic. Figure 6.29 shows the resulting Mealy machine in the topological form of Figure 6.23. The next state logic \(\sigma\) is the majority function and output logic \(\omega\) is a 3-variable XOR function.

Figure 6.29 completes the FSM design of the serial
adder. Note that the serial adder consists of the combinational logic
of one full adder plus one register. Compared to a *ripple carry
adder*, which requires \(n\) full adders to
add two \(n\)-bit summands, our serial adder FSM requires only one
full adder, independent of \(n.\) However, our serial adder FSM
requires \(n\) clock cycles to produce all sum bits. This is a
typical example for the trade-off between space and time in digital
logic.

Now that we have a Mealy machine, we have two options to design a
Moore machine for the serial adder. Option 1 is to start from
scratch, working ourselves through the six steps of the FSM recipe.
However, step 2 of the recipe requires more creativity for a Moore
machine than for the Mealy machine. This becomes clear after working
through Option 2, deriving the Moore machine from the Mealy machine
using *machine transformations*. The
state transition diagram of the Mealy machine in Figure 6.28(d) shows four incoming transitions for each state with different
outputs, 0 or 1. Therefore, we apply Rule 2 and split each Mealy state into two Moore states, as shown
in Figure 6.30. Moore state \(S0_0\) represents
a carry-in of 0 with sum output 0 and Moore state \(S0_1\)
represents a carry-in of 0 with sum output 1. Analogously, Moore
states \(S1_0\) and \(S1_1\) represent a carry-in of 1 with
the corresponding outputs.

The state transition diagram of the Moore machine in Figure 6.30 is significantly more complex than the corresponding diagram of the Mealy machine. This is, of course, why we present the design of the Mealy machine for the serial adder to begin with. We leave it as an exercise to complete the design of the Moore FSM.

#### Pattern Recognizer¶

We receive a periodic binary signal and wish to recognize patterns of
two consecutive 1’s. To solve this problem, we follow the *FSM
design recipe*.

At the first glance, the problem description appears to be rather vague. However, step 1 of the recipe gives us a chance to tie up loose ends by defining the inputs and outputs of our pattern recognizer machine. The binary signal constitutes a 1-bit input; call it \(A.\) Since the binary signal is periodic, we assume a clock signal \(\phi\) with the period of input \(A.\) To indicate recognition of two consecutive 1’s in input \(A,\) we attach a 1-bit output, say \(Y,\) to the machine. Our goal is to set output \(Y\) to 1 after recognizing two consecutive 1’s in \(A\) and to 0 otherwise. The block diagram on the right summarizes our findings.

Following step 2 of the design recipe, we model the problem as a state transition diagram. With little information in the problem description to go on, our best bet is to examine the sampling behavior of the input signal with a concrete example. As shown in Figure 6.31, we assume that clock \(\phi\) and input \(A\) have the same period. Furthermore, to facilitate reliable sampling, input signal \(A\) must be stable before and at the positive clock edge. As shown in the waveform diagram, we assume that signal \(A\) changes shortly after the positive clock edge. Thus, \(A\) is stable during most of the clock cycle and, in particular, at the triggering clock edge at the end of the cycle. This timing behavior gives our machine enough slack to compute the next state during the clock cycle, and store the next state at the end of the cycle. Under these assumptions, we count time in the waveform diagram in multiples of clock cycles.

To simplify the problem even further, we ignore the details of the timing behavior, and use abstract binary sequences for the input and output signals of the machine. The input sequence in Figure 6.31 is

In the following, we refer to element \(i\) of sequence \(A\) as \(A[i],\) and interpret \(A[i]\) as the Boolean value of input signal \(A\) during cycle \(i.\) For example, \(A[0] = 0\) and \(A[2] = 1.\)

Next, we develop the desired output signal for input sequence
\(A.\) Assume output \(Y\) is initially 0, until we have seen
the first pair of inputs \(A[0] = 0\) in cycle 0 and \(A[1] =
0\) in cycle 1. Since two consecutive 0’s differ from the 11-pattern
we are looking for, we set the output to 0. Since we need to inspect
the inputs during cycles 0 and 1, it seems natural to assign 0 to
output \(Y\) during cycle 2 after receiving two input bits. Thus,
we pursue a timing behavior where the machine inspects the first input
bit \(A[t]\) during cycle \(t,\) the second input bit
\(A[t+1]\) during cycle \(t+1,\) and asserts the corresponding
output \(Y[t+2]\) during cycle \(t+2.\) In Figure 6.31, the first pair of consecutive 1’s is
\(A[2]=1\) and \(A[3]=1,\) so that cycle 4 is the first cycle
where we set \(Y[4]=1.\) During cycle 5 we set \(Y[5]=0,\)
because \(A[3] =1\) and \(A[4]=0.\) The second pair of
consecutive 1’s appears in cycles 8 and 9, \(A[8]=1\) and
\(A[9]=1.\) Therefore, we set \(Y[10]=1.\) Cycle 9 presents
a notable case, because \(A[9]=1\) and \(A[10]=1.\) Hence, we
set \(Y[11]=1.\) This case is notable, because input
\(A[9]=1\) belongs to two overlapping pairs of 11-patterns. We
interpret our problem description to recognize 11-patterns as
recognizing *every* 11-pattern, and include overlapping patterns. As
a result our pattern recognizer should produce the output sequence

for input sequence \(A.\)

The sequence abstraction is crucial for the design of a finite state machine. First, however, we have to decide on the type of the machine, Mealy or Moore machine. Our earlier decision to assign the output during the cycle after inspecting two consecutive input bits implies a Moore machine. The waveform diagram in Figure 6.31 clarifies this implication. Consider cycle 4, where output \(Y[4]=1\) in response to recognizing 11-pattern \(A[2] = A[3] = 1.\) Output \(Y\) should be independent of input \(A,\) or \(Y\) might switch to 0 during cycle 4 when \(A\) switches from 1 to 0. In a Moore machine, output \(Y\) is independent of the input and, therefore, switches at the triggering clock edges only, as indicated by the arrows in Figure 6.31.

We are ready to design the state transition diagram for a Moore type pattern recognizer. Since the number of required states is not immediately obvious, we develop the states incrementally. We begin with start state S0, where we wait for the first 1 to appear in input \(A.\) When the first 1 occurs, we transition to a new state S1. Otherwise, we stay in state S0 and output \(Y=0.\) Figure 6.32(a) shows start state S0 and its two outgoing transitions. Next, consider state S1. We transition into S1 after observing one 1 on input \(A.\) Since we have not observed two 1’s, we output \(Y=0.\) If we observe input 0 while in S1, we return to S0, and wait for the next 1 to appear. Otherwise, if we observe input 1, we transition to new state S2, see Figure 6.32(b). In state S2, we have received two consecutive 1’s in the two preceding cycles. Therefore, we output \(Y=1.\) The transitions out of state S2 are shown in Figure 6.32(c). If we observe input 1, we encounter the case of overlapping 11-patterns. We remain in state S2 and output \(Y=1.\) Otherwise, if input \(A=0,\) we return to state S0 and wait for the next 1 to appear in the input. This completes the state transition diagram.

We continue with step 3 of the design recipe, and select a state encoding. The Moore machine in Figure 6.32(c) has three states. Thus, we need at least \(\lceil \lg 3\rceil = 2\) bits to encode three states. We choose a binary encoding of the state number:

state binary encoding S0 00 S1 01 S2 10

The binary encoding seems natural, but is not necessarily the best choice. Other choices may reduce the cost of the next state and output logic. However, the binary state representation minimizes the width of the state register. For our Moore machine, we need a 2-bit state register with binary state \(S = S_1 S_0.\)

Given the state encoding, we can develop the state transition table according to step 4 of the design recipe. The table defines the state \(S' = S_1' S_0'\) as a function of the current state \(S = S_1 S_0\) and input \(A.\) For clarity, we also include columns for the corresponding state names used in the state transition diagram of Figure 6.32(c).

\(S\) \(S_1\) \(S_0\) \(A\) \(S'\) \(S_1'\) \(S_0'\) S0 0 0 0 S0 0 0 S0 0 0 1 S1 0 1 S1 0 1 0 S0 0 0 S1 0 1 1 S2 1 0 S2 1 0 0 S0 0 0 S2 1 0 1 S2 1 0

The state transition table is a truth table that specifies the combinational next state logic. More succinctly, the next state logic is a multioutput function \(S' = \sigma(S, A)\) where \(S_1'\) and \(S_0'\) are Boolean functions in three variables, \(S_1,\) \(S_0,\) and \(A.\)

Furthermore, we compile a truth table for the output logic. Output \(Y\) of the Moore machine is a function of the current state \(S = S_1 S_0.\) The state transition diagram in Figure 6.32(c), specifies in each state vertex the associated output, which we summarize in a separate truth table.

\(S\) \(S_1\) \(S_0\) \(Y\) S0 0 0 0 S1 0 1 0 S2 1 0 1

Following step 5 of the design recipe, we minimize the next state and output logic. We use K-maps for the incompletely specified functions

to derive the minimal SOP expressions

Last but not least, we obtain the circuit diagram in Figure 6.33, which realizes step 6 of the design recipe. The
output logic is the trivial identity function. To enter start state
S0, we assume that the register has a separate *reset* input that
enables us to reset all D-flipflops.

Now that we have a complete design for the pattern recognizer as a Moore machine, we amortize our intellectual investment by designing a Mealy machine for direct comparison. We restart the design process at step 2 of the recipe. Figure 6.34 shows the waveform diagram with the same input signal as in Figure 6.31, assuming that output \(Y\) depends not only on the state but also on input \(A.\) We exploit the combinational dependence on input \(A\) to redefine the timing behavior of output \(Y,\) such that \(Y=1\) during the clock cycle when the second 1 appears on input \(A.\) More precisely, if \(A[t] = 1\) during cycle \(t\) and \(A[t+1] = 1\) during cycle \(t+1,\) then we can set \(Y[t+1]=1\) in a Mealy machine, which is one cycle earlier than in the Moore machine.

Consider the first occurance of two consecutive 1’s in cycles 2 and 3.
The Mealy machine sets output \(Y\) to 1 during cycle 3 already.
Note that output \(Y\) will remain 1 at the beginning of cycle 4
until \(A\) changes from 1 to 0. Indeed, the timing behavior of
output \(Y\) of the Mealy machine is not as clean as that of the
Moore machine. The output may even exhibit *glitches*
as at the beginning of cycle 6. However, if we sample output
\(Y\) toward the end of the cycle, then the behavior is as
expected: we observe \(Y=1\) at the positive clock edges at the
end of cycle 3, cycle 9, and cycle 10.

The state transition diagram of the Mealy machine requires only two states. State S0 is the start state, which waits for the first 1 to appear on input \(A.\) When \(A=1,\) we transition to state S1, and output \(Y=0.\) If \(A=1\) in state S1, the 1 input must have been preceded by another 1, so we output \(Y=1\) and remain in state S1. If input \(A = 0\) in state S0 or S1, the machine outputs \(Y=0\) and transitions to next state S0. The Mealy machine saves one state compared to the Moore machine.

The remaining steps of the design recipe involve straightforward design work. Since the Mealy machine has two states, we need a 1-bit register to store \(S=0\) representing state S0 and \(S=1\) for state S1. We translate the state transition diagram in Figure 6.35 into the combined truth table for the next state logic and output logic:

\(S\) \(A\) \(S'\) \(Y\) 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1

We can spot the minimal Boolean expressions in the truth table without the aid of K-maps: next state \(S' = A\) and output \(Y = A \cdot S.\) The resulting circuit diagram of the Mealy machine is shown in Figure 6.36. The machine is a degenerate Mealy machine without a feedback loop.

Comparison with the Moore machine in Figure 6.33 shows that the Mealy machine needs one D-flipflop rather than two and only one gate rather than four. These hardware savings come at the expense of a noisy timing behavior. If the clean timing behavior of the Moore machine output is desired, we can have it albeit at increased hardware cost.

#### Traffic Light¶

Perhaps the seminal application of finite state machines is the traffic light controller. The animation in Figure 6.37 shows the baroque Austrian style, which ends the green phase in a flashy trill.

Figure 6.37: Austrian traffic light animation.

Designing a controller FSM for the traffic light requires determining a suitable clock period. A simple, natural choice is the shortest duration that any light is switched on or off. Assume that the blinking green light at the end of the green phase determines the clock period \(T,\) and all other phases last a multiple of \(T\):

phase duration red \(13\,T\) red-yellow \(4\,T\) green \(5\,T\) green blinking \(2\,T \times 4\) yellow \(4\,T\)

Given this problem description, we design an Austrian traffic light
controller follwing the *FSM recipe*. First, step
1, we define the inputs and outputs. Other than the clock signal, our
rudimentary controller receives no data inputs. Standard traffic
lights consist of three colored lights, red, yellow, and green. The
controller switches each of the three lights on or off. Therefore, we
attach three 1-bit outputs, \(R,\) \(Y,\) and \(G.\) If
the output is 1, the associated light shall be switched on and
otherwise off.

Next, step 2 of the *FSM recipe*, we design a state transition diagram. Since the
controller has no input signals, we do not have a choice but construct
a Moore machine. We introduce one state per clock cycle, for a total
of 34 states. The green blinking phase consists of four times two
cycles, one cycle green off plus one cycle green on. Figure 6.38 shows the Moore type state transition diagram. If an
output equals 1, we include the associated output letter in the state
vertex. Absence of the output letter means the output shall be 0.

To encode 34 states, step 3 of the *FSM recipe*, we need \(\lceil \lg 34\rceil = 6\)
bits if we use a binary state encoding \(S = S_5 S_4 S_3 S_2 S_1
S_0.\) The number of states is large enough to get us into trouble
with the minimization of the next state and output logic. K-maps for
six variables do exist, but algorithmic minimization methods may be
preferable in this case. Rather than following the FSM recipe, we
develop a more creative solution to turn the state transition diagram
into a synchronous circuit. We note that the state sequence equals
that of a *binary counter* modulo 34. Thus,
rather than deriving minimal expressions for each of the six next
state functions, we extend the binary counter in Figure 6.11 to generate the desired state sequence. We
transition from state S33 to S0 by asserting the *reset* signal if the
current state is equal to \(33_{10} = 100001_2,\) or in terms of
Boolean algebra:

Minimizing the output functions individually, we find the SOP expressions as functions of the current state:

The resulting circuit diagram of the traffic controller is shown in Figure 6.39, with the output logic hidden in a black box,

This example illustrates the design problem with monolithic finite
state machines. If the number of states becomes too large, steps 4-6
of the *FSM design recipe* become unmanageable. At
this point, we may compose the machine out of smaller machines. We
discuss the *composition of FSMs* as an
alternative design methodology below.

A **divide-by-2 frequency divider** asserts output \(Y\) for
one clock cycle out of 2:

- Design a synchronous divide-by-2 frequency divider.
- Verify your circuit by deriving its state transition diagram.

The waveform of the divide-by-2 frequency divider suggests that output \(Y\) toggles between 0 and 1 at every positive clock edge. Thus, we may use a register to store present state \(Q,\) and use an inverter to toggle next state \(Q' = \overline{Q}.\) The sequential circuit below connects output \(Y\) to the output of the inverter, so that \(Y = Q'.\)

The problem is simple enough to ignore the

*FSM recipe*and derive the frequency divider machine in an ad hoc fashion.We analyze our machine design by deriving its state transition diagram. Since the state register stores one of two states \(Q \in \{ 0, 1\},\) the transition diagram has two vertices. The machine has no inputs other than clock signal \(\phi.\) Instead, whenever the machine is in state \(Q = 0\) it transitions to state \(Q = 1\) and vice versa. Therefore, we annotate the arcs with the corresponding output \(Y=Q'\) only.

The state transition diagram enables us to grasp the behavior of the frequency divider easily. The machine toggles between states 0 and 1, and outputs 1 in state 0 and 0 in state 1. This is the expected behavior given the waveform above. In fact, we could have connected output \(Y\) to the output of the register rather than the output of the inverter, so that \(Y = Q.\) The behavior observed at output \(Y\) would be the same as for \(Y=Q',\) if we ignore the startup behavior of the machine. The only change in the state transition diagram would be complementing the outputs annotated at the arcs.

A **divide-by-3 frequency divider** asserts output \(Y\) for
one clock cycle out of 3.

Design a state machine for the frequency divider following the
*FSM recipe*:

- Identify inputs and outputs.
- Design a state transition diagram.
- Select a state encoding.
- Derive the state transition table.
- Minimize the next state and output logic.
- Synthesize the circuit, and determine the cost of the logic.

Given the waveform diagram, the frequency divider needs no input other than clock signal \(\phi,\) and one output signal \(Y.\)

The creative part of the design process consists of identifying the states for the state transition diagram. The divide-by-3 frequency divider exhibits periodic behavior with a period of 3 clock cycles. We set output \(Y=1\) during one of the 3 cycles and \(Y=0\) during the other two cycles of one period. Therefore, we introduce three states, \(S0,\) \(S1,\) and \(S2,\) corresponding to each of the three clock cycles:

We choose state \(S0\) to set output \(Y=1.\) This decision is arbitrary in the sense that the observer of output \(Y\) cannot distinguish the particular choice of the state unless the startup behavior of the machine is relevant. The states transition at every positive clock edge, so that each state is visited for exactly one clock cycle, and the output transitions accordingly. Since we associate an output with each state, our state transition diagram models a Moore machine.

To encode 3 states we need at least \(\lceil \lg 3\rceil = 2\) bits. We choose the natural binary encoding where \(S = S_1 S_0\) is an unsigned binary number:

state \(S_1\) \(S_0\) \(S0\) 0 0 \(S1\) 0 1 \(S2\) 1 0 Given the state encoding, we derive the state transition table by transliterating the state transition diagram. For each current state \(S,\) we include the next state \(S'\) and output \(Y\):

\(S_1\) \(S_0\) \(S'_1\) \(S'_0\) \(Y\) 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 This table specifies next state functions \(S'_1(S_1, S_0)\) and \(S'_0(S_1, S_0)\) and output function \(Y(S_1,S_0).\) These functions are incompletely specified, because the state transition table omits state \(S = 11_2.\)

We minimize the next state and output logic using K-map minimization for incompletely specified functions:

We find minimal covers \(S'_1 = S_0,\) \(S'_0 = \overline{S}_1\,\overline{S}_0,\) and \(Y = \overline{S}_1\,\overline{S}_0.\) Note that \(Y = S'_0,\) which is obvious because the columns in the state transition table are identical. Thus, we could have saved the minimization of \(Y.\)

We synthesize the frequency divider circuit at the gate level. The schematic includes a 2-bit state register, and the minimized next state logic and output logic using NOR technology:

The cost of the combinational logic of the frequency divider is the cost of two 2-input NOR gates for a total of \(\mathcal{C} = 4.\) The cost of the machine is the cost of the logic plus two 1-bit registers.

Redesign the divide-by-3 frequency divider of Exercise 6.4 with a one-hot state encoding.

Follow the *FSM recipe*:

- Identify inputs and outputs.
- Design a state transition diagram.
- Use a one-hot state encoding.
- Derive the state transition table.
- Minimize the next state and output logic.
- Synthesize the circuit, and determine the cost of the logic.

We redesign the divide-by-3 frequency divider of
Exercise 6.4. This exercise requests a
*one-hot state encoding*. Steps (1) and (2)
of the *FSM recipe* remain unaffected by the
state encoding. Therefore, we reuse the state transition diagram
of step (2) in Exercise 6.4, and continue with
step (3) of the FSM recipe.

A one-hot encoding of three states \(S0,\) \(S1,\) and \(S2,\) sets one bit out of three to 1. Therefore, our one-hot code words consist of three bits, \(S = S_2 S_1 S_0,\) such that:

state \(S_2\) \(S_1\) \(S_0\) S0 0 0 1 S1 0 1 0 S2 1 0 0 Given the one-hot state encoding, we derive the state transition table by transliterating the state transition diagram of Exercise 6.4 into a truth table:

\(S_2\) \(S_1\) \(S_0\) \(S'_2\) \(S'_1\) \(S'_0\) \(Y\) 0 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 The state transition table specifies four incomplete Boolean functions in three variables. We minimize these functions using K-map minimization:

We find the minimal covers \(S'_2 = S_1,\) \(S'_1 = S_0,\) \(S'_0 = S_2,\) and \(Y = S_0 (= S'_1).\)

The synthesized circuit requires a 3-bit register and no logic gates:

The cost of the combinational logic is \(\mathcal{C} = 0.\) Compared to the binary encoding in Exercise 6.4, the one-hot encoding saves the logic cost entirely at the expense of one additional 1-bit register. Note that the one-hot frequency divider is a cyclically wired

*shift register*.

Given a bit-serial input stream, design an **odd parity checker**
that asserts its output when the input stream contains an odd
number of 1’s.

Follow the *FSM recipe*:

- Identify inputs and outputs.
- Design a state transition diagram.
- Select a state encoding.
- Derive the state transition table.
- Minimize the next state and output logic.
- Synthesize the circuit.

The black-box diagram of the parity checker shows one input \(A\) and one output \(Y.\) We assume that the clock signal is supplied implicitly.

We design a state transition diagram for the

*odd parity*checker with bit-serial input \(A\) and output \(Y.\) The specification requires that output \(Y\) is 1 when the number of 1’s in the input stream is odd. Before formalizing the specification, we clarify several details. To observe the number of 1’s in the input bit stream, we need to specify the start of the observation. Thus, we introduce clock cycle \(t,\) and let \(t = 0\) be the start cycle. Assuming our machine inspects input \(A[t-1]\) during cycle \(t-1,\) we set \(Y[t] = 1\) if the number of 1’s in sequence \(A[0{:}t-1] = \langle A[0], A[1], \ldots, A[t-1]\rangle\) is odd. A time table illustrates our design:t 0 1 2 3 4 5 6 7 8 9 A 0 1 0 1 1 1 1 0 0 0 Y 0 0 1 1 0 1 0 1 1 1 During start cycle \(t=0,\) we set output \(Y=0\) indicating that no 1’s have been detected before the start cycle. Note that zero is an even number. The machine inspects input \(A[0] = 0,\) and computes the next output \(Y[1]=0,\) because the number of 1’s in sequence \(A[0{:}0] = A[0]\) is zero. During cycle \(t=1,\) the machine outputs \(Y[1]=0,\) and inspects input \(A[1]=1.\) Since the number of 1’s in sequence \(A[0{:}1] = \langle 0, 1\rangle\) is 1, i.e. an odd number, the next output is \(Y[2]=1.\) The remaining entries in the time table are derived analogously.

Next, we formalize our parity checker by designing the state transition diagram. The machine has two states \(Q \in \{ \text{even}, \text{odd}\}\) indicating whether the current number of 1’s in input sequence \(A\) is even or odd. If current state \(Q\) is even and input \(A=0,\) we remain in state even. Otherwise, if \(A=1,\) we transition into next state \(Q' = \text{odd.}\) If current state \(Q = \text{odd}\) and input \(A=0,\) we remain in state odd, and if \(A=1,\) we transition into next state even. The state transition diagram below captures these arguments in form of a graph:

We associate output \(Y\) with the states, keeping in mind that the state during clock cycle \(t\) and associated output \(Y[t]\) reflect the parity of the number of 1’s in input sequence \(A[0,t-1].\) Thus, our design is a Moore machine. We handle the remaining steps of the

*FSM recipe*in a systematic fashion.The state transition diagram suggests the state encoding:

state \(S\) even 0 odd 1 This encoding is convenient, and likely to produce a low-cost circuit, because output \(Y=S.\)

The state transition table specifies next state \(S'\) and output \(Y\) as functions of current state \(S\) and input \(A.\) We generate the table by straightforward transliteration of the state transition diagram and the chosen state encoding:

\(S\) \(A\) \(S'\) \(Y\) 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 1 We recognize Boolean functions \(S'\) and \(Y\) in the state transition table as \(S' = A \oplus S\) and \(Y = S.\)

The synthesized gate-level circuit for the odd parity checker is:

Note that output \(Y\) is not a function of input \(A\) but is a function of state \(S\) only. Thus, our parity checker is a Moore machine, indeed.

Consider a pattern recognizer for bit pattern 011.

- Design the state transition diagram of a Mealy machine.
- Design the state transition diagram of a Moore machine.

A *pattern recognizer* has input
\(A\) and output \(Y.\) We set output \(Y = 1\) for
one clock cycle if pattern 011 appears on input \(A.\)

A Mealy machine can respond to changes at its inputs within the current clock cycle, because the output is a function of the current state and the inputs. Therefore, we design the state transition diagram such that output \(Y\) is associated with the transition arcs. First, we study a concrete input sequence by means of a time table, however:

t 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 Y 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 Observe that pattern 011 appears at input sequence \(A\) during cycle interval \(3 \le t \le 5,\) interval \(6 \le t \le 8,\) and \(11 \le t \le 13.\) In a Mealy machine, we set output \(Y = 1\) during the cycle when the last 1 of the pattern appears, i.e. during cycles 5, 8, and 13. During all other cycles, we set output \(Y=0,\) as shown in the time table.

A Mealy machine needs three states to recognize pattern 011. State \(S0\) is the start state, were the machine waits until the first 0 appears at input \(A.\) Then, the machine transitions into state \(S1.\) If input \(A=1\) in state \(S1,\) the machine transitions into state \(S2.\) If input \(A=1\) in state \(S2,\) the machine recognizes pattern 011, and transitions back to start state \(S0.\) We include arcs from \(S0\) to \(S1\) for \(A/Y = 0/0,\) from \(S1\) to \(S2\) for \(A/Y = 1/0,\) and from \(S2\) to \(S0\) for \(A/Y = 1/1.\)

To complete the state transition diagram, we observe that each state must have two outgoing arcs, one for input \(A=0,\) and the other for \(A=1.\) Thus, we inspect each state and attach the outgoing arcs not discussed above. We need an arc from state \(S0\) to itself for \(A/Y = 1/0,\) because we wait in \(S0\) while input \(A=1\) until the first 0 appears. State \(S1\) waits until a 1 appears at the input. Thus, we need an arc from state \(S1\) to itself for \(A/Y = 0/0.\) If input \(A=0\) in state \(S2,\) the input pattern is 010, which differs from pattern 011. The last 0 may be the first 0 of pattern 011, however. Therefore, we need an arc from state \(S2\) to state \(S1\) for \(A/Y = 0/0.\)

A Moore machine computes its output as a function of the current state only. Therefore, a Moore machine responds to an input in the next clock cycle at earliest. The state transition diagram of a Moore machine associates the output with the state vertices. Before, designing the state transition diagram, we revisit the concrete example of subproblem (a), with output \(Y\) specified as expected for a Moore machine:

t 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 Y 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 Compared to the Mealy machine, we set output \(Y=1\) during the clock cycle after receiving the third bit of pattern 011, i.e. during clock cycles 6, 9, and 14. The Moore machine needs four states to recognize pattern 011. State \(S0\) is the start state. If input \(A=0,\) the machine transitions to state \(S1,\) from there to state \(S2\) if \(A=1,\) and to state \(S3\) if \(A=1\) again. In state \(S3,\) the machine signals that it recognizes pattern 011 by setting output \(Y=1.\) In the other states, output \(Y=0.\)

We complete the state diagram by visiting each state, and attaching the missing outgoing arcs. Since the machine has one input, each state vertex must have two outgoing arcs one for input \(A=0\) and the other for \(A=1,\) as in the Mealy machine. If input \(A=1\) in start state \(S0,\) we wait for the first 0. Thus, we need an arc from \(S0\) to itself for \(A=1.\) In state \(S1,\) we wait for one or more 0’s in the input. Thus, for input \(A=0\) we need an arc from \(S1\) to itself. In state \(S2,\) we have seen pattern prefix 01. If the input is \(A=0,\) this 0 may be the first 0 of pattern 011. Therefore, we include an arc from \(S2\) to \(S1\) for \(A=0.\) In state \(S3,\) we recognize pattern 011. If input \(A=0,\) this 0 may be the first 0 of pattern 011, and we include an arc from \(S3\) to \(S1.\) On the other hand, input \(A=1\) does not start pattern 011, and we include an arc to start state \(S0\) to wait for the next 0 again.

We conclude that the Moore machine requires one more state than the Mealy machine and the output responds one clock cycle later. We note that the synthesized circuit for the Moore machine is often more costly than the Mealy machine. In contrast the timing behavior of output \(Y\) of the Moore machine is less sensitive to the timing behavior of input \(A\) than the Mealy machine.

## 6.4. Composition of Finite State Machines¶

The *6-step design recipe* for finite state
machines represents a methodology which produces *monolithic* machines
with a potentially large number of states. The *traffic light
controller* is an example of a monolithic FSM design
with 34 states. Larger FSMs with hundreds or even thousands of states
are by no means uncommon in real-world applications. An alternative
design methodology is the composition of larger FSMs from smaller FSMs
with combinational glue logic. Given a large monolithic FSM, we may
decompose or factor the machine into multiple smaller machines. In
this section, we demonstrate the opposite design process. Rather than
decomposing a given monolithic FSM, we start with a clean slate and
demonstrate how to design a larger FSM by composing multiple smaller
FSMs.

### 6.4.1. Primitive State Machines¶

We introduce several primitive state machines that serve as building blocks for larger state machines. We assume that all primitive state machines have a clock input and a reset input, without drawing them explicitly in our circuit diagrams. All clock inputs shall be driven by the same clock signal \(\phi\) so that the resulting state machine is a synchronous circuit. Similarly, all reset inputs shall be driven by the same reset signal, used to reset all D-flipflops to 0, for example to guarantee a well-defined start state.

#### Delay Element¶

The simplest primitive state machine is the delay element, which
resembles a 1-bit *shift register*. The symbol
of the delay element on the left in Figure 6.40 omits the
clock and reset inputs by convention. The implementation of the delay
element consists of a single D-flipflop. We assume that the
triggering clock edge is the positive clock edge. Furthermore, we
assume a resettable D-flipflop with a synchronous reset input. If the
*reset* input is 1 at the triggering clock edge the D-flipflop stores
a 0. Otherwise, if the *reset* input is 0, it stores the signal at
input \(D.\)

The delay element constitutes a degenerate Moore machine without a feedback loop. Both the next state logic \(\sigma(I) = I\) and the output logic \(\omega(S) = S\) implement identity functions, so that \(S' = I\) and \(O = S.\) We describe the time dependent behavior of the delay element abstractly as a function of clock cycle \(t,\) where \(t\) is a nonnegative integer. We use brackets, such as \(O[t],\) to refer to signal \(O\) in clock cycle \(t.\) With this notation, we model the behavior of the delay element for \(t \ge 0\) as

We assume that the reset signal is asserted before the beginning of time to enforce initial state \(O[0] = 0.\) Thereafter, in cycle \(t\) the delay element outputs the input of previous cycle \(t-1.\) If the input signal consists of sequence

then the output produces the same sequence with a delay of one cycle, with an initial value of \(O[0] = 0\):

Cycle \(t\) serves as selector of the \(t^{th}\) element of a sequence, starting with \(t=0.\) For example, during cycle \(t=2\,\) we find \(I[2] = x_2\) and \(O[2] = x_1.\) Since \(x_1 = I[1],\) we have \(O[2] = I[1]\) as expected from expression \(O[t] = I[t-1].\)

As a concrete example, consider binary input sequence \(I =
\langle 1, 1, 0, 1, 0, 0, 1\rangle.\) The delay element transduces
input sequence \(I\) into output sequence \(O.\) We
illustrate the sequences in a **time table** with one column per
cycle:

cycle 0 1 2 3 4 5 6 I 1 1 0 1 0 0 1 O 0 1 1 0 1 0 0

The time table serves as an abstract version of a waveform diagram. By default, we assume that the delay element stores 0 initially, so that in cycle 0, \(O[0] = 0.\) Then, in cycle 1, we have \(O[1] = I[0] = 1,\) and so on.

#### Accumulator State Machine¶

A \(n\)-bit accumulator state machine outputs the sum of its inputs modulo \(2^n.\) We may design Mealy type and Moore type accumulators, as shown in Figure 6.41. The output of the Moore machine is delayed by one cycle compared to the Mealy machine. All buses comprise \(n\) signals, except the clock and reset signals.

The Mealy machine computes next state \(S' = (S + I) \bmod 2^n\) and output \(O = (S + I) \bmod 2^n = S'.\) Thus, we may save one adder by rearranging the feedback loop analgous to the multiplexer loop in Figure 6.25. Assuming that the machine is reset to start state \(S[0] = 0,\) the time dependent output of the Mealy accumulator is

The Moore machine computes the same next state \(S' = (S + I) \bmod 2^n\) as the Mealy machine, but outputs the current state \(O = S.\) With start state \(S[0] = 0,\) the output of the Moore accumulator is

The difference between the two machine types becomes clear by considering a concrete example. Assume we apply input sequence

to both machines with bit width \(n\) large enough for the sums not to overflow. Then, the Mealy accumulator produces output sequence

whereas the Moore accumulator has output sequence

Thus, except that the output of the Moore machine is delayed by one
clock cycle compared to the Mealy machine, both accumulators compute a
*prefix sum* serially. The time table
illustrates the accumulation of an input sequence of 4-bit unsigned
binary numbers with a Moore accumulator:

cycle 0 1 2 3 4 5 6 I 9 3 11 6 1 0 8 O 0 9 12 7 13 14 14

The output sequence is easy to verify, if we notice that a 4-bit Moore accumulator has output function \(O[t] = (O[t-1] + I[t-1]) \bmod 16\) with start value \(O[0] = 0.\)

For bit width \(n = 1,\) the arithmetic adder reduces to a logical 2-input XOR gate, and the accumulator computes a prefix XOR function. The time table illustrates a Moore version of the prefix XOR transduction:

cycle 0 1 2 3 4 5 6 I 1 1 0 1 0 0 1 O 0 1 0 0 1 1 1

Element \(O[t]\) of the output sequence obeys the Boolean equation \(O[t] = O[t-1] \oplus I[t-1]\) with start value \(O[0] = 0.\)

#### Counter State Machine¶

We can turn the accumulator into a \(n\)-bit *binary counter* by tying accumulator input \(I\) to
\(n\)-bit binary number 1. Such a counter machine has no input
signal. The next state logic of an \(n\)-bit counter computes
\(S' = (S + 1) \bmod 2^n.\) Therefore, the time dependent
behavior of a Moore type \(n\)-bit counter is

For bit width \(n\) large enough to prevent overflow, the counter state machine counts the clock cycles, \(O[t] = t.\) Such a state machine is useful for performance measurements, for example. The count of the corresponding Mealy machine is one more than the count of the Moore machine, \(O[t] = t + 1.\)

A counter can be used as a clock divider. For example, a 2-bit counter implements next state function \(O[t] = t \bmod 4\):

\(t\) 0 1 2 3 4 5 6 7 8 9 \(O\) 0 1 2 3 0 1 2 3 0 1 \(O_0\) 0 1 0 1 0 1 0 1 0 1 \(O_1\) 0 0 1 1 0 0 1 1 0 0

Observe that 2-bit output \(O = O_1 O_0\) transduces clock signal \(\phi,\) which is 1 during the first half cycle and 0 during the second half cycle, into two clock signals, \(O_0\) with half the frequency of \(\phi,\) and \(O_1\) with a quarter of the frequency of \(\phi.\)

#### Average State Machine¶

An averaging machine computes the arithmetic average of a window of \(k\) inputs. For example, given a window size of \(k = 3,\) the average of three consecutive inputs \(I[t],\) \(I[t-1],\) and \(I[t-2]\) is

The group of inputs \(I[t],\) \(I[t-1],\) and \(I[t-2]\) constitute a window that slides across the input sequence as time progresses. Figure 6.42 illustrates the window in cycles \(t=3\) and \(t=4.\)

A machine that outputs \(avg_3(t)\) during cycle \(t,\) such that \(O[t] = avg_3(t),\) must be a Mealy machine with output function

Figure 6.43 shows the corresponding Mealy version of the average machine. The next states are \(S_1' = I\) and \(S_2' = S_1,\) and output \(O = (I + S_1 + S_2)/3.\) We assume the divider implements an integer division by 3 for binary numbers. Furthermore, the machine shall be reset to start state \(S_1[0] = 0\) and \(S_2[0] = 0.\)

The time table shows the averages of signed binary numbers in decimal format, assuming \(n\) is large enough to prevent overflows in the combinational arithmetic units:

cycle 0 1 2 3 4 5 6 I 10 2 -3 -8 5 12 4 O 3 4 3 -3 -2 3 7

By inserting another D-flipflop at input \(I\) in Figure 6.43, we transform the Mealy average machine into a Moore average machine, with output function

Average machines with various window sizes serve as low-pass filters in digital signal processing applications.

### 6.4.2. Series Composition¶

We compose two state machines in series just like we compose two
*resistors*, two *capacitors*, or two *switches* in
series. Given two state machines \(M_1: I_1 \rightarrow O_1\) and
\(M_2: I_2 \rightarrow O_2,\) such that output \(O_1\) of
\(M_1\) and input \(I_2\) of \(M_2\) have the same bit
width. Then, the series composition has input \(I = I_1,\) output
\(O = O_2,\) and connects output \(O_1\) to input \(I_2,\)
enforcing \(I_2 = O_1.\)

As an example, consider the series composition of two *delay
elements*. We wish to characterize the behavior of
the series composition by expressing output \(O\) as a time
dependent function of input \(I.\) In cycle \(t,\) the output
of delay element \(M_1\) is \(O_1[t] = I_1[t-1] = I[t-1],\)
and for delay element \(M_2,\) we have \(O[t] = O_2[t] =
I_2[t-1].\) The series composition enforces \(I_2[t-1] =
O_1[t-1].\) Therefore, we find \(O[t] = I[t-2]\) since
\(O_1[t-1] = I[t-2].\) Next, we determine the start behavior. In
cycle 0, each delay machine outputs \(O_1[0] = O_2[0] = 0.\)
Therefore, in cycle 0 output \(O[0] = 0,\) and in cycle 1
\(M_2\) outputs the start value of \(M_1,\) i.e. \(O[1] =
O_1[0] = 0.\) In summary, the time dependent output function of the
series composition of two delay elements is

The time table illustrates the transduction of input sequence \(I\) by the series composition of two delay elements into output sequence \(O.\)

cycle 0 1 2 3 4 5 6 \(I = I_1\) 1 1 0 1 0 0 1 \(O_1 = I_2\) 0 1 1 0 1 0 0 \(O = O_2\) 0 0 1 1 0 1 0

This time table assumes that the connecting wire \(O_1 = I_2\) is visible. However, for a specification of the functional behavior of the series composition as a whole, the sequence on the internal node is irrelevant.

### 6.4.3. Parallel Composition¶

The parallel composition of two state machines \(M_1: I_1 \rightarrow O_1\) and \(M_2: I_2 \rightarrow O_2\) instantiates \(M_1\) and \(M_2\) and arranges the inputs and outputs side by side such that the combined machine maps \(I_1 \times I_2 \rightarrow O_1 \times O_2.\) Parallel composition is the easiest composition we can accomplish in hardware. We impose no constraints on machines \(M_1\) and \(M_2\) other than synchronizing them with the same clock and reset signals.

As an example, consider the parallel composition of two delay elements in Figure 6.46. Interpret the inputs as a 2-bit bus and the outputs as a 2-bit bus, and the parallel composition is a delay element for a 2-bit bus. This construction extends directly to \(n\) parallel delay elements for \(n\)-bit buses.

If we use the \(n\)-bit buses on the right of Figure 6.46 to carry signals that we interpret as unsigned binary numbers, then we have a delay element for numbers. The time table illustrates the behavior with unsigned binary numbers shown in decimal format:

cycle 0 1 2 3 4 5 6 I 42 7 1 36 15 2 29 O 0 42 7 1 36 15 2

Figure 6.47 shows a composite state machine with a parallel composition of a series composition of two delay elements and a single delay element. The counter drives both inputs of a parallel composition. A combinational adder combines the outputs of the parallel composition. The composite machine has no input and \(n\)-bit output \(O.\)

We analyze the time dependent behavior of the machine starting at output \(O,\) and working ourselves toward the counter. We find that output \(O\) combines the outputs of the parallel composition, \(O_1\) and \(O_2,\) by means of an arithmetic addition. As for every combinational module, we assume that the output of the adder appears within the same cycle \(t\) as the inputs:

Next, we determine the outputs of the parallel composition.
Output \(O_1\) is the output of the *series compositon*
of two delay elements, which transduces input \(I_1\) such that

Output \(O_2\) delays input \(I_2\) by one clock cycle:

Both inputs of the parallel composition are driven by the \(n\)-bit counter. Assuming that the counter is a Moore machine, we have

Substituting the signals in the equation for output \(O[t],\) we obtain

We analyze the start behavior of the composite machine by means of a time table, assuming bit width \(n\) is large enough, so that we can ignore the modulo operations.

cycle 0 1 2 3 4 5 6 7 8 9 \(I_1 = I_2\) 0 1 2 3 4 5 6 7 8 9 \(O_1\) 0 0 0 1 2 3 4 5 6 7 \(O_2\) 0 0 1 2 3 4 5 6 7 8 \(O\) 0 0 1 3 5 7 9 11 13 15

The Moore type *counter* drives the cycle number on
inputs \(I_1\) and \(I_2,\) shown in the top row of the time
table. The two delay elements in series output two 0’s initially in
cycles 0 and 1, i.e. \(O_1[0] = O_1[1] = 0.\) Then, \(O_1\)
outputs input \(I_1,\) so that \(O_1[2] = I_1[0] = 0,\)
\(O_1[3] = I_1[1] = 1,\) and so on. Similarly, the delay element
of output \(O_2\) produces a 0 initially, \(O_2[0],\) and
thereafter outputs input \(I_2.\) The adder combines outputs
\(O_1\) and \(O_2\) cycle-by-cycle into output \(O.\) We
observe that the state machine in Figure 6.47
implements an *odd number generator*, with a startup delay of two
clock cycles.

### 6.4.4. Feedback Composition¶

The feedback composition connects output \(O\) of state machine \(M\) to one of its inputs. Since \(M\) may be a composite state machine, the feedback composition generalizes the simple feedback loops of Mealy and Moore machines. For a feedback composition to qualify as synchronous circuit, we require that \(M\) is not a combinational logic module, but a state machine with at least one delay element on the loop segment from \(I_2\) to \(O.\)

The feedback composition enables us to produce complex feedback
behavior if \(M\) itself is a composite state machine. However,
we may also use the feedback composition to construct primitive state
machines. For example, Figure 6.49 shows a primitive
*counter* state machine, based on a feedback
composition of machine \(M,\) which is a series composition of a
combinational adder and a delay element.

We check whether output \(O\) of the state machine implements the
same behavior as our earlier primitive *counter*
state machine by analyzing the time dependent behavior of the feedback
composition. We begin with machine \(M.\) Output \(O_1\) of
the combinational adder produces the \(n\)-bit sum of \(I_1\)
and constant 1:

The delay element introduces one clock cycle of delay at output \(O\) of machine \(M,\) such that

The feedback composition connects output \(O\) to input \(I_1,\) which enforces

Substituting the equations into the expression for \(O[t]\) yields

The base case of this recurrence is the start state of the delay element, \(O[0] = 0.\) The recurrence itself holds for cycles \(t > 0.\) We construct the time table interpreting the \(n\)-bit signals as unsigned binary numbers for large bit with \(n\) large enough to ignore the modulo operation. The start state is \(O[0] = 0.\) Then, according to the output recurrence, in cycle 1 we have \(O[1] = O[0] + 1 = 1,\) in cycle 2 \(O[2] = O[1] + 1 = 2,\) and so on.

cycle 0 1 2 3 4 5 6 O 0 1 2 3 4 5 6

Thus, the machine has the output behavior of the primitive \(n\)-bit binary counter machine, \(O[t] = t \bmod 2^n,\) indeed.

We analyze the composite state machine in Figure 6.50. Upon resetting the machine, the delay elements shall have start state \(S_0 = 1\) rather than default value \(S_0 = 0.\) Machine \(M_1\) is the counter machine of Figure 6.49. Machine \(M_2\) replaces the adder of \(M_1\) with a combinational multiplier for unsigned binary numbers modulo bit width \(n.\)

The machine is a series composition of two feedback compositions \(M_1\) and \(M_2.\) For a back-of-the-envelope analysis, we consider numbers with a bit width small enough to ignore the modulo operations of the adder and the multiplier.

Machine \(M_1\) is a primitive *counter*
state machine with output function \(O_1[t] = O_1[t-1] + 1\)
with start state \(S_0 = 1.\) We construct a time table to
deduce the effect of the start state on the output function. In
cycle 0, the output is equal to the start state, \(O_1[0] =
1.\) According to the recurrence, in cycle \(t=1,\) the output
is \(O_1[0]+1 = 2,\) and so on:

cycle 0 1 2 3 4 5 6 \(O_1\) 1 2 3 4 5 6 7

We conclude that counter \(M_1\) with initial output \(O_1[0] = 1\) generates output

Machine \(M_2\) multiplies the current output with the current input. The delay element introduces a delay of one clock cycle. Thus, we have

Given initial output \(O_2[0] = 1\) of the delay element of \(M_2,\) we construct a time table to derive the output of \(M_2\) as a function of \(O_1.\) In initial cycle \(t=0,\) output \(O_2[0] = 1,\) as determined by the start state of the delay element. Then, in cycle 1, the output is the arithmetic product of outputs \(O_1[0]\) and \(O_2[0]\) of the previous cycle, such that \(O_2[1] = 1 * 1 = 1.\) In cycle 2, we find \(O_2[2] = O_1[1] * O_2[1] = 2,\) and so on:

cycle 0 1 2 3 4 5 6 \(O_1\) 1 2 3 4 5 6 7 \(O_2\) 1 1 2 6 24 120 720

We recognize the output sequence of the composite machine
\(O[t] = O_2[t]\) as the *factorial sequence*

including base case \(0! = 1,\) which is commonly assumed by definition of the factorial function.

We analyze the output function of the composite state machine in Figure 6.51. The machine consists of a parallel composition of delay elements, similar to Figure 6.47. However, we assume that two of the delay elements have start state \(S_0 = 1\) rather than default start state \(S_0 = 0.\) Instead of driving the parallel inputs with a counter, as in Figure 6.47, we connect output \(y\) of the combinational adder to the delay elements, forming in a feedback composition.

As in Example 6.3 we perform a back-of-the-envelope analysis, ignoring the modulo operation of the arithmetic addition when the sum exceeds bit width \(n.\) We construct a time table to deduce the sequences at the outputs of the delay elements \(O[t] = y[t-2]\) and \(x[t] = y[t-1].\) Signal \(y\) is the combinational sum of \(O\) and \(x,\) i.e. \(y[t] = O[t] + x[t].\) The start states of the delay elements determine the initial states \(x[0] = 1,\) \(O[0] = 0,\) and \(O[1] = 1,\) so that \(y[0] = O[0] + x[0] = 1.\) Then, we find \(x[1] = y[0] = 1\) which implies \(y[1] = O[1] + x[1] = 2.\) Furthermore, since \(y[0] = 1,\) we deduce \(O[2] = y[0] = 1,\) and so on.

cycle 0 1 2 3 4 5 6 7 8 9 \(O\) 0 1 1 2 3 5 8 13 21 34 \(x\) 1 1 2 3 5 8 13 21 34 55 \(y\) 1 2 3 5 8 13 21 34 55 89

We recognize output \(O[t]\) as the *Fibonacci sequence* of
Example 5.1, where \(O[t] = Fib(t).\)

We revisit the *traffic light controller*,
whose synthesis as a monolithic Moore machine caused us some
difficulties due to the large number of states. Now that we know
how to compose state machines, we redesign the traffic light
controller as a composition of smaller state machines.

The key to factoring the monolithic traffic light controller is to
realize that an Austrian traffic light loops through five phases,
each with its own duration, and the green-blinking phase exhibiting
the extravagant flashing feature. Hence, a natural design employs
a *phase state machine* to loop through the five phases, with one
state per phase: red (R), red-yellow (RY), green (G), green
blinking (GB), and yellow (Y). The provisional state diagram below
shows the loop of phase transitions.

The controller visits each of the states for a different number of
clock cycles. Thus, we design a parameterizable *timer state
machine*, that effects the transition from one phase state to the
next. When entering a phase state, we start the timer for the
desired number of cycles. Once the timer expires, the phase state
machine transitions into the next phase state. For example, the
red state initializes the timer to expire after another 12 cycles,
keeping the red light on for a total of 13 cycles. Thereafter, the
phase state machine transitions into the red-yellow state.

The block diagram on the right shows the traffic light controller factored into two communicating state machines. The phase state machine starts the timer with a desired number of cycles. When the timer expires, it informs the phase state machine which, in turn, transitions into the next state.

We design the timer by adopting the primitive *counter* state machine, see Figure 6.52. Rather than
counting up, we count down, and compute the *expired* signal by
means of a combinational *equality-to-zero comparator*. When the comparator counts down to zero, it
retains value zero until the state register stores a new counter
value, supplied via the *cycles* input, when the *start* signal is
asserted.

Given the timer state machine, we can now refine the phase state machine. In each state, the phase state machine outputs the number of cycles, which is the duration in cycles minus two, and issues a start pulse to the timer. The phase state loops to itself until the timer is expired. The green blinking phase requires special treatment. We split state \(GB\) into two states, \(GB_0\) and \(GB_1.\) In state \(GB_0\) all lights are off, whereas in \(GB_1\) the green light is switched on.

Since the phase state machine has only six states, the synthesis procedure, including next state and output logic minimization, does not pose any difficulties. Thus, we leave the synthesis of the Moore type phase state machine as exercise.

## 6.5. Sequencing Composite State Machines¶

We can construct large composite state machines by means of
*series*, *parallel*, and
*feedback* compositions of smaller state machines.
Composite state machines are sequential circuits, that execute
according to the common beat of a global clock signal. The
*compositions* discussed so far have in
common that all constituent state machines perform a state transition
at every triggering clock edge. This is neither necessary nor
desirable. For example, we may want to save energy by putting the
phase state machine of the factored traffic light controller in
Example 6.5 to sleep while the timer runs.

To provide additional control over the temporal behavior of composite
state machines, we introduce **sequencing** as an additional
discipline for our digital design repertoire. Sequencing enables us
to selectively execute constituent machines of a composite state
machine by

- controlling when a constituent machine starts execution, and by
- determining the actions of a constituent machine after it has run to completion.

Informally, sequencing controls when a constituent machine of a composite machine executes and what it does when not. To facilitate sequencing, we need a mechanism that permits us to start execution of a state machine, and after termination enables another state machine to continue execution. We prepare a state machine for sequencing by introducing two control signals:

doit: is an input signal that starts the execution of the machine, and

done: is an output signal that informs the outside world about its termination.

Figure 6.54 shows a black box state machine with doit and done signals. A filled triangle in one of its corners marks a state machine as capable of sequencing. Which of its inputs and outputs are the doit and done signals should be clear from the context, or otherwise by proper wire naming. By convention, we must not restart a busy machine by asserting the doit signal before it has signalled termination by asserting the done signal.

The sequencing mechanism generalizes the communication in
Example 6.5 between the phase and timer
state machines via the *start* and *expired* signals. Here, the
*start* signal of the timer assumes the role of the *doit* signal, and
the *expired* signal of the timer corresponds to the *done* signal.
After termination the timer retains its state, i.e. state register
\(S = 0,\) until the *start* signal restarts the timer.

We specify the timing behavior of the inputs and outputs of a
sequenced state machine, including the *doit* and *done* signals, as
follows:

- To start machine \(M,\) set the
doitsignal to 1 for the duration of one clock cycle. In the same cycle of thisdoit pulse, apply initial input value \(I[0].\)- When the result of machine \(M\) is available, produce a
done pulseby setting outputdoneto 1 for the duration of one clock cycle accompanying the stabilized output \(O.\)- After the done pulse, output \(O\) remains unchanged until the subsequent clock cycle where another done pulse accompanies a new output \(O.\)

Assuming machine \(M\) has a delay of \(T_M\) clock cycles, and we apply a doit pulse in clock cycle \(t,\) then the done pulse appears in cycle \(t+T_M.\)

### 6.5.1. Sequencing Logic¶

We illustrate the design of the sequencing logic for a state machine
by extending the *AddR* machine shown in Figure 6.55. Machine
*AddR* is a series composition of a combinational adder and a delay
element consisting of a register. We classify the machine as a
*register machine* to emphasize that it stores the result of the
combinational addition in a register. The bit width depends on the
application, and is left unspecified.

Output \(Y\) of *AddR* has the time dependent behavior

because the register introduces a delay of one clock cycle. The output behavior with input sequences \(A = \langle 1, 2, 3, \ldots \rangle\) and \(B = \langle 3, 4, 5, \ldots \rangle\) is illustrated in the waveform diagram below. In cycle 0 output \(Y\) depends on the start state of the register. Since we do not care about the initial value, we mark \(Y[0]\) as undefined. The essential observation is that the outputs appear cycle-by-cycle whether we use them or not. For as long as the input sequences change, the output follows. The machine has no provisions to pause or stop the computation. However, such a feature is desirable, for instance, if we want to reuse a particular output value several times, say \(Y[4]\) in cycles 4, 5, and 6.

This is where sequencing enters the stage. Assuming that the timing
behavior of inputs \(A\) and \(B\) is beyond our control, we
wish to be able to start and stop the machine independently. The
important insight is that we can stop register machine *AddR* by
retaining the state of its register. Otherwise, *AddR*
executes by storing a new sum at every triggering clock edge.

We introduce an **enabled register** to control whether a register
stores its input at a triggering clock edge or not, see
Figure 6.56. The enabled register obeys the time
dependent functional specification

If the *enable* input is 1, the enabled register behaves like a
regular *register* or *delay element*. Otherwise, if
*enable* equals 0, the register retains its current state.

The first step towards adding a sequencing capability to the *AddR*
machine is to replace the register with an enabled register, and to
connect the *doit* signal to the *enable* input. Figure 6.57 shows the *doit* extension of the *AddR* machine on the
left. The *doit* input changes the behavior of the *AddR* machine
such that the output function is now

If the *doit* signal is 1, the *AddR* machine computes the sum of the
inputs with a single-cycle delay as before. Otherwise, if the *doit*
signal is 0, the *AddR* machine retains its state. Thus, by setting
the *doit* signal to 0, we effectively freeze the machine independent
of input signals \(A\) and \(B.\)

As the second step, we introduce a *done* output to signal that the
machine has terminated and the result is available at output
\(Y.\) Since the enabled register introduces one cycle of delay,
we produce the *done* signal by delaying the *doit* signal by one
cycle. Therefore, we extend the *AddR* machine with a D-flipflop
between the *doit* input and the *done* output, as shown on the right
in Figure 6.57. The *done* output implements the
time dependent function of a delay element:

The resulting *AddR* machine is capable of sequencing. It outputs a
new sum only if we assert the *doit* signal. If we apply a doit pulse
in cycle \(t_1,\) then we observe a done pulse in cycle
\(t_1+1\) and the sum \(Y[t_1+1] = A[t_1]+B[t_1]\) at the
output. This sum remains visible until we issue another doit pulse in
cycle \(t_2 > t_1,\) independent of inputs \(A\) and
\(B.\) Figure 6.58 illustrates the output
behavior by means of a waveform diagram. We apply doit pulses in
cycles 0, 1, 4, and 6. Due to the 1-cycle delay of the *AddR*
machine, we expect valid ouputs in cycles 1, 2, 5, and 7. Otherwise,
output \(Y\) retains the output associated with the last done
pulse.

Note that the abstract behavior of the sequenced *AddR* machine is
quite easy to comprehend, even without analyzing every detail of the
synchronous sequential circuit.

### 6.5.2. Sequencing Series Compositions¶

We arrange a series composition of sequenced state machines
\(M_1\) and \(M_2\) as shown in Figure 6.59,
assuming output \(O_1\) of \(M_1\) and input \(I_2\) of
\(M_2\) are compatible. Composite machine \(M\) is capable of
sequencing by means of the *doit* and *done* signals. The delay of
\(M\) is the sum of the delays of constituent machines \(M_1\)
and \(M_2.\)

As an example, we study the sequenced version of an *average
machine* for computing the average of two
binary numbers

Assume we have a supply of sequenced *AddR* machines, as shown in
Figure 6.57 on the right, and, furthermore,
sequenced *HalfR* machines with a combinational divider-by-2 and a
register. Then, we can construct an average machine with the series
composition of Figure 6.60.

Without the sequencing logic, the average machine exhibits the output behavior

because the series composition of *AddR* and *HalfR* results in
substituting \(X[t] = A[t-1] + B[t-1]\) for \(X[t-1]\) in
\(Y[t] = X[t-1]/2.\)

Including the sequencing logic, the composite average machine still has a delay of two clock cycles, such that \(done[t] = doit[t-2],\) and output function

Figure 6.61 illustrates the output behavior by means of a waveform diagram. We apply doit pulses in cycles 0, 1, 4, and 6. Due to the 2-cycle delay of the series composition, we expect valid ouputs in cycles 2, 3, 6, and 8. Otherwise, output \(Y\) retains the output associated with the last done pulse. For example, in cycles 4 and 5, the machine outputs the result of the last computation in cycle 3, i.e. the average \(Y[3] = (A[1] + B[1])/2 = 3.\)

We interpret the waveform diagram such that the average machine samples the inputs during the doit pulses, outputs the average two cycles later accompanied by a done pulse, and retains the last average between two average computations.

Consider the composite state machine in Figure 6.62. We have a series composition of three *R* machines.
An *R* machine is a sequenced delay element or register, with
output behavior

The *R* machine stores the input when a doit pulse arrives, and
outputs the input with a 1-cycle delay. The corresponding
sequencing logic produces done pulse \(done[t] = doit[t-1].\)
The series composition in Figure 6.62 connects
input \(X\) to the input of each *R* machine.

Our goal is to understand the behavior of outputs \(A,\) \(B,\) and \(C\) as a time dependent function of \(X.\) To that end, we analyze the behavior of the series composition by deducing a time table for input sequence \(X = \langle 1, 2, 3, \ldots \rangle.\) We assume that in cycle 0, all registers are reset to value 0. Furthermore, we apply a single doit pulse during cycle 0.

cycle 0 1 2 3 4 5 6 \(X\) 1 2 3 4 5 6 7 doit1 0 0 0 0 0 0 \(A\) 0 1 1 1 1 1 1 \(B\) 0 0 2 2 2 2 2 \(C\) 0 0 0 3 3 3 3

The doit pulse starts *R* machine \(A\) in cycle 0. Next in
the series composition, it starts *R* machine \(B\) in cycle 1,
and *R* machine \(C\) in cycle 2. In cycle 3, a done pulse
signals termination. Thereafter, the composite machine retains its
state until we apply another doit pulse. Tracing the doit pulse,
we find that the doit pulse in cycle 0 triggers output \(A[1]
= X[0] = 1,\) then in cycle 1 \(B[2] = X[1] = 2,\) and in cycle
2 \(C[3] = X[2] = 3.\) Output \(A\) retains its value
after cycle 1, output \(B\) after cycle 2, and output \(C\)
after cycle 3.

We can express this behavior by means of a simple program in your favorite programming language, e.g. C, Java, or Python, as a sequence of assignment statements:

```
A = 1;
B = 2;
C = 3;
```

The semantics of a semicolon in a programming language signifies that the assigments must be executed in textual order, from top to botton, in subsequent time steps. Actually, since newlines do not matter in programs, the one-liner:

```
A = 1; B = 2; C = 3;
```

constitutes an equivalent program, except that the textual order is now interpreted as left-to-right rather than top-to-bottom by convention. The sequenced series composition in Figure 6.62 implements the programs above in form of a synchronous sequential circuit in hardware. Like a program, you can execute the state machine repeatedly by applying input sequence \(X\) accompanied by a doit pulse.

We can *reprogram* our series composition of *R* machines by
rewiring their inputs and outputs. For example, Figure 6.63 shows the series composition with a cyclic connection
of the *R* machines. The internal wiring of the *R* machines is
omitted for clarity. The output of machine \(A\) connects to
the input of machine \(C,\) the output of \(C\) to the
input of \(B,\) and the output of \(B\) to the input of
\(A.\)

When we apply a doit pulse, then during the first cycle we store
the output of \(B\) in \(A,\) which corresponds to program
statement `A = B;`. In the second cycle, we store the output of
\(C\) in \(B,\) i.e. execute statement `B = C;`, and in
the third cycle we store the output of \(A\) in \(C,\)
corresponding to statement `C = A;`. Thus, the rewired state
machine executes the program:

```
A = B;
B = C;
C = A;
```

You may recognize these three statements as a program for swapping registers \(B\) and \(C\) using \(A\) as temporary register. Note that a cyclic composition of registers without sequencing logic implements a different functionality.

### 6.5.3. Sequencing Parallel Compositions¶

Parallel compositions of sequenced state machines can be difficult to
tame. However, if we restrict ourselves to simplifying assumptions,
then it becomes straightforward to extend a parallel composition with
a sequencing capability. Figure 6.64 shows two
sequenced state machines \(M_1\) and \(M_2\) arranged in
parallel, and the sequencing logic with a *doit* input and a *done*
output.

This parallel composition exhibits the desired sequencing behavior, if

- input pairs \(I_1\) and \(I_2\) arrive simultaneously, so that with a single
doitsignal suffices, and- both machines \(M_1\) and \(M_2\) have the same delay.

If the inputs arrive in different clock cycles, then each input must
be associated with its own *doit* signal. Such a scenario requires
more complex synchronization circuitry than needed if we restrict all
machines to the black-box specification of Figure 6.54.
On the other hand, the first assumption is trivially fulfilled in case
where a single input \(I\) drives \(I_1 = I\) and \(I_2 =
I.\) Then, the *doit* signal associated with input \(I\) is a
valid *doit* signal for both parallel machines \(M_1\) and
\(M_2.\)

The second assumption is necessary to ensure that the output pairs
\(O_1\) and \(O_2\) appear simultaneously, so that a single
*done* output suffices. If we compose two machines with unequal
delays, we can equalize the delays by increasing the delay of the
faster machine using a series composition with sequenced delay
elements. Then, the delay of the parallel composition is the maximum
of the delays of \(M_1\) and \(M_2.\) Note that the AND gate
is redundant if both assumptions are fulfilled. We could use either
of the two *done* signals as composite *done* output, and ignore the
other one.

Consider the parallel composition of three *R* machines shown in
Figure 6.65 with a single input
\(X\) accompanied by a *doit* signal, and outputs \(A,\)
\(B,\) and \(C\) accompanied by a *done* signal. The state
machine shown on the left resembles the parallel composition
template in Figure 6.64. As shown on the
right, we can save two sequencing D-flipflops and the AND gate, and
obtain the equivalent register machine *R3*.

To understand the behavior of outputs \(A,\) \(B,\) and \(C\) as a time dependent function of \(X,\) we repeat our experiment of Example 6.6, and deduce a time table for input sequence \(X = \langle 1, 2, 3, \ldots \rangle.\) We assume that in cycle 0, all registers are reset to value 0, and we apply a single doit pulse.

cycle 0 1 2 3 4 5 6 \(X\) 1 2 3 4 5 6 7 doit1 0 0 0 0 0 0 \(A\) 0 1 1 1 1 1 1 \(B\) 0 1 1 1 1 1 1 \(C\) 0 1 1 1 1 1 1

In reponse to the doit pulse in cycle 0, all registers store \(X[0],\) so that \(A[1] = B[1] = C[1] = X[0] = 1.\) Also in cycle 1, the done pulse signals termination. Thus, the state of the machine does not change after cycle 1 until the next doit pulse arrives. The behavior of the parallel machine is simpler than that of the series composition. We can express the behavior as a program by means of three assignments:

```
A = 1;
B = 1;
C = 1;
```

The semantics of this program suggests an order of execution where, first, 1 is assigned to register \(A,\) then 1 to register \(B,\) and thereafter 1 to register \(C.\) This serial semantics is inherent in most of today’s programming languages. However, in our parallel machine, all three assignments occur simultaneously in cycle 0. Such a parallel assignment cannot be expressed in C and Java, but Python [2] does support a parallel assignment by using the list data structure:

```
[A, B, C] = [1, 1, 1];
```

Parallel assignments are natural in digital circuits, but tend to confuse serial programmers. Consider the cyclic parallel composition in Figure 6.66. The wiring connects the same register inputs and outputs as in Figure 6.63. The output of \(B\) is connected to the input of \(A,\) the output of \(C\) to the input of \(B,\) and the output of \(A\) to the input of \(C.\) A doit pulse in cycle 0 causes all registers to store their inputs, such that \(A[1] = B[0],\) \(B[1] = C[0],\) and \(C[1] = A[0].\) These equations expose the dependence on time, and in particular the values of the registers in cycle 1 as a function of their values in cycle 0.

If we express the state changes in terms of program statements, we
sacrifice the time information. We write assignment `A = B;` to
express \(A[1] = B[0],\) `B = C;` for \(B[1] = C[0],\)
and `C = A;` for \(C[1] = A[0].\) Combining these three
assignments into a one-liner yields:

```
A = B; B = C; C = A;
```

Interpreting this one-liner as a serial program does not reflect
the behavior of the *R3* machine in Figure 6.66. For example, given initial state \(A = 1,\)
\(B = 2,\) and \(C = 3,\) then executing the sequence of
assignment statements of the one-liner left-to-right results in
\(A=2,\) \(B=3,\) and \(C=2.\) The effect is a swap of
registers \(B\) and \(C\) using \(A\) as temporary
register. In contrast, register machine *R3* transforms initial
state \(A=1,\) \(B=2,\) and \(C=3\) within a single
cycle into state \(A=2,\) \(B=3,\) and \(C=1,\) which
is a rotation. A doit pulse triggers a parallel assigment in the
*R3* machine, that we express in Python as:

```
[A, B, C] = [B, C, A];
```

Although this Python version of a parallel assignment is written compactly as a one-liner, today’s computers need at least four clock cycles for its execution, and use one additional temporary register.

As an aside, we can design a composite state machine to swap two
registers without a third temporary register as a straightforward
parallel composition. Take the *R3* machine and remove register
\(A.\) Connect the output of register \(B\) to the input
of register \(C,\) and the output of \(C\) to the input of
\(B.\) The resulting swap machine has a single cycle delay and
performs the parallel assignment `[B, C] = [C, B]` by executing
state transition \(B[t] = C[t-1]\) and \(C[t] = B[t-1]\) in
response to a doit pulse in cycle \(t-1.\)

### 6.5.4. Conditional Composition¶

Every programming language offers a decision capability, typically in
form of an *if-then-else* statement. We present the sequencing logic
for a composite state machine capable of executing the program:

```
if (predicate == 1) then
consequent;
else
alternative;
```

The combinational *predicate* and the sequenced state machines for the
*consequent* and the *alternative* logic depend on the application.
Figure 6.67 shows the conditional composition of the
consequent and alternative state machines. Inputs and outputs, except
those of the sequencing logic, are omitted. In case no alternative is
desired, replace the alternative state machine with a sequencing
D-flipflop.

The combinational predicate logic outputs the a *predicate* signal.
The composite machine receives doit pulses from the *doit* input.
While the *predicate* equals 1, doit pulses are steered to the
consequent state machine. Otherwise, if the *predicate* signal is 0,
a doit pulse is steered to the alternative state machine. Hence,
either the consequent or the alternative machine are started in
response to a doit pulse, depending on the *predicate* signal. When
one of the consequent or alternative machines outputs a done pulse, the
OR gate steers the done pulse to the *done* output of the composite
machine. The delay of the conditional composition is at least one
clock cycle, because the consequent and alternative are sequenced
state machines, which require at least one sequencing D-flipflop.
The actual delay of the conditional composition may vary, depending
on the delays of the consequent and alternative machines.

We wish to design a sequenced state machine for an **up-down
counter**. An up-down counter receives the counting direction
*dir* as an input. If the *dir* signal equals 1, the counter shall
increment its value by 1, like the *counter state machine*. Otherwise, if *dir* equals 0, the counter shall
decrement its value by 1. We summarize the behavior of the up-down
counter machine in form of a program, given counter register
\(C\) and input *dir*:

```
if (dir == 1) then
C = C + 1;
else
C = C - 1;
```

Before translating this program into a sequenced state machine, we notice that a single counter register \(C\) suffices to store the counter value. Therefore, we do not honor the clean separation into independent consequent and alternative state machines suggested in Figure 6.67, but share register \(C\) between them. Figure 6.68 shows the resulting state machine design.

The predicate logic implements an identity function, such that a
doit pulse is steered to the consequent when the *dir* input equals
1 and to the alternative when *dir* equals 0. The consequent
portion of the machine has a combinational adder to increment
\(C\) by 1, and the alternative portion has a combinational
substractor to decrement \(C\) by 1. The *tristate* buffers at the adder and subtractor outputs
implement a multiplexer. If the *dir* signal equals 1, then the
consequent tristate buffer is enabled during a doit pulse, steering
\(C+1\) to the input of register \(C.\) Otherwise, if the
*dir* signal equals 0 during a doit pulse, the alternative tristate
buffer steers \(C-1\) to the input of register \(C\)

The D-flipflops of the sequencing logic in both consequent and
alternative portions of the machine delay the doit pulse by one
clock cycle. Consequently, we have \(done[t] = doit[t-1],\)
independent of the *dir* signal. Furthermore, the OR gate at the
enable input of register \(C\) guarantees that the doit pulse
enables \(C\) independent of its path through the consequent or
alternative portions of the machine. Thus, when the up-down
counter receives a doit pulse in cycle \(t-1\) and the *dir*
signal is 1, requesting a counter increment, then the counter
register transitions to \(C[t] = C[t-1] + 1.\) Otherwise, if
the *dir* signal is 1, the doit pulse triggers the state
transtition \(C[t] = C[t-1] - 1.\)

cycle 0 1 2 3 4 5 6 7 8 9 dir1 1 1 1 0 0 0 1 1 0 doit1 1 0 1 1 1 0 0 1 0 C0 1 2 2 3 2 1 1 1 2

The time table above illustrates the behavior of the up-down counter, assuming the counter register is reset to \(C=0\) in cycle 0. Since the machine is sequenced, the counter changes in response to receiving a doit pulse only.

### 6.5.5. Loop Composition¶

We cap our study of sequenced state machines with another fundamental
sequencing construct, the *while loop*. Well known among computer
programmers, a while loop has the general program structure:

```
while (predicate == 1)
body;
```

The semantics of the while loop is to exit the loop if the *predicate*
is not equal to 1 or, otherwise, to execute its *body* and then
restarting the loop execution. Figure 6.69 shows the sequencing
logic for a loop composition of a combinational *predicate* and a
sequenced *body* state machine. Inputs and outputs, besides *doit*
and *done*, are omitted because they are application specific.

The combinational predicate logic outputs a *predicate* signal. When
a doit pulse arrives at the *doit* input and the *predicate* equals 1,
the doit pulse starts the *body* state machine. Upon termination, the
*body* outputs a done pulse, which is fed back through the OR gate,
where we reinterpret it as a doit pulse to restart the *body* machine
provided the *predicate* is still 1. Otherwise, if the *predicate* is
0, the doit pulse enters the sequencing D-flipflop at the *done*
output of the loop machine, and is output as a done pulse after a
delay of one clock cycle. Thus, if the *predicate* is 0 upon
receiving an external doit pulse or an internal done pulse from the
body machine, we terminate the loop and output a done pulse.

The delay of the loop composition is at least one clock cycle to
compute the *predicate* and to start the *body* or terminate the loop.
Otherwise, the delay depends on the delay of the *body* machine, which
is at least one clock cycle as for any sequenced state machine, and on
the number of loop iterations.

We wish to design a loop machine that counts from 0 up to 3 and terminates. We can specify the desired behavior as a program with a while loop, assuming we have a register to store iteration count \(i\):

```
i = 0;
while (i < 3)
i = i + 1;
```

This program generates state sequence \(i = \langle 0, 1, 2, 3\rangle\) and terminates, as desired. The loop machine shown in Figure 6.70 implements the program.

The predicate logic performs the comparison \(i < 3,\) which we
can implement with a combinational *magnitude comparator*. Assume the comparator output is equal to
1 if \(i < 3,\) and 0 otherwise. Then, the comparator output
matches the requirements of the *predicate* test in the loop
composition pattern in Figure 6.69. The *body* machine is
essentially the *AddR* machine of Figure 6.57
with one input tied to constant 1, and the other input connected to
the output of register \(i.\)

We examine the behavior of the loop counter machine by studying the
time table below. Assume that register \(i\) of the *body*
machine is reset to value 0 in cycle 0, and we apply a doit pulse
in cycle 0.

cycle 0 1 2 3 4 5 6 7 8 9 doit1 0 0 0 0 0 0 0 0 0 i0 1 2 3 3 3 3 3 3 3 done0 0 0 0 1 0 0 0 0 0

In cycle 0, the value of the counter register is \(i = 0,\)
which is less than 3. Therefore, the magnitude comparator of the
predicate logic outputs a 1. As a result, the external doit pulse
in cycle 0 starts the *body* machine. The *body* machine
transitions to \(i[1] = i[0] + 1 = 1\) and outputs a done pulse
during cycle 1.

In cycle 1, the magnitude comparator finds that \(i = 1\) is
less than 3, and outputs a 1. Consequently, the done pulse of the
*body* machine, which passes through the OR gate, starts the *body*
machine again. The *body* machine transitions to \(i[2] =
i[1] + 1 = 2\) and outputs a done pulse during cycle 2. During
cycle 2, the loop machine executes analogously to cycle 1, with the
result that the *body* machine transitions to \(i[3] = 3\) and
issues a done pulse during cycle 3.

In cycle 3, the magnitude comparator outputs a 0, because \(i
= 3\) is not less than 3. Therefore, the done pulse of the *body*
machine now enters the sequencing D-flipflop of the *done* output
of the counter machine. After a delay of one cycle, i.e. during
cycle 4, the done pulse appears on the *done* output, and signals
termination of the counter execution.

We are given a state machine with value ranges and state transition table:

Describe the functionality of the state machine.

Which output sequence does the machine generate for input sequence

\[\text{in} = \langle 2, 0, 0, 1, 1, 1, 2, 0, 1 \rangle\,.\]Give an input sequence that generates output sequence

\[\text{out} = \langle 0, 2, 1, 0, 0, 1, 2, 0, 1 \rangle\]with initial state \(S[0] = 0.\)

We study the behavior of the state machine by transliterating the state transition table into a state transition diagram. The machine has three states \(S \in \{ 0, 1, 2 \}.\) Since next state \(S'\) equals output \(\text{out},\) we can observe \(S'\) at the output of the machine. Furthermore, we conclude that the state machine is a Mealy machine because the output is a function of the state and the input. The machine produces different outputs for different inputs in the same state. In contrast, a Moore machine produces one output in each state, independent of the input. We develop the state transition diagram by inserting for each state all outgoing transition arcs, see (a), (b), and (c):

The complete state transition diagram is shown in (d), with the arcs colored to emphasize the three functions:

- modulo-3 up-counter when \(\text{in} = 0\) (blue arcs),
- modulo-3 down-counter when \(\text{in} = 1\) (red arcs),
- reset to 0 when \(\text{in} = 2\) (green arcs).

We conclude that the state machine is an up-down counter with reset capability.

The output sequence of a machine would be unique, if the state of the first clock cycle were known. Although no start state \(S[0]\) is given, the output associated with \(in[0] = 2\) is known. Since input \(in[0] = 2\) resets the machine to state 0, we know next state \(S[1] = 0.\) Thus, because the next state equals the current output, i.e. \(S[t+1] = \text{out}[t],\) we know that \(\text{out}[0] = S[1] = 0.\) We use a time table to derive the output sequence:

t 0 1 2 3 4 5 6 7 8 in 2 0 0 1 1 1 2 0 1 S X 0 1 2 1 0 2 0 1 out 0 1 2 1 0 2 0 1 0 We mark state \(S[0]\) with an \(X\) to indicate that the state is unknown. The next states are specified in the state transition table and the diagram. State sequence \(S\) is output sequence \(\text{out}\) shifted by one clock cycle to the right.

An output sequence of the state machine may be generated by more than one input sequence, because the state machine has multiple arcs between pairs of vertices, from state 1 to state 0 and from state 2 to state 0. We derive one possible input sequence using a time table. Output sequence \(\text{out}\) is given. Since state sequence \(S\) is output sequence \(\text{out}\) shifted by one cycle to the right, we also know state sequence \(S\) with the exception of \(S[0].\) However, \(S[0]\) is given as \(S[0] = 0.\)

t 0 1 2 3 4 5 6 7 8 out 0 2 1 0 0 1 2 0 1 S 0 0 2 1 0 0 1 2 0 in 2 1 1 1 2 0 0 0 0 At the end of cycle \(t=0,\) the state machine transitions from \(S[0] = 0\) to \(S[1] = \text{out}[0] = 0.\) This transition requires input \(\text{in}[0] = 2,\) as we see in the state transition table or the diagram. Analogously, transition \(S[1] = 0 \rightarrow S[2] = 2\) requires input \(\text{in}[1] = 1,\) and transition \(S[2] = 2 \rightarrow S[3] = 1\) requires \(\text{in}[2] = 1.\) The transition from \(S[3]=1\) to \(S[4]=0\) is not unique, We may choose input \(\text{in}[3]=1\) or input \(\text{in}[3]=2.\) We choose \(\text{in}[3] = 1\) arbitrarily. The other transition with a choice of input occurs at the end of cycle \(t=7.\) We may choose input \(\text{in}[7] = 0\) or \(\text{in}[7]=2\) to transition from \(S[7]=2\) to \(S[8]=0.\) In the time table we pick \(\text{in}[7]=0.\) Since the time table has two clock cycles with two choices for the input value, there exist four input sequences that generate the desired output sequence.

Implement the state machine of Exercise 6.8
using a *conditional composition*.

Specify the predicate, consequent, and alternative. Assume you can use resettable registers.

Design the state machine.

Verify your state machine by constructing a time table for input sequence

\[\text{in} = \langle 2, 0, 0, 1, 1, 1, 2, 0, 1 \rangle\,.\]

Recall the functionality of the state machine in Exercise 6.8, here written in form of a conditional statement:

if (in == 0) count up modulo-3 else if (in == 1) count down modulo-3 else if (in == 2) reset to 0

Assume we implement the reset capability with resettable registers, and encode the input as a binary number. Then, we can interpret the

*msb*as*reset*input and the*lsb*as control input*dir*for the counter direction:\(\text{in}_{10}\) \({in}_2\) *reset**dir*0 00 0 0 1 01 0 1 2 10 1 0 Hence, we plan to use the conditional composition, shown again on the right, to control the direction of the up-down counter. The predicate is the trivial identity function of input

*dir*, if we use the alternative SM to implement the up-counter and the consequent SM for the down-counter.In the following, we study the design of up and down counters for the alternative and consequent SMs. We begin with a state transition table for the up-counter using binary format for decimal numbers 0, 1, and 2.

\(S\) \(S'\) 00 01 01 10 10 00 The up-counter modulo-3 wraps around from state \(S = 2\) to next state \(S'=0.\) Denote the state bits as \(S = S_1 S_0\) and \(S' = S'_1 S'_0,\) then the minimal next state logic is:

\[S'_1 = S_0\,,\qquad S'_0 = \overline{S_1 + S_0}\,.\]Thus, the up-counter modulo-3 consists of a 2-bit state register with a feedback loop. The gate-level circuit is shown in (a) below.

The down-counter modulo-3 wraps around from state \(S=0\) to next state \(S' = 2.\) Its state transition table is:

\(S\) \(S'\) 00 10 01 00 10 01 The corresponding next state logic is:

\[S'_1 = \overline{S_1 + S_0}\,,\qquad S'_0 = S_1\,,\]and the gate-level circuit is shown in (b) above.

To use the two modulo-3 counters for the alternative and consequent SMs, we would extend them with sequencing logic by using enabled state registers and sequencing D-flipflops. However, using two separate state registers, one in the up-counter and another in the down-counter, does not solve our problem of designing a combined up-down counter. For example, consider separate state registers and apply a reset such that \(S_{up} = S_{down} = 0.\) Then, if we wish to count up, we would apply input

*dir*\(= 0,\) so that \(S'_{up} = 1.\) If we wish to count down in the next clock cycle, we apply input*dir*\(= 1,\) causing next state \(S'_{down} = 2.\) As a result, our two state registers would have different values, none of which has the desired counter value 0 that we would expect from an up-down counter after counting up and down, producing state sequence: \(0 \rightarrow 1 \rightarrow 0.\) We conclude that we need to share a single state register between consequent and alternative SM’s, and each SM contains the next state logic of the corresponding counters only.We design the state machine based on the conditional composition by merging the state registers of the up and down counters, and introduce multiplexers to select the next state signals for 2-bit state register \(S.\) The consequent SM contains the next state logic of the down counter and a sequencing D-flipflop, and the alternative SM the next state logic of the up counter and a sequencing D-flipflop. The multiplexers at the state register inputs select the next state from the consequent or alternative depending on input

*dir*.We enable state register \(S\) whenever one of the consequent or alternative receive a

*doit*pulse. Furthermore, we reset register \(S\) if the*reset*input of the SM is 1 when a*doit*pulse arrives.Now that we have a design for the counter SM based on the conditional composition, we notice that there are obvious opportunities to simplify the circuit. An obvious example is the redundant NOR gate. We need only one NOR gate rather than two. Also, we can save OR gates to produce the enable and done signals. We leave the optimization of the circuit to the reader.

We verify the up-down counter machine assuming that the

*doit*signal is 1 always. Here is the time table for the given input sequence with the*reset*and*dir*inputs listed separately:cycle 0 1 2 3 4 5 6 7 8 in 2 0 0 1 1 1 2 0 1 reset 1 0 0 0 0 0 1 0 0 dir 0 0 0 1 1 1 0 0 1 doit 1 1 1 1 1 1 1 1 1 Y X 0 1 2 1 0 2 0 1 done X 1 1 1 1 1 1 1 1 The verification effort lies in arguing why our SM produces output sequence \(Y.\) Since we are not given a start state, we assume that \(Y[0] = S[0]\) is undefined, and the

*done*signal in cycle 0 is undefined because the state of the two sequencing D-flipflops is unknown. During cycle 0, the reset signal is 1, which resets the state register, so that \(S[1] = 0\) and \(Y[1] = S[1] = 0.\) Note that the*doit*pulse traverses the D-flipflop of the alternative because*dir*\([0] = 0,\) so that*done*\([1] = 1.\) During cycle 1, the*reset*input is 0 and the*dir*input is 0. We expect the SM to count up. Since*dir*\(= 0,\) the*doit*pulse activates the alternative, and steers the next state computed in the alternative into register \(S.\) Therefore, the machine sets \(Y[2] = 1\) and*done*\([2] = 1,\) as expected. The remaining cycles of the time table can be verified analogously.We may also want to verify that the SM retains its state when the

*doit*input is 0. Inspecting our SM circuit, we find that*doit*\(= 0\) implies that the enable and reset inputs of state register \(S\) are both 0, so that \(S\) retains its current state. Furthermore,*doit*\(= 0\) implies that both sequencing D-flipflops store next state 0, so that the*done*output of the next cycle equals 0.

Design a primitive right-shift machine for 4-bit numbers.

Specify input \(I[t]\) and output \(O[t]\) of the state machine.

Design a primitive right-shift machine.

Extend the right-shift machine with sequencing logic.

Construct a time table for sequences:

\[\begin{eqnarray*} \text{I} &=& \langle 1, 1, 0, 0, 1, 1, 0, 0 \rangle\,, \\ \text{doit} &=& \langle 1, 0, 1, 1, 1, 0, 1, 0 \rangle\,. \end{eqnarray*}\]Assume initial state \(O[0]=0000_2.\)

First, we clarify the behavior that we expect from a primitive right-shift machine for 4-bit numbers. Consider 4-bit number \(A = a_3 a_2 a_1 a_0.\) Shifting \(A\) by 1 position to the right produces a new number

\[\begin{split}A >> 1 = a_4 a_3 a_2 a_1\end{split}\]that drops

*lsb*\(a_0\) of \(A\) and inserts a new*msb*\(a_4.\) Bit \(a_4\) must come from somewhere. So, we assume it is supplied as an input to the primitive state machine. In fact, if the input is a sequence of bits, one bit per clock cycle, the machine can produce a new number by right shifting in each cycle. Such a machine requires a 1-bit input and a 4-bit output, as shown on the right.Next, we express the output as a function of the input. We consider 1-bit input \(I\) and 4-bit output \(O = O_{3{:}0}.\) During cycle \(t,\) the machine shall output \(O_{3{:}0}[t-1]\) of the previous cycle, right shifted by 1 position, and input \(I[t-1]\) of the previous cycle prepended as the new

*msb*. Using braces to denote the concatentation of bit fields, we write output\[O_{3{:}0}[t] = \{ I[t-1], O_{3{:}1}[t-1] \}\,.\]For example, if \(O_{3{:}0}[t-1] = 1101\) and \(I[t-1] = 0,\) then \(1101\) is shifted to the right, such that \(O_{2{:}0}[t] = O_{3{:}1}[t-1] = 110,\) and \(O_{3{:}0}[t] = \{ 0, 110\} = 0110.\)

Recall the functionality of a

*shift register*. A 4-bit shift register matches our requirements, if we connect input \(I\) to the serial input of the shift register and parallel output \(Y_{0{:}3}\) of the shift register to output \(O_{3{:}0}\) of the right-shift machine:It takes four clock cycles to load and output the first 4-bit number. The

*msb*appears at output \(O_3.\) Thereafter, in each cycle, we input a new*msb*and right shift the number of the previous cycle.We extend the primitive right-shift machine to obtain a sequenced state machine. To that end, we augment the input with a

*doit*signal and the output with a*done*signal. Furthermore, we enable the registers of the shift register, and connect the*doit*signal to all enable inputs, because we want to shift each register’s state into the register to the right within a single clock cycle. Since a right shift takes one clock cycle, we insert a single sequencing D-flipflop to pass the*doit*input to output*done*.We verify our right-shift state machine by means of a time table. Input sequences \(I\) and

*doit*are given. Output sequence*done*is the*doit*sequence delayed by 1 cycle through the D-flipflop. We record the elements of output sequence \(O\) in 4-bit binary format.t 0 1 2 3 4 5 6 7 I 1 1 0 0 1 1 0 0 doit 1 0 1 1 1 0 1 0 O 0000 1000 1000 0100 0010 1001 1001 0100 done X 1 0 1 1 1 0 1 Initial state \(O[0]\) is given as \(0000_2.\) During cycle 0, we shift in input \(I[0] = 1\) and right shift \(O[0],\) so that \(O[1] = \{1, 000\} = 1000.\) Since

*doit*\([1] = 0,\) the shift register retains its state of cycle 1 in cycle 2. During cycle 2, we shift in*msb*\(I[2] = 0\) to obtain \(O[2] = \{ 0, 100 \} = 0100.\) The remaining cycles are derived analogously.

Design a *loop machine* for 4-bit number
\(x\) that executes:

```
while (x < 15)
x = x / 2 + 8;
```

- Design the body machine using a 4-bit right-shift machine.
- Design the loop machine.
- Verify the loop machine for initial state \(x = 4.\)

The while loop has *predicate* \(x < 15\) and an arithmetic
expression in its *body* statement: \(x = x / 2 + 8.\) We
assume that \(x\) is an unsigned integer, and the arithmetic is
integer arithmetic. In particular, in most programming languages
integer division produces the quotient and ignores the remainder.

To implement the body with the 4-bit right-shift machine of Exercise 6.10, we inspect the arithmetic expression of the body statement. First, we notice that for unsigned binary numbers integer division by 2 is equal to a right shift by 1 position. For example, \(13 / 2 = 6\) in decimal format, whereas \(1101 >> 1 = 0110\) in binary format. Thus, for 4-bit numbers, we can use our right-shift machine with input 0 into the

*msb*position to implement the integer division by 2. Second, we observe that the 4-bit binary representation of \(8_{10}\) is \(1000_2.\) Since \(x/2\) produces a quotient with*msb*\(= 0,\) adding \(8_{10}\) can be accomplished by replacing the 0 in the*msb*with a 1. For example, given \(x = 1101_2,\) we obtain:\[\begin{eqnarray*} x / 2 &=& \phantom{+} 0110 \\ + 8 && + 1000 \\ \hline x/2+8 &=& \phantom{+} 1110 \end{eqnarray*}\]The 4-bit right-shift machine enables us to add 8 by shifting in a 1, which sets the

*msb*to 1. We conclude that the 4-bit right-shift machine is capable of evaluating \(x / 2 + 8,\) assuming the shift register contains 4-bit number \(x.\) Furthermore, the shift register stores the result of the arithmetic evaluation in the next clock cycle, such that \(x[t] = x[t-1] / 2 + 8.\) Therefore, the 4-bit right-shift machine implements the body machine of the while loop as is.The design of the loop machine for the while loop requires a magnitude comparator for the predicate, and wiring of the 4-bit right shifter as body state machine:

We assume that the shift register of the body machine is initialized with value \(x.\) The associated initialization logic may be based on a shift register with parallel load, see Figure 6.14.

We verify the functionality of our while loop machine by means of a time table for initial state \(x = 4.\) Assuming that a single

*doit*pulse starts the loop machine, the time table traces state sequence \(x\) of the shift register in the body state machine using decimal format:t 0 1 2 3 4 5 6 7 doit 1 0 0 0 0 0 0 0 x 4 10 13 14 15 15 15 15 done 0 0 0 0 0 1 0 0 The

*doit*pulse in cycle 0 starts the body machine, because predicate \(x < 15\) evaluates to 1 for \(x[0] = 4.\) We find \(x[1] = x[0] / 2 + 8 = 10.\) The body machine outputs a*done*pulse during cycle 1, restarting itself, because predicate \(10 < 15\) is true. Therefore, in cycle 2 we have \(x[2] = x[1] / 2 + 8 = 13.\) Since \(13 < 15,\) the body machine is active in cycle 3, where \(x[3] = x[2] / 2 + 8 = 14,\) and in cycle 4, because \(14 < 15.\) During cycle 4, \(x[4] = x[3]/2 + 8 = 15,\) and predicate \(15 < 15\) is false. Therefore, the body machine is not activated cycle 5. Instead, the sequencing D-flipflop outputs a*done*pulse to signal termination of the while loop.

## 6.6. Timing Analysis¶

The performance of synchronous sequential circuits is determined by the clock period. The smaller clock period \(T_\phi\) the more state transitions we can execute per time unit. The clock frequency \(f\) is the reciprocal of the clock period

(1857-1894)

and is a simpler indicator for the performance, because it is directly
proportional to the number of state transitions per time unit. Since
the clock period is a time period, it is measured in seconds. The
unit of frequency is the **Hz**, pronounced *Hertz*, in honor of the
German physicist Heinrich Hertz:

Typical clock frequencies of state-of-the-art digital circuits are in the range of GigaHertz (GHz) or \(10^9\) cycles per second. The corresponding clock period lasts one nanosecond, or \(10^{-9}\,s = 1\,\mathit{ns}.\)

The two primary factors that determine the clock frequency of a synchronous sequential circuit are (1) the timing behavior of the registers at the triggering clock edge and (2) the timing behavior of the combinational portion of the circuit. In this section, we examine the effects of timing on the design of synchronous logic.

### 6.6.1. Timing Characteristics in Synchronous Logic¶

Synchronous logic consists of registers and combinational logic.
Recall the *timing behavior* of combinational
logic. Figure 6.71 illustrates the characteristic
delays. The *contamination delay* \(t_{cd}\) is the minimum delay
from the input transition to the output to start its transition. The
*propagation delay* \(t_{pd}\) is the maximum delay from the input
transition to the output transition to stabilize.

In synchronous sequential circuits, all registers are triggered by the
same global clock signal. Proper timing behavior of all registers is
a prerequisite for reliable operation of the circuit as a whole.
Registers consist of one or more *D-flipflops*. The
timing behavior of a D-flipflop as a series composition of two
*D-latches* is determined by the timing behavior of the
master D-latch. The triggering clock edge of a D-flipflop is the
clock edge that switches the master D-latch from transparent into
opaque mode. For the master D-latch to capture the input signal
properly, input \(D\) must be stable for the *setup time*
\(t_{setup}\) before the clock edge and for the *hold time*
\(t_{hold}\) after the clock edge.

Figure 6.72 illustrates the setup time and hold time
around the triggering positive clock edge. Depending on the
transistor-level design of the register, the transition of output
\(Q\) is not necessarily a sharp edge, but may take some time. In
general, the output of a register starts to transition after the clock
edge, and may exhibit glitches or multiple transitions before
stabilizing, as we know from the timing behavior of *feedback
loops*. We define the start and end times of the
output transition relative to the triggering clock edge:

The

contamination delay\(t_{ccq}\) (contamination delayclock-to-Q) of a register is the time after the triggering clock edge when output \(Q\) starts transitioning.The

propagation delay\(t_{pcq}\) (propagation delayclock-to-Q) of a register is the time period after the triggering clock edge beyond which output \(Q\) is guaranteed to be stable.

We note that the hold time of a register affects the timing behavior of the master D-latch. Although the hold time is typically less than \(t_{ccq},\) as indicated in Figure 6.72, different register circuits may have hold times that exceed \(t_{ccq}\) and even exceed \(t_{pcq}.\)

A synchronous sequential circuit combines registers with combinational logic. In Figure 6.73, we show the characteristic time periods of a simple feedback loop, whose timing behavior is representative for any synchronous circuit. The clock period \(T_\phi\) spans the time period between triggering clock edges. The register requires that next state input \(S'\) is stable during the setup time before and the hold time after the triggering clock edge. The earliest time next state \(S'\) begins to transition depends on the contamination delay of the register and the combinational logic. Output \(S'\) of the next state logic stabilizes at latest depending on the propagation delay of the register and the combinational logic.

The clock period of the circuit is equal the sum of the characteristic delays

where \(t_{pcq}\) is the register specific propagation delay from
the triggering clock edge until output \(S\) has stabilized,
\(t_{setup}\) is the register specific setup time, \(t_{pd}\)
is the propagation delay of the combinational logic, and
\(t_{pslack}\) is the **propagation slack** of the design. We
discuss the propagation slack, as well as the **contamination slack**
\(t_{cslack}\) shown in the timing diagram, below. The
register specific delays \(t_{pcq}\) and \(t_{setup}\) are
commonly referred to as **sequencing overhead**. An ideal register
would have zero sequencing overhead.

### 6.6.2. Setup-Time Constraint¶

If we are interested in maximizing the performance of a circuit by
minimizing the clock period, we must be sure that next state signal
\(S'\) is stable before the setup time of the register begins. In
Figure 6.73, propagation slack \(t_{pslack}\) is the
time period between \(S'\) having stabilized and the begin of the
setup time. If we have the choice, we may shorten clock period
\(T_\phi\) such the \(t_{pslack}=0.\) Since this is the
shortest clock period the circuit supports reliably, we formulate the
**setup-time constraint** for synchronous circuits as the inequality

If \(T_\phi\) is equal to the sum \(t_{pcq} + t_{pd} + t_{setup},\) then our design has no propagation slack. We may characterize a violation of the setup-time constraint with a negative propagation slack \(t_{pslack} < 0.\)

Typical synchronous circuits consist of many registers with combinational logic in between, each of which with its own propagation delay. For reliable operation of the circuit as a whole, we must substitute the maximum propagation delay of all combinational circuits for \(t_{pd}\) in the setup-time constraint. Consequently, paths with faster combinational logic than the maximum propagation delay have propagation slack \(t_{pslack} > 0.\) This slack represents a safety margin for the timing behavior of the logic path. On the flip side, the slack also represents an inefficiency, because the path does not exhaust the entire propagation delay available within the clock period.

In practice, the design of synchronous sequential circuits is often guided by a desired maximum clock period. Since digital designers have little to no control over the register specific delays \(t_{pcq}\) and \(t_{setup},\) because they are given by the manufacturer of the circuit, the only tuning knob available to the designer is the propagation delay of the combinational logic. Rearranging the setup-time constraint, we find

Given clock period \(T_\phi,\) the setup-time constraint presents
an upper bound on the propagation delay of the combinational logic.
Thus, to the first order, the setup-time constraint limits the maximum
number of stages on logic paths. When designing digital circuits, we
typically design a correct circuit according to the functional
specification first, and then obtain **timing closure** by reducing
the delay of critical paths so as to guarantee that the propagation
delay does not exceed the upper bound given by the setup time
constraint. There is no reason to minimize the propagation delay
further. Just the opposite, positive propagation slack indicates
optimization potential with respect to other design features including
area and energy consumption.

We wish to determine the maximum clock frequency to operate the
sequential circuit in Figure 6.74. The
operands of the 4-bit combinational *ripple-carry adder* are driven by registers \(A\) and
\(B,\) and the sum, including the carry-out bit, is stored in
register \(S.\) The manufacturer specifies the register delays
in normalized time units of \(\tau = 10\,\mathit{ps}\) such that
\(t_{ccq} = 5,\) \(t_{pcq} = 8,\) \(t_{setup} = 19,\)
and \(t_{hold} = 16.\)

Recall our timing analysis of the combinational RCA in Example 5.16. The logic delays \((t_{cd}, t_{pd})\) in normalized time units are reproduced in Figure 6.74. The propagation delay of the critical paths, \(A_0 \longrightarrow S_3,\) \(B_0 \longrightarrow S_3,\) and \(C_{in} \longrightarrow S_3,\) is \(t_{pd} = 56\) time units.

The maximum clock frequency \(f_{max}\) is the reciprocal of minimum clock period \(T_\phi.\) According to the setup-time constraint, the minimum clock period requires zero propagation slack. Then, the minimum clock period is equal to the sum of the \(t_{pcq}\) delay of one of the input registers, e.g. register \(A,\) the propagation delay of the RCA, and the setup time of sum register \(S\):

time units. The reciprocal of the clock period

is the maximum clock frequency of the circuit.

### 6.6.3. Hold-Time Constraint¶

For the registers of a synchronous circuit to operate reliably, we
must observe the register specific hold time. In Figure 6.73, next state input \(S'\) must not change for the hold time
after the triggering clock edge. The earliest the next state
\(S'\) can change after the clock edge is after the sum of the
contamination delays of the register and the combinational logic
\(t_{ccq} + t_{cd}.\) Therefore, this sum must not be shorter
than the hold time of the register. We formulate this condition as
the **hold-time contraint**:

In Figure 6.73, the hold-time constraint is fulfilled,
because \(t_{hold}\) is shorter than the sum \(t_{ccq} +
t_{cd}.\) The **contamination slack** quantifies how much shorter the
hold time is:

Analogous to the propagation slack, a positive contamination slack is a safety margin. A negative \(t_{cslack}\) can be interpreted as a violation of the hold-time constraint.

Since the contamination delay \(t_{ccq}\) and hold time \(t_{hold}\) are register specific constants determined by the manufacturer, digital designers can only vary the contamination delay of the combinational logic to ensure that the hold-time constraint is observed:

The hold-time constraint serves as lower bound for contamination delay \(t_{cd}\) of the combinational logic. A violation of the hold-time constraint can occur if the combinational portion of a synchronous circuit is too fast or, more succinctly, if \(t_{cd}\) is too small. Typical register designs prevent such hold-time violations by ensuring that \(t_{hold} \le t_{ccq}.\) In this case, the hold-time constraint simplifies to \(t_{cd} \ge 0,\) meaning that a synchronous circuit without combinational logic between register output and input will still operate properly. For register designs where \(t_{hold} > t_{ccq},\) we can prevent violating the hold-time constraint by placing logic on every path between register output and input, for example by inserting buffers.

We examine whether the sequential RCA circuit in Example 6.10 fulfills or violates the hold-time constraint. The manufacturer’s register delays are given as \(t_{ccq} = 5\) and \(t_{hold} = 16\) time units. The timing analysis of the combinational portion of the circuit reveals shortest paths \(A_3 \longrightarrow C_{out}\) and \(B_3 \longrightarrow C_{out}\) with a contamination delay of \(t_{cd} = 9\) time units. Thus, the hold-time constraint

is violated. The shortest paths of the RCA are too fast for the sum register to capture the carry output reliably, because \(C_{out}\) may begin to transition before the end of the hold time period of the sum register.

We can fix the hold-time violation by increasing the delay of the shortest paths. In general, we want to increase the contamination delay without increasing the propagation delay of the combination logic. This is possible in the RCA circuit. For example, we may insert back-to-back inverters between the carry output of the combinational RCA and the sum register.

Assuming that each inverter has a normalized delay of \(d_{inv} = 2\) time units, we increase the contamination delay of the RCA from \(t_{cd} = 9\) to \(t_{cd} = 13\) time units. The propagation delay of the carry path also increases from \(t_{pd}(C_{out}) = 36\) to \(t_{pd}(C_{out}) = 40\) time units. However, this is still less than propagation delay \(t_{pd} = 56\) time units of the critical paths. The hold-time constraint of the modified circuit

is fulfilled without increasing the minimum clock period. Our fix even produces a contamination slack of \(t_{cslack} = 2\) time units as safety margin.

### 6.6.4. Clock Skew¶

Our discussion of the setup-time and hold-time constraints assumes
that the triggering edges of the global clock signal arrive at all
registers at exactly the same time. This is an idealizing assumption
which is virtually impossible to guarantee in real designs, where the
arrival times of the triggering edges vary across registers. The
difference of the arrival times is called **clock skew**.
Figure 6.75 illustrates the clock skew in a sequential
circuit, where two registers receive the global clock signal with
slight timing variations such that \(\phi_1\) arrives slightly
earlier than \(\phi_2.\) The difference is of the arrival times
of the triggering clock edges is clock skew \(t_{skew}.\)

The presence of clock skew requires a refinement of the setup-time and hold-time constraints to guarantee reliable operation of synchronous circuits. We assume that \(t_{skew}\) is the maximum skew of our global clock signal between any two registers connected by combinational logic. Furthermore, we consider the worst-case scenario of the triggering edges arriving in order \(\phi_1\) before \(\phi_2\) or vice versa.

We consider the setup-time constraint first. If \(\phi_1\) arrives earlier than \(\phi_2\) the clock skew increases the propagation slack. In this case, the clock skew increases the safety margin. On the other hand, if \(\phi_2\) arrives earlier than \(\phi_1,\) then the clock skew reduces the propagation slack, as illustrated in Figure 6.76.

To guarantee that signal \(S'\) is stable before the setup time begins w.r.t. clock \(\phi_2,\) we refine the setup-time constraint with clock skew:

The added clock skew tightens the upper bound on the propagation delay of the combinational logic:

For the digital designer, this refinement implies that tolerating clock skew requires fewer stages on the combinational logic paths. In practice, this usually means increasing the design effort to obtain timing closure.

For the hold-time constraint, the worst-case scenario occurs if the triggering edge of clock signal \(\phi_1\) arrives earlier than the triggering edge of \(\phi_2.\) Figure 6.77 illustrates this case. We need to hold signal \(S'\) stable for the hold time after the triggering clock edge of \(\phi_2.\) If \(\phi_2\) arrives by \(t_{skew}\) later than \(\phi_1,\) the clock skew reduces the contamination slack.

We refine the hold-time constraint with clock skew:

The clock skew tightens the lower bound on the contamination delay of the combinational logic:

Fixing hold-time violations in the presence of clock skew requires increasing the delay of the shortest paths further by the clock skew.

Consider the sequential ripple-carry adder circuit of Example 6.10. Determine the propagation slack and contamination slack, assuming we wish to operate the circuit at a frequency of \(f = 1\,GHz\) and want to tolerate a clock skew of \(t_{skew} = 20\,\mathit{ps}.\)

From Figure 6.76, we deduce the propagation slack

Given the register specifications in Example 6.10, we obtain

We conclude that running the circuit by 20% below its maximum clock
frequency suffices to offset the clock skew, and still retain a
safety margin of 150 *ps*.

We derive the contamination slack from Figure 6.77 as

With the register specifications in Example 6.10, we find

A negative contamination slack indicates a hold-time violation. The fix in Example 6.11 increases the contamination delay to \(t_{cd} = 130\,\mathit{ps},\) which is just enough to obtain zero contamination slack, independent of the clock frequency. Therefore, the modified RCA circuit of Example 6.11 should work reliably at 1 GHz, whereas the original circuit of Example 6.10 will not due to the hold-time violation.

Given the FSM in the diagram with timing data for the next state logic and the register:

- Determine the minimum clock period \(T_\phi\) for the machine.
- Determine the slack of the hold-time constraint.
- Does the machine observe the hold-time constraint?

The FSM is a synchronous sequential circuit that must observe the setup time constraint to function properly. The state register introduces sequencing overhead in form of delays \(t_{pcq} = 20\,\mathit{ps}\) and \(t_{setup} = 10\,\mathit{ps}.\) We assume that input \(A\) causes no timing problems. Then, the critical path of the FSM is the feedback loop from output \(S\) of the state register through next state logic \(\sigma\) to register input \(S'.\) Thus, the setup time constraint of the circuit involves delay \(t_{pcq}\) of the register output, propagation delay \(t_{pd}\) of next state logic \(\sigma,\) and setup time \(t_{setup}\) of the register input. The clock period must not be smaller than the sum of these three delays:

\[\begin{eqnarray*} T_\phi &\ge& t_{pcq} + t_{pd} + t_{setup} \\ &=& 20\,\mathit{ps} + 180\,\mathit{ps} + 10\,\mathit{ps} \\ &=& 210\,\mathit{ps}\,. \end{eqnarray*}\]We conclude that the minimum clock period of the FSM is \(T_\phi = 210\,\mathit{ps},\) so that next state signal \(S'\) stabilizes just in time before the next positive clock edge.

We apply the hold time constraint to ensure that next state \(S'\) does not change too early to modify current state \(S\) unintendedly. Contamination slack \(t_{cslack}\) is the time reserve of the circuit to fulfill the hold time constraint. If \(t_{cslack} < 0,\) the hold time constraint is violated, and the circuit may not function properly:

\[\begin{eqnarray*} t_{ccq} + t_{cd} &=& t_{hold} + t_{cslack} \\ \Rightarrow\qquad t_{cslack} &=& 5\,\mathit{ps} + 20\,\mathit{ps} - 30\,\mathit{ps} \\ &=& -5\,\mathit{ps}\,. \end{eqnarray*}\]We find a negative contamination slack, which implies a hold time violation. To fix the problem, we should increase the delay of the shortest path(s) of next state logic \(\sigma\) by at least \(5\,\mathit{ps}.\)

### 6.6.5. Dynamic Discipline¶

The timing behavior of synchronous circuits can be confusing at times,
even to seasoned designers. However, observing the setup-time and
hold-time constraints does produce reliable synchronous sequential
circuits. The **dynamic discipline** summarizes the key aspects of
synchronous circuit design and serves as a guideline for digital
designers:

- No combinational feedback loops are permitted.
- A single clock signal triggers all registers.
- All register inputs must be stable around the triggering clock edge, and obey the setup-time and hold-time constraints.
- The clock period must be at least as large as the maximum combinational propagation delay plus the sequencing overhead of the registers.
- The next state of the circuit is determined by the output signals of the combinational circuits just before the triggering clock edge.

The dynamic discipline enables us to view the continuous, time dependent behavior of signals in a synchronous sequential circuit as an abstract state machine that produces a discrete sequence of states. Figure 6.78 contrasts boths views. From the circuit perspective, signals \(S[t]\) and \(S[t+1]\) coexist simultaneously during cycle \(t.\) The register drives signal \(S[t]\) into the combinational logic. After the propagation delay of the combinational logic, next state signal \(S[t+1]\) appears at the combinational outputs, where it coexists with state \(S[t]\) of the register. In accordance with the setup-time constraint, next state \(S[t+1]\) exists at the register inputs during the setup-time period before the triggering clock edge at the end of current cycle \(t.\)

From the perspective of a state machine, state \(S[t]\) precedes next state \(S[t+1].\) In clock cycle \(t,\) the register stores current state \(S[t],\) and in the next clock cycle \(t+1,\) the register stores next state \(S[t+1].\) The fact that most of the cycle may be spent to prepare the transition between the two states is irrelevant for the digial abstraction of a state sequence. However, without this abstract, perhaps even simplistic view of a state sequence, it is not only cumbersome but virtually impossible to argue about the complex sequences finite state machines are capable of producing.

## 6.7. Pipelining¶

So far, we have characterized the performance of a digital circuit by
means of its delay. The smaller the propagation delay of a
combinational circuit is, the faster it executes a computation. In a
synchronous sequential circuit, the higher the clock frequency is, the
faster it executes a computation. A state machine may require
multiple clock cycles to perform one computation, like *recognize
a pattern* or *add two binary numbers*. We call the execution time of such a computation the
**latency** of the computation, to distinguish it from the delay of
the underlying circuit. A *serial adder* incurs
different latencies for adding 4-bit numbers compared to 64-bit
numbers, unless we perform the 64-bit addition at a much higher clock
frequency. In this section, we introduce *pipelining* as a method for
improving the performance of a digital circuit without reducing its
latency, but by increasing its **throughput**, i.e. the number of
computations executed per unit of time.[3] Pipelining is
invaluable if we wish to perform multiple computations on a circuit
rather than just one.

### 6.7.1. Combinational Pipelines¶

We consider our options for performing multiple computations on the
combinational *AND tree* shown in Figure 6.79. Assume that the tree is symmetric and each 2-input AND gate
has delay \(t_{and} = 50\,\mathit{ps}.\) Thus, the delay from each input
to the output is \(t_{pd} = 4\,t_{and} = 200\,\mathit{ps},\) because the
number of tree levels is 4. Note that the contamination delay
\(t_{cd}\) is equal to the propagation delay \(t_{pd}\) of the
circuit. The waveform diagram in Figure 6.79 illustrates
the signal propagation from any of the inputs at level 0 to the output
at level 4, where the inputs are applied at time 0.

If we wish to perform ten AND computations, we might want to reuse the circuit as soon as one result has been produced. In this sequential mode of operation, we apply the inputs, wait for \(t_{pd}\) time units until the output has stabilized, and apply the next input. Then, each AND computation has a latency of \(t_{pd} = 200\,\mathit{ps},\) and ten AND computations take \(T_{10} = 10 \cdot 200\,\mathit{ps} = 2\,\mathit{ns}.\) The throughput \(X_{seq}\) of ten AND computations executed one after another is the average number of computations per time unit, here one second:

Note that the throughput is actually independent of the particular number of AND computations we perform.

Now, observe that the sequential mode of operation utilizes the AND gates of the tree rather inefficiently. Rather than waiting until the output of the tree has stabilized, we only need to wait until the level-1 outputs of the stage-1 AND gates have stabilized before applying the next input. If we apply a new input every \(t_{and} = 50\,\mathit{ps},\) we obtain waves of signals propagating through the levels of the combinational AND tree. Figure 6.80 illustrates this pipelined mode of operation.[4] The AND tree with four levels acts like a pipeline with up to five signals that are active simultaneously. If we keep the inputs stable for \(t_{and}\) time units, the corresponding output of the AND computation is stable for \(t_{and}\) time units. In Figure 6.80, time \(t_A\) is suitable point in time to capture the conjunction of input signal \(A\) at the output of the tree, i.e. at tree level 4. At the same time \(t_A,\) four more signals activate the tree. Signal \(E\) is applied to the inputs of the tree at level 0, signal \(D\) has propagated through the first stage of AND gates to level 1, signal \(C\) to level 2, and signal \(B\) to level 3.

We analyze the performance of the pipelined mode of operation by deducing the latency of one AND computation and the throughput of multiple AND computations. The latency of one AND computation is the same as in the sequential mode, because pipelining does not affect the propagation delay of a particular input signal through the tree. Therefore, the latency of one AND computation is still \(t_{pd} = 4\,t_{and} = 200\,\mathit{ps}.\) However, the pipelined mode increases the throughput. Once the pipeline is full, every \(t_{and} = 50\,\mathit{ps}\) a new input enters and a new output exits the tree. Therefore, throughput \(X_{pipe}\) in pipelined mode of operation is four times larger than in sequential mode:

The efficiency of pipelining hinges on the contamination and propagation delays of the subcircuits. The AND tree is a best-case example, where each AND gate has equal contamination and propagtion delays, \(t_{and} = t_{cd}(and) = t_{pd}(and).\) If the subcircuits have distinct delays, \(t_{cd} < t_{pd},\) signals on faster paths outpace signals on slower paths. Consider an adder tree of combinational ripple-carry adders. Each subcircuit is an RCA, as in Example 5.16, but with a contamination delay of \(t_{cd}(rca) = 40\) time units and a propagation delay of \(t_{pd}(rca) = 50\) time units. An adder tree with four levels can add 16 binary numbers, and has a timing behavior similar to the AND tree. However, the transitions spread out, the deeper the partial sums propagate into the tree. More succinctly, the transition period at level \(k\) of the adder tree lasts \(k (t_{pd}(rca) - t_{cd}(rca))\) time units. Figure 6.81 illustrates the increasing spread of the transitions towards deeper logic levels.

To prevent the signals on the shortest paths from racing ahead of the previous input on the critical paths, we must delay the signal transitions at the inputs of the tree. In Figure 6.81, we assume \(t_{pd}(rca)/t_{cd}(rca) = 5/4,\) and apply new inputs every \(2\,t_{pd}(rca)\) time units only. Then, the output of the adder tree at level 4 is stable for \(5/4\,t_{pd}(rca)\) time units. We can capture the stable sum signals at times \(t_{A}\) and \(t_{B},\) for example. The minimum delay between new inputs is determined by the spread of the transitions at the output. For example, if we require the output to be stable for \(1\,t_{pd}(rca)\) time units, then we can reduce the delay between new inputs to \(7/4\,t_{pd}(rca)\) time units.

The latency of one addition of 16 numbers on the adder tree is \(4\,t_{pd}(rca)\) time units, independent of whether we operate the adder tree in sequential or pipelined mode. The rate of sums exiting the adder tree is equal to the rate of inputs entering the tree. This rate is the throughput of the adder tree in terms of additions per time unit. In Figure 6.81, we perform 1 addition every \(2\,t_{pd}(rca)\) time units, which is half of the maximum throughput we could achieve using RCA circuits with \(t_{cd}(rca) = t_{pd}(rca).\) We conclude that distinct contamination and propagation delays of the subcircuits cause inefficiencies in the pipelined mode of operation.

### 6.7.2. Synchronous Pipelines¶

Operating a combinational circuit in pipelined mode requires careful circuit design, including the contamination and propagation delays of all subcircuits. Significantly simpler to implement, although in general less efficient, is the pipelining of synchronous circuits. Figure 6.82 shows a synchronous circuit with four stages of combinational logic and registers. Input signal \(S_{in}\) enters the combinational logic of stage 0, and output signal \(S_{out}\) is driven by the register of stage 3. At each triggering clock edge, the signal propagates from one stage to the next. After four triggering clock edges, the result of the computation exits the circuit. Therefore, the latency of a computation is 4 clock cycles.

Pipelining is the natural mode of operation for such a synchronous
circuit. There is no reason to operate the circuit in sequential mode
by applying the inputs to \(S_{in},\) and then waiting for 4
cycles until the output appears at \(S_{out}\) before applying the
next input. The latency of such a sequential computation is 4 cycles
and the throughput is 1 computation per 4 cycles. In the natural
pipelined mode, we apply a new input in every clock cycle, as
illustrated with the waveform diagram in Figure 6.82. The throughput is 1 computation per clock cycle, which is
4 times larger than in sequential mode. The crucial improvement
compared to a combinational pipeline is that the transition delays are
confined to each pipeline stage. We do not need to worry about
increasing transition times when increasing the number of pipeline
stages that effect combinational pipelines. Instead, the minimum
clock period \(T_\phi\) is determined by the maximum propagation
delay as in every synchronous sequential circuit. In fact, when
designing synchronous pipelines, we do not need to worry about
contamination delays other than observing the *hold-time
constraint*.

Synchronous pipelines permit a simple abstraction, that hides the details of the timing behavior of the underlying circuit. We sketch the pipeline as a row of cells with one cell per stage, and assume that each stage has a register at its output. Inputs enter the pipeline on the left and outputs exit the pipeline on the right.

We track the signal propagation of computations in the pipeline using
a **Gantt chart** to visualize the allocation of the pipeline stages
to computations over time. Figure 6.83 shows a
Gantt chart of six computations, \(A, B, C, D, E, F,\) on a
4-stage pipeline. The vertical axis tracks the sequence of
computations, and the horizontal axis spans the clock cycles. In the
row of a computation, we draw the abstract pipeline aligned to the
columns of the clock cycles when the computation occurs. Computation
\(A\) occupies stage 0 in cycle 0, \(B\) in cycle 1, \(C\)
in cycle 2, and so on.

The Gantt chart tells us which computation occupies which pipeline stage in which clock cycle. For example, computation \(C\) occupies stage 1 of the pipeline in clock cycle 3. Since each row of the chart corresponds to one computation, the position of the pipeline sketch in the chart encodes the time of execution. For example, computation \(C\) starts at the beginning of cycle 2 and terminates at the end of cycle 5. Each column of the chart informs us about the computations that occupy the pipeline simultaneously within one clock cycle. In cycle 4, for instance, the pipeline is occupied by computations \(B,\) \(C,\) \(D,\) and \(E.\) Moreover, it is easy to determine the execution time of multiple computations. For example, a sequence of five computations that enters the pipeline in consecutive clock cycles has an execution time of 8 cycles. Let the first computation be \(B\) and the fifth computation \(F,\) then the sequence of computations starts in cycle 1 when \(B\) occupies stage 0 and ends in cycle 8 when \(F\) occupies stage 3, for a total of 8 cycles.

The pipeline in Figure 6.83 is fully utilized in
cycles 3, 4, and 5 only. In cycles 0, 1, and 2, the stages at the
tail of the pipeline idle. This is the **startup phase** of the
pipeline, where we fill an initially empty pipeline with computations
from head to tail. Similarly, cycles 6, 7, and 8 are the **drain
phase** of the pipeline. When no computations enter the pipeline any
longer, the pipeline empties out from head to tail. The startup and
drain phases affect the utilization of the pipeline. Assume we
execute a sequence of \(n\) computations on a pipeline of depth
\(k.\) Then we characterize the performance of the pipeline with
these metrics:

Startup time\(T_s\): length of startup phase in clock cycles \(T_s = k - 1\) cycles. By symmetry, the drain time equals the startup time.

Latency\(T_n\): total number of clock cycles to execute \(n\) computations, \(T_n = T_s + n = n + k - 1\) cycles.

Throughput\(X_n\): throughput of \(n\) computations, \(X_n = n / T_n\) computations per clock cycle.

Speedup\(S_n\): speedup of pipelined execution versus sequential execution, \(S_n = n T_1 / T_n.\)

We illustrate these metrics with the aid of the Gantt chart in Figure 6.83. The pipeline has \(k = 4\) stages, and a startup time of \(T_s = 3\) cycles. The latency of \(n=6\) computations is \(T_6 = T_s + n = 9\) cycles. These are the cycles \(0, \ldots, 8\) in the Gantt chart. One computation has a latency of \(T_1 = 4\) cycles. As expected, the number of cycles of \(T_1\) equals the number of stages \(k\) of the pipeline. The throughput of \(n=6\) computations is \(X_6 = 2/3\) computations per cycle, or 2 computations every 3 clock cycles. Only when \(n \rightarrow \infty,\) do we approach the maximum throughput of \(X_\infty = 1\) computation per cycle, supported by the pipeline. The maximum throughput matches the observation that in a full pipeline, one computation can enter and another exit the pipeline within the same cycle. If we perform a finite number of computations only, then the actual throughput decreases because of the startup and drain phases.

The throughput reflects the utilization of the pipeline stages. We may want to know how many computations we need for utilization of 50%, which corresponds to a throughput \(X_n = 1/2\) computations per cycle, or 1 computation every 2 cycles. For a pipeline of depth \(k\) we obtain:

Thus, \(n = 3\) computations achieve a throughput of \(X_3 = 1/2\) computations per cycle on our 4-stage pipeline. More computations increase the throughput. To increase the throughput beyond 90%, we find \(n\) where \(X_n = 9/10.\) Analogous to the algebra above, we obtain \(n = 9 (k - 1).\) Thus, for \(n \ge 27\) computations, the throughput of our 4-stage pipeline is \(X_n \ge 9/10.\) The deeper a pipeline the more computations we need to utilize it efficiently.

The speedup of a pipeline is the performance gain of operating a circuit in pipelined mode compared to sequential mode. If we operate a synchronous pipeline of depth \(k\) sequentially, each computation has a latency of \(T_1 = k\) cycles, and the latency of \(n\) computations is \(n T_1 = n k\) cycles. The speedup

is the factor by which the pipelined execution of \(n\) computations outperforms the sequential execution. For example, \(n = 3\) computations achieve a speedup of \(S_3 = 12/6 = 2\) on our 4-stage pipeline. The pipelined mode of execution is 2 times faster than the sequential mode. The maximum speedup achievable on a \(k\)-stage pipeline is \(S_\infty = k,\) by taking the limit of \(S_n\) for \(n \rightarrow \infty.\) Thus, the deeper a pipeline the larger the maximum speedup we can achieve.

### 6.7.3. Register-Push Method¶

Synchronous pipelines are of considerable practical value, because
they can improve the performance of a digital circuit by increasing
its throughput, even if we cannot reduce its latency any further.
Therefore, we are interested in transforming a given combinational
circuit into a synchronous pipeline. In this section, we introduce
the *register-push method* that facilitates this transformation in
a systematic fashion.

We begin by extending our notion of a \(k\)-stage pipeline to acyclic circuit topologies. Consider the acyclic combinational circuit in Figure 6.84 with inputs \(A\) and \(B\) and outputs \(X\) and \(Y.\) Combinational subcircuits \(C1,\) \(C2,\) \(C3,\) and \(C4\) are shown as black boxes, because we wish to treat them as indivisible subcircuits, like a single CMOS gate for example. Since the circuit topology does not match the simple linear structure of the synchronous pipeline in Figure 6.82, it is not necessarily obvious where we can insert registers to transform the combinational circuit into a synchronous pipeline. The following definition of a k-stage pipeline is the key:

Ak-stage pipelineis a synchronous sequential circuit where every path from input to output contains \(k\) registers.

The idea of a \(k\)-stage pipeline is easily understood by example. The circuit below extends the combinational circuit of Figure 6.84 with four registers. We ask whether this circuit is a \(k\)-stage pipeline, and if this is the case, we want to know the value of \(k.\)

The circuit has five paths from inputs to outputs, \(A \longrightarrow C2 \longrightarrow X,\) \(A \longrightarrow C3 \longrightarrow X,\) \(A \longrightarrow Y,\) \(B \longrightarrow X,\) and \(B \longrightarrow Y.\) Consider path \(A \longrightarrow C2 \longrightarrow X,\) and we find 2 registers. Thus, we hypothesize that the circuit is a 2-stage pipeline, and check whether all other paths have 2 registers. This is not the case for path \(A \longrightarrow Y,\) which has 1 register only. We conclude that the circuit is not a \(k\)-stage pipeline.

As a second example, consider the modified circuit below, where we have shifted the register from the output of \(C2\) to the output of \(C1.\)

In this circuit, each of the five paths from inputs to outputs has 2 registers. Therefore, the circuit is a \(k\)-stage pipeline with \(k = 2.\)

The **register-push method** transforms an acyclic combinational
circuit of indivisible subcircuits into a functionally equivalent
\(k\)-stage pipeline:

- Determine the critical path and its propagation delay \(T_{crit}\) between all inputs and outputs.
- Choose clock period \(T_C\) to be at least the maximum propagation delay of all indivisible subcircuits.
- Choose the number of pipeline stages \(k\) such that \(k = \lceil T_{crit} / T_C\rceil.\)
- Place \(k\) registers on every output node.
- Push registers toward the inputs such that at least one register remains on each output node, and the circuit remains a \(k\)-stage pipeline.
- Stop pushing when no registers can be pushed without violating the property of being a \(k\)-stage pipeline, and when the propagation delay between any two registers is less than or equal to \(T_C.\) If the propagation delay of a path between an input and the first register is larger than \(T_C,\) set \(k = k + 1,\) place one additional register on every output node, and goto step 5.

We illustrate the register-push method by transforming the circuit in Figure 6.85 into a \(k\)-stage pipeline.

We begin by identifying the critical path \(A \longrightarrow C1
\longrightarrow C2 \longrightarrow C4 \longrightarrow X\) with
propagation delay \(T_{crit} = 70\,\mathit{ns}.\) The maximum propagation
delay among all subcircuits is \(30\,\mathit{ns}\) of subcircuit
\(C1.\) Therefore, we choose a clock period of \(T_C =
30\,\mathit{ns}.\) We could choose a larger clock period but this might result
in a smaller clock frequency than necessary for the pipelined circuit.
Given \(T_{crit}\) and \(T_C,\) we can choose the number of
pipeline stages \(k = \lceil 70 / 30 \rceil = 3.\) This completes
the first three steps of the register-push method. The remaining
steps specify a graphical method for inserting registers in the
combinational circuit, that resembles the *bubble-push method*. Figure 6.86 shows the register
pushing from outputs toward the inputs of the circuit. Step 4 of the
register-push method initiates the register pushing by placing
\(k=3\) registers on each output node. As shown Figure 6.86(a), we use a simple stroke to represent a register rather
than drawing register symbols.

Similar to bubble pushing, we push registers across a subcircuit by removing a register from each output, and placing a register on each input. In Figure 6.86(b), we push two registers across subcircuit \(C4,\) leaving one register behind on output \(X.\) Note that subcircuit \(C4\) is now framed by registers. If we operate the circuit with a clock period of \(T_C = 30\,\mathit{ns},\) subcircuit \(C4\) stabilizes its output signal earlier than necessary, because its propagation delay of \(t_{pd}(C4) = 20\,\mathit{ns}\) is less than \(T_C.\)

Next, in Figure 6.86(c), we push two registers across the wire fork. We remove two registers from each of the two output wires, and insert two registers on the input wire of the fork. This transformation does not affect the timing behavior of the circuit. One register remains on output \(Y.\)

In Figure 6.86(d), we push both registers from the output of subcircuit \(C3\) to its inputs. This step is justified by observing that the propagation delays between the registers at the inputs of \(C3\) and the registers on all output paths of \(C3\) are less than clock period \(T_C = 30\,\mathit{ns}.\) There are two such paths. The path between the input registers of \(C3\) to the register on output \(Y\) has a propagation delay of \(5\,\mathit{ns}.\) The other path through \(C4\) to the register on output \(X\) has a propagation delay of \(t_{pd}(C3) + t_{pd}(C4) = 25\,\mathit{ns}.\)

We cannot push both registers on the output of subcircuit \(C2\) to its input, because this would create a path between the register at the input of \(C2\) and the register at output \(X.\) This path would have a propagation delay of \(t_{pd}(C2) + t_{pd}(C4) = 40\,\mathit{ns},\) which is larger than the choosen clock period \(T_C = 30\,\mathit{ns}.\) Instead, we push only one register across \(C2\) as shown in Figure 6.86(e). The paths between two registers through \(C2\) and through \(C4\) each have a propagation delay of \(20\,\mathit{ns},\) which is less than \(T_C.\)

The register push across the wire fork in Figure 6.86(f) saves one register without changing the timing behavior of the circuit. We cannot push the register across subcircuit \(C1,\) because the propagation delay of the resulting path through \(C1\) and \(C2\) would exceed the clock period. Since there are no more opportunities to push registers, the register-push method terminates. The resulting synchronous circuit is the desired 3-stage pipeline. You may verify that each of the five paths between inputs and outputs has 3 registers, and that every path between two registers or from an input to the first register has a propagation delay less than or equal to clock period \(T_C.\)

Now that we have a pipelined version of the original combinational circuit, we analyze the performance benefit of our pipelining efforts. We compare the performance of the 3-stage pipeline in Figure 6.86(f) with the performance of the combinational circuit in Figure 6.85. The combinational circuit has a latency of \(T_{crit} = 70\,\mathit{ns}.\) Operated in sequential mode, the throughput of the combinational circuit is

Next, we consider the pipelined circuit. The 3-stage pipeline has a clock period of \(T_C = 30\,\mathit{ns}.\) Therefore, the latency of one computation is \(T_1 = k \cdot T_C = 90\,\mathit{ns}.\) We note that pipelining increases the latency by \(20\,\mathit{ns},\) or by nearly 30%. For infinitely many computations, the maximum throughput of the pipeline is 1 computation per clock cycle, or

The speedup of \(n\) computations due to pipelining is the ratio of the execution time of \(n\) computations on the combinational circuit, \(n \cdot T_{crit},\) divided by the corresponding execution time on the pipelined circuit \(T_n = (T_s + n)\,T_C.\) In the limit, for infinitely many computations we obtain speedup \(S_\infty\) of the pipelined versus the combinational circuit:

Thus, the speedup of our pipeline is \(S_\infty = 2.33.\) This speedup is below the maximum speedup 3 of a pipeline with \(k=3\) stages. The reason for this loss is the imbalance of the propagation delays in the stages. Only if the propagation delays of all subcircuits are equal is the speedup equal to the number of pipeline stages.

The 3-stage pipeline suffers an additional slowdown, if we include the sequencing overheads of the registers in our analysis. To observe the setup-time constraint, we must increase clock period \(T_C\) to account for the \(t_{pcq}\) and \(t_{setup}\) delays within a pipeline stage. Furthermore, we may have to worry about clock skew \(t_{skew}.\) Thus, the refined clock period for the pipeline is

The sum within the parentheses is the so-called **pipelining
overhead**. The pipelining overhead increases the latency of the
pipeline from \(k\cdot T_C\) to \(k \cdot T_\phi,\) and reduces
the speedup compared to the combinational circuit from
\(T_{crit}/T_C\) to \(T_{crit}/T_\phi.\) Because of the
pipelining overhead, we do not want to partition a given combinational
circuit into too many stages. Increasing the number of stages
increases the maximum speedup but also the pipelining overhead due to
the pipeline registers. In fact, for most combinational circuits
there exists a best number of stages that maximizes the speedup by
balancing the tradeoff between the number of stages and the pipelining
overhead.

We wish to pipeline the *ripple-carry adder* for 4-bit numbers shown in Figure 6.87.

We treat the *full adder (FA)* as an indivisible
circuit. Assuming that all FA’s have the same propagation delay
from inputs to outputs, we can determine the number of pipeline
stages \(k\) skipping steps 1 and 2 of the *register-push
method*. The critical path of the 4-bit RCA
stretches through the carry chain from the inputs in position 0,
\(A_0,\) \(B_0,\) and \(C_{in},\) to the outputs in
position 3, \(S_3\) and \(C_{out}.\) Since the carry chain
contains 4 FA’s, we choose to allocate one FA per pipeline stage,
so that \(k = 4.\) Since all stages have the same propagation
delay of the FA, the FA stages of the pipeline will be perfectly
balanced.

We initialize the register configuration for the pipelined RCA by placing \(k=4\) registers on each output of the combinational RCA, \(S_0,\) \(S_1,\) \(S_2,\) \(S_3,\) and \(C_{out}.\) Figure 6.88(a) shows the RCA with register strokes on its outputs.

We begin to push registers in bit position 3 of the RCA, because the FA is the only one with registers on all outputs to push towards its inputs. Leaving one register behind on outputs \(C_{out}\) and \(S_3,\) we can push three registers across the FA. Figure 6.88(b) shows the RCA with three registers removed from the outputs of the FA in bit position 3, and three registers added to each of its inputs \(A_3,\) \(B_3,\) and \(C_2.\) Now, we have registers on all outputs of the FA in bit position 2. We need to keep one register on \(C_2\) to turn the carry chain into a 4-stage pipeline. Thus, two of the three registers on \(C_2\) are available for pushing. We remove two registers from each output of the FA in bit position 2, \(S_2\) and \(C_2,\) and place two registers on each of its inputs, \(A_2,\) \(B_2,\) and \(C_1,\) as shown in Figure 6.88(c). Finally, we push one register across the FA in bit position 1, to obtain the 4-stage pipeline in Figure 6.88(d). You may verify that all paths from inputs to outputs have 4 registers, and no pipeline stage has a propagation delay larger than that of a full adder.

The manufacturer of our 4-stage pipeline has full adders with a
propagation delay of \(t_{pd}(FA) = 100\,\mathit{ps},\) registers with
\(t_{pcq} = 10\,\mathit{ps}\) and \(t_{setup} = 15\,\mathit{ps},\) and claims
that the clock signal will have negligible skew. We are interested
in the speedup of our pipelined RCA compared to the combinational
RCA. First, however, let’s step back, and use the timing data to
verify our choice of \(k=4\) pipeline stages by means of steps
1-3 of the *register-push method*. The
critical path of the combinational RCA in Figure 6.87 is the carry chain with propagation delay
\(T_{crit} = 4\,t_{pd}(FA) = 400\,\mathit{ps}.\) The subcircuit with
the largest propagation delay is an FA with a propagation delay of
\(t_{pd}(FA) = 100\,\mathit{ps}.\) Thus, we choose the clock period to
be \(T_C = t_{pd}(FA) = 100\,\mathit{ps}.\) Then, we choose the number
of pipeline stages

Note that the propagation delays of the FA cancel out, justifying our argument that the number of pipeline stages equals the number of FA’s on the critical path.

The combinational RCA is our reference design. The latency of one addition on the combinational RCA is the propagation delay of the critical path \(T_{crit} = 400\,\mathit{ps}.\) If we operate the combinational RCA in sequential mode, we obtain a throughput of

Next, we analyze the performance of the pipelined RCA. Without pipelining overhead, the 4-stage pipeline has the same latency and a four times larger throughput than the combinational RCA, because the pipeline can produce one addition per clock cycle with a period of \(T_C = 100\,\mathit{ps}.\) Thus, the maximum throughput for infinitely many additions is \(10^{10}\) additions\(/s\) on an ideal pipeline. Accounting for the pipelining overhead, we must increase the clock period to observe the setup-time constraint:

The increased clock period increases the latency of the 4-stage pipeline to \(T_1 = 4\,T_\phi = 500\,\mathit{ps},\) and the throughput decreases to

We find that the speedup of our pipelined RCA compared to the combinational RCA is

in terms of maximum throughput for infinitely many additions. The pipelining overhead reduces the speedup by 25% compared to the ideal speedup of 4 of a 4-stage pipeline. We conclude that pipelining increases the throughput of a combinational RCA by up to a factor of 3.2 at the expense of increasing the latency by 25%.

Design a circuit for binary numbers \(A,\) \(B,\) \(C,\) \(D,\) and \(E\) to compute

Use these combinational arithmetic building blocks:

operator | inputs | delay[ps] |
---|---|---|

adder | 2 | 200 |

times-3 | 1 | 300 |

times-5 | 1 | 300 |

divide-by-13 | 1 | 700 |

- Derive alternative combinational circuits for \(Y(A,B,C,D,E).\)
- Find a pipelined circuit with maximum throughput.
- Determine minimum latency and maximum throughput of an ideal pipeline without pipelining overhead.

We assume integer arithmetic and that all operators have sufficient bit widths to ignore exceptional conditions such as overflow. Furthermore, we assume that we can apply the theorems of associativity, commutativity, and distributivity without changing the result of the computation. The key to designing alternative combinational circuits is to parenthesize the arithmetic expression so as to map directly into the operators under their input constraints. In particular, the adder has two inputs only. Therefore, we must group the additions of the numerator into additions with two operands. For example, this parenthesization maps into the given operators:

\[Y = \frac{(((A + (3 \times B)) + (5 \times C)) + (3 \times D)) + E}{13}\,.\]The corresponding circuit (a) is shown below.

We observe that circuit (a) has a critical path from input \(B\) through the times-3 multiplier, four adders, and the divider to output \(Y.\) The associated propagation delay is \((300 + 4 \times 200 + 700)\,\mathit{ps} = 1.8\,\mathit{ns}.\) The circuit exhibits obvious optimizations to reduce the propagation delay. For example, exploiting associativity of addition suggests alternative parenthesization

\[Y = \frac{((A + (3 \times B)) + ((5 \times C) + (3 \times D))) + E}{13}\,,\]which yields circuit (b) with critical paths from \(B,\) \(C,\) or \(D\) to \(Y\) and a propagation delay of \((300 + 3 \times 200 + 700)\,\mathit{ps} = 1.6\,\mathit{ns}.\) Furthermore, we can apply commutativity and distributivity to save one times-3 multiplier with expression

\[Y = \frac{((A + E) + (5 \times C)) + (3 \times (B + D))}{13}\,,\]shown as circuit (c). This alternative has critical paths from \(B,\) \(C,\) or \(D\) to \(Y,\) and has a propagation delay of \(1.4\,\mathit{ns}\) only. This is the fastest circuit we can design for expression \(Y\) with the given operators.

We consider the three alternative circuits of subproblem (a). Note that the divider-by-13 is the operator with the largest delay of \(700\,\mathit{ps}.\) Therefore, the minimum clock period of any pipelined version must be \(T_C = 700\,\mathit{ps},\) and the divider occupies one pipeline stage by itself.

Consider circuit (a) below. We frame the divider at output \(Y\) with pipeline registers, and plan to place additional pipeline registers such that the propagation delay of a stage does not exceed \(T_C = 700\,\mathit{ps}.\) Working up towards inputs \(A\) and \(B,\) we find that three adders have a delay of \(3 \times 200\,\mathit{ps} = 600\,\mathit{ps},\) which is less than \(T_C,\) whereas four adders have a delay of \(800\,\mathit{ps},\) exceeding \(T_C.\) Thus, we insert another level of pipeline registers as shown in circuit (a). The resulting pipeline has three stages. Stage 1 has a propagation delay of \(500\,\mathit{ps}\) through one multiplier and one adder, stage 2 requires \(600\,\mathit{ps}\,\) through three adders, and stage 3 requires \(700\,\mathit{ps}\) through the divider. The maximum throughput of pipeline (a) is one result \(Y\) per \(T_C = 700\,\mathit{ps}.\)

Pipelining circuit (b) requires three stages too. Stage 1 has a delay of \(300\,\mathit{ps}\) through one multiplier. Stages 2 and 3 have delays of \(600\,\mathit{ps}\) and \(700\,\mathit{ps},\) like pipeline (a). Our design effort to reduce the latency of circuit (a) does not translate into an advantage from the perspective of pipelining. The maximum throughput of pipeline (b) is one result \(Y\) per \(T_C = 700\,\mathit{ps},\) just like pipeline (a). The reason is the divider, that enforces minimum clock period \(T_C.\) The reduced propagation delay of circuit (b) results in a larger stage imbalance, though. Pipeline (b) can utilize stage 1 for \(300\,\mathit{ps}\) only, whereas pipeline (a) utilizes stage 1 for \(500\,\mathit{ps}.\)

Pipeline (c) requires two stages only. Framing the divider with pipeline registers leaves us with a propagation delay of \(700\,\mathit{ps}\) for the remaining operators, one multiplier and two adders. This delay equals \(T_C\) and, hence, fits into one pipeline stage. We conclude that pipeline (c) saves one pipeline stage compared to alternatives (a) and (b). However, the maximum throughput of pipeline (c) is equal to that of pipelines (a) and (b), i.e. one result \(Y\) per \(T_C = 700\,\mathit{ps}.\) The maximum throughput of a pipeline is not determined by the number of stages but by the operator with maximum delay, because this operator determines minimum clock period \(T_C\) that applies to all stages.

We could have found the pipeline with maximum throughput by applying the

*register-push method*. Consider circuit (c) from the perspective of register pushing. Combinational circuit (c) has a critical path delay of \(T_{crit} = 1.4\,\mathit{ns}.\) The indivisible subcircuit with maximum delay is the divider, which implies minimum clock period \(T_C = 700\,\mathit{ps}.\) In the register-push method, we choose the number of pipeline registers \(k = \lceil T_{crit} / T_C \rceil = 2,\) place two registers at output \(Y,\) and push one register across the divider, which yields pipeline (c) as before. Applying the register-push method to circuits (a) and (b) produces a different register placement in pipeline (a) without affecting the maximum throughput.An ideal pipeline has no pipelining overhead, otherwise caused by the sequencing overhead of the pipeline registers. Therefore, the maximum throughput determined in subproblem (b) is the throughput of pipelines (a), (b), and (c) when operated at minimum clock period \(T_C = 700\,\mathit{ps}\) with infinitely many computations, i.e. \(X_{pipe} = X_\infty = 1\,\text{computation} / 700\,\mathit{ps}\,.\) The latency of an ideal pipeline is the time a single input requires to traverse the pipeline, i.e. \(T_1 = k \times T_C.\) Since pipelines (a) and (b) have \(k=3\) stages whereas pipeline (c) has \(k=2\) stage only, pipeline (c) is the pipeline with minimum latency \(T_1 = 2 \times T_C = 1.4\,\mathit{ns}.\)

Note that we choose the number of computations carefully when characterizing the performance of a pipeline with latency and throughput. If we are interested in \(n\) computations, we can determine their latency \(T_n\) and throughput \(X_n,\) which depend on \(n.\) In contrast, if we are interested in the latency and throughput of the pipeline per se, we wish to characterize the pipeline performance independent of the number of computations \(n.\) We choose latency \(T_1,\) because it expresses the minimum delay to produce one output, and throughput \(X_\infty,\) because it expresses the throughput of maximum utilization, when the pipeline produces one output every clock cycle.

Consider the combinational circuit with subcircuits \(A, B, \ldots, F\):

- Apply the register-push method to pipeline the circuit.
- Determine \(T_{crit}\) of the longest path.
- Determine minimum clock period \(T_C.\)
- Choose \(k = \lceil T_{crit} / T_C \rceil.\)
- Push registers from output toward input.

- Use register pushing to derive a 2-stage and a 3-stage pipeline.
- Given registers with sequencing overhead \(t_{pcq} + t_{setup} = 1\,\mathit{ns},\) which number of stages maximizes throughput?

The critical path of the circuit is the only path from input \(X\) to output \(Y.\) The delay is the sum of the delays of the subcircuits:

\[T_{crit} = (5 + 2 + 3 + 5 + 4 + 5)\,\mathit{ns} = 24\,\mathit{ns}\,.\]The minimum clock period is the maximum delay of any of the subcircuits, here

\[T_C = 5\,\mathit{ns}\]of subcircuits \(A,\) \(D,\) and \(F.\) We choose the number of pipeline stages:

\[k = \biggl\lceil \frac{T_{crit}}{T_C} \biggr\rceil = \biggl\lceil \frac{24\,\mathit{ns}}{5\,\mathit{ns}} \biggr\rceil = 5\,.\]We place \(k=5\) registers on output \(Y,\) and begin to push registers toward input \(X\):

Circuit (a) shows the initial placement of 5 registers on output \(Y.\) Since subcircuit \(F\) has a delay of \(5\,\mathit{ns},\) which is equal to \(T_C,\) we leave 1 register behind and push 4 registers across subcircuit \(F,\) see circuit (b). For subcircuit \(F\) to occupy its own pipeline stage, we leave 1 register between subcircuits \(E\) and \(F,\) and push 3 registers across \(E,\) see circuit (c). Notice that the delay of subcircuit \(D\) equals \(T_C,\) so that \(D\) should occupy its own pipeline stage. Therefore, we keep 1 register between \(D\) and \(E,\) and push 2 registers across subcircuit \(D,\) see circuit (d). Next, we keep 1 register at the input of \(D\) and push 1 register across subcircuit \(C,\) see circuit (e). If we push the register between \(B\) and \(C\) across \(B,\) the path of subcircuits \(B\) and \(C\) has a delay of \(5\,\mathit{ns},\) which fits into a single stage because the delay equals \(T_C.\) The resulting pipeline is shown as circuit (f). Subcircuit \(A\) occupies its own stage, assuming the circuit that drives input \(X\) has a register on its output. Circuit (f) is a 5-stage pipeline that can be operated with a minimum clock period of \(T_C = 5\,\mathit{ns},\) because each stage has a delay less than or equal to \(T_C.\)

Rather than minimizing the clock period of the pipelined circuit, which requires \(k=5\) stages as shown in subproblem (a), we want to consider a 2-stage pipeline and 3-stage pipeline as alternatives. The primary goal of pipeline design with fewer than 5 stages is to balance the stage delays so as to minimize the clock period. Ideally, we can identify stages with equal delays, which implies perfect utilization of all stages.

We begin with a 2-stage pipeline. Given \(T_{crit} = 24\,\mathit{ns},\) we shoot for two stages with \(12\,\mathit{ns}\) delay each. Here is the register pushing into a 2-stage pipeline:

We place 2 registers on output \(Y.\) Then, we leave 1 register behind, and push the other register toward input \(X.\) If we push the register across \(F\) we obtain 2 stages, the first stage with subcircuits \(A\) through \(E,\) and the second stage with subcircuit \(F.\) The delay of stage 1 would be \(19\,\mathit{ns},\) and the delay of stage 2 is merely \(5\,\mathit{ns}.\) The minimum clock period would be the maximum delay, here \(19\,\mathit{ns}.\) We get a better balance by pushing the register across circuit \(E,\) such that stage 1 covers subcircuits \(A\) through \(D\) with a delay of \(15\,\mathit{ns},\) and stage 2 has a delay of \(9\,\mathit{ns}.\) Pushing the register between subcircuits \(C\) and \(D,\) as shown in pipeline (b), offers the best balance of stage delays. Stage 1 has a delay of \(10\,\mathit{ns}\) and stage 2 a delay of \(14\,\mathit{ns}.\) The minimum clock period of this 2-stage pipeline is the maximum of the delays, \(T_C = 14\,\mathit{ns}.\)

To transfor the circuit into a 3-stage pipeline, we look for 3 stages, each with a delay of about \(T_{crit}/3 = 8\,\mathit{ns}.\) We place 3 registers on output \(Y,\) and start pushing 2 registers toward input \(X\):

Pushing 2 registers across subcircuits \(F\) and also \(E\) produces a stage with a delay of \(9\,\mathit{ns},\) which is close to the balanced delay of \(8\,\mathit{ns},\) see circuit (b). Therefore, we leave 1 register between subcircuits \(D\) and \(E,\) and push the other register toward input \(X.\) If we place the register between subcircuits \(B\) and \(C,\) as shown in (c), the stage consisting of subcircuits \(C\) and \(D\) has a delay of \(8\,\mathit{ns},\) as desired. The first stage has a delay of \(7\,\mathit{ns}\) only. However, the delays of the resulting three stages are pretty close to be equal. The minimum clock period of the 3-stage pipeline is the maximum stage delay, here \(T_C = 9\,\mathit{ns}.\)

We compare the throughput of the 2-stage pipeline, 3-stage pipeline, and 5-stage pipeline. Incorporating the sequencing overhead of of the pipeline register, the minimum clock period \(T_{\phi}\) is the maximum stage delay plus the sequencing overhead:

\[T_\phi = T_C + t_{pcq} + t_{setup}\,.\]We use throughput \(X_\infty = 1 / T_\phi\) to compare the three pipeline designs:

stages \(T_\phi [ns]\) \(X_\infty [\text{computations}/s]\) 2 \(15\) \(6.67 \times 10^7\) 3 \(10\) \(10^8\) 5 \(6\) \(1.67 \times 10^8\) We find that the 5-stage pipeline offers the maximal throughput of the three designs.

Consider the combinational circuit with subcircuits \(A, B, \ldots, F\):

- Apply the register-push method to pipeline the circuit.
- Calculate speedup \(S_n\) w.r.t. the sequential pipeline execution.
- Calculate the speedup w.r.t. \(n\) combinational executions.
- What is the speedup if circuit \(A\) had a delay of \(50\,\mathit{ps}\)?

The critical path of the circuit is path \(W \rightarrow B \rightarrow D \rightarrow E \rightarrow F \rightarrow Y\) with path delay

\[T_{crit} = (25 + 50 + 10 + 40)\,\mathit{ps} = 125\,\mathit{ps}\,.\]The subcircuit with maximum delay is subcircuit \(D.\) Therefore, the minimum clock period of the pipeline is

\[T_C = 50\,\mathit{ps}\,.\]We choose the number of pipeline stages

\[k = \biggl\lceil \frac{T_{crit}}{T_C} \biggr\rceil = \biggl\lceil \frac{125\,\mathit{ps}}{50\,\mathit{ps}} \biggr\rceil = 3\,.\]We place 3 pipeline registers on output \(Y,\) and push registers toward the inputs of the circuit:

In pipeline (a), we leave 1 register on output \(Y\) and push 2 registers across subcircuit \(F\) to each of its inputs. We observe in the resulting pipeline (b) that we can push both registers on the output of subcircuit \(E\) across \(E,\) because the resulting pipeline stage with subcircuits \(E\) and \(F\) has a delay of \(10\,\mathit{ps} + 40\,\mathit{ps} = 50\,\mathit{ps},\) which equals our clock frequency target \(T_C.\) Pipeline (c) shows the register placement after pushing both registers from the output of \(E\) to each of its three inputs. Next, we have two opportunities for register pushing. First, we push 1 of the 2 registers from the output of subcircuit \(D\) to its inputs, so that subcircuit \(D\) occupies one pipeline stage on its own. Second, we push 2 registers from the inputs of subcircuits \(E\) and \(F\) across the wire fork, and place 2 registers on the output of subcircuit \(C.\) As a result, we obtain pipeline (d). Here, we have another two opportunities for register pushing. First, we can push 1 register before the wire fork at the output of subcircuit \(B.\) Second, we may push 1 register across subcircuit \(C.\) The resulting pipeline is shown in (e).

To obtain a cleaner picture of the pipelined circuit, we redraw the schematic with vertically aligned stages and register symbols for the pipeline registers.

In this schematic, it is straightforward to verify that the circuit is a 3-stage pipeline, and that each pipeline stage has a propagation delay of at most \(T_C = 50\,\mathit{ps},\) including stage 1 from the input terminals to the first pipeline register.

The speedup of the pipelined execution w.r.t. the sequential pipeline execution is the ratio of the latency of the sequential and the pipelined executions, both on the same pipelined circuit. The latency of \(n\) computations on the 3-stage pipeline in pipelined execution mode is

\[T_n = (n + 3 - 1)\, T_C\,.\]As reference point, we consider the

*sequential*execution mode on the 3-stage pipeline, i.e. we start a new computation after the previous computation has left the pipeline. At any point in time, exactly one computation occupies the pipeline. Thus, if one computation takes time \(T_1 = k \cdot T_C = 3 T_C,\) then \(n\) computations executed sequentially one-by-one take time \(n T_1.\) The speedup of the pipelined w.r.t. to the sequential pipeline execution is then:\[\begin{eqnarray*} S_n &=& \frac{n T_1}{T_n} \\ &=& \frac{n 3 T_C}{(n+3-1) T_C} \\ &=& \frac{3 n}{n+2}\,. \end{eqnarray*}\]For a single computation, \(n = 1,\) and the speedup is \(S_1 = 1.\) For infinitely many computations, the speedup is \(S_\infty = 3,\) which equals the number of pipeline stages.

The speedup of the pipelined execution w.r.t. \(n\) combinational executions is the ratio of the latency of \(n\) sequential executions on the combinational circuit without pipeline registers and their pipelined execution on the pipelined circuit. The latency of \(n\) computations on the 3-stage pipeline in pipelined execution mode is

\[T_n = (n + 3 - 1)\, T_C\,.\]The reference point is the sequential execution mode on the original combinational circuit. The latency of 1 computation on the combinational circuit is the critical path delay \(T_{crit}.\) To avoid races between subsequent computations along faster paths, we assume that we apply the input signals to the combinational circuit, wait for time \(T_{crit}\) before sampling the output signal and applying the next input signals. This sequential execution mode yields execution time \(n T_{crit}\) for \(n\) computations. The speedup of the pipelined execution w.r.t. the sequential combinational execution is

\[\begin{eqnarray*} S_n &=& \frac{n T_{crit}}{T_n} \\ &=& \frac{n \cdot 125\,\mathit{ps}}{(n+2) 50\,\mathit{ps}} \\ &=& \frac{2.5 n}{n+2}\,. \end{eqnarray*}\]For one computation, the 3-stage pipeline incurs a speedup of \(S_1 = 0.83,\) i.e. a slowdown compared to the combinational circuit. For infinitely many computation, however, the pipeline achieves a speedup of \(S_\infty = 2.5.\) This speedup is not optimal, which would be equal to the number of pipeline stages \(k = 3.\) The reason is the lack of stage balance. In particular, stage 1 utilizes only \(25\,\mathit{ps}\) of clock period \(T_C = 50\,\mathit{ps}\) via subcircuit \(B.\)

Assuming subcircuit \(A\) has a delay of \(50\,\mathit{ps},\) the critical path of the combinational circuit is \(V \rightarrow A \rightarrow D \rightarrow E \rightarrow F \rightarrow Y\) with a delay of \(T_{crit} = 150\,\mathit{ps}.\) Pipelining the circuit yields the same 3-stage pipeline as in subproblem (a). However, subcircuit \(A\) utilizes the entire clock period of stage 1. The speedup of subproblem (c) changes to

\[\begin{eqnarray*} S_n &=& \frac{n T_{crit}}{T_n} \\ &=& \frac{n \cdot 150\,\mathit{ps}}{(n+2) 50\,\mathit{ps}} \\ &=& \frac{3 n}{n+2}\,. \end{eqnarray*}\]We find that one computation incurs no slowdown, but speedup \(S_1 = 1.\) Furthermore, infinitely many computations yield optimal speedup \(S_\infty = 3,\) equal to the number of pipeline stages.

Consider the Gantt chart of 5 computations \(A, B, \ldots, E\) executed on a 4-stage pipeline:

- Identify all pipeline stalls with arrows.
- Determine latency, throughput, and speedup w.r.t.the sequential pipeline execution.
- Determine the speedup assuming there were no pipeline stalls.

If a computation stalls in stage \(i\) of a \(k\)-stage pipeline, all stages from 0 to \(i\) must keep their computations, whereas stages \(i+1\) through \(k-1\) continue to operate as usual. In the 4-stage pipeline below, we assume the computation in stage 1 stalls. Then, all stages at the head of the pipeline must keep their computations, here stages 0 and 1, whereas the stages at the tail of the pipeline, here stages 2 and 3, can continue their operation.

Gantt charts make it easy to spot whether a computation stalls in a pipeline stage: if the computation occupies stage \(i\) in cycle \(t\) and also in next cycle \(t+1,\) then the computation is stalled in cycle \(t.\) We use vertical arrows pointing downwards from the cell of the stalled computation to stage 0 of the pipeline, so that the arrow covers all pipeline stages stalled during one clock cycle. Given the Gantt chart of the 4-stage pipeline, we begin inspecting computation \(A\) in the top row. Computation \(A\) occupies stage 0 during cycle 0, progresses to stage 1 during cycle 1, then to stage 2 during cycle 2, and remains in stage 2 during cycle 3. Thus, computation \(A\) stalls the pipeline during cycle 2. We draw a vertical arrow in cycle 2 from computation \(A\) to computation \(C\) in stage 0. The arrow indicates that the pipeline stalls computation \(A\) in stage 2, computation \(B\) in stage 1, and computation \(C\) in stage 0. All three pipeline stages must retain their computations during the next clock cycle 3. Otherwise, if computation \(B\) would progress into stage 2 during cycle 3, for example, a resource conflict would occur in stage 2 because two computations, \(A\) and \(B,\) would occupy the same stage in the same cycle. While this is possible in a Gantt chart, it is not in a digital circuit.

The arrows in the Gantt chart identify stall cycles 2, 4, 5, and 8. In cycle 2, computation \(A\) stalls stage 2. Computation \(B\) stalls stage 2 for two clock cycles 4 and 5, and computation \(C\) stalls stage 3 during cycle 8.

We find latency \(T_5\) of the 5 computations directly in the Gantt chart. The first computation \(A\) begins in cycle 0 and the last computation \(F\) ends in cycle 11. Therefore, the 5 computations together take \(T_5 = 12\) clock cycles.

The throughput of \(n\) computations on a pipeline is the average number of computations per cycle \(X_n = n / T_n.\) In case of the 5 computations with \(T_5 = 12\) the throughput is

\[X_5 = \frac{5}{T_5} = \frac{5}{12}\,.\]The speedup of the pipelined execution w.r.t. the sequential pipeline execution is the ratio of the latency of five sequential executions on the pipeline and latency \(T_5\) of the pipelined execution. We assume that the pipeline does not stall during the sequential execution of computations, where exactly one computation occupies the pipeline at any point in time. Therefore, in sequential mode, 1 computation has a latency of \(T_1 = k = 4\) clock cycles, and 5 computations executed one after another have latency \(5 T_1.\) Then, the speedup of pipelined w.r.t. sequential execution is

\[S_5 = \frac{5 T_1}{T_5} = \frac{20}{12} = 5/3\,.\]We conclude that \(S_5\) is larger than 1, and therefore a true speedup. However, the speedup is relatively small. The maximum speedup on a 4-stage pipeline is \(S_\infty = 4,\) so that \(S_5\) is clearly suboptimal. The reason for the performance loss of the pipelined execution are the pipeline stalls.

We wish to know the maximum speedup \(S_5\) if no pipeline stalls occur. The latency of 5 computations on a 4-stage pipeline without stalls is

\[T_5 = 5 + 4 - 1 = 8\]clock cycles. Therefore the speedup without pipeline stalls w.r.t to the sequential execution is

\[S_5 = \frac{5 T_1}{T_5} = \frac{5 \cdot 4}{8} = 5/2\,.\]We conclude that avoiding pipeline stalls can boost the performance of the 5 computations from a speedup of \(5/3 = 1.67\) with pipeline stalls to \(5/2 = 2.5\) without stalls.

Given a 4-stage pipeline with stages 0, 1, 2, and 3, and computations \(A,\) \(B,\) \(C,\) \(D,\) and \(E.\) Assume that

- computation \(A\) stalls in stage 1 for 2 clock cycles,
- computation \(B\) stalls in stage 2 for 1 clock cycle, and
- computation \(D\) stalls in stage 1 for 1 clock cycle.

- Draw a Gantt chart to illustrate the resource utilization.
- Determine latency and throughput of the computations.
- Assuming computations \(A, B, \ldots, E\) are repeated infinitely often, determine the throughput of the pipeline.

We assume that computations \(A,\) \(B,\) \(C,\) \(D,\) and \(E\) enter the pipeline in this order at the earliest possible clock cycle. The first computation \(A\) shall occupy stage 0 during clock cycle 0. Then, the Gantt chart of computations \(A, B, \ldots, E\) on a 4-stage pipeline is:

Computation \(A\) enters stage 1 in cycle 1, and stalls for 2 cycles by assumption. Thus, the first stall cycle is cycle 1, the second stall cycle is cycle 2, and stage 1 resumes execution of computation \(A\) in cycle 3. Thereafter, computation \(A\) traverses stages 2 and 3 without further stalls. Computation \(B\) enters stage 0 in cycle 1, and is stalled by computation \(A\) during cycles 1 and 2. Thus, computation \(B\) resumes execution in stage 0 during cycle 3, occupies stage 1 in cycle 4, and stalls in stage 2 during cycle 5 by assumption. Computation \(C\) enters stage 0 in cycle 4, and is stalled in stage 1 by computation \(B\) during cycle 5. Computation \(D\) enters stage 0 in cycle 5, where it is stalled by \(B\) for 1 cycle. Computation \(D\) procedes to stage 1 in cycle 7 and stalls for 1 cycle by assumption. The last computation \(E\) is stalled in stage 0 by computation \(D,\) and exits the pipeline after cycle 11.

The latency of the 5 computations \(A, B, \ldots, E\) is visible in the Gantt chart. The first computation starts in cycle 0 the last computation exits the pipeline after cycle 11. Therefore, the latency is

\[T_5 = 12\]clock cycles. The throughput is the average number of computations per cycle:

\[X_5 = \frac{5}{T_5} = \frac{5}{12}\,.\]We wish to deduce throughput \(X_\infty,\) assuming the group of 5 computations \(A, B, \ldots, E\) is repeated infinitely often. To that end, we observe in the Gantt chart that we can restart computation \(A\) in stage 0 during cycle 9, after computation \(E\) has propagated from stage 0 into stage 1. Thus, in steady state, the group of 5 computations requires \(T_5 = 9\) cycles, and computation \(A\) occupies stage 0 in cycles 0, 9, 18, 27, etc. Note that computations \(D\) and \(E\) do not stall the pipeline after restarting \(A.\) Therefore, the pipeline executes 5 computations every \(T_5 = 9\) cycles, such that throughput

\[X_\infty = \frac{5}{9}\]computations per cycle. The conclusion from one group of computations \(A, B, \ldots, E\) to infinitely many groups is justified because throughput is an average metric.

Footnotes

[1] | In theoretical computer science finite state machines, also
called finite automata, are the foundational model of discrete
systems and the study of their computational complexity. |

[2] | If you wish to learn Python, check out Allen Downey’s book Think Python: How to Think Like a Computer Scientist. |

[3] | Henry Ford introduced pipelining to increase the
throughput of his Ford car factories. A detailed account of Ford’s
accomplishments can be found in the book entitled Ford Methods and
the Ford Shops by Horace L. Arnold and Fay L. Faurote, published
by The Engineering Magazine Company in 1915. |

[4] | Pipelining combinational circuits is fittingly known as
wave pipelining, because signals traverse the circuit like
wavefronts travel across a water surface. |