# INTRODUCTION TO DIGITAL SIGNAL PROCESSORS (DSPs)

#### Accumulator architecture



#### Memory-register architecture





register file on-chip memory

#### Prof. Brian L. Evans

*Contributions by* Niranjan Damera-Venkata and Magesh Valliappan

Embedded Signal Processing Laboratory The University of Texas at Austin Austin, TX 78712

http://signal.ece.utexas.edu/

### Outline

- Signal processing applications
- Conventional DSP architecture
- Pipelining in DSP processors
- RISC vs. DSP processor architectures
- TI TMS320C6000 DSP architecture introduction
- Signal processing on general-purpose processors
- Conclusion

### **Signal Processing Applications**

#### • Embedded system demand in world: volume, volume, ...

- 400 Million units/year: automobiles, PCs, cell phones
- *30 Million units/year:* ADSL modems and printers

#### Consumer electronics products

| Product               | Average    | Annual         |
|-----------------------|------------|----------------|
|                       | Unit Price | Revenue        |
| Wireless phone        | \$136      | \$11.5 Billion |
| Digital cameras       | \$271      | \$ 4.2 Billion |
| Portable CD players   | \$ 48      | \$ 0.9 Billion |
| MP3 players           | \$137      | \$ 0.7 Billion |
| Compact audio systems | \$111      | \$ 0.5 Billion |

Source: CEA Market Reseach (US). Data for 2004 calendar year.

• How much should an embedded processor cost?

### Signal Processing Applications

#### Embedded system cost and input/output rates

- Low-cost, low-throughput: sound cards, cell phones, MP3 players, car audio, guitar effects
- Medium-cost, medium-throughput: low-end printers, disk drives, PDAs, ADSL modems, digital cameras, video conferencing
- *High-cost, high-throughput:* high-end printers, audio mixing boards, wireless basestations, high-end video conferencing, 3-D sonar, 3-D reconstructions from 2-D slices (e.g. X-rays) in medical imaging
- Embedded processor requirements
  - Inexpensive with small area and volume
  - Predictable input/output (I/O) rates to/from processor
  - Power constraints (severe for handheld devices)

Single

DSP

Single DSP +

Coprocessor

**Multiple** 

**DSPs** 

### **Conventional DSP Processors**

- Low cost: as low as \$2/processor in volume
- Deterministic interrupt service routine latency guarantees predictable input/output rates
  - On-chip direct memory access (DMA) controllers
    - Processes streaming input/output separately from CPU
    - Sends interrupt to CPU when block has been read/written
  - Ping-pong buffering
    - CPU reads/writes buffer 1 as DMA reads/writes buffer 2
    - After DMA finishes buffer 2, roles of buffers 1 & 2 switch
- Low power consumption: 10-100 mW
  - ► TI TMS320C54 0.32 mA/MIP → 76.8 mW at 1.5 V, 160 MHz
  - TI TMS320C55 0.05 mA/MIP  $\rightarrow$  22.5 mW at 1.5 V, 300 MHz

Based on conventional (pre-1996) architecture

### **Conventional DSP Architecture**

- Multiply-accumulate (MAC) in 1 instruction cycle
- Harvard architecture for fast on-chip I/O
  - Data memory/bus separate from program memory/bus
  - One read from program memory per instruction cycle
  - Two reads/writes from/to data memory per inst. cycle
- Instructions to keep pipeline (3-6 stages) full
  - Zero-overhead looping (one pipeline flush to set up)
  - Delayed branches
- Special addressing modes supported in hardware
  - Bit-reversed addressing (e.g. fast Fourier transforms)
  - Modulo addressing for circular buffers (e.g. filters)

### Conventional DSP Architecture (con't)

- Buffer of length *K* 
  - Used in finite and infinite impulse response filters

#### Linear buffer

- Sort by time index
- Update: discard oldest data, copy old data left, insert new data

#### Circular buffer

- Oldest data index
- Update: insert new data at oldest index, update oldest index



#### **Modulo Addressing Using a Circular Buffer**



### **Conventional DSP Processors Summary**

|                 | Fixed-Point          | Floating-Point                                 |  |  |
|-----------------|----------------------|------------------------------------------------|--|--|
| Cost/Unit       | \$2 - \$79           | \$3 - \$381                                    |  |  |
| Architecture    | Accumulator          | load-store or                                  |  |  |
|                 |                      | memory-register                                |  |  |
| Registers       | 2-4 data             | 8 or 16 data                                   |  |  |
|                 | 8 address            | 8 or 16 address                                |  |  |
| Data Words      | 16 or 24 bit integer | 32 bit integer and                             |  |  |
|                 | and fixed-point      | fixed/floating-point                           |  |  |
| On-Chip         | 2-64 kwords data     | 8-64 kwords data                               |  |  |
| Memory          | 2-64 kwords program  | 8-64 kwords program                            |  |  |
| Address         | 16-128 kw data       | $16 \mathrm{Mw} - 4 \mathrm{Gw} \mathrm{data}$ |  |  |
| Space           | 16-64 kw program     | 16 Mw – 4 Gw program                           |  |  |
| Compilers       | C, C++ compilers;    | C, C++ compilers;                              |  |  |
|                 | poor code generation | better code generation                         |  |  |
| <b>Examples</b> | TI TMS320C5000;      | TI TMS320C30;                                  |  |  |
|                 | Freescale DSP56000   | Analog Devices SHARC                           |  |  |

### **Conventional DSP Processor Families**

#### Floating-point DSPs

• Used in initial prototyping of algorithms

DSP Market (est.)Fixed-point95%Floating-point5%

- Resurgence due to professional and car audio
- Different on-chip configurations in each family
  - Size and map of data and program memory
  - A/D, input/output buffers, interfaces, timers, and D/A
- Drawbacks to conventional DSP processors
  - No byte addressing (needed for images and video)
  - Limited on-chip memory
  - Limited addressable memory on fixed-point DSPs (exceptions include Freescale 56300 and TI C5409)
  - Non-standard C extensions for fixed-point data type

# **Pipelining**



#### Pipelining

- •Process instruction stream in stages (as stages of assembly on a manufacturing line)
- •Increase throughput

#### **Managing Pipelines**

- •Compiler or programmer
- •Pipeline interlocking

# **Pipelining: Operation**

#### Time-stationary pipeline model

- Programmer controls each cycle
- Example: Freescale DSP56001 (has separate X/Y data memories/registers)

MAC X0,Y0,A X: (R0)+,X0 Y: (R4)-,Y0

#### Data-stationary pipeline model

- Programmer specifies data operations
- Example: TI TMS320C30

MPYF \*++AR0(1), \*++AR1(IR0), R0

- Interlocked pipeline
  - Protection" from pipeline effects
  - May not be reported by simulators: inner loops may take extra cycles

MAC means multiplication-accumulation.



### **Pipelining: Hazards**

- A control hazard occurs when a branch instruction is decoded
  - Processor "flushes" the pipeline, or
  - Use delayed branch (expose pipeline)
- A data hazard occurs because an operand cannot be read yet
  - Intended by programmer, or
  - Interlock hardware inserts "bubble"
  - TI TMS320C5000 (20 CPU & 16 I/O registers, one accumulator, and one address pointer ARP implied by \*)

| → LAR | AR2, | ADDR | ; | load address reg.   |
|-------|------|------|---|---------------------|
| LACC  | *_   |      | ; | load accumulator w/ |
|       |      |      | ; | contents of AR2     |



1 -12

LAR: 2 cycles to update AR2 & ARP; need NOP after it

### **Pipelining: Avoiding Control Hazards**







#### TI TMS320C6000 DSP Architecture







#### TI TMS320C6000 Instruction Set

#### **C6000 Instruction Set by Functional Unit**

| <u>.S Unit</u> |      | <u>.L Unit</u> |      | <u>.D Unit</u> |              |
|----------------|------|----------------|------|----------------|--------------|
| ADD            | NEG  | ABS            | NOT  | ADD            | ST           |
| ADDK           | NOT  | ADD            | OR   | ADDA           | SUB          |
| ADD2           | OR   | AND            | SADD | LD             | SUBA         |
| AND            | SET  | CMPEQ          | SAT  | MV             | ZERO         |
| В              | SHL  | CMPGT          | SSUB | NEG            |              |
| CLR            | SHR  | CMPLT          | SUB  |                |              |
| EXT            | SSHL | LMBD           | SUBC | <u>.M</u>      | <u>Unit</u>  |
| MV             | SUB  | MV             | XOR  | MPY            | SMPY         |
| MVC            | SUB2 | NEG            | ZERO | MPYH           | SMPYH        |
| MVK            | XOR  | NORM           |      |                |              |
| MVKH           | ZERO |                |      | <u>O</u>       | t <u>her</u> |
|                |      |                |      | NOP            | IDLE         |

Six of the eight functional units can perform integer add, subtract, and move operations

### TI TMS320C6000 Instruction Set

| <u>Arithmetic</u> | <u>Logical</u>    | Data                         |       |  |  |
|-------------------|-------------------|------------------------------|-------|--|--|
| ABS               | AND               | <u>Management</u>            |       |  |  |
| ADD               | CMPEQ             | LD                           |       |  |  |
| ADDA              | CMPGT             | MV                           |       |  |  |
| ADDK              | CMPLT             | MVC                          |       |  |  |
| ADD2              | NOT               | MVK                          |       |  |  |
| MPY               | OR                | MVKH                         |       |  |  |
| MPYH              | SHL               | ST                           |       |  |  |
| NEG               | SHR               |                              |       |  |  |
| SMPY              | SSHL              | Program                      |       |  |  |
| SMPYH             | XOR               | <u>Control</u>               |       |  |  |
| SADD              |                   | В                            |       |  |  |
| SAT               | Bit               | IDLE                         |       |  |  |
| SSUB              | <u>Management</u> | NOP                          |       |  |  |
| SUB               | CLR               |                              |       |  |  |
| SUBA              | EXT               | C6000 Instruc                | ction |  |  |
| SUBC              | LMBD              | Set by Category              |       |  |  |
| SUB2              | NORM              | (un)signed multiplication    |       |  |  |
| ZERO              | SET               | saturation/packed arithmetic |       |  |  |
|                   |                   |                              |       |  |  |

### C6000 vs. C5000 Addressing Modes

| <ul> <li>Immediate</li> <li>The operand is part of the</li> </ul>                                      | <i>TI C5000</i> | <i>TI C6000</i>     |
|--------------------------------------------------------------------------------------------------------|-----------------|---------------------|
| instruction                                                                                            | ADD #0FFh       | add .L1 -13,A1,A6   |
| Register                                                                                               |                 |                     |
| <ul> <li>Operand is specified in a register</li> </ul>                                                 | (implied)       | add .L1 A7,A6,A7    |
| Direct                                                                                                 |                 |                     |
| <ul> <li>Address of operand is part<br/>of the instruction (added<br/>to imply memory page)</li> </ul> | ADD 010h        | not supported       |
| Indirect                                                                                               |                 |                     |
| <ul> <li>Address of operand is<br/>stored in a register</li> </ul>                                     | ADD *           | ldw .D1 *A5++[8],A1 |
|                                                                                                        |                 | 1 -21               |



#### TI TMS320C6700 Extensions **C6700 Floating Point Extensions by Unit** .S Unit .L Unit ABSDP CMPLTSP ADDDP **INTSP** ABSSP RCPDP ADDSP SPINT CMPEQDP RCPSP DPINT **SPTRUNC** CMPEQSP RSARDP DPSP **SUBDP** CMPGTDP RSQRSP DPTRUNC SUBSP **SPDP** CMPGTSP **INTDP CMPLTDP** .M Unit **MPYDP MPYID** .D Unit ADDAD LDDW MPYI **MPYSP** Four functional units perform IEEE single-precision (SP) and doubleprecision (DP) floating-point add, subtract, and move.

Operations beginning with R are reciprocal (i.e. 1/x) calculations.

#### Selected TMS320C6700 DSPs

| DSP   | MHz | MIPS | Data<br>(kbits) | Program<br>(kbits) | Level 2<br>(kbits) | Price | Applications       |
|-------|-----|------|-----------------|--------------------|--------------------|-------|--------------------|
| C6701 | 150 | 1200 | 512             | 512                | 0                  | \$ 82 | C6701 EVM board    |
| C6711 | 150 | 1200 | 32              | 32                 | 512                | \$ 22 | C6711 DSK board    |
|       | 167 | 1336 |                 |                    |                    | \$ 20 |                    |
|       | 250 | 2000 |                 |                    |                    | \$ 19 |                    |
| C6712 | 150 | 1200 | 32              | 32                 | 512                | \$ 14 |                    |
| C6713 | 167 | 1336 | 32              | 32                 | 1000               | \$ 21 |                    |
|       | 225 | 1800 | 32              | 32                 | 1000               | \$ 28 | C6713 DSK board    |
|       | 300 | 2400 | 32              | 32                 | 1000               | \$ 39 |                    |
| C6722 | 250 | 2000 | 1000            | 3072               | 256                | \$ 16 | Professional Audio |
| C6726 | 250 | 2000 | 1000            | 3072               | 256                | \$ 19 | Professional Audio |

DSK means DSP Starter Kit. EVM means Evaluation Module. Unit price is for 1,000 units. Prices effective June 3, 2005. For more information: http://www.ti.com

### **Digital Signal Processor Cores**



- Application Specific Integrated Circuit (ASIC)
  - Programmable DSP core
  - RAM
  - ROM
  - Standard cells
  - Codec
  - Peripherals
  - Gate array
  - Microcontroller core

#### **General Purpose Processors**

#### Multimedia applications on PCs

- Video, audio, graphics and animation
- Repetitive parallel sequences of instructions
- Single Instruction Multiple Data (SIMD)
  - One instruction acts on multiple data in parallel
  - Well-suited for graphics
- Native signal processing extensions use SIMD
  - Sun Visual Instruction Set [1995] (UltraSPARC 1/2)
  - Intel MMX [1996] (Pentium I/II/III/IV)
  - Intel Streaming SIMD Extensions (Pentium III)







## **Concluding Remarks**

#### Digital signal processor market

#### \$9.5B '05 estimated

- ▶ 40% annual growth 1990-2000: #1 in semiconductor market
- Worldwide revenue: \$4.4B '99, \$6.1B '00, \$4.5B '01, \$4.9B '02, \$6.1B '03, \$8.0B '04 (est. annual growth of 23% for 2003-08)
- ▶ 2001: 40% TI, 16% Agere, 12% Freescale, 8% Analog Dev.
- 2002: 43% TI, 14% Freescale, 14% Agere, 9% Analog Dev.
- Source: Forward Concepts (http://www.fwdconcepts.com)
- Independent processor benchmarking by industry
  - Berkeley Design Technology Inc. http://www.bdti.com
  - Embedded Microproc. Benchmark Consortium www.eembc.org
- Web resources
  - Newsgroup comp.dsp: FAQ http://www.bdti.com/faq/dsp\_faq.html
  - Embedded processors and systems: http://www.eg3.com
  - On-line courses: http://www.techonline.com