Summary of Design of Digital Circuits course by Onur Mutlu in ETH Zurich. Thank you very much for opening up the great lectures and materials for self-learners like me. This course provided me invaluable insight to understand computers.

1. Introduction and Basics

2. Mysteries in Comp Arch
• Meltdown and Spectre
• RowHammer
3. Introduction to the Labs and FPGAs
• FPGA(Field Programmable Gate Array)
4. Mysteries in Comp Arch and Basics
• Memory Performance Attacks
• DRAM Refresh
• Bloom Filters
5. Combinational Logic
• Computer
• MOS(Metal-Oxide Semicondoctors) Transistors
• n-type, p-type
• Logic gates
• CMOS NOT, NAND, AND Gates
• Moores’s Law
• Functional Specification
• Boolean Algebra
6. Combinational Logics, Hardware Description Lang. & Verilog
• Two-Level Canonical Forms
• Sum Of Products (SOP)
• A Product Of Sums (POS)
• Combinational Building Blocks
• Decoder
• Multiplexer (MUX)
• PLA
• Karnaugh Maps (K-Maps)
• Hardware Description Languages(HDL)
• Verilog
• Implementation
• Structural HDL
• Behavioral HDL
7. Sequential Logic Design
• Basic Storage Element
• The Reset-Set Latch (R-S Latch)
• Gated D Latch
• The Register
• Memory
• Sequential Logic Circuits
• Clock
• Finite State Machines
• Next state logic, State register, Output logic
• Sequential circuits
• D Flip-Flop
• Timing Diagram
• State Encoding
• Moore, Mealy Machines
• Sequential Logic Using Verilog
• always, posedge, negedge, reg, begin ~ end, if ~ else, case
• Asynchronous / Synchronous reset
• Blocking / Non-Blocking assignment
8. Timing and Verification
• Combinational Circuit Timing
• contamination delay, propagation delay
• output glitches
• Avoiding glitches using K-Maps
• Sequential Circuit Timing
• input timing constraints
• setup time, hold time, aperture time
• output timing constraints
• contamination delay clock-to-q (ccq), Propagation dalay clock-to-q (pcq)
• setup time constraints
• c > pcq + pd + setup
• hold time constraints
• cd > - ccq + hold
• Timing Analysis
• Clock skew
• Circuit Verification
• Functional Verification
• device under test (DUT)
• Testbench-based functional testing
• Simple, self-checking, automatic testbench
• Timing verification
9. Von Neumann Model, ISA, LC-3 and MIPS
• The Von Neumann Model
• Memory
• Memory Data Register MDR)
• Processing Unit
• Arithmetic and Logic Unit (ALU)
• Registers
• Input and output
• Control Uint
• Instruction register (IR)
• Program counter (PC)
• LC-3 & MIPS
• Instruction Set Architecture (ISA)
• Instruction cycle
• Fetch
• Decode
• Fetch operands
• Execute
• Store result
• Instruction set
• Opcodes
• Data types
10. ISA (II), and Assembly Programming
• Instruction Set Architecture (ISA)
• Operate Instruction
• Data movement instruction
• LD, LDR, LDI, LEA, ST, STR, STI
• PC-relative mode
• Indirect mode
• Base + offset mode
• Immediate mode
• Control flow instruction
• Jump
• JMP, RET, JSR, JSRR
• Conditional branches
• BRn, BRz, BRp, BRzp, BRnp, BRnz, BRnzp
• Looping
• Assembly programming
• Sequential construct
• Conditional construct
• Iterative construct
• Debugging
• Array
• Function call
• Stack
11. Microarchitecture
• ISA vs Microarchitecture
• Single cycle machine vs Multi-cycle machine
• Instruction processing cycle vs Machine clock cycle
• Datapath & Control logic
• Performance Analysis
• Single-Cycle Microarchitecture
• Instruction Processing
• IF, ID/RF, EX/AG, MEM, WB
• Arithmetic and logical instruction
• Data Movement Instruction
• lw, sw
• Control flow instruction
• j, jal, jr, jalr
• beq, bne, blez, bgtz
12. Microarchitecture II
• Single-Cycle Microarchitecture
• Control Logic
• Control signals
• RegDest, ALUSrc, MemtoReg, RegWrite, MemRead, MemWrite, PCSrc1, PCSrc2
• Control box
• Performance Analysis
• Cycles per instruction (CPI), clock period (T), clock frequency (f)
• {# of instructions} x {Average CPI} x {Clock cycle time}
• Slowest instruction
• Microarchitecture design principle
• Critical path design
• Balanced design
• Multi-Cycle Microarchitecture
• States
• Control Unit
• Main controller
• MUX: MemtoReg, RegDst, IorD, PCSrc, ALUSrcB, ALUSrcA
• Register Enable: IRWrite, MemWrite, PCWrite, Branch, RegWrite
• ALU decoder
• ALUControl
• Performance Analysis
13. Microprogramming
• Performance Analysis
• {# of instructions} x {Average CPI} x {Clock cycle time}
• Single cycle critical path
• Multi cycle performance
• Microprogramming
• Microsequencer - control store - microinstruction
• Microinstruction
• data path, control signals
• State machine
• node(31 state), arcs(flow)
• Datapath
• Single-bus datapath
• Power of Abstraction
14. Pipelining
• Pipelining Instruction Processing
• IF - ID/RF - EX/AG - MEM - WB
• Pipeline Registers
• Control Signals
• Issues in Pipeline Design
• Balancing work in pipeline stages
• Keeping the pipeline correct, moving, and full in the presence of events that disrupt pipeline flow
• Handling excpetions, interrupts
• Minimizing stalls
• Dependences
• Resource contention
• Data Dependences
• Flow dependence (RAW)
• Output dependence (WAW)
• Anti dependence (WAR)
• Contol dependences
• Data dependence handling
• Five fundamental ways
• Detect and wait until value is abailable in register file
• Detect and forward/bypass data to dependent instruction
• Detect and eliminate the dependence at the software level
• Predict the needed value, execute “spexulatively”
• Do something else
• Interlocking
• Detect
• Scoreboarding
• Combinational dependence check logic
• Data Forwarding / Bypassing
• Control dependence
• Implementation
• nops, bubbles
• Hazard Unit, forwardAE, forwardBE
15. Pipelining Issues
• Control dependence handling
• Early Branch Resolution
• Control Forwarding
• Hardware vs software based scheduling
• Precise Exceptions
• Exceptions vs Interrupts
• Handling exceptions in pipelining
• Reorder Buffer (ROB)
• Valid bits
• Random Access Memory vs Content Addressable Memory
• Indirection
• Register Renaming
16. Out-of-Order Execution
• Out-of-Order Execution (Dynamic Instruction Scheduling)
• Register Renaming
• Tomasulo’s Algorithm
• Register Alias Table (RAT)
• Reservation Stations
• Dataflow Graph
17. Out-of-Order, DataFlow, Superscalar Execution
• Out-of-Order Execution
• Frontend register file (RAT)
• Architectural register file
• Restricted Dataflow
• instruction window
• Memory Dependence Handling
• Memory disambiguation problem
• Conservative
• Aggressive
• Intelligent
• Data Forwarding
• Load queue (LQ), store queue (SQ)
• Data Flow (at ISA level)
• Superscalar Execution
18. Branch Prediction
• Control dependence handling
• Stall
• Brach prediction
• Branch delay slot
• Predicated execution
• Multipath execution
• Branch Prediction
• Misprediction penalty
• Always Guess NextPC = PC + 4
• predicate combining
• predicated execution
• Branch Target Buffer (BTB)
• Compile time (static)
• Always not taken
• Always taken
• Backward taken, forward not taken
• Profile based
• Program Analysis based
• Programmer-based
• Progmas
• Run time (dynamic)
• Last time prediction
• Two-bit counter based prediction
• Two-level prediction
• global branch correlation
• global history register (GHR) / pattern history table (PHT)
• local branch correlation
• local history register
• Hybrid
19. Branch Prediction II, VLIW, Fine-Grained Multithreading
• Branch Prediction
• Loop branch dector and predictor
• Perceptron branch predictor
• Hybrid history length based predictor
• Branch confidence estimation
• Branch Delay Slot
• Delayed branching with squashing
• Predicated Combining
• Predicated execution
• Multipath execution
• Call and Return Prediction
• Indirect branch prediction
• VLIW (Very Long Instruction Word)
• RISC (Reduced Instruction Set Computer)
• Superblock
• Modern GPUs
20. SIMD Processors
• SIMD Processing (Single Instruction Multiple Data)
• SISD, SIMD, MISD, MIMD
• Array Processor
• Vector Processor
• Vector registers
• Vector length register (VLEN)
• Vector stride register (VSTR)
• Vector functional units
• Memory Banking
• Vector Memory System
• Vector Chaining
• Multiple memory ports
• Vector stripmining
• Gather / Scatter operation
• simple implementation, density-time implementation
21. SIMD Processors II and Graphics Processing Units
• SIMD processing
• Vector instruction execution
• Vector unit structure
• Automatic code vectorization
• SIMD operations in modern ISA
• Image overlaying
• GPUs (Graphics Processing Units)
• Programming model
• SPMD (Single Program Multiple Data)
• Harware execution model
• SIMT (Single Instruction Multiple Thread)
• Warp (Wavefront)
• Dynamic warp formation, merging
22. GPU Programming
• GPU Programming
• Thread / Block / Grid
• Memory Hierarchy
• CUDA / OpenCL
• Traditional Program Structure in CUDA
• CUDA programming language
• Memory allocation, memory copy, kernel launch, memory deallocation, explicit synchronization
• Memory access
• GPU Architecture
• Streaming Processor array
• Streaming Multiprocessors (SM)
• Straming Processors (SP)
• Performance Consideration
• Global memory acces
• CPU-GPU data transfers
• Memory Access
• Latency hiding
• Occupancy
• Memory coalescing
• Array of Structures (AoS) / Structure of Arrays (SoA)
• Data Reuse
• Tiling
• Shared memory
• Memory Bank Conflicts
• SIMD Utilization
• Intra-warp divergence, Divergence-free execution
• Vector reduction
• Divergence-free mapping
• Atomic Operations
• Atomic Conflicts
• Histogram Calculation
• Privatization
• Data Transfer between CPU and GPU
• Synchronous and asynchronous transfer
• Streams
23. Systolic Arrays and Beyond & Memory Organization and Memory Technology
• Systolic Arrays
• Systolic Architectures
• Systolic Computation
• Two-Dimensional systolic array
• Combinations
• Pipeline-parallelism
• Stage
• WARP Computer
• TPU (Tensor Processor Unit)
• Decoupled Access / Execute (DAE)
• A Computing System
• Computation, communication, storage/memory
• Memory Organization
• Memory array
• Interleaving (Banking)
• Memory Technology
• DRAM (Dynamic Random Access Memory)
• SRAM (Static Random Access Memory)
24. Memory Hierarchy and Caches
• Memory Hierarchy
• SRAM, DRAM, Hard Disk, Flash memory, PC-RAM, MRAM, RRAM
• Locality -Temporal / Spatial
• Cache hierarchy
• Manual / Automatic management
• Hierarchical letency analysis
• Ti = ti + mi * Ti+1
• Cache
• Block
• Design decisions
• Placement
• Replacement
• Granularity of management
• Write policy
• Instructions / data
• Tag store / data store
• Average memory access time (AMAT)
• (hit-rate * hit-latency) + (miss-rate * miss-latency)
• Hardware Cache Design
• Degree of associativity
• Direct-Mapped Cache, Set / Higher / Full Associativity
• Issues in set-associative caches
• Insertion
• Promotion
• Eviction / replacement policy
• LRU (Least recently used)
• Not MRU (Most recently used)
• Hierarchical LRU
• Victim-NextVictim Replacement
• Random
• Set thrashing
25. More Caches & Virtual Memory
• Cache
• Handling Writes
• Write back, write back
• Write mis
• Subblock Cache
• Instruction / Data caches
• Separate / Unified
• Multilevel Caching
• Cache Performance
• Cache size
• Working set
• Block size
• Subblocking
• Associativity
• Cache misses
• Compulsory miss, Capacity miss, Conflict miss
• Improve cache performance
• Reduce miss rate
• Reduce miss latency or miss cost
• Reduce hit latency or hit cost
• Software approaches
• Restructuring data access patterns
• Loop interchange
• Blocking (Tiling)
• Restructuring data layout
• Data structure separation / merging
• Multi-Core Issues in Caching
• Private / Shared cache
• Resource Sharing
• Cache Coherence
• Physical memory
• Virtual Memory
• Indirection
• Virtual pages / Physical frames
• Definitions
• Demand paging
• Page size