Summary of Design of Digital Circuits course by Onur Mutlu in ETH Zurich. Thank you very much for opening up the great lectures and materials for self-learners like me. This course provided me invaluable insight to understand computers.

  1. Introduction and Basics

  2. Mysteries in Comp Arch
    • Meltdown and Spectre
    • RowHammer
  3. Introduction to the Labs and FPGAs
    • FPGA(Field Programmable Gate Array)
    • Vivado
  4. Mysteries in Comp Arch and Basics
    • Memory Performance Attacks
    • DRAM Refresh
    • Bloom Filters
  5. Combinational Logic
    • Computer
    • MOS(Metal-Oxide Semicondoctors) Transistors
      • n-type, p-type
    • Logic gates
      • CMOS NOT, NAND, AND Gates
    • Moores’s Law
    • Functional Specification
    • Boolean Algebra
  6. Combinational Logics, Hardware Description Lang. & Verilog
    • Two-Level Canonical Forms
      • Sum Of Products (SOP)
      • A Product Of Sums (POS)
    • Combinational Building Blocks
      • Decoder
      • Multiplexer (MUX)
      • Full adder
      • PLA
    • Karnaugh Maps (K-Maps)
    • Hardware Description Languages(HDL)
      • Verilog
      • Implementation
        • Structural HDL
        • Behavioral HDL
  7. Sequential Logic Design
    • Basic Storage Element
      • The Reset-Set Latch (R-S Latch)
      • Gated D Latch
      • The Register
    • Memory
    • Sequential Logic Circuits
      • Clock
    • Finite State Machines
      • Next state logic, State register, Output logic
      • Sequential circuits
        • D Flip-Flop
      • Timing Diagram
      • State Encoding
      • Moore, Mealy Machines
    • Sequential Logic Using Verilog
      • always, posedge, negedge, reg, begin ~ end, if ~ else, case
      • Asynchronous / Synchronous reset
      • Blocking / Non-Blocking assignment
  8. Timing and Verification
    • Combinational Circuit Timing
      • contamination delay, propagation delay
      • output glitches
        • Avoiding glitches using K-Maps
    • Sequential Circuit Timing
      • input timing constraints
        • setup time, hold time, aperture time
      • output timing constraints
        • contamination delay clock-to-q (ccq), Propagation dalay clock-to-q (pcq)
      • setup time constraints
        • c > pcq + pd + setup
      • hold time constraints
        • cd > - ccq + hold
      • Timing Analysis
      • Clock skew
    • Circuit Verification
      • Functional Verification
        • device under test (DUT)
        • Testbench-based functional testing
          • Simple, self-checking, automatic testbench
      • Timing verification
  9. Von Neumann Model, ISA, LC-3 and MIPS
    • The Von Neumann Model
      • Memory
        • Memory Address Register (MAR)
        • Memory Data Register MDR)
      • Processing Unit
        • Arithmetic and Logic Unit (ALU)
        • Registers
      • Input and output
      • Control Uint
        • Instruction register (IR)
        • Program counter (PC)
      • LC-3 & MIPS
    • Instruction Set Architecture (ISA)
      • Instruction cycle
        • Fetch
        • Decode
        • Evaluate address
        • Fetch operands
        • Execute
        • Store result
      • Instruction set
        • Opcodes
        • Data types
  10. ISA (II), and Assembly Programming
    • Instruction Set Architecture (ISA)
      • Operate Instruction
      • Data movement instruction
        • LD, LDR, LDI, LEA, ST, STR, STI
        • Addressing modes
          • PC-relative mode
          • Indirect mode
          • Base + offset mode
          • Immediate mode
      • Control flow instruction
        • Jump
          • JMP, RET, JSR, JSRR
        • Conditional branches
          • BRn, BRz, BRp, BRzp, BRnp, BRnz, BRnzp
          • Looping
    • Assembly programming
      • Sequential construct
      • Conditional construct
      • Iterative construct
      • Debugging
      • Array
      • Function call
      • Stack
  11. Microarchitecture
    • ISA vs Microarchitecture
    • Single cycle machine vs Multi-cycle machine
      • Instruction processing cycle vs Machine clock cycle
      • Datapath & Control logic
      • Performance Analysis
    • Single-Cycle Microarchitecture
      • Instruction Processing
        • IF, ID/RF, EX/AG, MEM, WB
      • Arithmetic and logical instruction
      • Data Movement Instruction
        • lw, sw
      • Control flow instruction
        • j, jal, jr, jalr
        • beq, bne, blez, bgtz
  12. Microarchitecture II
    • Single-Cycle Microarchitecture
      • Control Logic
        • Control signals
          • RegDest, ALUSrc, MemtoReg, RegWrite, MemRead, MemWrite, PCSrc1, PCSrc2
        • Control box
      • Performance Analysis
        • Cycles per instruction (CPI), clock period (T), clock frequency (f)
        • {# of instructions} x {Average CPI} x {Clock cycle time}
        • Slowest instruction
    • Microarchitecture design principle
      • Critical path design
      • Bread and butter design
      • Balanced design
    • Multi-Cycle Microarchitecture
      • States
      • Control Unit
        • Main controller
          • MUX: MemtoReg, RegDst, IorD, PCSrc, ALUSrcB, ALUSrcA
          • Register Enable: IRWrite, MemWrite, PCWrite, Branch, RegWrite
        • ALU decoder
          • ALUControl
      • Performance Analysis
  13. Microprogramming
    • Performance Analysis
      • {# of instructions} x {Average CPI} x {Clock cycle time}
      • Single cycle critical path
      • Multi cycle performance
    • Microprogramming
      • Microsequencer - control store - microinstruction
      • Microinstruction
        • data path, control signals
      • State machine
        • node(31 state), arcs(flow)
      • Datapath
        • Single-bus datapath
        • gating, loading
      • Advantages
        • Power of Abstraction
  14. Pipelining
    • Pipelining Instruction Processing
      • IF - ID/RF - EX/AG - MEM - WB
      • Pipeline Registers
      • Control Signals
    • Issues in Pipeline Design
      • Balancing work in pipeline stages
      • Keeping the pipeline correct, moving, and full in the presence of events that disrupt pipeline flow
      • Handling excpetions, interrupts
      • Minimizing stalls
    • Dependences
      • Resource contention
      • Data Dependences
        • Flow dependence (RAW)
        • Output dependence (WAW)
        • Anti dependence (WAR)
      • Contol dependences
    • Data dependence handling
      • Five fundamental ways
        • Detect and wait until value is abailable in register file
        • Detect and forward/bypass data to dependent instruction
        • Detect and eliminate the dependence at the software level
        • Predict the needed value, execute “spexulatively”
        • Do something else
      • Interlocking
        • Detect
          • Scoreboarding
          • Combinational dependence check logic
        • Data Forwarding / Bypassing
      • Control dependence
      • Implementation
        • nops, bubbles
        • Hazard Unit, forwardAE, forwardBE
  15. Pipelining Issues
    • Control dependence handling
      • Early Branch Resolution
      • Control Forwarding
    • Hardware vs software based scheduling
    • Precise Exceptions
      • Exceptions vs Interrupts
      • Handling exceptions in pipelining
    • Reorder Buffer (ROB)
      • Valid bits
      • Random Access Memory vs Content Addressable Memory
      • Indirection
    • Register Renaming
  16. Out-of-Order Execution
    • Out-of-Order Execution (Dynamic Instruction Scheduling)
      • Register Renaming
      • Tomasulo’s Algorithm
      • Register Alias Table (RAT)
      • Reservation Stations
    • Dataflow Graph
  17. Out-of-Order, DataFlow, Superscalar Execution
    • Out-of-Order Execution
      • Frontend register file (RAT)
      • Architectural register file
      • Restricted Dataflow
        • instruction window
    • Memory Dependence Handling
      • Memory disambiguation problem
        • Conservative
        • Aggressive
        • Intelligent
      • Data Forwarding
        • Load queue (LQ), store queue (SQ)
        • store-to-load forwarding logic
    • Data Flow (at ISA level)
    • Superscalar Execution
  18. Branch Prediction
    • Control dependence handling
      • Stall
      • Brach prediction
      • Branch delay slot
      • Fine-grained multithreading
      • Predicated execution
      • Multipath execution
    • Branch Prediction
      • Misprediction penalty
      • Always Guess NextPC = PC + 4
        • predicate combining
        • predicated execution
      • Branch Target Buffer (BTB)
      • Compile time (static)
        • Always not taken
        • Always taken
        • Backward taken, forward not taken
        • Profile based
        • Program Analysis based
        • Programmer-based
          • Progmas
      • Run time (dynamic)
        • Last time prediction
        • Two-bit counter based prediction
        • Two-level prediction
          • global branch correlation
            • global history register (GHR) / pattern history table (PHT)
          • local branch correlation
            • local history register
        • Hybrid
        • Advanced algorithms
  19. Branch Prediction II, VLIW, Fine-Grained Multithreading
    • Branch Prediction
      • Advanced algorithms
        • Loop branch dector and predictor
        • Perceptron branch predictor
        • Hybrid history length based predictor
        • Branch confidence estimation
    • Branch Delay Slot
      • Delayed branching with squashing
    • Predicated Combining
    • Predicated execution
    • Multipath execution
    • Call and Return Prediction
      • Indirect branch prediction
        • Return Address Stack
    • VLIW (Very Long Instruction Word)
      • RISC (Reduced Instruction Set Computer)
      • Superblock
    • Fine-Grained Multithreading
      • Modern GPUs
  20. SIMD Processors
    • SIMD Processing (Single Instruction Multiple Data)
      • SISD, SIMD, MISD, MIMD
      • Array Processor
      • Vector Processor
        • Vector registers
        • Vector length register (VLEN)
        • Vector stride register (VSTR)
        • Vector mask register (VMASK)
        • Vector functional units
        • Memory Banking
        • Vector Memory System
        • Vector Chaining
        • Multiple memory ports
        • Vector stripmining
        • Gather / Scatter operation
        • Masked vector instruction
          • simple implementation, density-time implementation
  21. SIMD Processors II and Graphics Processing Units
    • SIMD processing
      • Vector instruction execution
        • Vector unit structure
        • Automatic code vectorization
    • SIMD operations in modern ISA
      • Image overlaying
    • GPUs (Graphics Processing Units)
      • Programming model
        • SPMD (Single Program Multiple Data)
      • Harware execution model
        • SIMT (Single Instruction Multiple Thread)
      • Fine-Grained Multithreading
        • Warp (Wavefront)
        • Dynamic warp formation, merging
  22. GPU Programming
    • GPU Programming
      • Thread / Block / Grid
      • Memory Hierarchy
      • CUDA / OpenCL
        • Traditional Program Structure in CUDA
        • CUDA programming language
          • Memory allocation, memory copy, kernel launch, memory deallocation, explicit synchronization
          • Memory access
    • GPU Architecture
      • Streaming Processor array
      • Streaming Multiprocessors (SM)
      • Straming Processors (SP)
    • Performance Consideration
      • Global memory acces
      • CPU-GPU data transfers
      • Memory Access
        • Latency hiding
          • Occupancy
        • Memory coalescing
        • Array of Structures (AoS) / Structure of Arrays (SoA)
        • Data Reuse
          • Tiling
          • Shared memory
            • Memory Bank Conflicts
      • SIMD Utilization
        • Intra-warp divergence, Divergence-free execution
        • Vector reduction
          • Divergence-free mapping
      • Atomic Operations
        • Atomic Conflicts
        • Histogram Calculation
          • Privatization
      • Data Transfer between CPU and GPU
        • Synchronous and asynchronous transfer
        • Streams
  23. Systolic Arrays and Beyond & Memory Organization and Memory Technology
    • Systolic Arrays
      • Systolic Architectures
      • Systolic Computation
      • Two-Dimensional systolic array
      • Combinations
      • Pipeline-parallelism
        • Stage
      • WARP Computer
      • TPU (Tensor Processor Unit)
    • Decoupled Access / Execute (DAE)
    • A Computing System
      • Computation, communication, storage/memory
    • Memory Organization
      • Memory array
        • Address / data
      • Interleaving (Banking)
    • Memory Technology
      • DRAM (Dynamic Random Access Memory)
      • SRAM (Static Random Access Memory)
  24. Memory Hierarchy and Caches
    • Memory Hierarchy
      • SRAM, DRAM, Hard Disk, Flash memory, PC-RAM, MRAM, RRAM
      • Locality -Temporal / Spatial
      • Cache hierarchy
      • Manual / Automatic management
      • Hierarchical letency analysis
        • Ti = ti + mi * Ti+1
    • Cache
      • Block
      • Design decisions
        • Placement
        • Replacement
        • Granularity of management
        • Write policy
        • Instructions / data
      • Tag store / data store
      • Average memory access time (AMAT)
        • (hit-rate * hit-latency) + (miss-rate * miss-latency)
      • Hardware Cache Design
        • Degree of associativity
        • Direct-Mapped Cache, Set / Higher / Full Associativity
      • Issues in set-associative caches
        • Insertion
        • Promotion
        • Eviction / replacement policy
          • LRU (Least recently used)
          • Not MRU (Most recently used)
          • Hierarchical LRU
          • Victim-NextVictim Replacement
          • Random
          • Set thrashing
          • Belady’s OPT
  25. More Caches & Virtual Memory
    • Cache
      • Handling Writes
        • Write back, write back
        • Write mis
        • Subblock Cache
      • Instruction / Data caches
        • Separate / Unified
      • Multilevel Caching
        • Serial / parallel access to tag / data store
    • Cache Performance
      • Cache size
        • Working set
      • Block size
        • Subblocking
      • Associativity
      • Cache misses
        • Compulsory miss, Capacity miss, Conflict miss
      • Improve cache performance
        • Reduce miss rate
        • Reduce miss latency or miss cost
        • Reduce hit latency or hit cost
      • Software approaches
        • Restructuring data access patterns
          • Loop interchange
          • Blocking (Tiling)
        • Restructuring data layout
          • Data structure separation / merging
    • Multi-Core Issues in Caching
      • Private / Shared cache
        • Resource Sharing
      • Cache Coherence
    • Physical memory
    • Virtual Memory
      • Indirection
      • Virtual pages / Physical frames
      • Definitions
        • Demand paging
        • Page size
        • Address Translation
        • Virtual page number (VPN), Physical page number (PPN)
        • Page table
          • Valid bit, PPN, Replacement policy, Dirty bits
      • Physical memory as a cache
      • Translation Lookaside Buffer (TLB)