+ - 0:00:00
Notes for current slide
Notes for next slide

Digital System Architectures

141IN (last 3 CFU part)

Eric Medvet

A.Y. 2021/2022

1 / 91

Lecturer

Eric Medvet

Research interests:

  • Evolutionary Computation
  • Machine Learning applications
  • Embodied intelligence

Labs:

2 / 91

Materials

Teacher's slides:

  • available here
  • might be updated during the course

Intended usage:

  • slides should contain every concept that has to be taught/learned
  • but, slides are designed for consumption during a lecture, they might be suboptimal for self-consumption \Rightarrow take notes!

Teacher's slides are based on :

  • Patterson, David A., and John L. Hennessy. Computer organization and design ARM edition: the hardware software interface. Morgan kaufmann, 2016.
3 / 91

Exam

Written test with questions and short open answers

4 / 91

Course content

In brief:

  1. Cache
  2. Virtual memory

See the syllabus!

5 / 91

Memory: does it matter?

6 / 91

Registers and memory

Processor operates on data through instructions:

  • instructions operate on registers
  • when needed, load (read) data from memory to registers
  • when needed, store (write) data from registers to memory

Loads and stores take time!

Overall, how long?

  • depends on number of operations
  • depends on duration of single operation
7 / 91

Load/store: how many? (number)

C:

float fahreneit2celsius(float f) {
return ((5.0 / 9.0 ) * (f - 32.0));
}

LEGv8:

LDURS S16, [X27, const5] ; S16 = 5.0 (5.0 in memory)
LDURS S18, [X27, const9] ; S18 = 9.0 (9.0 in memory)
FDIVS S16, S16, S18 ; S16 = 5.0 / 9.0
LDURS S18, [X27, const32] ; S18 = 32.0
FSUBS S18, S12, S18 ; S18 = f – 32.0
FMULS S0, S16, S18 ; S0 = (5/9)*(fahr – 32.0)
BR LR ; return

3 loads (LDURS); might be 2 with compiler optimizations

8 / 91

More complex case

// 32x32 matrices
void matrixMult (double c[][], double a[][], double b[][]) {
int i, j, k;
for (i = 0; i < 32; i = i + 1) {
for (j = 0; j < 32; j = j + 1) {
for (k = 0; k < 32; k = k + 1) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
}

Many more loads and stores!

9 / 91

Load/store: how long? (duration)

Depends on the memory technology:

Type Access time [ns] Cost [$/GB]
SRAM 0.5–2.5 500–1000
DRAM 50–70 10–20
Flash 5000–50000 0.75–1.00
Magnetic disk 5000000–20000000 0.05–0.10

Recall: 1 CPU cycle takes \approx 1 ns

Data from 2012

We'll see more later

10 / 91

Goal

Goal: process more data, faster, cheaper

Ideally, we want to minimize overall time spent accessing memory.

We can work on:

  1. number of accessess (depends mainly on the code)
  2. duration of each access

How?

  1. write better (optimized) code
  2. use faster memory?

➝ access time/cost trade-off!

11 / 91

Guy at the library

German National Library

A guy is doing some research about some topic. Needs to:

  • take a book from shelf (long walk)
  • read some part
  • put back the book on shelf (long walk)

How to work faster?

12 / 91

Library ⇿ computer

  • Book ⇿ data
  • Shelf ⇿ "main" (farther) memory
  • Desk ⇿ "local" (closer) memory
  • Guy ⇿ processor
13 / 91

Idea: two memories

Suppose to have:

  • close memory M1M_1: small, fast access
  • large memory M2M_2: large, slow access

Whenever you look for data item xx:

  • if in M1M_1, take it from there
  • otherwise
    • take it from M2M_2
    • put it in M1M_1 (freeing some space, if needed)

Useful if you soon access again xx

xx is the address of the data item
"look for" means load/read; we'll see store/write later

14 / 91

Better idea

Whenever you look for data item xx:

  • if in M1M_1, take it from there
  • otherwise
    • take (,x2,x1,x,x+1,x+2,)(\dots, x-2, x-1, x, x+1, x+2, \dots) from M2M_2
    • put them in M1M_1 (freeing more some space, if needed)

Useful if you soon access again xx or something close to xx!

15 / 91

Locality principle

Useful if you soon access again xx or something close to xx!

Temporal locality: if a data location xx is accessed at tt, it will be likely accessed again soon (at some t+δtt+ \delta t)

Spatial locality: if a data location xx is accessed at tt, data locations close to xx (some x+δxx+ \delta x) will be likely accessed again soon

How to exploit spatial locality at the library? What's the assumption?

16 / 91

Localities in action

float findAvgDiff (
float[] a, float[] b, int n
) {
float aSum = 0;
float bSum = 0;
int i;
for (i = 0; i < n; i = i + 1) {
aSum = aSum + a[i];
}
for (i = 0; i < n; i = i + 1) {
bSum = bSum + b[i];
}
return (aSum - bSum) / n;
}

"Where" in this representation is spatial locality? "where" is temporal locality?

a, b, n (=3), aSum, bSum, i

Accesses (simplified):

·······w··
········w·
·········w
······r···
·········r
r·········
·······r··
·······w··
······r···
·········r
·········r
·········w
·········r
·r········
·······r··
·······w··
······r···
·········r
·········r
·········w
·········r
··r·······
·······r··
·······w··
...

17 / 91

More than one level: memory hierarchy

Instead of just two memories M1M_1 (fast and small) and M2M_2 (large and slow), a hierarchy of memories:

  • CPU
  • M1M_1: closest to CPU, fastest, smallest
  • M2M_2: a bit farther, a bit slower, a bit larger
  • M3M_3: a bit farther, a bit slower, a bit larger
  • ...
  • MkM_k: far, slow, large
18 / 91

Memory hierarchy

  • M1M_1: closest to CPU, fastest, smallest
  • ...
  • MkM_k: far, slow, large

Pros:

  • size of MkM_k (the largest)
  • (almost) access time of M1M_1 (the closest)
  • the overall cost is much lower than a MM with the size of MkM_k and the access time of M1M_1 (ideality)

Cons:

  • need to implement the access algorithm within the CPU/system
19 / 91

Cache

The memory hierarchy is a pattern, a scheme of solution for a common problem

Other cases:

  • local copy of network data
  • storing of complex computation results
  • physical memory

Common name for the closer memory: cache

20 / 91

Memory technology

(very briefly)

21 / 91

Recap

Type Access time [ns] Cost [$/GB]
SRAM 0.5–2.5 500–1000
DRAM 50–70 10–20
Flash 5000–50000 0.75–1.00
Magnetic disk 5000000–20000000 0.05–0.10

How do they work?

22 / 91

Static Random Access Memory (SRAM)

Properties:

  • fixed access (read/write) time
    • read and write times may be different

Technology:

  • 6–8 transistors for each bit (for minimizing disturbance during reads): large footpring on silicon
  • no need to refresh: access time is very close to cycle time
  • low power consumption in standby
  • "currently" SRAM are no longer separate chips, but are integrated onto processors
23 / 91

Dynamic Random Access Memory (DRAM)

Properties:

  • faster access to batches of data (rows are buffered)

Technology:

  • 1 transistor + 1 capacitor per bit: lower footprint than SRAM
  • stored data has to be refreshed (read and write)
    • no access while refreshing
    • bits are organized in rows, rows are freshed at once, rows are buffered
24 / 91

"Modern" DRAM

  • syncrhonous DRAM (SDRAM): a clock facilitates burst transfer
  • Double Data Rate (DDR) SDRAM: transfer on both rising and falling edge
    • DDR4-3200 means 1600 MHz clock
    • organized in banks: reads/writes done concurrently on several banks
  • often physically packed in dual inline memory modules (DIMMs)
    • a DIMM usually contains 4-16 DRAM chips
    • a typical DIMM can transfer 25600 MB/s (PC25600)
25 / 91

DRAM over time

Year Chip size [bit] Cost [$/GB] Access time new¹ [ns] Access time existing² [ns]
1980 64 K 1500000 250 150
1983 256 K 500000 185 100
1985 1 M 200000 135 40
1989 4 M 50000 110 40
1992 16 M 15000 90 30
1996 64 M 10000 60 12
1998 128 M 4000 60 10
2000 256 M 1000 55 7
2004 512 M 250 50 5
2007 1 G 50 45 1.25
2010 2 G 30 40 1
2012 4 G 10 35 0.8

1: row accessed for the first time
2: row recently accessed

26 / 91

Flash memory

Technology:

  • based on electrically erasable programmable read-only memory (EEPROM)
  • writes can wear hardware behind bits
    • a controller takes care of distributing usage evenly

Solid state drives (SSDs) use flash memory

27 / 91

Disk memory

Technology:

  • with parts in movements (differently than SRAM, DRAM, flash)

Hard disk drives (HDDs) are storage devices based on disk memory

The internals of an HDD

28 / 91

Terminology

Mechanical parts:

  • platters
  • heads

Logical parts:

  • track (1000s of sectors): concentric cyrcle on the platter
  • sector (512–4096 bytes): portion of the track
  • cylinder: set of vertically aligned sectors

Read/write phases:

  • seek: head moves toward track (\approx 3–13 ms)
  • rotational latency or delay (\approx 6 ms @ 5400 RPM)
  • actual data read/write
    • overall transfer rate: 100–200 MB/s (2012, w/o cache)
29 / 91

HDD "history"

Size of HDDs of different ages Different sizes over time

HDD in 1979 HDD in 1979!

30 / 91

Caches

31 / 91

What's a cache?

Cache: a hiding place to put something for future use

First usage of the term in computers denoted specifically the small memory between the processor and the main memory

Nowadays, cache denotes any storage that takes advantage of locality

  • e.g., browser cache
32 / 91

Using a cache

(Consider just reads for now)

Whenever the processor want to read a memory location xx:

  • check if xx is in the cache xx is the address of the data
  • if yes, read it from there, otherwise, read it from main memory and store it in the cache

How to?

  1. know if xx is in the cache?
  2. if yes, know where it is in the cache? xx is the address of the location in the main memory, not in the cache; if it was, the size of the cache would be the same of the main memory ➝ useless!
33 / 91

Premise

Memory content vs. memory addresses

000 13
001 45
010 e4
011 14
100 0f
101 76
110 1a
111 ff

8 byte memory:

  • addresses goes from 0002 = 0 to 1112 = 7, i.e., 3-bit addresses
  • at each address, there is a byte
    • 1316 = 000100112
  • [x][x] is the data at xx
    • [011] = 1416 = 000101002

In general, there are nn-bit addresses (e.g., n=64n=64)

34 / 91

Direct mapped cache

The cache consists of sc=2ncs_c=2^{n_c} triplet \langlevalidity bit, tag, block\rangle

  • a block contains sb=2nbs_b=2^{n_b} bytes (e.g., sb=8,64,s_b=8, 64, \dots)
    • the kk-th block will host portions of the main memory of sbs_b bytes starting at addresses xx s.t. x%sc=kx \mathbin{\%} s_c = k
    • \Rightarrow many different xx map to the same block
  • a tag indicates which one of the many different xx is actually in the block
    • there can be 2nmncnb2^{n_m-n_c-n_b} values, with main memory size sm=2nms_m=2^{n_m}
  • a validity bit indicates if the block is to be considered valid
    • initially set to 0
35 / 91

Example

Cache
(sc=4s_c=4, nc=2n_c=2, sb=1s_b=1, nb=0n_b=0)

00 1 01 00010000
01 1 10 10000010
10 0 11 10000010
11 0 11 10010101

Tag size:
nmncnb=420=2n_m-n_c-n_b=4-2-0=2 bits

Triplet size:
1+nmncnb+8sb=1+2+8=111+n_m-n_c-n_b+8 s_b=1+2+8=11 bits

Cache size:
sc(1+nmncnb+8sb)=411=44s_c (1+n_m-n_c-n_b+8 s_b) = 4 \cdot 11 = 44 bits

Main memory
(nm=4n_m=4, sm=s_m= 16 bytes = 128 bit)

0000 00000001
0001 00000010
0010 00000100
0011 00001000
0100 00010000
0101 00100000
0110 01000000
0111 10000000
1000 10000001
1001 10000010
1010 10000100
1011 10001000
1100 10010000
1101 10100000
1110 11000000
1111 11000001

36 / 91

Looking in the cache

(Assume sb=1s_b=1)

Given a cache with sc=2ncs_c=2^{n_c} blocks, looking for xx means looking at the kk-th block with x%sc=kx \mathbin{\%} s_c = k

With binary numbers x%sc=x%2ncx \mathbin{\%} s_c = x \mathbin{\%} 2^{n_c} is "the less significant ncn_c bits of xx"

Example with nm=8n_m=8, nc=3n_c=3:

  • x=x= 010011112 = 79 \Rightarrow k=79%23=79%8=7=k= 79 \mathbin{\%} 2^{3} = 79 \mathbin{\%} 8 = 7 = 1112
  • x=x= 001010112 = 43 \Rightarrow k=43%23=43%8=3=k= 43 \mathbin{\%} 2^{3} = 43 \mathbin{\%} 8 = 3 = 0112
  • ...
37 / 91

One to many

Cache
(sc=4s_c=4, nc=2n_c=2, sb=1s_b=1)

00 1 01 00010000
01 1 10 10000010
10 0 11 10000010
11 0 11 10010101

Main memory
(nm=4n_m=4, sb=s_b= 16 bytes = 128 bit)

0000 00000001
0001 00000010
0010 00000100
0011 00001000
0100 00010000
0101 00100000
0110 01000000
0111 10000000
1000 10000001
1001 10000010
1010 10000100
1011 10001000
1100 10010000
1101 10100000
1110 11000000
1111 11000001

38 / 91

Cache size

  1. For a main memory of 4 GB and a cache with 1024 blocks of 4 words each
    • compute the tag size in bits
    • compute the ratio between actual data stored in the cache and overall cache size
  2. Repeat for a main memory of 4 GB and a cache with 1024 blocks of 16 words each

1 word is 4 bytes

39 / 91

Algorithm for reading xx

Given xx of nmn_m bits (assume sb=1s_b=1):

  1. take y=x[nmnc,nm[y=x_{[n_m-n_c,n_m[} as the ncn_c less significant bits of xx
  2. read triplet t=[y]t=[y] from cache at address yy
    • tval=t[0,1[t_\text{val} = t_{[0,1[} is the first bit of tt
    • ttag=t[1,1+(nmnc)[t_\text{tag}=t_{[1,1+(n_m-n_c)[} are bits of tt from 2-nd to 2+(nmnc)2+(n_m-n_c)-th
    • tblock=t[1+(nmnc),1+(nmnc)+8[t_\text{block}=t_{[1+(n_m-n_c),1+(n_m-n_c)+8[} is the block (of 1 byte)
  3. if tvalt_\text{val} \ne 1, go to 6
  4. if ttagt_\text{tag} \ne the nmncn_m-n_c most significant bits of xx, go to 6
  5. return tblockt_\text{block} (hit)
  6. read xx from main memory (miss)
  7. put 1 x[0,nmnc[x_{[0,n_m-n_c[} [x][x] at y=x[nmnc,nm[y=x_{[n_m-n_c,n_m[} in the cache
  8. return [x][x]
40 / 91

Example (hit for x=x= 1001)

Cache (nc=2n_c=2, nb=0n_b=0)

00 1 01 00010000
01 1 10 10000010
10 0 11 10000010
11 0 11 10010101

Main memory (nm=4n_m=4)

0000 00000001
0001 00000010
...
1110 11000000
1111 11000001

  1. y=x[42,4[=y=x_{[4-2,4[}= 01
  2. t=[y]=t=[y]= [01] = 1 10 10000010
    • tval=t[0,1[=t_\text{val}=t_{[0,1[}= 1
    • ttag=t[1,1+42[=t_\text{tag}=t_{[1,1+4-2[}= 10
    • tblock=t[1+42,1+42+8[=t_\text{block}=t_{[1+4-2,1+4-2+8[}= 10000010
  3. tval=t_\text{val}= 1 == 1
  4. ttag=t_\text{tag}= 10 =x[0,42[==x_{[0,4-2[}= 10
  5. return tblock=t_\text{block}= 10000010
41 / 91

Example (miss for x=x= 0000)

Cache (nc=2n_c=2, nb=0n_b=0)

00 1 01 00010000
01 1 10 10000010
10 0 11 10000010
11 0 11 10010101

Main memory (nm=4n_m=4)

0000 00000001
0001 00000010
...
1110 11000000
1111 11000001

  1. y=x[42,4[=y=x_{[4-2,4[}= 00
  2. t=[y]=t=[y]= [00] = 1 01 00010000
  3. tval=t_\text{val}= 1 == 1
  4. ttag=t_\text{tag}= 01 x[0,42[=\ne x_{[0,4-2[}= 00
  5. read 0000 from main memory
  6. put 1 00 00000001 at y=y= 00
  7. return [x]=[x]= 00000001
42 / 91

Miss rate and miss penalty

Given a sequence of nn addresses x1,x2,,xnx_1, x_2, \dots, x_n

  • the miss rate is the ratio between accesses resulting in a miss and nn
    • the hit rate is 11- miss rate
  • the miss penalty is the time taken to read a block from main memory and put it in the cache (steps 6 and 7)
    • during this time, the processor waits
43 / 91

Example of misses and hits

x=x= 101, 000, 111, 001, 010 (left), 000, 101, 110, 111, 110 (right)

00 0 0 00000000
01 1 1 00100000
10 0 0 00000000
11 0 0 00000000

00 1 0 00000001
01 1 1 00100000
10 0 0 00000000
11 0 0 00000000

00 1 0 00000001
01 1 1 00100000
10 0 0 00000000
11 1 1 10000000

00 1 0 00000001
01 1 0 00000010
10 0 0 00000000
11 1 1 10000000

00 1 0 00000001
01 1 0 00000010
10 1 0 00000100
11 1 1 10000000

00 1 0 00000001
01 1 0 00000010
10 1 0 00000100
11 1 1 10000000

00 1 0 00000001
01 1 1 00100000
10 1 0 00000100
11 1 1 10000000

00 1 0 00000001
01 1 1 00100000
10 1 1 01000000
11 1 1 10000000

00 1 0 00000001
01 1 1 00100000
10 1 1 01000000
11 1 1 10000000

00 1 0 00000001
01 1 1 00100000
10 1 1 01000000
11 1 1 10000000

Initial cache

00 0 0 00000000
01 0 0 00000000
10 0 0 00000000
11 0 0 00000000

Main memory

000 00000001
001 00000010
010 00000100
011 00001000
100 00010000
101 00100000
110 01000000
111 10000000

hit rate =30%= 30\%

Is there temporal or spatial locality in this sequence of xx? Are they both exploited?

44 / 91

Exploting spatial locality

Spatial locality: if a data location xx is accessed at tt, data locations close to xx (some x+δxx+ \delta x) will be likely accessed again soon

We need to have not only xx but also x+δxx+ \delta x in cache
\Rightarrow use larger blocks! (i.e., sb>1s_b>1)

  • miss penalty increases too: upon miss, larger data transfer from main memory
45 / 91

Direct mapping with sb>1s_b>1

Assume nm=6n_m=6, nc=2n_c=2, nb=2n_b=2 (block size is sb=4s_b=4 bytes)

  • block at y=y= 00 hosts
    • bytes from x=x= 000000 to 000011
    • bytes from x=x= 010000 to 010011
    • bytes from x=x= 100000 to 100011
    • bytes from x=x= 110000 to 110011
  • block at y=y= 01 hosts
    • bytes from x=x= 000100 to 000111
    • bytes from x=x= 010100 to 010111
    • ...
  • ...

yy is the bits in xx from nmncnbn_m-n_c-n_b to nmnbn_m-n_b

46 / 91

Algorithm for reading xx (nb0n_b \ge 0)

  1. y=x[nmncnb,nmnb[y=x_{[n_m-n_c-n_b,n_m-n_b[}
  2. t=[y]t=[y]
    • tval=t[0,1[t_\text{val} = t_{[0,1[}
    • ttag=t[1,1+(nmncnb)[t_\text{tag}=t_{[1,1+(n_m-n_c-n_b)[}
    • tblock=t[1+(nmncnb),1+(nmncnb)+82nb[t_\text{block}=t_{[1+(n_m-n_c-n_b),1+(n_m-n_c-n_b)+8 \cdot 2^{n_b}[}
  3. if tvalt_\text{val} \ne 1, go to 6
  4. if ttagx[0,nmncnb[t_\text{tag} \ne x_{[0,n_m-n_c-n_b[}, go to 6
  5. return tblock[8z,8z+8[t_{\text{block} [8 z,8 z+8[} with z=x[nmnb,nm[z=x_{[n_m-n_b, n_m[} (hit) z=0z=0 if nb=0n_b=0
  6. read x0,,xsb1x_0, \dots, x_{s_b-1} from main memory (miss)
    • x0=x[0,nmnb[x_0 = x_{[0,n_m-n_b[} 0...0 (nbn_b 0s)
    • xk=x[0,nmnb[x_k = x_{[0,n_m-n_b[} kk2 (kk as binary with nbn_b bits)
    • xsb1=x[0,nmnb[x_{s_b-1} = x_{[0,n_m-n_b[} 1...1 (nbn_b 1s)
  7. put 1 x[0,nmncnb[x_{[0,n_m-n_c-n_b[} [x0][xsb1][x_0] \dots [x_{s_b-1}] at yy
  8. return [x][x]
47 / 91

Example (hit for x=x= 00001)

Cache (nc=2n_c=2, nb=1n_b=1)

00 1 00 00000001 00000010
01 1 10 00100000 00110000
10 0 00 00000000 00000000
11 0 00 00000000 00000000

Main memory (nm=6n_m=6)

00000 00000001
00001 00000010
...
11110 11100000
11111 11100001

  1. y=x[521,51[=y=x_{[5-2-1,5-1[}= 00
  2. t=[y]=t=[y]= [00] = 1 00 00000001 00000010
    • tval=t[0,1[=t_\text{val}=t_{[0,1[}= 1
    • ttag=t[1,1+42[=t_\text{tag}=t_{[1,1+4-2[}= 00
    • tblock=t[1+42,1+42+82[=t_\text{block}=t_{[1+4-2,1+4-2+8^2[}= 00000001 00000010
  3. tval=t_\text{val}= 1 == 1
  4. ttag=t_\text{tag}= 00 =x[0,42[==x_{[0,4-2[}= 00
  5. return tblock[8z,8z+8[=t_{\text{block} [8 z,8 z+8[}= 00000010 with z=x[51,5[=z=x_{[5-1, 5[}= 1
48 / 91

Misses and hits with sb=4s_b=4

For a main memory of 256 byte, where x=[x]x=[x], an initially empty cache with 4 blocks of 1 word each, and the reads 10, 11, 13, 20, 21, 22, 10, 20, 21 (decimal addresses xx)

  • show the cache content after the 3rd request
  • compute the hit rate

Hint: use decimal notation for block contents

"Initially empty" means all bits set to 0

49 / 91

Miss rate/penalty and block size

The larger the block size

  • the lower the miss rate (for better exploitment of spatial locality)
  • the longer the miss penalty
  • the lower the number of blocks (for the same cache size)
    • thus, the greater the miss rate

Depends on the specific sequence of memory accesses

50 / 91

In practice

Miss rate vs. block size

from Patterson, David A., and John L. Hennessy. Morgan kaufmann, 2016

51 / 91

Miss penalty

The larger the block size

  • the longer the miss penalty

Possible optimizations for reducing the penalty:

  • early restart: restart when [x][x] is read, not when the entire block is read from main memory
  • requested word first: first read [x][x], then early restart, then read the entire block
52 / 91

Miss penalty: numbers

Assume:

  • sb=4s_b=4
  • 1 cycle per instruction (CPI)
  • 1 cycle for reading one byte from cache
  • 100 cycles for reading 4 bytes from main memory

Cycles for read:

  • Hit: 1 cycle
  • Miss: 4100+1=4014 \cdot 100 + 1=401 cycles
  • Miss with early restart: from 1100+1=1011 \cdot 100 + 1=101 to 4100+1=4014 \cdot 100 + 1=401 cycles, average 251251 cycles
  • Miss with requested word first: 1100+1=1011 \cdot 100 + 1=101 cycles

These are gross estimates, not real numbers

53 / 91

Impact on CPI

  • Hit: 1 cycle
  • Miss: 4100+1=4014 \cdot 100 + 1=401 cycles
  • Miss with early restart: from 1100+1=1011 \cdot 100 + 1=101 to 4100+1=4014 \cdot 100 + 1=401 cycles, average 251251 cycles
  • Miss with requested word first: 1100+1=1011 \cdot 100 + 1=101 cycles

Actual CPI based on miss rate:

1% 5% 10% 50%
Miss 5 21 41 201
Miss with early restart 4 14 26 126
Miss with requested word first 2 6 11 51
54 / 91

Writes

If a write acts only on the cache, the content of the main memory at a given xx and of the cache at the corresponding yy may differ: they are inconsistent.

How to avoid inconsistencies?

  1. write-through
  2. write-back
55 / 91

Write-through

At each write:

  • write at yy in the cache and
  • write at xx in the main memory

Pros:

  • "simple" strategy, both conceptually and for the implementation

Cons:

  • basically cache does not give any advantage for writes
56 / 91

Impact on CPI

Assume:

  • sb=1s_b=1
  • 1 cycle per instruction (CPI)
  • 1 cycle for reading one byte from cache
  • 100 cycles for reading/writing 1 byte from main memory

Actual CPI based on miss rate (y-axis) and read-to-write ratio (x-axis):

20 10 5 1
1% 7 11 18 51
5% 10 15 22 53
10% 15 19 26 55
50% 53 55 59 75
57 / 91

Write-back

At each write:

  • write at yy in the cache

At each miss at yy (read and¹ write): 1: can be optimized

  • first, write yy block back on proper xx
  • then, load from xx to yy

Pros:

  • more complex, harder to implement
    • more logic components, larger footprint, greater cost!

Cons:

  • lower impact on actual CPI because of fewer accessess on main memory
58 / 91

Actual CPI

Consider a processor with a CPI of 2 and a miss penalty of 100 cycles for both read and write; consider a program with a 3-to-1 read-to-write ratio resulting in an overall 4% miss rate:

  1. what's the change in actual CPI by halving the miss rate?
  2. what's the change in actual CPI by halving the miss penalty?

Three missing info: cycle to access cache; write policy; if not "actual r/w CPI", percentage of r/w instructions in the program

59 / 91

A real cache (FastMATH processor)

FastMATH cache scheme

from Patterson, David A., and John L. Hennessy. Morgan kaufmann, 2016

Why is the instruction miss rate much lower?

  • 2x cache: data, instructions
  • 16 KB
  • nb=4n_b=4
  • write-through and write-back with write buffer

Miss rate:

  • instruction: 0.4%
  • data: 11.4%
  • combined: 3.2%
60 / 91

Decreasing miss rate

One option: increasing block size

  • has some cons

Another option: each xx can be hosted in (associated with) many yy, not in just one.
➝ increase the degree of associativity

What's the counterpart of associativity in the library case?

61 / 91

Associativity

  • each xx in exactly one yy: 1-way associativity
    • i.e., no associativity
  • each xx in nn yy: nn-way associativity
    • usually, n=2kn=2^k (in practice, 2 or 4)
  • each xx in all yy: full associativity
    • k=nck=n_c

With k=1k=1: many-to-one from xx to yy

With k>1k>1: many-to-many from xx to yy

62 / 91

Finding/putting an xx with >1>1-associativity

Where should I look in the cache for a given xx?

  • with 1-way: in exactly one yy
  • with nn-way: in nn yy

Where should I put an xx in the cache?

  • with 1-way: in exactly one yy
  • with nn-way: in exactly one yy chosen among nn; how?
    • least recently used (LRU)
    • randomly
63 / 91

Set of blocks

With associativity, scs_c blocks are organized in sets of blocks, each consisting of ss=2nss_s=2^{n_s} contiguous blocks former nn becomes sss_s

  • in practice, ss1,2,4,scs_s \in {1,2,4,s_c}

A given xx will go in a block of the kk-th set, with x%scss=kx \mathbin{\%} \frac{s_c}{s_s} = k (instead of the kk-th block)

  • scss\frac{s_c}{s_s} is the number of sets of blocks, it takes ncnsn_c-n_s bits

Tag indicates which xx is in a given yy

64 / 91

xx to yy

Assume ns=2n_s=2, nc=4n_c=4:

Block index kk to yy:

  • set k=k= 00 contains blocks from y=y= 0000 to 0011
  • set k=k= 01 contains blocks from y=y= 0100 to 0111
  • set k=k= 10 contains blocks from y=y= 1000 to 1011
  • set k=k= 11 contains blocks from y=y= 1100 to 1111

xx to kk: x%scss=kx \mathbin{\%} \frac{s_c}{s_s} = k, that is, k=x[nm(ncns)nb,nmnb[k = x_{[n_m-(n_c-n_s)-n_b,n_m-n_b[}

Hence, xY={y:y[0,ncns[=x[nm(ncns)nb,nmnb[}x \rightarrow Y=\{y: y_{[0,n_c-n_s[}=x_{[n_m-(n_c-n_s)-n_b,n_m-n_b[}\}

ncn_c "becomes" ncnsn_c-n_s

65 / 91

Algorithm for reading xx

  1. Y={y:y[0,ncns[=x[nm(ncns)nb,nmnb[}Y=\{y: y_{[0,n_c-n_s[}=x_{[n_m-(n_c-n_s)-n_b,n_m-n_b[}\}
  2. for each yYy \in Y
    1. t=[y]=(tval,ttag,tblock)t=[y]=(t_\text{val}, t_\text{tag}, t_\text{block})
    2. if tval=t_\text{val} = 1 and ttag=x[0,nm(ncns)nb[t_\text{tag} = x_{[0,n_m-(n_c-n_s)-n_b[}
    3. return tblock[8z,8z+8[t_{\text{block} [8 z,8 z+8[} with z=x[nmnb,nm[z=x_{[n_m-n_b, n_m[} (hit) 0 if nb=0n_b=0
  3. read x0,,xsb1x_0, \dots, x_{s_b-1} from main memory (miss)
  4. choose kk (with nsn_s bits)
  5. put 1 x[0,nm(ncns)nb[x_{[0,n_m-(n_c-n_s)-n_b[} [x0][xsb1][x_0] \dots [x_{s_b-1}] at y=x[nm,nm(ncns)nb[  ky=x_{[n_m,n_m-(n_c-n_s)-n_b[} \; k2
  6. return [x][x]

Steps 2.1 to 2.3 are actually done in parallel!

  • more comparators, larger footprint
66 / 91

Example with misses and hits

nc=3n_c=3, ns=1n_s=1, nm=8n_m=8, nb=1n_b=1

000 1 00000 00000001 00000010
001 1 01010 00100000 00110000
010 0 11010 00000000 00000000
011 1 01111 00000000 11100000
100 1 00101 00000001 00000010
101 1 10110 00100000 00110000
110 1 01100 00000000 00000000
111 0 01011 00000000 00000000

x=x= 00100110 Y={\Rightarrow Y=\{110, 111}\} \Rightarrow miss tag, miss validity

x=x= 01111011 Y={\Rightarrow Y=\{010, 011}\} \Rightarrow miss validity, hit
\Rightarrow returns tblock[8z,8z+8[=t_{\text{block} [8 z,8 z+8[}= 11100000 with z=x[81,8[=z=x_{[8-1,8[}= 1

x=x= 01011000 Y={\Rightarrow Y=\{000, 001}\} \Rightarrow miss tag, miss tag

Can we have >1>1 hits in a block?

67 / 91

Choosing a block in set (LRU)

For each block in the set, we need to know when it was used last time: it can be done with nan_a recency bits (a recency tag):

  • 00 for the most recently used
  • 01 for the 2nd most recently used
  • 10 for the 3rd third most recently used
  • 11 for the 4th most recently used

Whenever an yy is accessed, if its recency tag is \ne 00, the recency tags of all blocks in the set have to be updated:

  • the one at this yy becomes 00
  • each other is increased by 1 (11 stays 11)
68 / 91

Choosing a block in set (random)

Just random!

  • no recency tags needed
  • no update on hits

Much less complex than LRU:

  • in practice, LRU is used only with ns=1n_s=1 or 22
  • impact on miss rate is low: 10%\approx 10 \%
    • negligible with large caches
69 / 91

Misses and hits with associativity

For a main memory of 256 byte, where x=[x]x=[x], an initially empty associative cache with 4 blocks of 1 word each, and the reads 10, 11, 13, 20, 21, 22, 10, 20, 21 (hexadecimal addresses xx):

  • compute the percentage improvement in hit rate for ss=2s_s=2 with respect to the case of no associativity
  • compute the overall size of the cache for ss=2s_s=2 and LRU

"Initially empty" means all bits set to 0

70 / 91

Virtual memory

71 / 91

Goals

Goal of cache: memory with fast access and large size

Unsolved issues:

  1. memory used by many programs at the same time
  2. physical memory being smaller than addressable memory
72 / 91

Memory sharing

Suppose program AA and program BB are being executed concurrently: we'll see, briefly, how

Question:

  • how to prevent AA from accessing an xx address that "belongs" to BB?

Trivial, but wrong, solution: make AA access addresses from xA,ix_{A,i} to xA,fx_{A,f} and BB access from xB,ix_{B,i} to xB,fx_{B,f}, with [xA,i,xA,f][x_{A,i},x_{A,f}] and [xB,i,xB,f][x_{B,i},x_{B,f}] being disjoint...

  • works only for AA and BB, what about CC, DD, ...
  • works only if [xA,i,xA,f][x_{A,i},x_{A,f}] is free (same for BB)
  • require knowing [xA,i,xA,f][x_{A,i},x_{A,f}] at compile time
73 / 91

Addressable memory

In ancient times, RAMs were small, much smaller than the addressable space (i.e., 2nm2^{n_m})

Today, RAMs are much bigger, but still often smaller than addressable space:

  • nm=32n_m=32 means 4 GB: not all computers have 4 GB RAM
  • nm=64n_m=64 means 18 exabytes... actually, depends on the architecture: usually "much" lower

Questions:

  • how to allow programs to use a memory of 2nm2^{n_m} bytes having a physical memory that is smaller?
  • (further) how to make all running programs "see" a memory of size 2nm2^{n_m}?
74 / 91

Idea (sketch of)

At any time, the memory content is only partially in the physical memory (RAM); the remaining part is on the disk

When a program wants to access a given address, it may be on the physical memory:

  • if yes, nice
  • if not, find it on the disk, bring it to the physical memory, and use it

Solves the issue of size, not that of concurrent access to memory

75 / 91

Idea (finer sketch of)

Memory is organized in pages:

  • each page has a page number
  • within page, bytes has an address (page offset) starting at 0
  • a physical memory location is composed of a page number and a page offset

Each program is given a page and accesses memory specifying the page offset

Solves the issues of concurrent access, not that of size

  • AA cannot access the page of BB
  • AA is compiled specifying just the page offset, the page number at runtime is assigned by someone
76 / 91

Idea (together)

  • memory is organized in pages
  • programs are given many pages, not just one
    • from program POV, the first has always number 0
  • at any time a page may be in the physical memory or on the disk
    • programs do not need to know if a page is on memory or on disk

Solves all problems:

  • there can be as many pages as we want, actually a number of pages that is like 2nm2^{n_m} bytes for each program
  • no undesired interference among programs there can be regulated sharing of page, though

This way, what is seen by programs is a larger, private memory that is not the physical memory: \rightarrow virtual memory

77 / 91

Virtual memory and addresses

From program point of view, a memory address is given by \langlepage number, page offset\rangle

  • each program has pages starting from number 0

But physically, bytes are either in RAM or on disk

  • in general, not starting "at 0"

\Rightarrow addresses has to be translated

78 / 91

Virtual memory and caches

There are similarities:

  • both motivated by "have your cake and eat it too"
  • both exploit a memory hierarchy (i.e., two memories)
  • both are organized in chunks
  • both require an address translation
Cache Virtual memory
Goal fast as SRAM, large as DRAM private as split for programs, large as entire for each program
Two memories SRAM and DRAM DRAM and disk
Chunks block page
Address translation xx to yy (or YY) we'll see

Names are different for historical reasons

79 / 91

Address translation

In cache:

  • xx is the physical address in the (main) memory in general, the lower level memory
  • yy is the physical block address in the cache

In virtual memory:

  • xvx_v is the virtual address
    • generated by the program
  • xpx_p is the physical address
80 / 91

Address parts

Assume:

  • a virtual memory addressable with nv=32n_v=32 bits
  • a page size of sp=16s_p=16 KB, i.e., np=log2(161024)=4+10=14n_p=\log_2 (16 \cdot 1024) = 4 + 10 = 14

A virtual address is composed of:

  • the page number, ranging from 00 to sn=232214s_n=\frac{2^{32}}{2^{14}}, i.e., nn=nvnp=18n_n = n_v - n_p = 18

In general, the virtual memory can be larger than the physical memory (illusion of large memory):

  • nv>nmn_v > n_m, e.g., nv=48n_v=48 and nm=40n_m = 40
81 / 91

Finding a page

Assume a program wants to access xvx_v: what happens?

In brief and conceptually:

  1. get page number pv=xv[0,nn[p_v=x_{v [0, n_n[} from xvx_v
  2. if page is in physical memory
    • translate xvx_v to xpx_p
    • access it
  3. otherwise (page fault)
    • find it on disk where?
    • move it in memory where? remove what?
    • translate xvx_v to xpx_p
    • access it

Looks very like the case of cache, but...

82 / 91

Cost of miss vs. fault

Type Access time [ns] Cost [$/GB]
SRAM 0.5–2.5 500–1000
DRAM 50–70 10–20
Flash 5000–50000 0.75–1.00
Magnetic disk 5000000–20000000 0.05–0.10

Cache:

  • hit (SRAM): 1 cycle
  • miss (DRAM): 500.5=100\approx \frac{50}{0.5}=100 cycles, 100×100 \times longer

Virtual memory:

  • hit (DRAM): t50t \approx 50
  • fault (disk): t5106t \approx 5 \cdot 10^6, 100000×100000 \times longer
    • much uglier if said in cycles: millions of!
83 / 91

Cost of page fault

Since the cost of page faults is very high, greater care is put in reducing page fault probability:

  • full associativity
    • with smarter replacement strategies employed by OS operating system
  • much larger chunk size
    • exploits high bandwidth, despite big latency, of disks
  • write back
    • write through would be very unpractical (100000×100000 \times longer)
84 / 91

Page table

Every program has a page table consisting of 2nn2^{n_n} entries

Each entry contains:

  • one validity bit
  • a physical page number ppp_p of nmnpn_m-n_p bits the npn_p bits are the page offset

The translation from xvx_v to xpx_p is a look-up in the page table:

  1. take pv=xv[0,nn[p_v=x_{v [0, n_n[}
  2. look at pvp_v-th entry in the page table
  3. if the validity bit is set (hit), xp=ppxv[nn,nv[x_p = p_p x_{v [n_n, n_v[}
  4. otherwise (miss) ...

No tag: why?

85 / 91

Program state

For a program, the page table is in memory

  • hardware is optimized for keeping the address of the page table (that is accessed frequently) in a specific registry

State of a program (also called process):

  • program counter
  • registers
  • page table (address)

State can be stored (in memory) and restored back:

  • active process if contents above are in place
  • inactive process otherwise
86 / 91

Page location on fault

The page table does not contain the physical address of the faulted page on the disk

The OS takes care of reserving a disk portion for all the pages of each project: swap space actually, things can be more complex

  • usually pages for a program are contiguous on swap space
  • pvp_v can be easily translated to a position on the swap disk for a given program

OS takes care of page faults!

  • they are long enough to make convinient the management by sw instead of hw

first request for a page does not need to go on disk for reading

87 / 91

Replacement policy

Because of full associativity, any page table entry can be replaced

Since replacement is managed by OS, replacement policy may be complex (e.g., LRU may be preferable over random)

However, real LRU is still too costly, since require updating usage info at each access, not only upon faults

\Rightarrow approximated LRU with usage bit

88 / 91

Usage/reference bit

Each entry in the page table contains:

  • one validity bit
  • one usage or reference bit
  • a physical page number ppp_p

At every hit, the usage bit is set

Every while, all usage bits are unset

for bits, "set" is "set to 1", "unset" is "set to 0"

How to estimate the effectiveness of usage bit with respect to actual LRU?

89 / 91

Size of page table

The page table may be very large!

E.g., for nv=48n_v=48, np=14n_p=14, nm=40n_m=40, there are 2nn2^{n_n} entries, each consisting of 1+1+nmnp=281+1+n_m-n_p=28 bits, with nn=nvnp=4814=34n_n=n_v-n_p=48-14=34
\Rightarrow 282346028 \cdot 2^{34} \approx 60 GB!!! one per each process!!!

Unfeasible to take them in memory!

Optimizations:

  • use smaller tables (process address a smaller virtual memory)
  • use dynamic tables (initially small, grown on need)
    • possibly in two directions for heap/stack growing requests
  • hash the the virtual page numbers (inverted page table)
  • page the page table!
  • ...
90 / 91

Writes

Write back policy with optimization: for avoding writing back when unneeded (because the page has not be written)

Each entry in the page table contains:

  • one validity bit
  • one usage bit
  • one dirty bit
  • a physical page number ppp_p

Upon read/creation, the dirty bit is unset

Upon writes on the page, the dirty bit is set

91 / 91

Lecturer

Eric Medvet

Research interests:

  • Evolutionary Computation
  • Machine Learning applications
  • Embodied intelligence

Labs:

2 / 91
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow