Eric Medvet
Research interests:
Labs:
Teacher's slides:
Intended usage:
Teacher's slides are based on :
Written test with questions and short open answers
Processor operates on data through instructions:
Loads and stores take time!
Overall, how long?
C:
float fahreneit2celsius(float f) { return ((5.0 / 9.0 ) * (f - 32.0));}
LEGv8:
LDURS S16, [X27, const5] ; S16 = 5.0 (5.0 in memory)LDURS S18, [X27, const9] ; S18 = 9.0 (9.0 in memory)FDIVS S16, S16, S18 ; S16 = 5.0 / 9.0LDURS S18, [X27, const32] ; S18 = 32.0FSUBS S18, S12, S18 ; S18 = f – 32.0FMULS S0, S16, S18 ; S0 = (5/9)*(fahr – 32.0)BR LR ; return
3 loads (LDURS
); might be 2 with compiler optimizations
// 32x32 matricesvoid matrixMult (double c[][], double a[][], double b[][]) { int i, j, k; for (i = 0; i < 32; i = i + 1) { for (j = 0; j < 32; j = j + 1) { for (k = 0; k < 32; k = k + 1) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } }}
Many more loads and stores!
Depends on the memory technology:
Type | Access time [ns] | Cost [$/GB] |
---|---|---|
SRAM | 0.5–2.5 | 500–1000 |
DRAM | 50–70 | 10–20 |
Flash | 5000–50000 | 0.75–1.00 |
Magnetic disk | 5000000–20000000 | 0.05–0.10 |
Recall: 1 CPU cycle takes ≈ 1 ns
Data from 2012
We'll see more later
Goal: process more data, faster, cheaper
Ideally, we want to minimize overall time spent accessing memory.
We can work on:
How?
➝ access time/cost trade-off!
A guy is doing some research about some topic. Needs to:
How to work faster?
Suppose to have:
Whenever you look for data item x:
Useful if you soon access again x
x is the address of the data item
"look for" means load/read; we'll see store/write later
Whenever you look for data item x:
Useful if you soon access again x or something close to x!
Useful if you soon access again x or something close to x!
Temporal locality: if a data location x is accessed at t, it will be likely accessed again soon (at some t+δt)
Spatial locality: if a data location x is accessed at t, data locations close to x (some x+δx) will be likely accessed again soon
How to exploit spatial locality at the library? What's the assumption?
float findAvgDiff ( float[] a, float[] b, int n) { float aSum = 0; float bSum = 0; int i; for (i = 0; i < n; i = i + 1) { aSum = aSum + a[i]; } for (i = 0; i < n; i = i + 1) { bSum = bSum + b[i]; } return (aSum - bSum) / n;}
"Where" in this representation is spatial locality? "where" is temporal locality?
● a
, ● b
, ● n
(=3), ● aSum
, ● bSum
, ● i
Accesses (simplified):
·······w··
········w·
·········w
······r···
·········r
r·········
·······r··
·······w··
······r···
·········r
·········r
·········w
·········r
·r········
·······r··
·······w··
······r···
·········r
·········r
·········w
·········r
··r·······
·······r··
·······w··
...
Instead of just two memories M1 (fast and small) and M2 (large and slow), a hierarchy of memories:
Pros:
Cons:
The memory hierarchy is a pattern, a scheme of solution for a common problem
Other cases:
Common name for the closer memory: cache
(very briefly)
Type | Access time [ns] | Cost [$/GB] |
---|---|---|
SRAM | 0.5–2.5 | 500–1000 |
DRAM | 50–70 | 10–20 |
Flash | 5000–50000 | 0.75–1.00 |
Magnetic disk | 5000000–20000000 | 0.05–0.10 |
How do they work?
Properties:
Technology:
Properties:
Technology:
Year | Chip size [bit] | Cost [$/GB] | Access time new¹ [ns] | Access time existing² [ns] |
---|---|---|---|---|
1980 | 64 K | 1500000 | 250 | 150 |
1983 | 256 K | 500000 | 185 | 100 |
1985 | 1 M | 200000 | 135 | 40 |
1989 | 4 M | 50000 | 110 | 40 |
1992 | 16 M | 15000 | 90 | 30 |
1996 | 64 M | 10000 | 60 | 12 |
1998 | 128 M | 4000 | 60 | 10 |
2000 | 256 M | 1000 | 55 | 7 |
2004 | 512 M | 250 | 50 | 5 |
2007 | 1 G | 50 | 45 | 1.25 |
2010 | 2 G | 30 | 40 | 1 |
2012 | 4 G | 10 | 35 | 0.8 |
1: row accessed for the first time
2: row recently accessed
Technology:
Solid state drives (SSDs) use flash memory
Technology:
Hard disk drives (HDDs) are storage devices based on disk memory
Mechanical parts:
Logical parts:
Read/write phases:
Different sizes over time
HDD in 1979!
Cache: a hiding place to put something for future use
First usage of the term in computers denoted specifically the small memory between the processor and the main memory
Nowadays, cache denotes any storage that takes advantage of locality
(Consider just reads for now)
Whenever the processor want to read a memory location x:
How to?
Memory content vs. memory addresses
000 13
001 45
010 e4
011 14
100 0f
101 76
110 1a
111 ff
8 byte memory:
In general, there are n-bit addresses (e.g., n=64)
The cache consists of sc=2nc triplet ⟨validity bit, tag, block⟩
Cache
(sc=4, nc=2, sb=1, nb=0)
00 1 01 00010000
01 1 10 10000010
10 0 11 10000010
11 0 11 10010101
Triplet size:
1+nm−nc−nb+8sb=1+2+8=11 bits
Cache size:
sc(1+nm−nc−nb+8sb)=4⋅11=44 bits
Main memory
(nm=4, sm= 16 bytes = 128 bit)
0000 00000001
0001 00000010
0010 00000100
0011 00001000
0100 00010000
0101 00100000
0110 01000000
0111 10000000
1000 10000001
1001 10000010
1010 10000100
1011 10001000
1100 10010000
1101 10100000
1110 11000000
1111 11000001
(Assume sb=1)
Given a cache with sc=2nc blocks, looking for x means looking at the k-th block with x%sc=k
With binary numbers x%sc=x%2nc is "the less significant nc bits of x"
Example with nm=8, nc=3:
Cache
(sc=4, nc=2, sb=1)
00 1 01 00010000
01 1 10 10000010
10 0 11 10000010
11 0 11 10010101
Main memory
(nm=4, sb= 16 bytes = 128 bit)
0000 00000001
0001 00000010
0010 00000100
0011 00001000
0100 00010000
0101 00100000
0110 01000000
0111 10000000
1000 10000001
1001 10000010
1010 10000100
1011 10001000
1100 10010000
1101 10100000
1110 11000000
1111 11000001
1 word is 4 bytes
Given x of nm bits (assume sb=1):
Cache (nc=2, nb=0)
00 1 01 00010000
01 1 10 10000010
10 0 11 10000010
11 0 11 10010101
Main memory (nm=4)
0000 00000001
0001 00000010
...
1110 11000000
1111 11000001
Cache (nc=2, nb=0)
00 1 01 00010000
01 1 10 10000010
10 0 11 10000010
11 0 11 10010101
Main memory (nm=4)
0000 00000001
0001 00000010
...
1110 11000000
1111 11000001
Given a sequence of n addresses x1,x2,…,xn
x= 101, 000, 111, 001, 010 (left), 000, 101, 110, 111, 110 (right)
00 0 0 00000000
01 1 1 00100000
10 0 0 00000000
11 0 0 00000000
00 1 0 00000001
01 1 1 00100000
10 0 0 00000000
11 0 0 00000000
00 1 0 00000001
01 1 1 00100000
10 0 0 00000000
11 1 1 10000000
00 1 0 00000001
01 1 0 00000010
10 0 0 00000000
11 1 1 10000000
00 1 0 00000001
01 1 0 00000010
10 1 0 00000100
11 1 1 10000000
00 1 0 00000001
01 1 0 00000010
10 1 0 00000100
11 1 1 10000000
00 1 0 00000001
01 1 1 00100000
10 1 0 00000100
11 1 1 10000000
00 1 0 00000001
01 1 1 00100000
10 1 1 01000000
11 1 1 10000000
00 1 0 00000001
01 1 1 00100000
10 1 1 01000000
11 1 1 10000000
00 1 0 00000001
01 1 1 00100000
10 1 1 01000000
11 1 1 10000000
Initial cache
00 0 0 00000000
01 0 0 00000000
10 0 0 00000000
11 0 0 00000000
000 00000001
001 00000010
010 00000100
011 00001000
100 00010000
101 00100000
110 01000000
111 10000000
Is there temporal or spatial locality in this sequence of x? Are they both exploited?
Spatial locality: if a data location x is accessed at t, data locations close to x (some x+δx) will be likely accessed again soon
We need to have not only x but also x+δx in cache
⇒ use larger blocks! (i.e., sb>1)
Assume nm=6, nc=2, nb=2 (block size is sb=4 bytes)
y is the bits in x from nm−nc−nb to nm−nb
Cache (nc=2, nb=1)
00 1 00 00000001 00000010
01 1 10 00100000 00110000
10 0 00 00000000 00000000
11 0 00 00000000 00000000
Main memory (nm=6)
00000 00000001
00001 00000010
...
11110 11100000
11111 11100001
For a main memory of 256 byte, where x=[x], an initially empty cache with 4 blocks of 1 word each, and the reads 10, 11, 13, 20, 21, 22, 10, 20, 21 (decimal addresses x)
Hint: use decimal notation for block contents
"Initially empty" means all bits set to 0
The larger the block size
Depends on the specific sequence of memory accesses
from Patterson, David A., and John L. Hennessy. Morgan kaufmann, 2016
The larger the block size
Possible optimizations for reducing the penalty:
Assume:
Cycles for read:
These are gross estimates, not real numbers
Actual CPI based on miss rate:
1% | 5% | 10% | 50% | |
---|---|---|---|---|
Miss | 5 | 21 | 41 | 201 |
Miss with early restart | 4 | 14 | 26 | 126 |
Miss with requested word first | 2 | 6 | 11 | 51 |
If a write acts only on the cache, the content of the main memory at a given x and of the cache at the corresponding y may differ: they are inconsistent.
How to avoid inconsistencies?
At each write:
Pros:
Cons:
Assume:
Actual CPI based on miss rate (y-axis) and read-to-write ratio (x-axis):
20 | 10 | 5 | 1 | |
---|---|---|---|---|
1% | 7 | 11 | 18 | 51 |
5% | 10 | 15 | 22 | 53 |
10% | 15 | 19 | 26 | 55 |
50% | 53 | 55 | 59 | 75 |
At each write:
At each miss at y (read and¹ write): 1: can be optimized
Pros:
Cons:
Consider a processor with a CPI of 2 and a miss penalty of 100 cycles for both read and write; consider a program with a 3-to-1 read-to-write ratio resulting in an overall 4% miss rate:
Three missing info: cycle to access cache; write policy; if not "actual r/w CPI", percentage of r/w instructions in the program
from Patterson, David A., and John L. Hennessy. Morgan kaufmann, 2016
Why is the instruction miss rate much lower?
Miss rate:
One option: increasing block size
Another option: each x can be hosted in (associated with) many y, not in just one.
➝ increase the degree of associativity
What's the counterpart of associativity in the library case?
With k=1: many-to-one from x to y
With k>1: many-to-many from x to y
Where should I look in the cache for a given x?
Where should I put an x in the cache?
With associativity, sc blocks are organized in sets of blocks, each consisting of ss=2ns contiguous blocks former n becomes ss
A given x will go in a block of the k-th set, with x%sssc=k (instead of the k-th block)
Tag indicates which x is in a given y
Assume ns=2, nc=4:
Block index k to y:
x to k: x%sssc=k, that is, k=x[nm−(nc−ns)−nb,nm−nb[
Hence, x→Y={y:y[0,nc−ns[=x[nm−(nc−ns)−nb,nm−nb[}
nc "becomes" nc−ns
Steps 2.1 to 2.3 are actually done in parallel!
nc=3, ns=1, nm=8, nb=1
000 1 00000 00000001 00000010
001 1 01010 00100000 00110000
010 0 11010 00000000 00000000
011 1 01111 00000000 11100000
100 1 00101 00000001 00000010
101 1 10110 00100000 00110000
110 1 01100 00000000 00000000
111 0 01011 00000000 00000000
x= 00100110 ⇒Y={110, 111}⇒ miss tag, miss validity
x= 01111011 ⇒Y={010, 011}⇒ miss validity, hit
⇒ returns tblock[8z,8z+8[= 11100000 with z=x[8−1,8[= 1
x= 01011000 ⇒Y={000, 001}⇒ miss tag, miss tag
Can we have >1 hits in a block?
For each block in the set, we need to know when it was used last time: it can be done with na recency bits (a recency tag):
Whenever an y is accessed, if its recency tag is = 00, the recency tags of all blocks in the set have to be updated:
Just random!
Much less complex than LRU:
For a main memory of 256 byte, where x=[x], an initially empty associative cache with 4 blocks of 1 word each, and the reads 10, 11, 13, 20, 21, 22, 10, 20, 21 (hexadecimal addresses x):
"Initially empty" means all bits set to 0
Goal of cache: memory with fast access and large size
Unsolved issues:
Suppose program A and program B are being executed concurrently: we'll see, briefly, how
Question:
Trivial, but wrong, solution: make A access addresses from xA,i to xA,f and B access from xB,i to xB,f, with [xA,i,xA,f] and [xB,i,xB,f] being disjoint...
In ancient times, RAMs were small, much smaller than the addressable space (i.e., 2nm)
Today, RAMs are much bigger, but still often smaller than addressable space:
Questions:
At any time, the memory content is only partially in the physical memory (RAM); the remaining part is on the disk
When a program wants to access a given address, it may be on the physical memory:
Solves the issue of size, not that of concurrent access to memory
Memory is organized in pages:
Each program is given a page and accesses memory specifying the page offset
Solves the issues of concurrent access, not that of size
Solves all problems:
This way, what is seen by programs is a larger, private memory that is not the physical memory: → virtual memory
From program point of view, a memory address is given by ⟨page number, page offset⟩
But physically, bytes are either in RAM or on disk
⇒ addresses has to be translated
There are similarities:
Cache | Virtual memory | |
---|---|---|
Goal | fast as SRAM, large as DRAM | private as split for programs, large as entire for each program |
Two memories | SRAM and DRAM | DRAM and disk |
Chunks | block | page |
Address translation | x to y (or Y) | we'll see |
Names are different for historical reasons
In cache:
In virtual memory:
Assume:
A virtual address is composed of:
In general, the virtual memory can be larger than the physical memory (illusion of large memory):
Assume a program wants to access xv: what happens?
In brief and conceptually:
Looks very like the case of cache, but...
Type | Access time [ns] | Cost [$/GB] |
---|---|---|
SRAM | 0.5–2.5 | 500–1000 |
DRAM | 50–70 | 10–20 |
Flash | 5000–50000 | 0.75–1.00 |
Magnetic disk | 5000000–20000000 | 0.05–0.10 |
Cache:
Virtual memory:
Since the cost of page faults is very high, greater care is put in reducing page fault probability:
Every program has a page table consisting of 2nn entries
Each entry contains:
The translation from xv to xp is a look-up in the page table:
No tag: why?
For a program, the page table is in memory
State of a program (also called process):
State can be stored (in memory) and restored back:
The page table does not contain the physical address of the faulted page on the disk
The OS takes care of reserving a disk portion for all the pages of each project: swap space actually, things can be more complex
OS takes care of page faults!
first request for a page does not need to go on disk for reading
Because of full associativity, any page table entry can be replaced
Since replacement is managed by OS, replacement policy may be complex (e.g., LRU may be preferable over random)
However, real LRU is still too costly, since require updating usage info at each access, not only upon faults
⇒ approximated LRU with usage bit
Each entry in the page table contains:
At every hit, the usage bit is set
Every while, all usage bits are unset
for bits, "set" is "set to 1", "unset" is "set to 0"
How to estimate the effectiveness of usage bit with respect to actual LRU?
The page table may be very large!
E.g., for nv=48, np=14, nm=40, there are 2nn entries, each consisting of 1+1+nm−np=28 bits, with nn=nv−np=48−14=34
⇒ 28⋅234≈60 GB!!! one per each process!!!
Unfeasible to take them in memory!
Optimizations:
Write back policy with optimization: for avoding writing back when unneeded (because the page has not be written)
Each entry in the page table contains:
Upon read/creation, the dirty bit is unset
Upon writes on the page, the dirty bit is set
Eric Medvet
Research interests:
Labs:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |