CS 지식/컴퓨터구조

[컴퓨터구조] 14-3. Pipelined MIPS Performance and Exception handler

2023.10.10

Pipelined Performance Example

SPECINT2000 benchmark:
- 25% loads
- 10% stores
- 11% branches
- 2% jumps
  - jump 명령이 fetch되고 decode 되면서 jump라는 것을 알 수 있는데 그 때 PC에 BTA를 넣어주어야 하는데 이미 그 다음 줄 instruction이 들어와 있다. 그래서 그 명령을 flush 해 주는 clock이 하나 더 소비 되기 때문에 jump 명령의 CPI는 2이다.
- 52% R-type
Suppose:
- 다음 명령어에 의해 사용되는 loads의 40% -> load use hazard(stalling)
  - CPI = 2 (due to stalling)
- 잘못 예측된 branch의 25%
  - CPI = 2 (due to flush)
- 모든 jumps는 다음 명령어를 flush시킨다.
  - All jump CPI = 2
What is the average CPI?
- Load/Branch CPI = 1 when no stalling, 2 when stalling
- CPI_lw = 1(0.6) + 2(0.4) = 1.4
- CPI_beq = 1(0.75) + 2(0.25) = 1.25
- Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1) = 1.15

jump는 CPI가 다른 것과 비교했을 때 매우 큰 값인 2이지만 비율로 따졌을 때 2%이기 때문에 크게 영향을 미치지 않았다.

Pipelined processor critical path:
Tc = max {2(t_RFread+ t_mux+ t_eq+ t_AND+ t_mux+ t_setup) -> critical patht_pcq+ t_memwrite+ t_setupW방과 D방에서는 RFread or RFwrite은 falling edge trigger이기 때문에 2 배만큼의 clock이 더 걸리게 된다.
2(t_pcq+ t_mux+ t_RFwrite) }
t_pcq+ t_mux+ t_mux+ t_ALU+ t_setup
t_pcq + t_mem + t_setup

Pipelined Performance Example

D방이 제일 오래 걸린다.

Tc = 2(t_RFread+ t_mux+ t_eq+ t_AND+ t_mux+ t_setup)

=2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps

Program with 100 billion instructions

Execution Time = (#instructions) x CPI x Tc

= (100 x 10⁹)(1.15)(550 x 10^-12)

= 63 seconds

Processor Performance Comparison

single-cycle보다 1.47배 만큼 더 빨라진 것을 볼 수 있음

Review: Exceptions

Unscheduled function call to exception handler
Caused by:
- Hardware, interrupt라고도 불리움. e.g. keyboard
- Software, traps,라고도 불리움. e.g. undefined instruction
- Divide-by-zero, overflow, debugger breakpoint, hardware malfunction, attempt to read nonexistent memory, accessing privileged instruction or memory
exception이 발생할 때 the processor는..:
- exception의 이유를 기록한다.(Cause register)
- exception handler로 Jump (0x80000180)
- Returns to program (EPC register를 보고 다시 리턴한다.)

Example Exception

Exception Register

Not part of register file
- Cause
  - Records cause of exception
  - Coprocessor 0 register 13
- EPC (Exception PC)
  - Records PC where exception occurred (어디서 발생했는지 기록)
  - Coprocessor 0 register 14
    - Coprocessor: 별도의 process를 처리하는 processor
Move from Coprocessor 0
- mfc0 $t0, Cause
  - Cause의 내용을 읽어서 $t0에 write
- Moves contents of Cause into $t0

Exception Causes

마지막 두 타입의 exceptions을 다루기 위해 multicycle MIPS processor을 확장한다.

Exception Hardward: EPC & Cause

Exception이 발생하면(EPCWrite active) 현재 실행되고 있는 PC 값이 EPC register에 저장이 되고

어떤 exception인지를 mux가 결정하여 그 값이 cause register에 저장이 된다.

Exception Hardware: mfc0

mfc0 $3, Cause

WriteReg에 cause 정보를 저장하는 path

mux를 통해 cause 혹은 EPC 값을 전달한다.

Control FSM with Exceptions

Advanced Microarchitecture

Deep Pipelining
Branch Prediction
Superscalar Processors
Out of Order Processors
Register Renaming
SIMD (Single Instruction Multiple Data )
Multi-threading
Multi-processors

Deep Piplining

10-20 stages typical (너무 높으면 오히려 성능 저하가 생기기 때문에 적당한 stage로 구성)
stage의 수는 다음에 의해 제한된다:
- Pipeline hazards
- Sequencing overhead
- Power - frequency에 비례하기 때문에 너무 빠른 클락을 사용하면 그만한 power를 요구
- Cost

Tc: clock cycle time이 pipeline stage가 늘어날 수록 짧아지지만 Instruction time 즉, 성능은 줄어들다가 어느 순간 늘어난다.

따라서 너무 deep pipeline은 좋지 않다.

Branch Prediction

branch가 언제 수행될 지 추측한다.
- Backward branches가 대개 수행된다.(loops)
- 추측을 향상하기 위해 history를 고려한다.
좋은 prediction은 flush를 요구하는 branches의 조각(fraction)을 감소시킨다.

Ideal pipelined processor: CPI = 1
Branch misprediction increases CPI (CPI = 2)
- flush 하는 과정 때문에 beq의 경우 CPI가 2
Static branch prediction:
- branch의 방향을 확인(forward or backward)
  - backward면, predict taken
  - forward면, predict not taken
Dynamic branch prediction:
- branch target buffer 안에 마지막 (수백가지) branches의 history를 계속 기록한다:
  - Branch destination
  - Whether branch was taken

Branch Prediction Example

loop를 10번 돌리는 과정
history 1 bit가 있고 처음에는 T(aken) 으로 되어 있는데 실제로는 일어나지 않아야 하기 때문에 여기서 한 번 틀리게 되고 그 이후 NT로 바뀌어서 loop를 10번 진행할 동안에는 모두 NT가 맞기 때문에 계속 맞다가 마지막에는 beq명령어가 done으로 이동해야 하기 때문에 T인데 history는 NT로 되어있기 때문에 여기서 또 한 번 틀리고 다시 NT로 바뀌게 된다.
따라서 맨 처음과 맨 끝에서만 틀리게 된다.

1-Bit Branch Predictor

Mispredicts first and last branch of loop
branch가 마지막에 수행되는지를 기억해서 같은 행위를 한다.
첫번째와 마지막 branch loop를 잘못 예측한다.
- state transition이 곧 mispredict를 의미한다.

이 방법도 한계가 있는데, 가령 doubly nested loop(이중 for문)를 들어보면 위와 같은 구조는 branch instruction에서 miss가 날 때마다 prediction state가 변경된다. 그런데 전체적으로 루프 한 개가 다른 loop 한 개를 포괄하고 있는 상태에서 전체 state를 예측하는 것이 확실하다고 보장하지 못할 것이다.
다시 말해 내부 loop가 수행되면서 branch가 taken이 되더라고 마지막 반복구문에서는 그 branch에 대한 prediction이 일어나지 않으므로 그 상태가 not taken으로 변경된다. 그런데 이 상태에서 안쪽 loop를 다시 수행하면 처음 branch predict를 수행하면 다시 수행하기 위해 state가 taken으로 변경되면서 결론적으로 안쪽 루프의 처음 state와 나중 state가 맞지 않는 현상이 발생하는 것이다.

2-Bit Branch Predictor

연속적으로 계속 taken이 일어나고 있는지를 확인한다.

Only mispredicts last branch of loop

loop의 마지막에서만 mispredict가 일어난다.

연속적으로 Not Taken이 일어났을 때서야 (Weakly) Not Taken으로 예측한다.

2bit에서는 branch prediction이 두 번 연속으로 틀릴 경우에만 state가 변경되기 때문에 위의 케이스 경우에는 안쪽 루프의 마지막 instruction이 mispredict 되더라도 state가 taken을 유지하게 된다. 물론 그 상태에서 다시 안쪽 루프를 수행하더라도 전체적으로 prediction state가 맞게 되는 것이다.

Superscalar

만약 CPU의 ALU를 통해서 덧셈과 곱셈의 연산결과를 뽑으라는 instruction을 줬다고 가정했을 때 앞에서 말한 IPC=1인 환경에서는 한 cylce이 지나면 곱셈의 결과를 얻을 수 있어야 한다. 그런데 만약 곱셈의 결과가 덧셈의 결과를 바탕으로 나오는 것이라면 한 cylce이 지나기 전에 곱셈의 결과를 얻기는 어려울 것이다. 왜냐면 덧셈의 instruction을 수행하는데도 1cylce이 소모되기 때문이다. 이렇게 컴퓨터 성능을 막는 요소 중 하나를 instruction 간의 dependency라고 한다.

이걸 극복하기 위해선 instruction이 처리되는 path를 여러개 만들고 각각의 instruction을 해당 path를 통해 처리하게 하면 된다.

(위 예제에서 덧셈과 곱셈) 물론 완전히 똑같으면 성능 측면에서 의미가 없으니까 하나는 data만 처리하게 하고, 다른 하나는 instruction만 처리할 수 있도록 구성하면 instruction을 읽어왔을 때 data pipeline을 통해 결과를 뽑을 수 있게 한다.

이 구조를 바로 SuperScalar 구조라고 하고 앞에서 잠깐 언급한 path를 SuperScalar에서는 way로 표현한다.

Multiple copies of datapath execute multiple instructions at once
Dependencies make it tricky to issue multiple instructions at once CLK CL
내부구조는 아래 그림과 같이 생겼다.

기존의 pipeline과 다른 점은 superscalar는 두 개의 instruction을 한 번에 처리할 수 있다는 점이다.

그래서 이어진 명령어들 간에 같은 register 위치를 사용하는 것이 불가능 하다.

Superscalar Example

Superscalar with Dependencies

t0를 load 하는데 그 다음 명령에서 t0를 사용해야 한다.

6개 명령어가 들어가는데 5clock이 걸렸다. -> IPC = 6/5

datapath가 두 개로 늘긴 했지만 그만큼의 효과를 보지 못했다.(due to dependencies)

Out of Order (OOO) Processor

들어가는 instruction 순서와 실행되는 순서가 뒤죽박죽 by compiler (hazard를 줄이기 위해)

multiple insturction에 걸쳐 살펴본다.
가능한 한 많은 issue를 한 번에
Issues instructions out of order (as long as no dependencies)
Dependencies:
- RAW (read after write): one instruction writes, later instruction reads a register
- WAR (write after read): one instruction reads, later instruction writes a register
- WAW (write after write): one instruction writes, later instruction writes a register

Q) RAR는 dependency에 속하지 않는 이유

A) dependency는 연속된 명령어들 간에 의존성을 말한다. 그래서 떨어질 수 없는 경우를 말하는데 read의 경우는 값이 달라질 문제가 없기 때문에 그냥 하면 된다.

Out of Order Processor

Instruction level parallelism (ILP):
- 동시에 발행될 수 있는 명령어의 수 (average < 3)
Scoreboard: table that keeps track of(기록):
- 발행을 기다리는 명령어들
- Available functional units
- Dependencies

Out of order Processor Example

dependencies를 따진 후에 or, sw의 위치를 위로 올려서 stall을 없앴다. -> IPC 향상

Register Renaming

rename register - nonarchitectural register (우리가 사용할 수 없는 register; MIPS: $r0~$r19)

t0를 r0로 rename 함으로써 add명령을 아래로 내릴 수 있었고 이를 통해 IPC의 향상을 이끌어냈다.

SIMD

Single Instruction Multiple Data (SIMD)
- Single instruction이 여러 데이터에 동시 작용함
- Common application: graphics
- 짧은 arithmetic operations를 수행한다. (packed arithmetic이라고도 불리는)
For example, add four 8-bit elements

Advanced Architecture Techniques

Multithreading
- Wordprocessor: thread for typing, spell checking, printing
Multiprocessors
- Multiple processors (cores) on a single chip

processor vs. process
- processor는 hardware
- process는 software(task)이다.

Threading: Definitions

Process: 컴퓨터에서 실행 중인 프로그램
- Multiple processes가 한 번에 실행될 수 있다. (e.g., surfing Web, playing music, writing a paper)
Thread: 프로그램의 일부
- 각 process는 multiple threads를 갖는다.
- e.g., word processor는 typing, spell checking, printing에 대해 thread를 가질 수 있다.

Threads in Conventional Processor

Single-Core system

한 번에 하나의 스레드
하나의 thread가 stall일 때 (예를 들어, 메모리를 기다리는):
- thread가 저장되는 Architectural state
- Architectural state of waiting thread loaded into processor and it runs
- Architectural state는 processor로 로드되는 thread를 기다리다 그리고 실행된다.
- Called context switching
마치 모든 스레드가 동시에 실행 중인 것처럼 유저에게 보인다.
- 동시에 실행이 되는 것처럼 보이지만 실제로는 이거했다 저거했다 context switching이 이루어지는 것

Multithreaded processor

thread를 여러개로 동시에 할 수 있는 hardware

architectural state 의 multiple copies를 갖는다.(PC, reg)
Multiple threads는 한 순간에 active 될 수 있다.
- 한 thread가 stall 되면, 또 다른 thread는 동시에 실행될 수 있다(어떠한 delay 없이).
  - PC와 register는 이미 실행가능하기 때문
- 만약 한 thread가 모든 execution units을 바쁘게 유지할 수 없으면, 또 다른 thread는 idle units를 사용할 수 있다.
single thread의 instruction-level parallelism(ILP)을 증가시키지 않지만 throughput을 증가시킨다.

Intel calls this “hyperthreading”

Multiprocessors

Multiple processors (cores) with a method of communication between them
Types:
- Homogeneous: shared memory을 가진 multiple cores
- Heterogeneous: 다른 tasks에 분리된 cores(for example, DSP and CPU in cell phone)
- Clusters: 각 core는 각자의 memory system을 갖는다.

참고문헌 (reference)

Patterson & Hennessy’s: Computer Architecture: A Quantitative Approach
Conferences:
www.cs.wisc.edu/~arch/www/
ISCA (International Symposium on Computer Architecture)
HPCA (International Symposium on High Performance Computer Architecture)

'CS 지식 > 컴퓨터구조' 카테고리의 다른 글

[컴퓨터구조] 16-1. Cache (1)	2023.10.10
[컴퓨터구조] 15. Memory (2)	2023.10.10
[컴퓨터구조] 14-2. Pipelined MIPS - hazard (1)	2023.10.10
[컴퓨터구조] 14-1. Pipelined MIPS (1)	2023.10.10
[컴퓨터구조] 13. Multi-cycle Performance (2)	2023.10.10

Contents

새소식

인기 검색어

[컴퓨터구조] 14-3. Pipelined MIPS Performance and Exception handler

Pipelined Performance Example

Pipelined Performance Example

Processor Performance Comparison

Review: Exceptions

Example Exception

Exception Register

Exception Causes

Exception Hardward: EPC & Cause

Exception Hardware: mfc0

Control FSM with Exceptions

Advanced Microarchitecture

Deep Piplining

Branch Prediction

Branch Prediction Example

1-Bit Branch Predictor

2-Bit Branch Predictor

Superscalar

Superscalar Example

Superscalar with Dependencies

Out of Order (OOO) Processor

Out of Order Processor

Out of order Processor Example

Register Renaming

SIMD

Advanced Architecture Techniques

Threading: Definitions

Threads in Conventional Processor

Multithreaded processor

Multiprocessors

참고문헌 (reference)

'CS 지식 > 컴퓨터구조' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바