Hello



Shivaram Venkataraman Fall 2020

## ADMINISTRIVIA

Attigrounderty are done

- Course project titles
- Project proposal aka Introduction (10/16)
  Introduction
  Related Work
  Timeline (with eval plan)
- Midterm: Oct 22

## MACHINE LEARNING: STACK



## **MOTIVATION: PERFORAMNCE PORTABILITY**

Rytorch -> model file Different hard ware Intel CPU IE MKL has different men hierarchy/ compute primitives Memory Subsystem Architecture CPU GPU 'TPU' You want high performance L2 L3 Activation SM Buffer L2 across hardware backeds File 11 L1D L1I implicitly managed explicitly managed mixed Dependence on vendor specific libraries **Compute Primitive** scalar vector ML models evolve fait ») new operators of operators  $\mathcal{Y} =$  Not available new combination of operators  $\mathcal{Y} =$  in existing verdor tenso librarila





## **TENSOR EXPRESSION LANGUAGE**



Common Arithmetic, Math operations Know the shape of the output and the data accessed

ex pression Halide OpenMP & gcc schedule of internetion CODE GENERATION launch 55 threads each thread does load a Cisjo, b(ij) for thread\_group (by, bx) in cross(64, 64): for thread item (ty, tx) in cross(2, 2): Nested parallelism > parallelism local CL[8][8] = 0for k in range (1024). for k in range(1024): LPV for i in range(4): in 1:10 AS[ty][i\*4+tx] = A[k][by\*64+ty\*8+i\*4+tx]for jarke for j in ris for each i in 0..4: a Ci,j] = b[i,j]+2 ) independent of other loop iterations BS[ty][i\*4+tx] = B[k][bx\*64+ty\*8+i\*4+tx]memory\_barrier\_among\_threads() threads can use AS, BS to do inputation load, store, add Tensorization \_\_\_\_\_ what is the def gemm\_intrin\_lower(inputs, outputs): bardware intruction ww\_ptr = inputs[0].access\_ptr("r") xx\_ptr = inputs[1].access\_ptr("r") zz ptr = outputs[0].access ptr("w") compute = t.hardware\_intrin("gemm8x8", ww\_ptr, xx\_ptr, zz\_ptr) Allows you to register operator reset = t.hardware intrin("fill zero", zz ptr) update = t.hardware\_intrin("fuse\_gemm8x8\_add", ww\_ptr, xx\_ptr, zz\_ptr) return compute, reset, update Extensible ! Intrini cs gemm8x8 = t.decl\_tensor\_intrin(y.op, gemm\_intrin\_lower)

# pane as as Pytorch etc. LATENCY HIDING What is the goal? L) Overlap computation and communication **Monolithic Pipeline**

Schedule that utilizes memory bondwidth & compute units

|   | ld       | ex | ld | ex | ld | ex | ld | ex |
|---|----------|----|----|----|----|----|----|----|
|   | 0        | 0  | 1  | 1  | 2  | 2  | 3  | 3  |
| t | <b>—</b> |    |    |    |    |    |    |    |

**Decoupled Access-Execute Pipeline** 



#### Instruction Stream

ld\_perform action(ld0) ex.perform action(ex0) ld\_perform action(ld1) ex\_perform action(ex1) . . .

ld.perform\_action(ld0) ld\_push\_dep\_to(ex) ld.perform\_action(ld1) ld\_push dep to(ex) ex.pop\_dep\_from(ld) ex.perform\_action(ex0) ex.push\_dep\_to(ld) ex.pop\_dep\_from(ld) ex.perform\_action(ex1) ex.push\_dep\_to(ld) ld.pop dep from(ex) ld\_perform action(ld2)

. . .

## AUTOMATING OPTIMIZATION

Goal: Create a specialized operator for input shape and layout Challenge:



## ML-BASED COST MODEL

Machine Learning Model Design Choices

Speed: Faster than time it takes to evaluate a config Quality: Use a rank objective to predict the relative order of runtime

features

take generate

memory access count

reuse ratio of each memory buffer at each loop level

one-hot encoding of loop annotations

### ML-BASED COST MODFI each candidate is a configuration ) publicable < C2, 8m57 < Cu, fail 7 Iteration (Select a batch of candidates How to select candidates? Mep (1) above training data C3 Jask model is C3' better than C3 Yes -> go L try C3 on cluster No -> generate onother No -> generate onother Parallel Simulated Annealing & C37 Start from a random config Walk to a nearby config $\rightarrow$ Successful if cost decreases Else Reject

## **DISTRIBUTED DEVICE POOL**

Pool of devices to speed up profiling RPC interface to run a trial on device

Share device pools for multiple graphs

## SUMMARY

TVM: Compiler for ML inference models

Support high performance for range of models, hardware devices

Key ideas

- ---- Graph-level optimizations Operator finion
  - Tensor expression language: Code-gen, Latency hiding etc
    ML based Cost Model for automation

# DISCUSSION

https://forms.gle/WiVgJ3abGXXgfBN99

Consider that you are building an optimizer for Spark programs instead of ML inference. What would be some configuration knobs that you could similarly tune? What might be different from the TVM optimizer?

Similar Logic > Latency hiding overlap comp, communication Ir & dimensions, access patterns operator funion > map poperations Partitionity -> Can you automate Partitionity -> Can you automate partitions / Co- partitioning !!! > performance! Persistence - manually insert rold, cache () -> long space!



# NEXT STEPS

Next class: Ray

Course project: Oct 16 (introductions) Midterm: Oct 22

Latery hidry in spark? roddl < may tools > <preduce tasks?</pre> rdd nap rdd nap rdd nap « transfer shuffle files {reduce > « no omn writ