Myvideo

Guest

Login

tinyML Asia 2021 Dongsoo Lee: Extremely low-bit quantization for Transformers

Uploaded By: Myvideo
1 view
0
0 votes
0

tinyML Asia 2021 Extremely low-bit quantization for Transformers DongSoo LEE 이동수, Executive Officer, NAVER CLOVA The deployment of widely used Transformer architecture is challenging because of heavy computation load and memory overhead during inference, especially when the target device is limited in computational resources such as mobile or edge devices. Quantization is an effective technique to address such challenges. Our analysis shows that for a given number of quantization bits, each block of Transformer contributes to model accuracy and inference computations in different manners. Moreover, even inside an embedding block, each word presents vastly different contributions. Correspondingly, we propose a mixed precision quantization strategy to represent Transformer weights by an extremely low number of bits (e.g., under 3 bits). For example, for each word in an embedding block, we assign different quantization bits based on statistical property. We also introduce a new

Share with your friends

Link:

Embed:

Video Size:

Custom size:

x

Add to Playlist:

Favorites
My Playlist
Watch Later