Language Model Tokenizer 이해하기

업데이트: October 21, 2020

LM	type	text_a	text_b
None	-	A distant person is climbing up a very sheer mountain.	The mountain is unclimbable for humans.
BERT	WordPiece	‘[CLS]’, ‘a’, ‘distant’, ‘person’, ‘is’, ‘climbing’, ‘up’, ‘a’, ‘very’, ‘sheer’, ‘mountain’, ‘.’, ‘[SEP]’	‘[CLS]’, ‘the’, ‘mountain’, ‘is’, ‘un’, ‘##cl’, ‘##im’, ‘##ba’, ‘##ble’, ‘for’, ‘humans’, ‘.’, ‘[SEP]’
ALBERT	SentencePiece	‘[CLS]’, ‘▁a’, ‘▁distant’, ‘▁person’, ‘▁is’, ‘▁climbing’, ‘▁up’, ‘▁a’, ‘▁very’, ‘▁sheer’, ‘▁mountain’, ‘.’, ‘[SEP]’	‘[CLS]’, ‘▁the’, ‘▁mountain’, ‘▁is’, ‘▁unc’, ‘limb’, ‘able’, ‘▁for’, ‘▁humans’, ‘.’, ‘[SEP]’
RoBERTa	(sentencepiece)	‘A’, ‘Âł’, ‘d’, ‘istant’, ‘Âł’, ‘person’, ‘Âł’, ‘is’, ‘Âł’, ‘cl’, ‘im’, ‘bing’, ‘Âł’, ‘up’, ‘Âł’, ‘a’, ‘Âł’, ‘very’, ‘Âł’, ‘she’, ‘er’, ‘Âł’, ‘mount’, ‘ain’, ‘.’	‘The’, ‘Âł’, ‘mount’, ‘ain’, ‘Âł’, ‘is’, ‘Âł’, ‘un’, ‘cl’, ‘imb’, ‘able’, ‘Âł’, ‘for’, ‘Âł’, ‘humans’, ‘.’
T5	SentencePiece	‘▁A’, ‘▁distant’, ‘▁person’, ‘▁is’, ‘▁climbing’, ‘▁up’, ‘▁’, ‘a’, ‘▁very’, ‘▁sheer’, ‘▁mountain’, ‘.’	‘▁The’, ‘▁mountain’, ‘▁is’, ‘▁un’, ‘c’, ‘limb’, ‘able’, ‘▁for’, ‘▁humans’, ‘.’

climbing = climbing
unblimbable = (이상적) un + climb + able
unblimbable = (but 실제) un + cl + im + ba + ble / unc + limb + able …

Tokenization [Transformer]

BertTokenizer

Construct a BERT tokenizer based on WordPiece.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

AlbertTokenizer

Construct an ALBERT tokenizer based on SentencePiece.
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

from transformers import AlbertTokenizer

albert_tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

RobertaTokenizer

Constructs a RoBERTa tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:

from transformers import RobertaTokenizer

roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

띄어쓰기도 하나의 token으로! (id = 232)

T5Tokenizer

Construct a T5 tokenizer based on SentencePiece.
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

from transformers import T5Tokenizer

t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')

Twitter Facebook LinkedIn

Heerin Yang

Language Model Tokenizer 이해하기

Tokenization [Transformer]

BertTokenizer

AlbertTokenizer

RobertaTokenizer

T5Tokenizer

공유하기

참고

Conditionally Adaptive Multi-Task Learning 이해하기

Bert is Not All You Need for Commonsense Inference 논문 정리

NLI dataset biases

Transformer 이해하기 - Attention is All You Need