Language Model Tokenizer 이해하기

업데이트:

LM type text_a text_b
None - A distant person is climbing up a very sheer mountain. The mountain is unclimbable for humans.
BERT WordPiece ‘[CLS]’, ‘a’, ‘distant’, ‘person’, ‘is’, ‘climbing’, ‘up’, ‘a’, ‘very’, ‘sheer’, ‘mountain’, ‘.’, ‘[SEP]’ ‘[CLS]’, ‘the’, ‘mountain’, ‘is’, ‘un’, ‘##cl’, ‘##im’, ‘##ba’, ‘##ble’, ‘for’, ‘humans’, ‘.’, ‘[SEP]’
ALBERT SentencePiece ‘[CLS]’, ‘▁a’, ‘▁distant’, ‘▁person’, ‘▁is’, ‘▁climbing’, ‘▁up’, ‘▁a’, ‘▁very’, ‘▁sheer’, ‘▁mountain’, ‘.’, ‘[SEP]’ ‘[CLS]’, ‘▁the’, ‘▁mountain’, ‘▁is’, ‘▁unc’, ‘limb’, ‘able’, ‘▁for’, ‘▁humans’, ‘.’, ‘[SEP]’
RoBERTa (sentencepiece) ‘A’, ‘Âł’, ‘d’, ‘istant’, ‘Âł’, ‘person’, ‘Âł’, ‘is’, ‘Âł’, ‘cl’, ‘im’, ‘bing’, ‘Âł’, ‘up’, ‘Âł’, ‘a’, ‘Âł’, ‘very’, ‘Âł’, ‘she’, ‘er’, ‘Âł’, ‘mount’, ‘ain’, ‘.’ ‘The’, ‘Âł’, ‘mount’, ‘ain’, ‘Âł’, ‘is’, ‘Âł’, ‘un’, ‘cl’, ‘imb’, ‘able’, ‘Âł’, ‘for’, ‘Âł’, ‘humans’, ‘.’
T5 SentencePiece ‘▁A’, ‘▁distant’, ‘▁person’, ‘▁is’, ‘▁climbing’, ‘▁up’, ‘▁’, ‘a’, ‘▁very’, ‘▁sheer’, ‘▁mountain’, ‘.’ ‘▁The’, ‘▁mountain’, ‘▁is’, ‘▁un’, ‘c’, ‘limb’, ‘able’, ‘▁for’, ‘▁humans’, ‘.’
  • climbing = climbing
  • unblimbable = (이상적) un + climb + able
  • unblimbable = (but 실제) un + cl + im + ba + ble / unc + limb + able …

Tokenization [Transformer]

BertTokenizer

Construct a BERT tokenizer based on WordPiece.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

AlbertTokenizer

  • Construct an ALBERT tokenizer based on SentencePiece.

  • This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

from transformers import AlbertTokenizer

albert_tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

RobertaTokenizer

  • Constructs a RoBERTa tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.

  • This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:

from transformers import RobertaTokenizer

roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
  • 띄어쓰기도 하나의 token으로! (id = 232)

T5Tokenizer

  • Construct a T5 tokenizer based on SentencePiece.

  • This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

from transformers import T5Tokenizer

t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')

카테고리:

업데이트: