Language & LLMs
What Is WordPiece Tokenization?
WordPiece is a subword tokenization algorithm that splits words into smaller units based on how frequently the pieces appear in training data. It builds a vocabulary that lets models represent rare words as combinations of known subwords. It is notably used by BERT and related models.
Further reading
Read more about WordPiece tokenization — articles and blogs from around the web: