{"id":10524,"date":"2025-10-11T04:17:39","date_gmt":"2025-10-11T04:17:39","guid":{"rendered":"https:\/\/www.aihello.com\/resources\/?p=10524---60d42093-9e04-4f5b-abc5-710b2a6bbac8"},"modified":"2025-10-27T01:26:39","modified_gmt":"2025-10-27T01:26:39","slug":"a-complete-overview-to-tokenization","status":"publish","type":"post","link":"https:\/\/www.aihello.com\/resources\/blog\/a-complete-overview-to-tokenization\/","title":{"rendered":"A Complete Overview to Tokenization"},"content":{"rendered":"<p>Let <a href=\"https:\/\/www.aihello.com\/resources\/blog\/how-to-launch-an-amazon-mexico-business\/\">us<\/a> discuss absolutely everything about the greatest challenge that Large Language Models (LLMs) face\u200a\u2014\u200a<a href=\"https:\/\/www.aihello.com\/resources\/blog\/tokenization-and-its-application\/\">tokenization<\/a>.<\/p><h2 class=\"wp-block-heading\">What Exactly is Tokenization?<\/h2><p>Tokenization is the process of breaking down text into smaller parts called \u2018<em>tokens<\/em>\u2019 whereas a \u2018<em>tokenizer<\/em>\u2019 is the <a href=\"https:\/\/www.aihello.com\/resources\/blog\/amazon-brand-analytics-a-complete-guide-2024\/\">tool<\/a> or algorithm that lets us do that.<\/p><p>Text can be tokenized into characters or words or even subwords\u200a\u2014\u200awe\u2019ll be exploring each of these and more in detail below.<\/p><h2 class=\"wp-block-heading\">Why do we Tokenize?<\/h2><p>Unstructured text usually lacks a structure, splitting the text into smaller tokens and assigning a numerical value to each unique token offers the computer interpretability of that text.<\/p><p>All tokens in a given text are processed parallelly through the Transformers architecture which is the base architecture for LLMs\u200a\u2014\u200athis parallel processing helps in increasing the processing speed. These tokens are then also processed through positional embeddings which allow the model to realize the order of the tokens in the text and get a sense of the context for each one. Realizing the context is extremely important since changing the context can drastically affect the meaning of a word or a sentence in a language.<\/p><h3 class=\"wp-block-heading\">Different Tokenization Techniques<\/h3><ul class=\"wp-block-list\"><li><strong>Character Tokenization<\/strong><\/li><\/ul><p>We split the text into its characters in this method. Strings are already just a concatenated list of characters in most programming languages and hence, this method is easy to code.<\/p><pre class=\"wp-block-code\"><code>text = \"Sample text for tokenization!\"tokens = list(text)print(tokens)<\/code><\/pre><pre class=\"wp-block-code\"><code>&#091;'S', 'a', 'm', 'p', 'l', 'e', ' ', 't', 'e', 'x', 't', ' ','f', 'o', 'r',' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a','t', 'i', 'o', 'n', '!']<\/code><\/pre><p>The problem with this type of tokenization is that the model receives tokens that only consist of characters along with their neighbor tokens also being the same\u200a\u2014\u200athese same characters are present in lots of more different words causing the model to not get any sense of context.<\/p><ul class=\"wp-block-list\"><li><strong>Word Tokenization<\/strong><\/li><\/ul><p>The text is split into words in this technique and whitespaces are not included.<\/p><pre class=\"wp-block-code\"><code>tokens = text.split()print(tokens)<\/code><\/pre><pre class=\"wp-block-code\"><code>&#091;'Sample', 'text', 'for', 'tokenization!']<\/code><\/pre><p>As we can see from the above example, this approach is mostly better than the previous one because providing words as tokens provide more context, although, of course\u200a\u2014\u200athey still can have different meanings in a sentence.&nbsp;<\/p><p>We can also see that symbols such as \u2018!\u2019 are attached to a word, which is not desirable because the symbol in itself has a different meaning and can change the meaning of any word it is attached to. We don\u2019t want to underfit other words that also might have the same symbol. To solve this, we split all symbols in the text into separate tokens.<\/p><pre class=\"wp-block-code\"><code>&#091;'Sample', 'text', 'for', 'tokenization', '!']<\/code><\/pre><ul class=\"wp-block-list\"><li><strong>Subword Tokenization<\/strong><\/li><\/ul><p>The basic concept of subword tokenization is that each word in a text should be further split into further subwords where for example, one of them would be a frequently-appearing substring and the latter would be a rarely-appearing meaningful base word.<\/p><p>The most popular tokenizer that uses this type of algorithm is called \u2018<strong><em>WordPiece<\/em><\/strong>\u2019. Let\u2019s see a popular WordPiece-based tokenizer used by the DistilBERT model in action through the below code-<\/p><pre class=\"wp-block-code\"><code>from transformers import DistilBertTokenizermodel = \"distilbert-base-uncased\"tokenizer = DistilBertTokenizer.from_pretrained(model)encoded_text = tokenizer(text)tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)print(tokens)<\/code><\/pre><pre class=\"wp-block-code\"><code>&#091;'&#091;CLS]', 'sample', 'text', 'for', 'token', '##ization', '!', '&#091;SEP]']<\/code><\/pre><p>As we can see the word \u2018tokenization\u2019 is further split into two tokens\u200a\u2014\u200a\u2018token\u2019 and \u2018##ization\u2019, this is because \u2018token\u2019 is a rarely-appearing base word whereas \u2018ization\u2019 is a substring that can appear many times with various words and change the meaning of any word it is attached to just like a symbol. The \u2018##\u2019 in \u2018##ization\u2019 represents that it is attached to another token before it.<\/p><p>Another thing that we notice is the addition of two new tokens\u200a\u2014\u200a\u2018[CLS]\u2019 and \u2018[SEP]\u2019, these are nothing but extra tokens added by this particular tokenizer to indicate the start and end of the text respectively so as to make interpretation easier for the model.<\/p><ul class=\"wp-block-list\"><li><strong>Byte-Pair Encoding&nbsp;(BPE)<\/strong><\/li><\/ul><p>The first thing that this algorithm does is use a pre-tokenization method to split the text into unique words and get the frequency for each one as well. Example-<\/p><pre class=\"wp-block-code\"><code>(\"hug\", 10), (\"pug\", 5), (\"pun\", 12), (\"bun\", 4), (\"hugs\", 5)<\/code><\/pre><p>Now, the technique creates a base vocabulary for itself from all the unique characters: <em>[\u201cb\u201d, \u201cg\u201d, \u201ch\u201d, \u201cn\u201d, \u201cp\u201d, \u201cs\u201d, \u201cu\u201d].<\/em><\/p><p>BPE then also takes into account the frequency of characters. To understand what it does further, let\u2019s split all unique words into their characters-<\/p><pre class=\"wp-block-code\"><code>(\"h\" \"u\" \"g\", 10), (\"p\" \"u\" \"g\", 5), (\"p\" \"u\" \"n\", 12), (\"b\" \"u\" \"n\", 4),(\"h\" \"u\" \"g\" \"s\", 5)<\/code><\/pre><p>The algorithm looks for the most frequently appearing contiguous pair of characters. In the example above, \u2018u\u2019 and \u2018g\u2019 appear the most frequently (10+5+5=20) and hence they will be merged into \u2018ug\u2019 which will also be added to the base vocabulary-<\/p><pre class=\"wp-block-code\"><code>(\"h\" \"ug\", 10), (\"p\" \"ug\", 5), (\"p\" \"u\" \"n\", 12), (\"b\" \"u\" \"n\", 4),(\"h\" \"ug\" \"s\", 5)<\/code><\/pre><p>From the new base vocabulary, this merging process will continue the same way upto a certain threshold set by the programme.<\/p><p>The final vocabulary will then be applied on any text encountered by the tokenizer. For example, \u2018bugs\u2019 would be tokenized into <em>[\u2018b\u2019, \u2018ug\u2019, \u2018s\u2019]<\/em> whereas \u2018mugs\u2019 would be tokenized into <em>[\u2018&lt;unk&gt;\u2019, \u2018ug\u2019, \u2018s\u2019]<\/em>.&nbsp;<\/p><p>As we can see, the letter \u2018m\u2019 got changed into an <em>&lt;unk&gt; <\/em>character, and this is because \u2018m\u2019 was not present anywhere in the base vocabulary. The same will happen to all new characters.<\/p><p><strong><em>Byte-level BPE <\/em><\/strong>is a method used by GPT-2 where instead of taking all unicode characters for the base vocabulary, bytes (0\u2013255) are taken. These bytes represent various characters and the method helps reduce the scope of the vocabulary to 256.<\/p><ul class=\"wp-block-list\"><li><strong>Unigram Tokenization<\/strong><\/li><\/ul><p>This algorithm starts by already initialising a large vocabulary and trims it down along the way. At each training step, this model calculates the loss between the vocabulary and the training data so as to decide what to keep in the vocabulary. This process is repeated until the vocabulary has reached the desired size.&nbsp;<\/p><p>The method makes sure to retain all the base character so all words can be tokenized.<\/p><h2 class=\"wp-block-heading\">Popular Tokenizer Libraries<\/h2><ul class=\"wp-block-list\"><li><strong>SentencePiece<\/strong><\/li><\/ul><p>SentencePiece is a library developed by <a href=\"https:\/\/www.aihello.com\/resources\/blog\/boost-amazon-sales-using-google-ads\/\">Google<\/a> for general-purpose tokenization. It offers tokenization through all Unigram, BPE, Word and Character methods, although, by default it is set to Unigram.<\/p><p>It is different from just the regular vanilla methods because unlike them, SentencePiece also incorporates whitespaces in the tokenization with an \u2018_\u2019 symbol.<\/p><p>Let\u2019s use a pre-trained BPE model with SentencePiece-<\/p><pre class=\"wp-block-code\"><code>import sentencepiece as spmsp = spm.SentencePieceProcessor()sp.load('en.wiki.bpe.vs1000.model')tokens = sp.encode_as_pieces(text)print(tokens)<\/code><\/pre><pre class=\"wp-block-code\"><code>&#091;'\u2581', 'S', 'amp', 'le', '\u2581te', 'x', 't', '\u2581for', '\u2581to', 'k', 'en', 'iz','ation', '!']<\/code><\/pre><ul class=\"wp-block-list\"><li><strong>TikToken<\/strong><\/li><\/ul><p>TikToken is a tokenizer algorithm as well as an open-source library developed by OpenAI. It is a fast-BPE tokenizer made specifically for use with GPT models. Due to it being focused for only a particular type of model, it is much faster than any other BPE tokenizer.<\/p><h2 class=\"wp-block-heading\">Challenges with Tokenization for&nbsp;LLMs<\/h2><p>Other than it being hard to tell the model about the exact meaning and context of the token, there\u2019s another problem with them that sometimes breaks entire language models\u200a\u2014\u200a<strong><em>glitch tokens<\/em><\/strong>.<\/p><p>Glitch tokens are the tokens that when input into most language models, cause the model to give anomalous outputs, some of these tokens include\u200a\u2014\u200a\u2018SolidGoldMagikarp\u2019, \u2018attRot\u2019, \u2018ysics\u2019 etc.<\/p><p>It is believed that this happens because the tokenizers of these LLMs are trained on a huge corpus of data that is directly web scraped from the internet while the text that the LLMs themself are trained on is properly curated with much more effort, this leads to some tokens being present in the training set of the tokenizer but not the <a href=\"https:\/\/www.aihello.com\/resources\/blog\/tokenization-and-its-application\/\">LLM<\/a> and hence, the language model doesn\u2019t know <a href=\"https:\/\/www.aihello.com\/resources\/blog\/everything-about-optimizing-amazon-backend-keywords-guide-2022\/\">how to<\/a> deal with them causing it to break.<\/p><p>It is also observed that when k-means is used to cluster similar tokens, these glitch-tokens tend to be the center of them although, the reason is not confirmed.<\/p><h2 class=\"wp-block-heading\">Tokenizer-Free Approach for&nbsp;LLMs<\/h2><p>In a recent paper called \u2018<a href=\"https:\/\/arxiv.org\/abs\/2406.19223\" target=\"_blank\" rel=\"noreferrer noopener\">T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings<\/a>\u2019, usage of sparse vectors\u200a\u2014\u200ato represent text has been talked about which can help in completely skipping the heavy tokenization process. This sparse representation can be more efficient and also preserve the semantic meaning of the text that the previous embeddings after tokenization did.<\/p><h2 class=\"wp-block-heading\">Research Papers on Tokenization<\/h2><p>Here are some research papers that I recommend giving a read to know more about tokenization-<\/p><ul class=\"wp-block-list\"><li><a href=\"https:\/\/arxiv.org\/pdf\/2404.08335\" target=\"_blank\" rel=\"noreferrer noopener\">Toward a Theory of Tokenization in LLMs<\/a><\/li><li><a href=\"https:\/\/arxiv.org\/pdf\/2405.17067\" target=\"_blank\" rel=\"noreferrer noopener\">Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization<\/a><\/li><li><a href=\"https:\/\/arxiv.org\/pdf\/2406.11687\" target=\"_blank\" rel=\"noreferrer noopener\">Tokenization Falling Short: hTe Cusre of Tkoeniaztion<\/a><\/li><\/ul><div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\"><\/div>","protected":false},"excerpt":{"rendered":"<p>Explore tokenization, the essential text processing method for LLMs. Understand its importance and how character, word, and subword techniques enhance AI comprehension.<\/p>\n","protected":false},"author":30,"featured_media":10528,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[89,906,103,163,34],"tags":[1149,1147,1148,1150],"class_list":["post-10524","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-amazon","category-amazon-seller-tips","category-machine-learning","category-resources","category-tutorials","tag-llm","tag-tokenization","tag-tokenization-techniques","tag-tokenizer-libraries"],"_links":{"self":[{"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/posts\/10524","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/users\/30"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/comments?post=10524"}],"version-history":[{"count":1,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/posts\/10524\/revisions"}],"predecessor-version":[{"id":12290,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/posts\/10524\/revisions\/12290"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/media\/10528"}],"wp:attachment":[{"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/media?parent=10524"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/categories?post=10524"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/tags?post=10524"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}