If I asked you to separate helloworld into meaningful words, you would immediately jump to hello and world before your mind could even consider that the first word might be he or hell.

You’ve just created a token boundary using only a pre-existing knowledge of the language. In other words, the token boundary wasn’t encoded in the source material — you used outside information to infer one.

This isn’t how parsers work.

Computer languages are designed to be tokenized unambiguously. This is achieved by encoding token boundaries into the source material.

This comes at a price.

Knowing token boundaries

Whitespace is the most apparent token boundary. The space between class and test allows the lexer to separate them into two distinct tokens.

1
2
class Configuration {
};

The less obvious token boundary is this:

1
c=a+b

There are no spaces here, and yet, we know exactly how to tokenize this text.

You are probably a human looking at this example in a human way and seeing operators and variables. I think you would appreciate my point more if the source looked like this1:

[459.99, 408.88, 457.3, 384.67, 458.645]

The token boundaries aren’t so obvious here. How do you decide where to draw the lines?

We implicitly know that = does not belong in the same set as c and a.2 Therefore, c and = cannot be a part of the same token. Therefore, there must be a token boundary between c and =.

Token boundaries are encoded in the change of character sets of adjacent tokens.

The whitespace example is not an exception; it is a special case of this rule. Whitespace encodes two boundaries — the first boundary is encoded in the transition from keyword to whitespace, and the second in the transition from whitespace to identifier.

The other notable characteristic of this special case is that whitespace encodes no other information beyond the token boundary.3

How many Shannons does it take

Example 1

To start with, we want to calculate the number of Shannons4 the token boundary between c= occupies.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import math

def entropy(token_set, token_length):
    token_set_size = len(token_set)

    probability_unit = 1 / token_set_size
    entropy_unit = -1 * token_set_size * \
        (probability_unit * math.log2(probability_unit))
    # ^ We assume that all characters in the set are equally likely

    entropy_token = token_length * entropy_unit
    return entropy_token

Also, let

1
2
3
4
5
import string

set_empty      = set()
set_operator   = set('+-*/=')
set_identifier = set(string.ascii_lowercase)

Case 1

For this case, our token boundary is encoded in the change of character sets between the two tokens.

1
assert set_identifier & set_operator == set_empty

In place of c, we could have 1 of 26 possible states. For this token of length 1, the Shannon entropy is:

1
2
3
entropy_identifier = entropy(set_identifier, 1)
print(entropy_identifier)
# =4.700439718141092

Likewise, in place of =, we could have 1 of 5 possible states. For a token of this type, and length 1, Shannon entropy is:

1
2
3
entropy_operator = entropy(set_operator, 1)
print(entropy_operator)
# =2.321928094887362

The total entropy of the source is:

1
2
3
entropy_case_1 = entropy_identifier + entropy_operator
print(entropy_case_1)
# =7.022367813028454

Case 2

For this case, let’s say that the token boundary is encoded separately, outside the source.

Since the boundary is not encoded in the change of character sets, each individual position can be filled with any character available to us (i.e. the union of operator and identifier sets)

The Shannon entropy for the source message can be calculated by:

1
2
3
entropy_case_2 = entropy(set_operator | set_identifier, 2)
print(entropy_case_2)
# 9.908392620773748

Entropy of the token boundary

1
2
3
entropy_token_boundary = entropy_case_2 - entropy_case_1
print(entropy_token_boundary)
# 2.8860248077452937

This means that this token boundary encodes ~3 Shannons worth of information. That’s roughly 30% of the total entropy of the message. Granted that the message itself is quite small, this is still a significant result.

Think about it this way - a certain character set is reserved for any given token. A boundary is detected where a character from the reserved set appears. Each character of the token carries information about the token boundary - specifically where the token boundary isn’t.

Each character of a token encodes two different pieces information


  1. I obfuscated this example slightly using [(ord(_)+243)*1.345 for _ in "c=a+b"] ↩︎

  2. In this context. ↩︎

  3. As opposed to the case of c= where the change in character set encodes the token boundary, c encodes the identifier name, and = encodes the specific operation. The whitespace character, when used as a token separator, contains no information of its own. ↩︎

  4. I have intentionally avoided using ‘bits’. ↩︎