Tokenizing

Wednesday, June 11, 2003

Two friends have recently asked me to provide a definition for the term "Tokenize" as in: "I'm going to tokenize this chunk of text.", and I didn't really provide an answer. I guess that they asked me because it's a term that I've used quite a bit in the past - and in the future too no doubt ;-)

Merriam Webster provides several definitions for the term "Token", a couple of which are:

    a distinguishing feature : CHARACTERISTIC
    a small part representing the whole :

For the record, now that I've had time to give it some thought, I'd like to give an example of what I mean when I use it...

Imagine that you've been assigned the task of building a service that would provide a word-wrap functionality to applications. Applications could supply a body of text and a LineLength property to the service and it would return the original chunk of text with lines formatted to "no longer than" the LineLength limit. Additionally, subscribers to this service would be afforded the option of toggling between word-wrap states via the Wrapped, UnWrapped values of the WrapMode enumerated datatype.

To convert raw text into formatted "wrapped" text, you decide to apply an algorithm similar to the one shown here http://www.namesuppressed.com/syneryder/code-phpwordwrap.shtml; that is:

    - Find all paragraphs
    - For Each paragraph
        - Remove linebreaks
        - Split on spaces
        - Enumerate the words and append them to a string    
        - When the length of the string reaches the LineLength limit insert a linebreak

Given the following chunk of raw text:

    This is a paragraph of text.
    This is a yet another
    paragraph of boring 
    old text.

A LineLength of 18 would see it formatted as:

    This is a 
    paragraph of text.
    This is a yet 
    another paragraph
    of boring old 
    text.

To return the text in it's original, raw format, you might store 2 versions of the data in private fields and, depending on which version is requested, simply return it from that location. That is, after the initial "formatting", you'd write the formatted version to a private field and you would have already stored the raw value in another field, i.e.:

Private mstrRawValue As String
Private mstrFormattedValue As String
Private mCurrentState As WrapMode
 
Public Function GetText() As String
  If Me.mCurrentState = WrapMode.Wrapped Then
   If mstrFormattedValue.Length = 0 Then
       FormatRawText()
       End If
       Return mstrFormattedValue
  Else
       Return mstrRawValue
  End If
End Function

That's probably fine, even though the amount of memory required is approximately double size of the raw string alone, but you might find it difficult to scale if the client asks for another one, or two, or twenty-two different WrapMode states or if they request that they'd like you to provide an "offline" version of the formatted text!

In situations such as the one mentioned above, if I think that there's a chance that I might need to more than one operation on a string I'll often "tokenize" it after the first pass. When I say tokenize, what I'm referring to is that I leave small, descriptive "marks" in the text that can be read at a later date to describe a given state. To show what I'm referring to, here's the algorithm above, amended for "tokenizing":

    - Find all paragraphs
    - For Each paragraph
        - Split on linebreaks and wrap with "<raw>...</raw>" tokens
        - Split on spaces
        - Enumerate the words and append them to a string    
        - When the length of the string reaches the LineLength limit insert a linebreak and wrap with "<formatted>...</formatted>" tokens

And, again, given the following chunk of raw text:

    This is a paragraph of text.
    This is a yet another
    paragraph of boring 
    old text.

A LineLength of 18 would see it formatted as:

<formatted>paragraph of text.</formatted></raw>

<formatted>another</raw></raw> <raw>paragraph</formatted>

<formatted>of boring</raw> <raw>old</formatted>

This allows the amount of memory required to store the data to be roughly halved as the document is now self-describing of its states. The algorithm for returning text is like so:

Private mstrStoredValue As String
Private mblnIsMarked As Boolean = False
Private mCurrentState As WrapMode
 
Public Function GetText() As String
  Dim tmpStrng As String
  If Not mblnIsMarked Then
           FormatRawText()
  End If
  If Me.mCurrentState = WrapMode.Wrapped Then tmpStrng = RegexReplace(mstrStoredValue,"(\<formatted\>|\<\/?raw\>)", "")
       Return Regex.Replace(tmpStrng, "<\/formatted\>",Environment.NewLine)
  Else
       tmpStrng = RegexReplace(mstrStoredValue,"(\<raw>|\<\/?formatted\>)", "")
       Return Regex.Replace(tmpStrng, "\<\/raw\>",Environment.NewLine)
  End If
End Function

Either way, because the document is now "described" via the tokenizing process, the presentation of the data can now be separated from the logic required for the formatting of it.

Well, that pretty much covers the "Darren" interpretation of Tokenizing. I should add however, that, anyone with compiler or interpreter experience will probably have a different interpretation where the term is generally used to refer to the "substitution" of text rather than marking, or adding to text.

I always think of "tokenizing" as the recognition of tokens. So, the string: "This is a nice blog," is tokenized by your brain, by chunking out the various words from the mess of characters. "This" "is" "a" "nice" "blog". Perhaps I am confusing this term with lexing? I dunno, it's been a while since my last compilers class! :-)

Now that I think about it, isn't what I just described lexing, and tokenizing is saying something like, "Ok, 'This is' is a verb participle and 'nice' is an adjective and 'blog' is a noun." (Apologies for any incorrect grammar terms.) Heh, good thing I'm graduating grad school NOW before any tests on this material! :-)

Scott Mitchell - Wednesday, June 11, 2003 9:15:00 AM

Interesting stuff Dazza,

One thing I thought I should add... since you're tokenizing your text with valid XML, your string becomes a XML Fragment. Now if you had multiple markup tags in there you could do something like the following...

Pseudo code:

- If output mode = formatted

- - set node list to all "formatted" elements

- Else

- - set node list to all "raw" elements

- For each node in the list

- - output the innerText

- - output a linebreak

It's more work for the current solution, but if you had more wrap modes it is easily scalable. In fact if you ended up with 30 different wrap modes, you could use the enumerator's toString method to name your markup nodes, and then use the same toString method in your selectNodes() method, no extra coding required except extending the Enum.

You could also use XSLT of course, but given your dislike for it you'll be wanting to wait until a WYSIWYG editor before you try that right? :)

Tim.

Tim Walters - Wednesday, June 11, 2003 10:04:00 AM

Dear sir,

i sam ur comment on various concepts,but feel a litle to say taht u should include the dynamic view of each example to make it a more interactive .

best of luck and good luck for future

with regards

your's truly

balu

balram_online@yahoo.com - Friday, February 20, 2004 11:32:00 AM

Thanks for the great feedback balu... I'll consider revising it by adding some images.

Cheers,

- Darren

Darren Neimke - Friday, February 20, 2004 11:32:00 AM

4 Comments