Tokenizing
Two friends have recently asked me to provide a definition for the term "Tokenize" as in: "I'm going to tokenize this chunk of text.", and I didn't really provide an answer. I guess that they asked me because it's a term that I've used quite a bit in the past - and in the future too no doubt ;-)
Merriam Webster provides several definitions for the term "Token", a couple of which are:
a distinguishing feature : CHARACTERISTIC a small part representing the whole :
For the record, now that I've had time to give it some thought, I'd like to give an example of what I mean when I use it...
Imagine that you've been assigned the task of building a service that would provide a word-wrap functionality to applications. Applications could supply a body of text and a LineLength property to the service and it would return the original chunk of text with lines formatted to "no longer than" the LineLength limit. Additionally, subscribers to this service would be afforded the option of toggling between word-wrap states via the Wrapped, UnWrapped values of the WrapMode enumerated datatype.
To convert raw text into formatted "wrapped" text, you decide to apply an algorithm similar to the one shown here http://www.namesuppressed.com/syneryder/code-phpwordwrap.shtml; that is:
- Find all paragraphs - For Each paragraph - Remove linebreaks - Split on spaces - Enumerate the words and append them to a string - When the length of the string reaches the LineLength limit insert a linebreak
Given the following chunk of raw text:
This is a paragraph of text. This is a yet another paragraph of boring old text.
A LineLength of 18 would see it formatted as:
This is a paragraph of text. This is a yet another paragraph of boring old text.
To return the text in it's original, raw format, you might store 2 versions of the data in private fields and, depending on which version is requested, simply return it from that location. That is, after the initial "formatting", you'd write the formatted version to a private field and you would have already stored the raw value in another field, i.e.:
Private mstrRawValue As String Private mstrFormattedValue As String Private mCurrentState As WrapMode Public Function GetText() As String If Me.mCurrentState = WrapMode.Wrapped Then If mstrFormattedValue.Length = 0 Then FormatRawText() End If Return mstrFormattedValue Else Return mstrRawValue End If End Function
That's probably fine, even though the amount of memory required is approximately double size of the raw string alone, but you might find it difficult to scale if the client asks for another one, or two, or twenty-two different WrapMode states or if they request that they'd like you to provide an "offline" version of the formatted text!
In situations such as the one mentioned above, if I think that there's a chance that I might need to more than one operation on a string I'll often "tokenize" it after the first pass. When I say tokenize, what I'm referring to is that I leave small, descriptive "marks" in the text that can be read at a later date to describe a given state. To show what I'm referring to, here's the algorithm above, amended for "tokenizing":
- Find all paragraphs - For Each paragraph - Split on linebreaks and wrap with "<raw>...</raw>" tokens - Split on spaces - Enumerate the words and append them to a string - When the length of the string reaches the LineLength limit insert a linebreak and wrap with "<formatted>...</formatted>" tokens
And, again, given the following chunk of raw text:
This is a paragraph of text. This is a yet another paragraph of boring old text.
A LineLength of 18 would see it formatted as:
<raw><formatted>This is a</formatted>
<formatted>paragraph of text.</formatted></raw>
<raw><formatted>This is a yet</formatted>
<formatted>another</raw></raw> <raw>paragraph</formatted>
<formatted>of boring</raw> <raw>old</formatted>
<formatted>text.</raw></formatted>
This allows the amount of memory required to store the data to be roughly halved as the document is now self-describing of its states. The algorithm for returning text is like so:
Private mstrStoredValue As String Private mblnIsMarked As Boolean = False Private mCurrentState As WrapMode Public Function GetText() As String Dim tmpStrng As String If Not mblnIsMarked Then FormatRawText() End If If Me.mCurrentState = WrapMode.Wrapped Then tmpStrng = RegexReplace(mstrStoredValue,"(\<formatted\>|\<\/?raw\>)", "") Return Regex.Replace(tmpStrng, "<\/formatted\>",Environment.NewLine) Else tmpStrng = RegexReplace(mstrStoredValue,"(\<raw>|\<\/?formatted\>)", "") Return Regex.Replace(tmpStrng, "\<\/raw\>",Environment.NewLine) End If End Function
Either way, because the document is now "described" via the tokenizing process, the presentation of the data can now be separated from the logic required for the formatting of it.
Well, that pretty much covers the "Darren" interpretation of Tokenizing. I should add however, that, anyone with compiler or interpreter experience will probably have a different interpretation where the term is generally used to refer to the "substitution" of text rather than marking, or adding to text.