Natural Language & AI: Dissecting the process of reading.
It has been a few years since I've seen the reports on how people read and more specifically how speed readers get so darn fast, but I recently got an email that demonstrated the concept rather well. Rather than explain the research study in plain words, it explained the research study in plain words! Only they weren't exactly plain. Each word consisted of the first and last letters completely in place, but the remainder were munged and out of place. Strangely enough you read the email without much thought as to the out of place characters. In fact the document is rather easy to read. As an example:
It has been a few yreas scine Iv'e seen the rortpes on how popele raed and mroe seifcpically how seped redares get so dran fsat, but I rlecenty got an eaiml taht darmtteonsed the cpeonct rtaher wlel. Rthear tahn elaipxn the rseearch sutdy in pailn wsdor, it elxpained the rseaerch sdtuy in pilan wsord! Olny tehy wrnee't eltxacy pnila. Ecah wrod ctsiseond of the frist and lsat lreetts cmpetolely in pacel, but the rienamder wree mngeud and out of pcela. Sengtraly eonugh you raed the eamil wtihout mcuh toughht as to the out of pclae ccehtarars. In fcat the deumocnt is rthaer esay to rade. As an eamxple:
Don't ask me for the program, since I don't find the process interesting enough to post readable code. What I will point out is that if, as humans, we really only use the first and last characters of the word initially, then there must be some other properties or features that help us quickly identify words. If we can find out what those are exactly, we can probably write better software. The people over at Google and Yahoo most likely have a nice jumpstart. Mathematically the first and last characters act as a reduction by allowing us to partition our list of words down into some number of groups:
26^2 = 676 possible combinations for the first and last characters.
470,000 words in the unabridged dictionary
As you can see, breaking down by first and last character doesn't buy us all that much in terms of reducing the problem set. Or does it? Humans don't really understand 470,000 words. Hell a bunch of those would seem pretty strange, even incorrect. We only use a few thousand really and so for the 676 possible combinations there are only a few words left by the time we are done parsing just the first and last characters. If we combine that with a relative word lengths maybe we don't even need the letters in the middle. There is probably more to the patterns, namely defining features of the character shapes, such as curvy, tall, short.
Why in the hell would I even bring this up? Well, I figured it played in with some of the other natural language posts like the telephone number to words algorithm. I'm also working on a neat little social game that involves a lot of dictionary processing to see how little virtual people pass around rumors. It is not very easy to allow a rumor to grow, shrink, be remembered, and/or affect its power on the various people of an area. I think understanding how we munge things mentally can help define some basic transforms, or perhaps I should call them compressions, that occur while people transfer and remember information. If you think of rumors as a fixed asset, only allowing a certain space for each person to store them, then only important properties will be stored. Other properties may remain, but be reduced. Each player may store a small dictionary and reduce words to look-ups. In the case of a redundant look-up the character has to guess based on context... Rumors .NET is low on my radar for now, but it isn't that much work. Its one of those games I go back to every now and then as I get the time and find it fun to play with the algorithms. If I get over the fact that the graphics suck I might give you all a glance.