Using Geometry Differentiation for Hand-Written Character Segmentation

Sunday, May 31, 2009

.NET AI Math OCR

Intro.:

The most important part of a Persian OCR is the segmentation process. Because of contiguous characters of Persian scripts the segmentation process is not easy and this will become more difficult when we're talking about hand-written words because there are vast kinds of a hand-written word in Persian script!

Note: As mentioned above, this algorithm targets Persian OCR systems because these systems has no reliable algorithm for this purpose, and this is why there is no Persian OCR system which recognizes hand-written words with high accuracy. But I've decided to share this algorithm here, because it may give you some idea for some other purposes.

The Idea:

The idea is so simple; this algorithm tries to do one of the stages of recognition process of a human brain. When you see a word, your brain first tries to match the whole word against its database, if nothing found, tries to segment characters based on its trained data against the word shape, and …

So, the questions are how brain stores shapes and how brain matches shapes against its data?

A1: Based on your language, the brain tries to store some principals while is under training; for sample, in English, the principal component that brain uses to segment characters is the space between characters and/or the connection between matched shapes for contiguous words(which this is done after recognition).

The principal in the Persian language is differences of ascends and descends in shapes. How this is understood is simple, because when a character lacks these principals, it cannot be recognized by human. Although, in this situation, it will be recognized if and only if we see some other characters around that character (guessing).

A2: As I mentioned above, the matching is done based on differences of ascends and descends in shapes. But one important thing is, in this stage we don't really have to focus on the "dot" and "kashida" (some Persian characters include these special components like "ب" and "ک") although the brain may or may not do the focus but in this algorithm we will ignore them.

The Algorithm:

Stage 1: generate a multi-level horizontal histogram of the whole word's image. This histogram will help us to separate dots and kashidas.

Stage 2: generate vertical histogram of the image, this will help us to find the image baseline and combining this to multi-level histogram will be used to make decision on ignoring some under/above baseline pixels.

Stage 3: convert all of the remaining pixels to curves, for sample the following part of the image can be converted to Sinh(x^2).

Stage 4: get the first differentiation on x. for the above sample, the differentiation will be 2xCosh(x^2).

Stage 5: solve the 2xCosh(x^2) = 0 equation. So the logical point (0, 0), is the reference point of the curve.

Stage 6: add this reference point to the list and store the x and y difference of the point n and point n-1;

Stage 7: calculate the average of the x and y differences of reference points.

Stage 8:

for (int i = 0; i < refpoints->Count; i++)

for (int j = i+1; j < refpoints->Count; j++)

if (refpoints[i].X - refpoints[j].X > segHorAvg ||

(Math::Abs(refpoints[i].Y - refpoints[j].Y) > segVerAvg && refpoints[i].X - refpoints[j].X > 2)) {

if (i == 0 && !beg) {

slices->Add(refpoints[i]);

beg = true;

}

if (!IsDescendingAscending(rp, i, j))

slices->Add(refpoints[j]);

else

continue;

i=j;

break;

}

Real-World Results:

In the following images the green line is the image's baseline, red points are calculated reference points and blue lines are slices of the image.

(you can see that how the dots are ignored)

94 miliseconds. (Debug mode)

93 miliseconds. (Debug mode)

Author: Mehran Toosi

2 Comments