Quick attempt at a validating roman numeral parser... Lots of gotchas.

Got asked about a roman numeral parser during an interview. I have to say that I don't mind when the process of obtaining employment plays into my strengths. The process was quite similar to a previous process where I wrote a spoken numerics converter. Not only that, there were many similar qualities to my int parsing routines. With that in mind I think I did fairly well. The goal at the time was to produce a routine to validate numbers up to roman numeral 30 or XXX. Didn't take long, but in the end, I had left out many different validation techniques. I really wanted to revisit the problem since I had the code correctly written in my mind. Check the algorithms out, they should handle just about anything you can throw at them at this point. If you find issues, please feel free to submit your problems, since I'd love to solidfy things a bit more. Apparently roman numeral parsing has great application in reading dates.

Roman Numeral Parsing: Code Only: Bidirectional roman numeral parsing. [EDIT: Added alternate parsing routines and performance fixes]
Integer to Spoken Numerics: Code-Only: int/long/double conversion to Spoken Numerics
Phone Number to Words: Trying my hand at the old Phone number to Words teaser project!
Integer Parsing: DWC.Algorithms.NumberUtilities

Published Sunday, October 24, 2004 10:02 PM by Justin Rogers
Filed under:

Comments

Monday, October 25, 2004 12:55 PM by Mike Lorengo

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

A better solution would be to create an array of string [30] lookup = { "0", "I", "II" ... };
Then do a match on the input and return the index. Quick and easy :)
Monday, October 25, 2004 12:59 PM by Justin Rogers

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

Show me!

There is a partial test harness in the code, and I can upload the performance harness and boundary test harness to make sure you catch all of the invalid input.

Remember, on my blog you can't make a statement without backing it up with proof.
Monday, October 25, 2004 1:12 PM by Justin Rogers

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

My goodness, you were speaking more to the design of the original question. I would postulate, in your defense, that they'd accept such an algorithm, and then ask you to write a more general parser. This isn't just based on assumption, but rather the level of detail that is required during any sort of development oriented interview at Microsoft.

Sorry to seem so rude in the comment. Now that I know more of what you are speaking towards, yes, it would be possible to do an IndexOf over the array, and return the index. Just be forewarned and prepared when they ask you to rewrite the parser to accept any value from 1 to Int32.MaxValue and make sure you bring lots of dry erase markers for that huge array.
Monday, October 25, 2004 2:03 PM by Justin Rogers

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

That article was actually fairly enjoyable. Oddly enough I'm not sure I could code using that technique, but then I also yell at people for build fixing (build fixing is where you quickly write code, compile to get bugs, then fix). At least the test-driven technique isn't quite compile fixing. Thinking back I did write an entire page or two full of invalid inputs that I wanted to ensure the function got rid of, so maybe I do more of that than I think?

Looking at that post our integer to roman numeral routine could be more table driven if we chose. In fact, I'll add a table driver right along-side the existing routine and upload in a bit.

Now, the reason int2Rom is easy to write as a table driven approach is that there is no ambiguity in the handling of an integer. Aka, the interger is in the right format, no matter how you slice it.

With the reverse rom2Int the input string can be completely bogus and not represent a roman numeral at all. Even worse, it can be a malformed roman numeral. A table driven approach would work to a point, and then require that you bolt on validation for the basic things like "VV" which is invalid or "DD" which is invalid.
Monday, October 25, 2004 2:31 PM by Derick Bailey

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

I wonder if it would be possible to do roman numeral validation using a regular expression? And then, could you use a single replace parameter in the expression, to convert the numeral to an integer?
Monday, October 25, 2004 2:48 PM by Justin Rogers

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

Short answer is No! I don't believe so. This falls into a class of problems that I've been thinking about for a while in terms of how powerful regular expressions could be. So let's talk a bit about replacements...

1. The standard replacement is to take a Match region and replace it with Groups out of that match.
2. Replacements don't allow mappings in the replacement string. An example would be a conditional capture that if captured, replace with 1, etc... You can map 1 thing (aka replace a Match with some fixed string).
3. You can only construct captures from material in the string. You can't create new material at any point...

Longer answer is... Yes/Maybe! It would be complex, but you can add material to the string before you do a replacement. By appending the digits 0 through 9 as the first ten characters, they can now be used to construct captures.

So what kinds of captures do we have? Well, We use conditionals that look for a pattern in the string, and then create a new capture group that is either empty or filled with a number... By concatenating all of these items in the replacement string, you can possibly retrieve a number back.

The maximum number achievable will be fixed to how you build expression. Remember that thousands can walk on forever, and it isn't possible to *count* them and then use that information to more properly add digits... Actually, I take that back, you kind of CAN do that too...

I might put some work into this. I imagine the place to use it would be in client-side script, however, we could just as easily convert our parsers to Javascript...
Monday, October 25, 2004 4:08 PM by Justin Rogers

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

Just to give an idea of what in the hell I'm yammering about above:

((?<two>II)|(?<one>I))(?(two)(?=.*(?<ones>2))|(?(one)(?=.*(?<ones>1))|(?=.*(?<ones>0))))012

Notice that we match 012 at the end... The input string needs to be of the form II012 or I012 for the above example to work... So what we assume is that we'll be appending 012 to the end of any roman numeral BEFORE doing the replacement.

We'll in essence take 9 sub-patterns to represent the ones, 9 sub-patterns for the tens, 9 sub-patterns for the hundreds... This is a first chance at this after only thinking a few minutes, so if you spot a short-cut that I'm missing please point it out.

Being a tool driven world we live in, I now know the intricacies of the expression, enough such that I can dynamically generate the full pattern with just a small amount of code... Yay!
Sunday, November 14, 2004 9:45 PM by Edward G. Nilges

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

Quick question...have you considered attacking the problem of creating a Roman Numeral parser using Backus Naur Form? Just thinking about the problem in these terms may clean up your code a bit by associating precisely the elements of the RN with "what is to be done".

I say this not to tease you (for I consider our discussion of my book over) but because I have found BNF useful in surprising areas including the parsing of international phone numbers.
Monday, November 15, 2004 6:17 AM by Edward G. Nilges

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

A quick approximation:

romanNumeral = [ thousands ] [ hundreds ] [ fifty ] [ tens ] [ five ] [ ones ]
thousands = M [ thousands ]
hundreds = C [ hundreds ]
fifty = [ decrementByTen ] L
tens = ( decrementByOne X ) | straightTens
straightTens = X [ straightTens ]
five = [ decrementByOne ] V
ones = I [ ones ]
decrementByTen = X
decrementByOne = I

The grammar is clearly ambiguous, but this shouldn't scare us because the ambiguity is resolved in a known limited context.

For example, a "preprocessor" could resolve all incidences of a I, acting as a decrement operator by appearing in other than the first position. It could then replace the decrement operator and the next symbol by a bogus symbol representing IV (four) in the case of decrementing five by one.

Once ambiguity is resolved, you start to code one procedure for each nonterminal, passing the input RN and an index.

The ambiguities don't have to be resolved by preprocessing and transcribing. Instead, a routine (get next token) looks at I to see if it is followed by other-than-I, and returns a special symbol in place of the input symbol.

I will implement a code solution and post it this weekend.

The above grammar doesn't let you decrement the C or M symbols and needs further work.

Monday, March 2, 2009 7:21 PM by ...

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

Gute Arbeit hier! Gute Inhalte.

Saturday, March 14, 2009 2:35 PM by ...

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

Sehr wertvolle Informationen! Empfehlen!

Wednesday, March 25, 2009 6:06 AM by Figure_Flattening_Garter_Belts

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

The Best Figure Flattening Garter Belts Online Available

Tuesday, June 26, 2012 5:58 AM by Bluetooth mouse

# re: Quick attempt at a validating roman numeral parser... Lots of gotchas.

Wow, I never really knew I needed to look at my net worth. It is really interesting the great things you can learn by searching the internet a little bit. Thanks so much for the valuable information.