Programming idiom to parse a string in multiple-passes

I'm working on a Braille translation library, and I need to translate a string of text into braille. I plan to do this in multiple passes, but I need a way to keep track of which parts of the string have been translated and which have not, so I don't retranslate them.

I could always create a class which would track the ranges of positions in the string which had been processed, and then design my search/replace algorithm to ignore them on subsequent passes, but I'm wondering if there isn't a more elegant way to accomplish the same thing.

I would imagine that multi-pass string translation isn't all that uncommon, I'm just not sure what the options are for doing it.

13.10.2009 15:32:27
What sort of tasks do you perform on the different passes? Would it not be possible to have the 2nd pass tasks sufficiently different that they don't clash? Eg 1st pass, substitute char for char latin for braile, 2nd pass, substitute long-hand braile for short-hand braile?
Andrew M 13.10.2009 15:37:49
Interesting ... I had thought Braille was character for character, and then it would be a simple substitution. This isn't the case? (Just curious)
John 13.10.2009 15:45:56
Hehe - my knowledge of braille comes mostly from the book "code" by Charles Petzold, but I believe there are abbreviations or conventions, that effectively shorten spellings, and generally make the script terser.
Andrew M 13.10.2009 15:48:03
Wikipedia covers it quite well: it's called Grade 2 braille, and uses 'spare' characters for letter combinations or short words
Andrew M 13.10.2009 15:50:26
Braille can be almost one-for-one, if you use what is called "uncontracted Braille", which almost no one does because it takes up way too much space. Contracted Braille adds nearly 200 additional short-hand symbols to uncontracted Braille. Unfortunately, I don't believe that there is a way to order the translation passes in such a way that subsequent passes will never need to refer to already translated text. In short, I need a way to look at the untranslated chars which are adjacent to the bit I am translating, whether or not those adjacent chars have been translated yet or not.
Aaron 13.10.2009 16:05:07

A more usual approach would be to tokenize your input, then work on the tokens. For example, start by tokenizing the string into a token for each character. Then, in a first pass generate a straightforward braille mapping, token by token. In subsequent passes, you can replace more of the tokens - for example, by replacing sequences of input tokens with a single output token.

Because your tokens are objects or structs, rather than simple characters, you can attach additional information to each - such as the source token(s) you translated (or rather, transliterated) the current token from.

13.10.2009 16:11:46

Check out some basic compiler theory..

5.01.2010 17:44:24