Fit Reading (Part 2 of 8)

Inside Parse

I spent last time on tests only – this time I want to go inside the Parse class.

The top of the class reveals strings for leader, tag, body, end, and trailer, as expected. There are also parts and more, which are Parses. A skim through the class, looking for big routines, shows that the constructor, findMatchingEndTag(), removeNonBreakTags(), print(), and footnote() methods seem to be the biggest and most complex.

Footnote? What’s that? The tests didn’t mention it! Looks odd – it’s not referenced inside fit anywhere; rather, it’s used by some clients, typically after a call to wrong(). It appears to create a file Reports/footnote/n.html, and prints the parse to it.

My strategy today is to chew off the routines that are small and/or simple, then go back and figure out the big routines. I have two things I’m trying to understand: “What happens with nested tables?” and “How do I insert stuff into the middle of a Parse?” (I need the latter for fixtures that want to report a little more nicely.) I guess I have a third question too – “how are spaces handled?” This arises because I saw a note on the mailing list that says there are differences in the various fit implementations.

Small Fry

There are some small and simple recursive routines: size(), last(), leaf(), and at(). There are a bunch of little routines for escaping characters and dealing with HTML; I’ll come back to those.

There’s a little helper routine addToBody() that just appends text to the body. That doesn’t sound like much – and it’s a one line routine, basically “body = body + text”, but a search for usages shows that this is what fixtures use to get their info into the output. (If a fixture wants to show a cell’s expected value, it uses this method to append some HTML text to the cell’s Parse.) That answers one of my questions. I’ll have to play with it to learn it better.

The print() routine is longer than these one-line methods, but looks straightforward. It writes the Parse out: leader, tag, then either body or parts, the end tag, and either the more or the trailer. I knew body and parts were mutually exclusive; I hadn’t realized that more and trailer are exclusive as well. I wonder if body and trailer appear together, and parts and more appear together? If so, I wonder about splitting Parse up so subclasses can deal with that difference. It’s not a huge class; may not be worth it.

Constructors

That leads me up to the first constructor – Parse(String tag, String body, Parse parts, Parse more). Note that it has parameters for both body and parts. So much for my theory of a paragraph ago. But it’s close – I did a search and found 15 places that called this constructor. All but three used either body or parts exclusively.

One of the ones that didn’t is in fat.Table It is using this constructor to copy an existing Parse. That looks misplaced – if we need to copy these, then we can put a method on Parse to do so. A second place is fat.FixtureNameFixture. The GenerateRowParses() method passes in a string for “body” and a Parse for “more”. (So we have an example where “parts” and “more” don’t go together.) I can’t tell why it does this on a quick look. The final place is eg.AllFiles.td(), which also uses “body” and “more” together.

The first constructor passes in all the pieces separately. Then there are a few constructors that default tags and so on, to the main constructor that actually parses some HTML. That fixture looks for several key positions in the input: the start of the target tag, the end of the target tag, the start and end of the corresponding end tag, and the start of the rest of the text.

I see that the first search starts at the beginning of the string, rather than at “offset”. That seems odd.

We’ll have to double-check how findMatchingEndTag() works, but the rest of the constructor looks straightforward: if there are more tag levels, turn the body into a new Parse (and set body to null). If there’s a nested table, parse the table and set the body to “”. (That seems odd also, like it’s throwing away any non-table stuff. I’m not sure what the “” body accomplishes either.) Finally, if there are more tags at the current level, null out the trailer and parse the remaining tags into “more”.

FindMatchingEndTag() looks like an implementation of the parenthesis-balancing rule – add 1 every time you see a left parenthesis, subtract 1 every time you see a right parenthesis. If you’re balanced, you’ll have a net of 0.

So I have an answer about nested tables: it’s trying to handle them. I’m seeing a little weirdness that makes it look like a nested table is the only thing retained inside a cell. But at least I know it’s trying. I’ll make some tests to fill in what I’m seeing. I only have a few minutes left, so I want to move on to the htmlToText() part of the code.

Html to Text

The htmlToText() routine has four steps: normalize line breaks, remove non-break tags, condense white space, and unescape. Normalizing line breaks turns <br> and strings of <p> tags into <br />

Removing non-break tags is a little tricky-looking, but it basically squeezes out tags other than the normalized break tags we just produced. The method “looks forward” to see an end-of-tag; if it’s there, it trims out the tag and looks at the rest of the string.

Condensing white space applies the rule: convert multiple blanks to a single blank, convert a “160” to a space, and convert to a space. I assume 160 is the code for a non-breaking space in Word’s font.

Unescaping is simple too: br tags are converted to newlines, standard entities such as <lt; are converted to their simple character, and smart quotes are converted to ” or ‘ as appropriate.

The result of all this is that text() produces the Parse in straight text form – no tags. This is what fixtures will want when they compare expected values.

Summary

I had three questions:

What happens with nested tables? They are apparently handled, although it looks like only the nested table is retained, not anything surrounding it.
How do I put stuff inside a Parse? Use the addToBody() method.
What happens with spaces? Multiple spaces get converted into one, and non-breaking spaces get converted into one space each.

I’m left with a little bit of question in my mind about why the Parse constructor doesn’t use the offset when it’s looking for the first tag, and about the details of nested tables. But that’s ok; I learned a lot today.

=====
The series: