Chapter 3 - A View of Documents

Last updated: March 26, 1996

Documents as Categories

What does it mean to move this document to that format? This is the key question we'll address through the framework of category theory. Category theory is a branch of abstract mathematics dealing with the algebra of functions. It is concerned with objects and the mappings between them, which dovetails nicely with the problem of translating documents.

Category Theory: Background

Category theory addresses fundamental structures. It generalizes several familiar mathematical constructs: sets, graphs, groups, and others.

Pierce [] defines a category:

1.1.1 Definition. A category C comprises:
1. a collection of objects;
2. a collection of arrows (often called morphisms);
3. operations assigning to each arrow f an object dom f, its domain, and an object cod f, its codomain (we write f:A->B or A-f->B to show that dom f = A and cod f = B; the collection of all arrows with domain A and codomain B is written C(A,B));
4. a composition operator assigning to each pair of arrows f and g, with cod f = dom g, a composite arrow gof: cod f -> cod g, satisfying the following associative law:
  for any arrows f:A->B, g:B->C, and h:C->D (with A, B, C, and D not necessarily distinct),
  h o (g o f) = (h o g) o f
5. for each object A, an identity arrow idA:A->A satisfying the following identity law:
  for any arrow f:A->B, idB o f = f and f o idA = f.

Let's look at a few examples. The simplest category is 0. It has no objects and no arrows, so it satisfies conditions 3 through 5 vacuously. The category 1 has only one object. By condition 5, we must have at least an identity arrow for it. Having this one arrow trivially satisfies the other conditions. This category looks like this:

          __
         v  \
         o   |
          \_/
The category 2 has two objects and three arrows, and looks like this:
             __             __
            /  v           v  \
            |  o --------->o   |
             \_/            \_/
We'll see this one in a later example.

Two other important categories are Set and Graph. "The category Set has sets as objects and total functions between sets as arrows." [Pierce, p. 2] "The category Graph has directed multi-graphs as its objects. An arrow f:G->H in Graph is a structure-preserving map between graphs." [Pierce, p. 51]

The next important notion is that of a functor. Again, from Pierce [p. 36]:

2.1.1 Definition. Let C and D be categories. A functor F:C->D is a map taking each C-object A to a D-object F(A) and each C-arrow f:A->B to a D-arrow F(f):F(A)->F(B), such that for all C-objects A and composable C-arrows f and g:
  1. F(idA) = idF(A)
  2. F(g o f) = F(g) o F(f)

Just as an arrow in a category maps between objects, functors map between categories. The important thing is that functors preserve structure (in an abstract sense): not only are objects mapped, but arrows as well. As we will see, analyzing how arrows are mapped is a powerful tool.

[[Further discussion TBD. Introduce commuting diagrams. Talk briefly about several concepts: isomorphism, natural transformations, limits, adjoints [canonical constructions], colimits, and comma categories.]] [[Not clear these are really needed.]]

Approach:

  • Embed category
  • Develop/define/identify functors
  • Transformations: One-way, invertible

Category Theory

A category is a collection of several things: objects, arrows, and composition operators. Many familiar mathematical objects have a category associated with them.

[Examples TBD]

The fundamental approach is to define a pair of categories, and then define a mapping between them, known as a functor. A functor maps both objects and arrows in a way that preserves identity and associativity.

For example, consider a stack as one category, and integers as another. Glossing over the details, each can be modeled as a category. We will define a functor that maps each stack to its size.

In a non-categorical approach, this might be a sufficient description. But note that it only maps the objects, not the arrows. For our stack category, we will consider the arrow push as operating on stacks. The definition of functor requires that we identify a mapping for arrows as well. So, if we push onto a stack, its size increases by 1, so the corresponding arrow for integers is incr.

What does this give us that a simple depth function does not? One nice thing is the ability to do diagram chasing. Consider a mapping:

        S    push    S'
        o ---------> o
        |            |
down    |            |  down
        V            V
        o ---------> o
        I    incr    I'
This diagram asserts that down(push) = incr(down). It tells us that it doesn't matter whether we push and then translate down, or translate first and then use the increment arrow.

The S, S, I, and I' are arbitrary values. We can use this fact to build up complex diagrams by pasting two diagrams together on a common edge:

        S    push    S'  push      S''
        o ---------> o ----------> o
        |            |             |
down    |            |  down       | down
        V            V             V
        o ---------> o ----------> o
        I    incr    I'   incr     I''
This shows us that
  down o push o push = inc o inc o down
The proof is embedded in the diagram, so we are spared the tedium of an algebraic proof.

Side Note: Categories are object oriented. Note how OO defines objects as both data and functions, and CT as mapping of data plus mapping of functions.

Fables

To show the application of the category theory notions, we'll use a few stories as springboards for discussion.

Fable 1. The Land Before Printers

In a faraway land, where computers were invented before printers, a printer merchant came to town. The merchant drew a picture like this:
     Online  o       o  Paper
              \______^
               print

and said, "This device converts your files into an image on a piece of paper. Your work can be moved from the world of electronics to the world of paper." Everybody thought this was great, and ordered one. It's too bad those early printers had a tendency to make mistakes.

The next time the merchant was in town, he heard about it: "Sometimes it prints `The quick brown fox' and others `The quick jrown fox'. And what about that picture you drew?"

The merchant replied, "That diagram just shows a document moving from online to paper, and the printer does that. It's not 100% accurate, but nothing's perfect." So, everybody put down money in anticipation of the next software upgrade.

Moral: Domains matter.

We want to consider the category implied by the diagram. To the merchant, there are only two objects, Online and Paper. To the townspeople, there is an object corresponding to each document. A simple attempt to identify the category reveals this misunderstanding. Category theory is "typed"; it requires explicit identification of the domains.

A similar analysis applies whenever crossing domains. Does a translator from C++ to C promise to produce merely some C program for a C++ program, or does it promise to produce a particular C program for a given C++ program? The diagram looks the same, but the analysis level can matter greatly.

Fable 2. Down Below the Ocean

A wizard came to the capital and bragged, "I'm the last survivor of Atlantis; nothing is left except me and my pen. As I'm the only person left who can read or write Atlantan, and it is everywhere admitted to be the best language for expressing laws, let me translate your laws into that glorious tongue."

The ruler thought this was worth a few million silver coins, and authorized the translation. After several months, and at great expense, the wizard returned with several volumes in a strange script. The court was thrilled, paid its money, and sent the wizard to the next kingdom.

A squirrel asked, "How do we know it's really our laws written in real Atlantan?"

Moral: Verification matters.

The ruler should have learned to read Atlantan (unless the volumes were only meant as a coffee table decoration). This situation structurally matches the previous one:

        Laws1   o       o  Laws2
                 \______^
                toAtlantan
The fable suggests the importance of having another arrow toEnglish:
             toEnglish
              ______
             v      \
     Laws1   o       o  Laws2
              \______^
             toAtlantan

Suppose the category objects are {Laws1, Laws2, Other}. We want to require that
   toEnglish o toAtlantan = idEnglish
that our translation is accurate. If there is no function to "close the circle" then we have no means to verify the translation.

This is known as a commuting diagram. It is an assertion that all paths for getting somewhere are equivalent. Diagrams can be composed, and very complex paths established. In this simple case, we realize that we have no way to verify the translation unless we have the second arrow.

The same analysis applies to printers: if we add a "retype" arrow, which means to re-enter a paper document back online, we have the key notion that lets us see how the townspeople in the first story can verify the problem: they see that the (obvious) retype function is incapable of making the diagram commute. (If we don't trust the printer, we'd better have an online copy somewhere to guide us in making corrections.)

Fable 3: Rarer Than a Pearl

After people had been printing for a while, they started noticing a problem: sometimes they would have a paper document for which they'd lost the online version, but they wanted it back online. They had no choice but to re-enter the document manually.

A new merchant came to town, an unprinter merchant. He said, "Let me show you a picture," and the townspeople groaned, but he showed them anyway:

              unprint
              ______
             v      \
      Online o       o Paper
             \______^
               print

He said, "I know the tales; I'd like to tell you my unprinter works like this on all documents, but it doesn't." (He was a very honest merchant.) "But, it is only 85% accurate." (I told you he was honest.) "Let me show you what it really does":

                      __________
                     v          \
        D1  D2  D3  D4  D5  D6  |
           ^ ^   ^     ^   ^   /
          /  |   |    /   /   /
        P1  P2  P3  P4  P5   P6

"If you print D1 to P1, and then unprint it, you unfortunately get D2 back. But, D2 and D3 work right, although I admit D4, D5, and D6 get a bit scrambled." He went on, "Everybody needs one. And it's already in Version 3.0, even though it just came out." The early adopters bought one, but most people decided to wait for Version 3.1 (which would have all the bug fixes).

Moral: Sometimes the hidden domains are what's interesting.

We have a discrepancy: the diagram
            unprint
            ______
           v      \
    Online o      o Paper
           \______^
             print
applies at the highly abstract level ("is it online or not?"), but does not apply at the level of individual documents (printing D1, and unprinting the result, yields D2 rather than D1). But we want to say more; after all, the unprinter is certainly doing something useful.

When there is a leaky or lossy transformation, there is an opportunity to find a "hidden" domain. This hidden domain may be trivial, or it may not even be obvious, but it provides a point of leverage. Here, the hidden domain might be an "abstract" version of a document.

Unprinting a printed document may yield a different document:

D1 o           o D2
     \         ^
      \       /
print  \     /  unprint
        \   /
         v /
          o
         P1
but they are in the relation "D2 is what results when you unprint what you get from printing D1." We can consider a new abstract document AD:
           AD
           o
          ^  ^
   up    /    \  up
        /      \
       /        \  
   D1 o          o D2
       \         ^
        \       /
  print  \     /  unprint
          \   /
           v /
            o
            P1
The up function takes a document to its abstract form. The diagram asserts that
  up(d) = up (unprint (print (d))).

But what are the abstract documents? We'll treat the abstract documents as a partition of the electronic documents: each represents a set. We can construct them by taking the closure of the print and unprint functions. Since unprint(print(D1)) = D2, then D1 and D2 are in the same partition. In our example, the abstract online documents are AD1 = {D1, D2}, AD2 = {D3}, AD3 = {D4, D5, D6}. (We could build a similar structure for abstract paper documents.)

[[Side note: This glosses over where arrow starts]]

So, up takes a document to the abstract document which contains it. Since we have a partition, there is exactly one such abstract document.

We can compare unprinters by the way they partition the documents. One might be so bad it is essentially a "two-state" version. Another might tend to arbitrarily substitute letters; it has a partition for each possible length of document. A good one might only confuse certain words (like "Smith", "SmiTh", and "5mith"); it will have many partitions.

This example reveals a guideline: many apparent inverses are not true inverses. When this happens, there may be an abstract domain that can reconcile the differences.

Real Problems

SortTables

Development of the SortTables system provides an example of how categorical notions can drive design decisions.

SortTables has a set of records, a current column (the sort order), and a current row (indicated by a cursor bar). One of the transformations available (arrows in the categorical framework) is to change the sort order.

It is clear that the order of the records are changed by this operation; but what about the current row? This was not addressed by the original design, but came up during implementation. The three possibilities were to leave it at the same position (rejected as confusing), to put it at the first row in the new order, or to locate the same record as the previous order and put it there.

The argument of trying to maintain as much information even across the "functor" won the day for the latter choice. It's interesting that people learning to use the system did not at first realize why the cursor remained on the record it did. They understood that the order was different, but didn't have a model to tell them why the current line was where it was. After a short use, they became accustomed to it, and agreed that it was the proper behavior, as they often wanted to explore items in the neighborhood of a given item when sorted by a different attribute.

Character Sets

Given a document in one character-set encoding, how do we display it in another? The SortTables interface has this problem, as do web gateways. (A source document might be in "ANSEL", the target in ISO-Latin-1.)

For example, the ANSEL encoding of US-MARC records encodes diacritics as the diacritic mark (a value above 128) followed by the letter. The ISO-Latin-1 encoding uses a single character for the letter with the diacritic. Thus ANSEL "'e" must map to ISO-Latin-1 "é". Unfortunately, not all ANSEL characters have representations in ISO-Latin-1. (For example, the Polish slashed-L is missing.) The question then becomes, "what to do about missing characters?"

Two approaches are very common: either ignore the diacritic (which is not acceptable in all languages), or introduce "special translations" for the missing character (e.g., "*L" for the slashed-L).

Looking at this problem in the category theory framework reveals why both approaches must be regarded as hacks: it is not possible to map completely back from ISO-Latin-1 to ANSEL, as we have lost critical information. If the diacritic is ignored, we don't know whether to map "L"ISO-Latin-1 to "L"ANSEL or "slashed-L"ANSEL. If the special translation is introduced, we don't know whether to map "*L"ISO-Latin-1 to "*L"ANSEL or "slashed-L"ANSEL. We might decide to live with the second "solution": perhaps "*L" never appears in the source text. Or we can introduce special mapping rules, perhaps involving control characters. (But this reduces readability.)

What is the solution? In a sense, there is none. Our domains are incompatible, and something has to give. A partial approach is to map to a union character set. For example, the 16-bit Unicode character set contains all the characters in both ANSEL and ISO-Latin-1. By working in the common domain where possible, we defer the problem.

Eventually, the piper must be paid, if we ever must move "back" to ANSEL. If such back-translation is required, we will have to define unambiguous representations for each Unicode character.

But the example of the web gateway allows another way out. Some web browsers "know" Unicode, in the sense that they support character entities for Unicode characters. Thus, "Lukasiewicz" might be encoded "&#597;ukasiewicz". If we must deal with a browser that doesn't understand this much, we might use an even more verbose solution: "<img src="letter/Lslash.gif" alt="L/">ukasiewicz", wherein we display an image of the required character.

We gain a lot from mapping to the Unicode domain, because it solves other problems as well. Real systems don't deal with just two character sets: they have M input alphabets and N output alphabets. By using a union character set, we only need M+N mappings (M up, N down), rather than the M*N separate mappings that would otherwise be required. (This is similar to the problem of compiling M languages for N machines.)

Other Examples

(Needs expanding)
  • Structure of documents - bring on hierarchical documents
  • Variations and the Bible problem
  • Two printers, which is right?
  • Lessons: CT suggests importance of identifying domains.
  • New domains induce engineering decisions, i.e., economic tradeoffs.
  • Introducing correction:
               AD
               o
              ^  ^
       up    /    \  up
            /      \
           /  fix   \  
       D1 o<-------- o D2
           \         ^
            \       /
     print   \     /  unprint
              \   /
               v /
                o
                   P1
    
    Note that correction depends on original; objects become pairs (original, unprinted).

    RightPages: image + text

               AD
                o
              ^  ^
       up    /    \  up
            /      \
           /        \  
    ascii o          o screen
           \         ^
            \       /
    unprint  \     /  display
              \   /
               v /
                o
                ^
                | scan
                o 
              paper
    
  • Views: some are orthogonal, some are dependent.
  • Translations, views, multiple views, leaky transformations. "Leaky transformations induce hidden views." Lossy transformations.
  • Models: when can we go up and down?
  • How are multiple views related to each other?
  • The Myth of Conversion, the Myth of Perfection
  • Higher-level concepts in category theory: functors, adjoints, etc.

RightPages

The RightPages system uses optical character recognition and page images to search for and present articles in an online library.

Page images are scanned in for display. The cover of the journal is scanned in also. They also attempt to integrate the text of the article. If the text is online, they will use that. If not, they do optical character recognition to try to recover the text.

Clearly, OCR is not a consistent process. We already know that it's not clear what document is converged to if it's cycled through printing and OCRing (recall the unprinter example). However, they take steps to avoid that problem: they try to ensure that each document is scanned in only once.

Another problem is the inaccuracy of the OCR: they're aware of this too, as their interface encourages visual searching: by displaying the covers of various journals, it encourages people to look in places they already "know".

[[Do a categorical treatment first, then discuss problems identified in the system, their (RP) solution, and a CT solution.]]

What are the categories? We have a scanned version of a document, and a possible ASCII or OCR version. The ASCII version is clearly a category:

        objects = strings
        arrows = transforms on strings
The scanned image is too:
        object = scanned image
        arrow = image edit

One approach:

             OCR
        Image --------> ASCII

Another:

    Image     ASCII
   OCR \       / copy
        ASCII2
ie. convert image to ASCII.

Yet another:

      IMAGE X ASCII
             |
           IMAGE
("forgetful functor")

A document is thus two things: an image and some ASCII. They certainly can't be fully derived from each other. What does search produce? It must produce both to enable future manipulation.

            OCR
            ______
           v      \
     ASCII o       o IMAGE
            \______^
              print
(especially for OCR - what is recovered doesn't include e.g., font information).

Pen-Based Systems

Some ink .../\__ meant to be a word. What can we say?

In the case of the Newton, it supposedly doesn't keep ink form around for later analysis. (That is, handwriting analysis is a mode.) This may be a flaw.

Hierarchical Editing in a Linear World

Include other paper here

Multiple Hierarchy Document Model

TBD

Lessons

The stories and examples above suggest design rules that can be useful to people working with multiple document formats.

Conclusions

TBD

Bibliography

  • Mayfield et al. Apply CT to IR.
  • Pierce. Good intro to CT.
  • Barr and Wells. More thorough treatment of CT.
  • O'Gorman et al. RightPages.
  • SortTables.
  • MARC ANSEL document.
  • Web browser document.

Superseded - OLD PARTS NOT RE-INTEGRATED

What is a View?

A view is a way of looking at things. To formalize the notion, we will frame views in the language of category theory, a branch of abstract mathematics dealing with the algrebra of functions. We hope this will provide a more solid basis for the use of views, and hope to use this basis to generate insight into the problem of maintaining them.

Background and Direction

The programming language Smalltalk has developed views as part of a general framework for developing user interfaces. This framework is known as MVC, for Model-View-Controller. The Model is a single entity, on which several Views and Controllers may depend. Views are presentations of the model; Controllers allow interaction with (and modification of) the model.

Smalltalk factors user interfaces this way to provide intellectual control of the process of software development. As Models tend to be more stable than Views, Views are made to depend on Models rather than the other way around.

In our categorical approach, we will be less concerned with which is the Model and which the View, and more concerned with translations between them.

Category Theory as Inspiration

A key idea in this research is that multiple views are important. Category theory is a meta-model, providing a framework for moving between views. We do not want to reject previous document models, but rather to address them through a common framework.

This research originally set out to develop yet-another-document-model: to more fully define what it means to be a document with multiple hierarchies. Instead, it has evolved toward a category-theory approach to dealing with documents. Category theory is a branch of mathematics that attempts to provide insight into the question "What is the fundamental structure?" This approach requires identifying and manipulating the categories involved. Most document models have been ones for which an obvious category exists (e.g., sets and graphs).

The difficulty in using documents is often not working with them in their intended domain, but rather moving them between domains. A category theory approach can provide insight into these difficulties and perhaps suggest resolutions. Like [Mayfield etc.] we adopt the notion and notation of categories, but don't develop new theorems in category theory. Rather, we use category theory ideas to illuminate some dark corners in document models.

We will use some parables to demonstrate the utility of this approach in small examples. Then we will apply it to several extended examples, including ....

Copyright 1994-2010, William C. Wake - William.Wake@acm.org