Primitive types are the basis of others. They’re typically small (ish), and don’t involve repetition or recursion. We’ve already looked at numbers; today we’ll look at Booleans, characters, and simple enumerations. Beyond that, we’ll look at combining values via product and sum types.
Boolean
In some ways, Boolean are the simplest type – a single bit of information.
True and False – they form two natural partitions, one for each value. We often need both, though sometimes we can fold their use into a single expression.
Characters
A character is a graphic symbol such as a letter or punctuation.
One challenge is that there are multiple character sets and encodings. You need to be clear which ones your application supports.
There are a few challenges:
- The ways of using characters tend to be app-specific, so the partitions are too. For example, if I’m trying to parse numbers. I care about digits and a few other characters vs. everything else. If I’m writing an editor, I may care about visible vs. invisible.
- There are many representations; they’ve been evolving for years.
Originally, each manufacturer had its own character sets (the characters used), and encodings (how they’re represented in memory or files). For example, 3-character file extensions were common in operating systems for a while. Why three characters? Because three characters fit in 2 bytes with the RAD-50 encoding. Rad-50 has only 40 characters: uppercase letters, numbers, and four other characters. (The 50 is octal.)
Then was some standardization – EBCDIC and ASCII (and their extensions) flourished for a while. When PCs came in, they tended to use ASCII with extensions. Then a multiple-year process produced Unicode, which has become more common but not universal.
However hard you think managing characters is – it’s probably harder than that!
Character Partitions
How do we partition characters? There’s usually a top-level partition of characters for those to:
- treat specially
- pass through but don’t interpret
- not handle
Your application’s needs drive the partitions, tests, and behavior you devise. For example, many “small objects” have domain-specific character rules: email addresses, credit card numbers, social security number, phone numbers, URLs.
Unicode divides characters into a number of groups: letter, number, mark (for accents), punctuation, symbol, separator (spaces), and others (including control characters). Do these reflect your needs?
There are other common distinctions:
- Control/visible
- Alpha vs number vs other
- ASCII vs Unicode (single- vs. multi-byte)
- Lowercase vs upper
- Digit vs digit-like – do you consider ④ (circled 4) to be a digit?
Character Equality
Does your application need to decide whether two characters are equal?
Are upper-case and lower-case equivalent?
How will you handle accented characters? Are single accented characters equivalent to ones formed with a combining character? Is a plain letter (e) equivalent to an accented one (é)?
Are there other equivalencies? Is “∫ρą𝔪” “spam”?
Do you ever have to compare characters that aren’t in the same character set or encoding?
Ordering Characters
If you’re sorting, how will you order the characters?
Sort order is language specific, and it can’t be done fully on a single-character by single-character comparison. For example, “ll” in Spanish may sort separately from “l”.
Does case matter? Do you want the order “Apple, Banana, aardvark”, or “Apple, aardvark, Banana”. Where will non-letters fit? Are any characters ignored for sorting purposes?
See the references for an article about how Oracle handles “linguistic sorting”.
Simple Enum Types
At the simplest level, enums are a way to define constants. Simple enums may correspond to a single value. (Some languages allow more complicated treatment – e.g., Java lets you attach functions, Swift lets you attach various values on a per-case basis.)
Simple enums can act like integers, but with a slight bit more protection.
Almost always, you have a separate partition for each enum, although you may be able to fold some partitions together.
Product Types
So far, our discussion has considered single values – primitive types. But there’s a good chance you also work with types that combine multiple primitives (or other types).
Type theory has the notion of product and sum types. In C terms, a struct is a product type, and a (tagged) union is a sum type.
For product types, the number of possibilities is the product of the number for each individual type. For example, if one argument is a boolean and the other is a 16-bit unsigned integer, the number of possible values passed in is 2 * 2^16 = 2^17.
Some things that form product types:
- Record (struct)
- Tuple (anonymous record)
- Class
- Object
- Parameters (not strictly a product type, but you can think of multiple arguments as being like a tuple)
When the parts interact, your partitions tend to cut across multiple dimensions.
For example, take tax rules that say:
- If you make < $100K/year, you get a credit of $1000/child for 0-4 children, and $4000 for five or more.
- If you make $100K-$200K/year, you get $500/child for 0-2 children, and $1300 for three or more.
- If you make > $200K/year, you get $0 regardless of the number of children.
Our partitions will consider both salary range and number of children together. For example, one partitioning might be: {(<100K, 0..4), (<100K, >4), (100K..200K, 0..2), (100K..200K, >2), (>200K, any)}. (You could devise others.)
Sum Types
Product types say “this AND that AND the other”. Sum types say “this OR that OR another” (mutually exclusive).
Some examples:
- unions (specifically, tagged unions with a value that tells which one to use)
- enums, possibly extended with additional data
- objects of different classes
In code, these are often revealed by the need for “if” and “switch” statements.
For sum types, each separate case requires its own handling.
For example, I’m starting work on an elm interpreter in Swift. Swift has extended enums where you can attach extra values. Here’s the AST enum (AST = Abstract Syntax Tree):
enum AST: Equatable {
case elmInteger(_ value: Int, _ type : ElmTypes)
case elmFloat(_ value: Double, _ type: ElmTypes)
//...
}
enum ElmTypes {
case number
case float
}
To use this, somewhere there’s a “switch” statement:
func interpret(_ value: AST) {
switch value {
case AST.elmInteger(let number, let type):
output.append("\(number) : \(formatType(type))")
case AST.elmFloat(let number, let type):
output.append("\(number) : \(formatType(type))”)
// ...
}
}
In effect, each enum forces completely separate handling. (Also notice that the enum – sum type – contains a product – record-like values.)
Objects Are Both Products and Sums
You may have noticed that objects appear on both the list of product and sum types.
A class is a product: its fields are the Cartesian product of its own and those of its ancestor classes. In this sense, it’s like a record. Partitioning may need to consider combinations of values.
A class is a sum: it’s an alternative to its relatives (ancestors and their descendants). Even if its fields are the same as another relative’s, it’s still a distinct type – the type itself acts as a discriminator.
Conclusion
We’ve looked at primitive types and ways you might partition them:
- Integer (see references: “Integers, Floats, and Partitions…”)
- Double (likewise)
- Boolean
- Character
- Enum
We also considered building more complex types out of primitives (and other types):
- Product: struct, record, class (fields), tuple, multiple arguments
- Sum: union, enum, class (vs. its relatives)
A product type multiples the possibilities. Partitions for these typically cut across multiple dimensions (e.g., across fields).
A sum type adds possibilities. Each separate case adds in more possibilities, and often represents a separate partition.
References
“Integers, Floats, and Partitions – TDD Patterns”, retrieved 2021-01-26.
“Linguistic Sorting and String Searching”, retrieved 2021-01-24.