Input Validation - TDD Patterns

Input validation is the process of checking that an input value is legal according to some set of rules. We’ll look at some considerations of developing such code.

Validation and Normalization Functions

The simplest form of validation is a boolean function over the input, for example:

function isValid(String) → Bool

To test this function, give it both valid and invalid inputs, and check that it accepts and rejects them properly. Choose “near misses” for the invalid inputs, as a form of boundary testing.

(When validation is a side-effect of say doing a conversion from the input type to the internal type, it may not be necessary to separate validation from conversion.)

To simplify validation, you might introduce normalization: converting the input to a standard form. For example, you might trim spaces off the end of convert to uppercase.

Usage then looks like this:

isValid(normalize(input))

We can test-drive normalization and validation separately. Testing validation becomes easier because it can make simplifying assumptions.

Validation need not be a boolean decision; it can provide multiple messages about what went wrong. It may even include locations so you can highlight the input. For example, password forms may tell you that your password isn’t long enough, and that it doesn’t have enough variety.

With a validator detecting several errors, your tests must drive out each error individually, as well as a combination of two or more errors.

Ward Cunningham’s “Checks” pattern (see References) suggests a way to manage validation and error reporting.

Example Validations

Here are some example of things you might like to validate:

Numbers
Names
Date
Time
Email address
Phone number
Credit card number
ISBN (book number)
Password
Money
Domain name
URL
…

Simple Validation Functions

The simplest validations are those that check the length. For example, “length > 0” or “length == 3”.

The next easiest is probably a function that checks the types of characters. For example, this field requires all digits.

You may have simple format, e.g., a US ZIP (postal) code is either 5 digits or 5 digits followed by “-” and 4 digits.

Calculated Validation

Some validation is done by a formula.

For example, ISBN-10 (the product number on a book) has a validation function that multiplies digits by their position, and adds a check digit that makes the some come to zero modulo 11. (This is why you occasionally seen a “X” in an ISBN; it represent 10 mod 11.) (See the References.)

Credit cards also have a validator function with a checksum, known as the Luhn algorithm. (See References.) This prevents someone from just making up and using a random card number. The credit card system has a reserved test number: 4111111111111111. This passes validation but is known to not be a real account.

Grammatical Tests

Some validation expresses constraints on the sequence of characters. For example, US phone numbers might be expressed like this: (ddd) ddd-dddd, where each d represents a digit 0-9.

Regular expressions aren’t a perfect solution, as Jamie Zawinski noted: “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” (See References.)

Regular expressions can work well (when they’re powerful enough). But they aren’t always easy to understand or get right.

For example, a surprising number of formats are acceptable for an email address. See RFC5322 (in References).

One of the RFC examples includes this:

   From: Pete(A nice \) chap) <pete(his account)@silly.test(his host)>

and describes the whole example as “aesthetically displeasing, but perfectly legal”.

I’ve also seen systems be inconsistent about what they accept. For example, me+mine@example.com is legal to sign up with, but rejected by the unsubscribe mechanism.

You may even have complicated enough rules that you need a context-free grammar (one capable of handling nested structures). By the time you get to this, you’re getting beyond simple validation.

Double Validation

For web applications, and other applications where we can’t trust the users environment, we often have to do double validation: validate on the front end so we can find errors without a trip to the back end, then again on the back end as we can’t trust the front end.

Since a front end and back end are often written in two different languages, it’s hard to make sure they implement the same validation.

I18N and Other Special Cases

If your software needs to support multiple languages or cultures, you need to take into account the full variety of rules around the world. Postal codes, currency, dates, etc. are all formatted differently for different regions.

For almost any attribute, there are complexities that you need to be aware of. Names are notorious for this: some people have only one name (ask Cher or Madonna); some have names with only one letter; some names (gasp) aren’t even written with English letters. (See “Falsehoods Programmers Believe About Names” in the References.)

Even things that appear to be standardized may have special circumstances. I used to live in an outdoor museum, and our (18th-century) houses weren’t allowed to display street addresses (since houses weren’t numbered back then). Getting a package was tricky, even though our legal address met Post Office standards. (I loved forms with a notes section!:)

Avoiding Validation

You may be able to avoid validation, or simplify it, when you control or trust the user interface.

Consider the user interface for a date. If it’s just a text field, it’s hard to interpret.

Your code has to provide all the interpretation. Is this “4/7/21” a day/month/year or year/month/day? Or will they spell it out, with a 3-letter or full month name? Or what else?

Instead, suppose you define a set of dropdowns:

You may have some work to ensure that only valid dates are possible (no “Feb. 31, 2025”!), but at least you know the year, month, and day component values are in the right range.

Finally, suppose you have a calendar widget that only accepts valid dates, and gives you a valid Date object:

Since the widget does all the work (and hopefully works when localized too), you avoid your code having to validate any formats.

Conclusion

It’s common to need to validate input. You can often separate validation from processing, to make it easier to test each independently. Sometimes it’s helpful to normalize input before validating it, simplifying the problem even further. We looked at a number of typical validations, and some ideas you might use to implement them. Finally, we considered that judicious user interface design may make some validation unnecessary.

References

“The CHECKS Pattern Language of Information Integrity“, by Ward Cunningham. Retrieved 2021-04-13.

“Falsehoods Programmers Believe About Names“, by Patrick McKenzie.

“How to Verify an ISBN“. Retrieved 2021-04-13.

“Luhn algorithm“, Wikipedia. Retrieved 2021-04-13.

“Source of the famous ‘Now you have two problems’ quote“, by Jeffrey Friedl. Retrieved 2021-04-13.