Time Troubles - Binary Timer Overflow

A fixed-size counter (triggered by some sort of pulse) is a common way to represent time.

When you use this approach, you generally have to answer three questions:

What moment does “0” represent? (Sometimes called the “epoch”.)
What time interval does adding 1 represent? (Are you counting milliseconds, seconds, or what?)
How many bits are in the counter?

Answering those questions lets you answer several other questions: (For all these, make sure your units line up)

Q: What’s the representation of the current time? A: (t – epoch) / interval
Q: What time does this timer count represent? A: epoch + n * interval
Q: When will the timer run out of room? A: epoch + max * interval (where max is the largest counter value)

Examples

There are a number of examples where overflow affected (or will affect) systems:

Windows 98 consistently crashed after 49.7 days of uptime (2^32 msec). (API call GetTickCount() returns 32-bit milliseconds of uptime.)
Boeing 787s shut down after 248 days: 2^31 * 0.1 sec.
Airbus 350s require a reboot every 149 hours. (I didn’t find a source saying exactly what the overflow was.)
Deep Impact 2013: The Deep Impact space probe, used to study comet Tempo 1, was believed to have been lost because its timer counter overflowed after 2^32 * 0.1 sec. Starting in 2000, it ran out on 11 August 2013.
Anonymous tracking system: Called 16-bit time functions (unintentionally) on Windows to measure the duration of a task, causing it to compute task duration wrong, and crash when run overnight.
Y2036 Problem: NTP (Network Time Protocol, that allows computers to synchronize their clocks) uses a 32-bit unsigned counter of seconds since 1 January 1900, and will run out in 2036.
Y2038 Problem: Many Unix systems represent time as a signed 32-bit count of the number of seconds since 1 January 1970, and will run out on 19 January 2038.
Y292,277,026,596 Problem: Systems that use the Unix scheme with a 64-bit count will overflow. This sounds far away, but note that problems may occur in years preceding, e.g., retirement calculations or long-term mortgages may need to project beyond that date.

Detecting Timer Overflow

How will you know a timer overflow occurred? It’s a tricky problem.

Before Runtime

Analytic: When you set a starting point for your timer, a time interval, and a fixed counter size, you’ve effectively decided the maximum time you can represent. Can your system run into these limits?
Hardware Review and Code Review: Systematic human review, from a time management perspective, may be able to identify whether a timer can (or will be) properly managed, or at least identify that it is not.
Tools: Compilers, linters, and special-purpose tools may be able to identify code that (potentially) doesn’t handle timer overflow.

Built-In Detection

Detection isn’t the whole story; you also need to address the problem.

Exceptions: Some languages or hardware provide an exception when an overflow occurs for signed and/or unsigned values.
Flags – Timer overflow, arithmetic overflow, or carry: Many processors can detect when incrementing or adding to a value pushes it out of range.
Timer Overflow Interrupt: Some systems (e.g., Arduino) provide an interrupt when a timer overflows, where you can install a handler to detect and address it.
Check for Wrapping: In some languages, you can tell when a timer has wrapped because the value decreases instead of increases. This may require that you don’t wait too long between checking values – if it could wrap twice, or wrap beyond the previous value, you won’t be able to tell.

Run-Time Detection

Soak Test: Run the system at load for a “long” period of time, hopefully long enough to give any timer-related problems long enough to reveal themselves.
Debug Crashes, esp. “Cyclic” Crashes: When a failure or crash occurs “randomly”, it’s worth trying to see if it is really occurring a consistent amount of time after some event. For example, if the app consistently crashes 18.2 hours after it starts, you might look for a 16-bit counter of seconds.

Solutions

Timer Scaling: Some systems let you explicitly control the interval. For example, you can choose whether your timer has a one-millisecond or a one-second “tick”. You may be able to modify the design to deal with a larger tick (thus increasing the time before the timer runs out).

Timer Extension: Keep a second (or beyond) counter, incremented when the first overflows. Think of it as a “carry”. You still have to decide where this extra space lives and how it’s managed. And you may have extra synchronization problems. (See Lamport article in the references.)

Timer Expansion: Get a bigger boat: use a counter with more bits. Note that this can cause legacy problems: input, storage, output, APIs, cross-system compatibility, etc.

Change the Starting Point: Use a later starting point to give you a later rollover time. For example, you were thinking of using Unix’ 1-Jan-1970 “epoch”, but your system deploys in 2020 or later, so you use 1-Jan-2020 instead. (Note that this is asking for trouble if you have to interact with 1970-based dates.)

Timer Conversion Charts

“𝛑 seconds is a nanocentury.” – Tom Duff (from More Programming Pearls)

Since time is measured in a mix of base-10 and base-60 values, with a few 12s and 24s thrown in, it’s hard to relate a time that’s a power of 2 or power of 10 bigger. These charts help you do that: you can easily see that a 16-bit count of seconds is 18.2 hours, but a 32-bit count is 136.2 years.

For this table, I used Google’s time converter, where 1 day is 86400 sec, and 1 year is 365 days. I rounded most numbers. Verify any critical numbers before you use them.

Time Conversion Chart – ns to sec

bits	1 nsec	1 μsec	1 ms	1 sec
7	1.3e-7 sec	1.3e-4 sec	0.13 sec	2.1 min
8	2.6e-7 sec	2.6e-4 sec	0.26 sec	4.3 min
15	3.3e-5 sec	0.033 sec	32.8 sec	9.1 hours
16	6.6e-5 sec	0.066 sec	1.1 min	18.2 hours
23	8.4e-3 sec	8.4 sec	2.3 hours	97.1 days
24	1.7e-2 sec	16.8 sec	4.7 hours	194.2 days
31	2.1 sec	35.8 min	24.9 days	68.1 years
32	4.3 sec	71.6 min	49.7 days	136.2 years
63	292.4 years	2.92e5 years	2.92e8 years	2.92e11 years
64	584.9 years	5.85e5 years	5.85e8 years	5.85e11 years

Time Conversion Chart – ms to sec

bits	.001 s (=1 ms)	.01 s	.1 s	sec
7	0.1 sec	1.3 sec	12.7 sec	2.1 min
8	0.26 sec	2.6 sec	25.6 sec	4.3 min
15	32.8 sec	5.5 min	54.6 min	9.1 hours
16	1.1 min	10.9 min	1.8 hours	18.2 hours
23	2.3 hours	23.3 hours	9.7 days	97.1 days
24	4.7 hours	1.9 days	19.4 days	194.2 days
31	24.9 days	248.6 days	6.8 years	68.1 years
32	49.7 days	1.4 years	13.6 years	136.2 years
63	2.92e8 years	2.92e9 years	2.92e10 years	2.92e11 years
64	5.85e8 years	5.85e9 years	5.85e10 years	5.85e11 years

Time Conversion Chart – Seconds to Years

bits	sec	min	day	year
7	2.1 min	2.1 hours	0.35 years	127 years
8	4.3 min	4.3 hours	0.7 years	255 years
15	9.1 hours	22.8 days	89.8 years	32,767 years
16	18.2 hours	45.5 days	179.5 years	65,535 years
23	97.1 days	16.0 years	22,982.5 years	8.4e6 years
24	194.2 days	31.9 years	45,965.0 years	1.7e7 years
31	68.1 years	4,085.8 years	5.9e6 years	2.1e9 years
32	136.2 years	8,171.6 years	1.2e7 years	4.3e9 years
63	2.92e11 years	1.75e13 years	2.5e16 years	9.2e18 years
64	5.85e11 years	3.51e13 years	5.05e16 years	1.8e19 years

Acknowledgments

Thanks to Lisa Crispin and Mat Bess for reviewing an earlier draft. (Errors and shortcomings are mine, of course.)

References

“Computer Hangs After 49.7 Days”, http://web.archive.org/web/20111224012719/http://support.microsoft.com/kb/216641. Retrieved 2020-08-17.
“Concurrent Reading and Writing of Clocks”, by Leslie Lamport. ACM Transactions on Computer Systems, Nov. 1990.
“Docket No. FAA-2015-0936”, https://s3.amazonaws.com/public-inspection.federalregister.gov/2015-10066.pdf. Retrieved 2020-08-17. “The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode.”
“EASA AD No.: 2017-0129R1”, https://ad.easa.europa.eu/blob/EASA_AD_2017_0129_R1.pdf/AD_2017-0129R1_1. Retrieved 2020-08-17. “Prompted by in-service events where a loss of communication occurred between some avionics systems and avionics network, analysis has shown that this may occur after 149 hours of continuous aeroplane power-up.”
“Deep Impact (spacecraft)”, https://en.wikipedia.org/wiki/Deep_Impact_(spacecraft). Retrieved 2020-08-17.
More Programming Pearls, by Jon Bentley. ISBN 0201118890.
“Unix time”, https://en.wikipedia.org/wiki/Unix_time. Retrieved 2020-08-17.
“Why Windows 95 and Windows 98 would crash after 49.7 days of uptime“. Retrieved 2024-09-25.
“Year 2038 Problem”, https://en.wikipedia.org/wiki/Year_2038_problem. Retrieved 2020-08-17. Discusses both the Y2036 NTP problem and the Y2038 Unix problem.

†