A fixed-size counter (triggered by some sort of pulse) is a common way to represent time.
When you use this approach, you generally have to answer three questions:
- What moment does “0” represent? (Sometimes called the “epoch”.)
- What time interval does adding 1 represent? (Are you counting milliseconds, seconds, or what?)
- How many bits are in the counter?
Answering those questions lets you answer several other questions: (For all these, make sure your units line up)
- Q: What’s the representation of the current time? A: (t – epoch) / interval
- Q: What time does this timer count represent? A: epoch + n * interval
- Q: When will the timer run out of room? A: epoch + max * interval (where max is the largest counter value)
Examples
There are a number of examples where overflow affected (or will affect) systems:
- Windows 98 consistently crashed after 49.7 days of uptime (2^32 msec). (API call GetTickCount() returns 32-bit milliseconds of uptime.)
- Boeing 787s shut down after 248 days: 2^31 * 0.1 sec.
- Airbus 350s require a reboot every 149 hours. (I didn’t find a source saying exactly what the overflow was.)
- Deep Impact 2013: The Deep Impact space probe, used to study comet Tempo 1, was believed to have been lost because its timer counter overflowed after 2^32 * 0.1 sec. Starting in 2000, it ran out on 11 August 2013.
- Anonymous tracking system: Called 16-bit time functions (unintentionally) on Windows to measure the duration of a task, causing it to compute task duration wrong, and crash when run overnight.
- Y2036 Problem: NTP (Network Time Protocol, that allows computers to synchronize their clocks) uses a 32-bit unsigned counter of seconds since 1 January 1900, and will run out in 2036.
- Y2038 Problem: Many Unix systems represent time as a signed 32-bit count of the number of seconds since 1 January 1970, and will run out on 19 January 2038.
- Y292,277,026,596 Problem: Systems that use the Unix scheme with a 64-bit count will overflow. This sounds far away, but note that problems may occur in years preceding, e.g., retirement calculations or long-term mortgages may need to project beyond that date.
Detecting Timer Overflow
How will you know a timer overflow occurred? It’s a tricky problem.
Before Runtime
- Analytic: When you set a starting point for your timer, a time interval, and a fixed counter size, you’ve effectively decided the maximum time you can represent. Can your system run into these limits?
- Hardware Review and Code Review: Systematic human review, from a time management perspective, may be able to identify whether a timer can (or will be) properly managed, or at least identify that it is not.
- Tools: Compilers, linters, and special-purpose tools may be able to identify code that (potentially) doesn’t handle timer overflow.
Built-In Detection
Detection isn’t the whole story; you also need to address the problem.
- Exceptions: Some languages or hardware provide an exception when an overflow occurs for signed and/or unsigned values.
- Flags – Timer overflow, arithmetic overflow, or carry: Many processors can detect when incrementing or adding to a value pushes it out of range.
- Timer Overflow Interrupt: Some systems (e.g., Arduino) provide an interrupt when a timer overflows, where you can install a handler to detect and address it.
- Check for Wrapping: In some languages, you can tell when a timer has wrapped because the value decreases instead of increases. This may require that you don’t wait too long between checking values – if it could wrap twice, or wrap beyond the previous value, you won’t be able to tell.
Run-Time Detection
- Soak Test: Run the system at load for a “long” period of time, hopefully long enough to give any timer-related problems long enough to reveal themselves.
- Debug Crashes, esp. “Cyclic” Crashes: When a failure or crash occurs “randomly”, it’s worth trying to see if it is really occurring a consistent amount of time after some event. For example, if the app consistently crashes 18.2 hours after it starts, you might look for a 16-bit counter of seconds.
Solutions
Timer Scaling: Some systems let you explicitly control the interval. For example, you can choose whether your timer has a one-millisecond or a one-second “tick”. You may be able to modify the design to deal with a larger tick (thus increasing the time before the timer runs out).
Timer Extension: Keep a second (or beyond) counter, incremented when the first overflows. Think of it as a “carry”. You still have to decide where this extra space lives and how it’s managed. And you may have extra synchronization problems. (See Lamport article in the references.)
Timer Expansion: Get a bigger boat: use a counter with more bits. Note that this can cause legacy problems: input, storage, output, APIs, cross-system compatibility, etc.
Change the Starting Point: Use a later starting point to give you a later rollover time. For example, you were thinking of using Unix’ 1-Jan-1970 “epoch”, but your system deploys in 2020 or later, so you use 1-Jan-2020 instead. (Note that this is asking for trouble if you have to interact with 1970-based dates.)
Timer Conversion Charts
“𝛑 seconds is a nanocentury.” – Tom Duff (from More Programming Pearls)
Since time is measured in a mix of base-10 and base-60 values, with a few 12s and 24s thrown in, it’s hard to relate a time that’s a power of 2 or power of 10 bigger. These charts help you do that: you can easily see that a 16-bit count of seconds is 18.2 hours, but a 32-bit count is 136.2 years.
For this table, I used Google’s time converter, where 1 day is 86400 sec, and 1 year is 365 days. I rounded most numbers. Verify any critical numbers before you use them.
Time Conversion Chart – ns to sec
bits | 1 nsec | 1 μsec | 1 ms | 1 sec |
7 | 1.3e-7 sec | 1.3e-4 sec | 0.13 sec | 2.1 min |
8 | 2.6e-7 sec | 2.6e-4 sec | 0.26 sec | 4.3 min |
15 | 3.3e-5 sec | 0.033 sec | 32.8 sec | 9.1 hours |
16 | 6.6e-5 sec | 0.066 sec | 1.1 min | 18.2 hours |
23 | 8.4e-3 sec | 8.4 sec | 2.3 hours | 97.1 days |
24 | 1.7e-2 sec | 16.8 sec | 4.7 hours | 194.2 days |
31 | 2.1 sec | 35.8 min | 24.9 days | 68.1 years |
32 | 4.3 sec | 71.6 min | 49.7 days | 136.2 years |
63 | 292.4 years | 2.92e5 years | 2.92e8 years | 2.92e11 years |
64 | 584.9 years | 5.85e5 years | 5.85e8 years | 5.85e11 years |
Time Conversion Chart – ms to sec
bits | .001 s (=1 ms) | .01 s | .1 s | sec |
7 | 0.1 sec | 1.3 sec | 12.7 sec | 2.1 min |
8 | 0.26 sec | 2.6 sec | 25.6 sec | 4.3 min |
15 | 32.8 sec | 5.5 min | 54.6 min | 9.1 hours |
16 | 1.1 min | 10.9 min | 1.8 hours | 18.2 hours |
23 | 2.3 hours | 23.3 hours | 9.7 days | 97.1 days |
24 | 4.7 hours | 1.9 days | 19.4 days | 194.2 days |
31 | 24.9 days | 248.6 days | 6.8 years | 68.1 years |
32 | 49.7 days | 1.4 years | 13.6 years | 136.2 years |
63 | 2.92e8 years | 2.92e9 years | 2.92e10 years | 2.92e11 years |
64 | 5.85e8 years | 5.85e9 years | 5.85e10 years | 5.85e11 years |
Time Conversion Chart – Seconds to Years
bits | sec | min | day | year |
7 | 2.1 min | 2.1 hours | 0.35 years | 127 years |
8 | 4.3 min | 4.3 hours | 0.7 years | 255 years |
15 | 9.1 hours | 22.8 days | 89.8 years | 32,767 years |
16 | 18.2 hours | 45.5 days | 179.5 years | 65,535 years |
23 | 97.1 days | 16.0 years | 22,982.5 years | 8.4e6 years |
24 | 194.2 days | 31.9 years | 45,965.0 years | 1.7e7 years |
31 | 68.1 years | 4,085.8 years | 5.9e6 years | 2.1e9 years |
32 | 136.2 years | 8,171.6 years | 1.2e7 years | 4.3e9 years |
63 | 2.92e11 years | 1.75e13 years | 2.5e16 years | 9.2e18 years |
64 | 5.85e11 years | 3.51e13 years | 5.05e16 years | 1.8e19 years |
Acknowledgments
Thanks to Lisa Crispin and Mat Bess for reviewing an earlier draft. (Errors and shortcomings are mine, of course.)
References
- “Computer Hangs After 49.7 Days”, http://web.archive.org/web/20111224012719/http://support.microsoft.com/kb/216641. Retrieved 2020-08-17.
- “Concurrent Reading and Writing of Clocks”, by Leslie Lamport. ACM Transactions on Computer Systems, Nov. 1990.
- “Docket No. FAA-2015-0936”, https://s3.amazonaws.com/public-inspection.federalregister.gov/2015-10066.pdf. Retrieved 2020-08-17. “The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode.”
- “EASA AD No.: 2017-0129R1”, https://ad.easa.europa.eu/blob/EASA_AD_2017_0129_R1.pdf/AD_2017-0129R1_1. Retrieved 2020-08-17. “Prompted by in-service events where a loss of communication occurred between some avionics systems and avionics network, analysis has shown that this may occur after 149 hours of continuous aeroplane power-up.”
- “Deep Impact (spacecraft)”, https://en.wikipedia.org/wiki/Deep_Impact_(spacecraft). Retrieved 2020-08-17.
- More Programming Pearls, by Jon Bentley. ISBN 0201118890.
- “Unix time”, https://en.wikipedia.org/wiki/Unix_time. Retrieved 2020-08-17.
- “Why Windows 95 and Windows 98 would crash after 49.7 days of uptime“. Retrieved 2024-09-25.
- “Year 2038 Problem”, https://en.wikipedia.org/wiki/Year_2038_problem. Retrieved 2020-08-17. Discusses both the Y2036 NTP problem and the Y2038 Unix problem.
†