You're 4 microseconds late, Jetson! NTP at NIST Boulder has lost power

Thread Starter

nsaspook

Joined Aug 27, 2009
16,250
https://lists.nanog.org/archives/li...org/message/ACADD3NKOG2QRWZ56OSNNG7UIEKKTZXL/

Dear colleagues, In short, the atomic ensemble time scale at our Boulder campus has failed due to a prolonged utility power outage. One impact is that the Boulder Internet Time Services no longer have an accurate time reference. At time of writing the Boulder servers are still available due a standby power generator, but I will attempt to disable them to avoid disseminating incorrect time.
 

WBahn

Joined Mar 31, 2012
32,703
https://lists.nanog.org/archives/li...org/message/ACADD3NKOG2QRWZ56OSNNG7UIEKKTZXL/

Dear colleagues, In short, the atomic ensemble time scale at our Boulder campus has failed due to a prolonged utility power outage. One impact is that the Boulder Internet Time Services no longer have an accurate time reference. At time of writing the Boulder servers are still available due a standby power generator, but I will attempt to disable them to avoid disseminating incorrect time.
I love the phrase, "I will attempt to disable them."

Gives an idea of the degree to which they have tried to make it very difficult to disable them!

Very unfortunate, but fortunately not a disaster since the Boulder labs are only one of the ensemble clock operators around the world that coordinate to produce the official time standard (hence, the name Coordinated Universal Time). That's the external reference being referred to. When I was a co-op student working for NBS/NIST back in the late 80s I dropped down to visit the Time and Frequency Division (I was in the Superconductor and Magnetic Measurements Group). It was a bit surreal -- it looked like any other mundane research lab with seemingly non-descript equipment here and there. There were a few posters up on the wall talking about their work, but that was common everywhere. One of the guys gave me the nickel tour and pointed out the time standards they had (I think they had three at the time) and how they coordinated them with others around the world. I wish I had been further along in my education at that point, because there are a lot of questions I would have loved to ask regarding issues such as latency that I just wasn't aware of at the time.
 

schmitt trigger

Joined Jul 12, 2010
2,027
Help me understand the sequence of events;
-High winds and other atmospheric disturbances caused a loss of utility power.
-The standby generators did start as planned. But then one failed.
-This one failure caused a glitch that has de-synced the clocks.

Am I understanding correctly?
 

WBahn

Joined Mar 31, 2012
32,703
Help me understand the sequence of events;
-High winds and other atmospheric disturbances caused a loss of utility power.
-The standby generators did start as planned. But then one failed.
-This one failure caused a glitch that has de-synced the clocks.

Am I understanding correctly?
Probably. I don't know the details of this specific incident, so I'm speculating here. Having said that, critical systems are designed with the possibility of various failures in mind. But there's no one-size-fits-all algorithm to determine what that entails. One of the key considerations is what are the worst consequences that can happen and how can they best be avoided. Remember that these atomic clocks, together, define what time is. Let's say that there was only a single clock. You would go to great lengths to make sure that it took multiple things to fail for it to have a hiccup and, if a hiccup did occur, you'd probably rather accept a certain degree of error rather than a complete loss, and deal with any corrections that could be made down the road and accept whatever residual error remained -- that's about the best you can do. But what if you had two clocks and the official time was the average of them. Now things are fundamentally different. Which would be worse, to have each clock have the same failure design as if it were a single clock, meaning to keep it in play however possible, despite any risk of errors that might result, or to have monitors on the clocks that would remove one from the time-keeping process if it even looked like it might not be running correctly? You would probably choose the latter and rely on the other clock to keep time on its own until the first clock could be repaired and brought back on line. Better to run on a sole clock that is believed to be fine than to have the official time be the average of two clocks, one of which might not be running fine. Now imagine that official time is established by not two clocks, but the weighted averages of between 400 and 500 clocks scattered in 85 national laboratories around the world. The accuracy and precision of the time keeping is such that a single clock could pollute the result well beyond the allowable limits, so it is standard practice to pull a clock from the ensemble any time it isn't running optimally with a high degree of confidence. Clocks are moving in and out of the ensemble all the time as they are upgraded or taken down for maintenance or whatever. So while you want to keep the clock running via backups and resiliency as much as possible, the fail-safe is to quickly remove the clock from the ensemble, deal with the issue, resync the clock, and move it back into the ensemble.
 

schmitt trigger

Joined Jul 12, 2010
2,027
I investigated a little further….
Since the California wildfires caused by trees shorting out high voltage transmission lines, because of insufficient maintenance of their right of way, utilities are having pre-emptive blackouts during high winds and dry conditions. Which apparently is exactly what happened in Colorado.
 

WBahn

Joined Mar 31, 2012
32,703
I investigated a little further….
Since the California wildfires caused by trees shorting out high voltage transmission lines, because of insufficient maintenance of their right of way, utilities are having pre-emptive blackouts during high winds and dry conditions. Which apparently is exactly what happened in Colorado.
Yep. The loss of main utility power was expected and planned for. It was the loss of one of the on-site emergency generators that caused them to pull the clocks from the ensemble.

One of my friends that lives up in that general area (a bit west in the mountains) also lost power via an announced shut down. But due to damage to the lines, it took three days for him to get power back, by far the longest he's been without power in the 30+ years he's been living there.
 

schmitt trigger

Joined Jul 12, 2010
2,027
This coming February is the fifth anniversary of the Texas “deep freeze”. Which was caused by all of the wrong reasons collectively known as the Texas Way to utilities approach.

The TX government solution? Regulation to force utilities to prepare against inclement weather? Have a tie-in with the Eastern or Western grids, perhaps even with Mexico? Of course not, that isn’t being business friendly. And besides, we, in Texas, are rugged individuals who value or freedoms. We don’t tolerate communist regulations in the Lone Star State.

The solution was for a weekend sales-tax holiday for Texans on emergency preparedness products, like generators. If you want to survive in Texas, you have to prepare yourself for it.
 
Top