Winter is coming… actually it has already come bringing constructor’s nightmare. Sudden freezing cold caused unexpected and massive restarts of all field stations. All because of the same hardware reason. Reason impossible to happen. All that because circuit boards were not suited up for real harsh outdoor conditions.
Two stations were flawlessly working during 2014/15 winter. Next four identical stations were deployed in 2015 to make monitoring grid complete for river basin under observation. All they were working for months without a single intervention. Sporadic restarts occuring once a month on average were prototype trade-off: induced stochastically by tight timing of hardware watchdog and rare but still possible SD card glitches. Autumn and early winter was unusually warm with temperature around 10C (50F). Arctic front on New Year’s Eve that has cooled down the air over couple days down to -16C (0F) was shocking for human body as well as electronic circuitry. Rate of resets jumped up from 6 resets per month (for all stations) to 40 restarts over one freezing cold week of January, just to disappear after warm up. I was really upset: hundreds hours of work over last two years, months of tests and such a nasty surprise.
Logs review shown repeating pattern of misbehavior around GSM section, showing problems on serial communication as well as power control. When system works as designed MCU periodicaly turns GSM section on, transmits data and then turns GSM off. MCU controls power of the whole GSM block and its dedicated DC/DC converter thru MOSFET switch like this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
2016-01-02 18:03:51 INFO gsm No more requests for GSM. Shutting down. 2016-01-02 18:03:51 DBG gsm SIM900 power off sequence. 2016-01-02 18:03:57 WARN gsm SIM900 did not stop in 5 seconds. 2016-01-02 18:03:57 INFO gsm SIM900 powered off. ... 2016-01-04 12:01:20 DBG gsm GPRS context attached. 2016-01-04 12:01:41 WARN gsm SIM900 does not respond, power cycle. 2016-01-04 12:01:41 DBG gsm SIM900 power off sequence. 2016-01-04 12:01:47 WARN gsm SIM900 did not stop in 5 seconds. 2016-01-04 12:01:47 INFO gsm SIM900 powered off. 2016-01-04 12:01:50 DBG gsm SIM900 power on sequence skipped: already powered on. 2016-01-04 12:01:56 ERR gsm SIM900 UART does not respond. 2016-01-04 12:01:56 WARN gsm Permanent problem, SIM900 extra power cycle. 2016-01-04 12:01:56 DBG gsm SIM900 power off sequence. 2016-01-04 12:02:01 INFO cron Scheduling now. 2016-01-04 12:02:01 DBG cron Scheduled 0 request(s), next check in 60 seconds. 2016-01-04 12:02:02 WARN gsm SIM900 did not stop in 5 seconds. 2016-01-04 12:02:02 INFO gsm SIM900 powered off. 2016-01-04 12:02:05 DBG gsm SIM900 power on sequence skipped: already powered on. ... 2016-01-04 12:18:43 ERR wdog Task 'gsm' stalled. Rebooting.
Earlier before failure, when temperature started diving below zero degrees Celsius, log files catched first symptoms. SIM900 module has PWRKEY line (pin 1) for power cycle manipulation and STATUS line (pin 66) that reflects current state. Line 3 in log example shows that STATUS reported back was incorrect, either because PWRKEY was misdriven or STATUS was misread.
Over time problems accumulated. At some point of time UART communication was distorded and MCU failed to parse output (lines 6-7) pushing GSM part to restart by turn off and on again. Disturbing facts is that 3 seconds after physical power off (turning MOSFET off) GSM block reported status of still being powered on (line 11); it never happened before in any other way then trivial firmware bug. As consequence locked UART was still non-responsive (line 12) and another power-cycle attempt (lines 13-19) did not help either. Later on GSM task got blocked for unknown reason and software watchdog task, responsible for monitoring of liveness of other tasks over long time periods, jumped in to restart the whole system (line 21).
Failure was obviously caused by hardware and instantly linked to low temperature. Firstly I was not sure why that even happened; a year earlier there were days with similar conditions and two stations (plus one extra on my balcony) simply coped with that. The temperature drop was way faster this time, which make water condensation more serious suspect. I rushed to my workbench to simulate MOSFET misbehavior with my wet fingers. But hold on, I was using sealed case rated as IP65 that passed immersion tests at home, cables are passed thru self-sealing glands used in underwater probe, and finally dessicant bags were placed inside the case. It could be not enough for humidity sucked by hundred of warming-cooling cycles in outdoor environment.
To recreate failure in controlled environment I prepared torture test: logger case with blinded cable glands was cooled for a while in a fridge, then opened up and placed over steam of boiling water to catch as much frosting moisture as possible and then freeze it again down to -18C. Within and hour I had constantly restarting system in my lab.
The box was reopened, warmed and dried. I applied single layer of conformance coating using V-66 insulating varnish in spray. I used it already building underwater probe. PCB covering required masking SD card and SIM slots as well as board connectors. Ten minutes later fast drying coating was ready and torture test begun again. Over 5 hours frosty station power-cycled GSM section 20 times during which the box restarted 4 times. Single layer of coating was a joke for such rapid condensation.
I moved to another cycle of warming, drying and spraying 2nd coating layer. And 3rd layer on top of GSM section. Half day later and system also reported 4 restarts due to same reason. Imagine my frustration and racing thoughts: if coating does not work then prototype has some other serious construction flaw rendering the whole work useless.
Browsing youtube and extra lecture made me realizing how thick coating must be to cover curved elements of PCB landscape. The more complex shape the more meticulous coating is required. I give it one more chance literally sinking boards in varnish so thick that it took 2 hours to dry out. Surprisingly finishing was still considerably thin and spotted only as shiny blobs around angled corners and cavities. Torture test had begun once more. After first 8 hours it was working fine. To eliminate error deviation I pushed limits even further and applied steam once more. Logger worked fine for another 24 hours in fridge without a single sign of problem. It is not the end though, later this year I will have to mare a circle around all stations subsequently replacing boards to apply conformant coating on all of them, easily adding another unplanned 20 work-hours.
This was another hard lesson of engineering and humbleness. I started appreciate cosy environments software developer has, mostly virtual places ruled by logic and equipped with sophisticated testing and debugging tools. Going down to hardware confrontation with bumpy physical world full of uncertainity is unevitable.
See also other related articles: