Outdoor device? Bundle up your PCB.

coat-wet-simWinter is coming… actually it has already come bringing constructor’s nightmare. Sudden freezing cold caused unexpected and massive restarts of all field stations. All because of the same hardware reason. Reason impossible to happen. All that because circuit boards were not suited up for real harsh outdoor conditions.

Two stations were flawlessly working during 2014/15 winter. Next four identical stations were deployed in 2015 to make monitoring grid complete for river basin under observation. All they were working for months without a single intervention. Sporadic restarts  occuring once a month on average were prototype trade-off: induced stochastically by tight timing of hardware watchdog and rare but still possible SD card glitches. Autumn and early winter was unusually warm with temperature around 10C (50F). Arctic front on New Year’s Eve that has cooled down the air over couple days down to -16C (0F) was shocking for human body as well as electronic circuitry. Rate of resets jumped up from 6 resets per month (for all stations) to 40 restarts over one freezing cold week of January, just to disappear after warm up. I was really upset: hundreds hours of work over last two years, months of tests and such a nasty surprise.

Logs review shown repeating pattern of misbehavior around GSM section, showing problems on serial communication as well as power control. When system works as designed MCU periodicaly turns GSM section on, transmits data and then turns GSM off. MCU controls power of the whole GSM block and its dedicated DC/DC converter thru MOSFET switch like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2016-01-02 18:03:51	INFO	gsm    	No more requests for GSM. Shutting down.
2016-01-02 18:03:51	DBG 	gsm    	SIM900 power off sequence.
2016-01-02 18:03:57	WARN	gsm    	SIM900 did not stop in 5 seconds.
2016-01-02 18:03:57	INFO	gsm    	SIM900 powered off.
...
2016-01-04 12:01:20	DBG 	gsm    	GPRS context attached.
2016-01-04 12:01:41	WARN	gsm    	SIM900 does not respond, power cycle.
2016-01-04 12:01:41	DBG 	gsm    	SIM900 power off sequence.
2016-01-04 12:01:47	WARN	gsm    	SIM900 did not stop in 5 seconds.
2016-01-04 12:01:47	INFO	gsm    	SIM900 powered off.
2016-01-04 12:01:50	DBG 	gsm    	SIM900 power on sequence skipped: already powered on.
2016-01-04 12:01:56	ERR 	gsm    	SIM900 UART does not respond.
2016-01-04 12:01:56	WARN	gsm    	Permanent problem, SIM900 extra power cycle.
2016-01-04 12:01:56	DBG 	gsm    	SIM900 power off sequence.
2016-01-04 12:02:01	INFO	cron   	Scheduling now.
2016-01-04 12:02:01	DBG 	cron   	Scheduled 0 request(s), next check in 60 seconds.
2016-01-04 12:02:02	WARN	gsm    	SIM900 did not stop in 5 seconds.
2016-01-04 12:02:02	INFO	gsm    	SIM900 powered off.
2016-01-04 12:02:05	DBG 	gsm    	SIM900 power on sequence skipped: already powered on.
...
2016-01-04 12:18:43	ERR 	wdog   	Task 'gsm' stalled. Rebooting.

Earlier before failure, when temperature started diving below zero degrees Celsius, log files catched first symptoms. SIM900 module has PWRKEY line (pin 1) for power cycle manipulation and STATUS line (pin 66) that reflects current state. Line 3 in log example shows that STATUS reported back was incorrect, either because PWRKEY was misdriven or STATUS was misread.

Over time problems accumulated. At some point of time UART communication was distorded and MCU failed to parse output (lines 6-7) pushing GSM part to restart by turn off and on again. Disturbing facts is that 3 seconds after physical power off (turning MOSFET off) GSM block reported status of still being powered on (line 11); it never happened before in any other way then trivial firmware bug. As consequence locked UART was still non-responsive (line 12) and another power-cycle attempt (lines 13-19) did not help either. Later on GSM task got blocked for unknown reason and software watchdog task, responsible for monitoring of liveness of other tasks over long time periods, jumped in to restart the whole system (line 21).

Frost built from humid air

Frost built from humid air

Frost built from humid air

Frost built from humid air

Failure was obviously caused by hardware and instantly linked to low temperature. Firstly I was not sure why that even happened; a year earlier there were days with similar conditions and two stations (plus one extra on my balcony) simply coped with that. The temperature drop was way faster this time, which make water condensation more serious suspect. I rushed to my workbench to simulate MOSFET misbehavior with my wet fingers. But hold on, I was using sealed case rated as IP65 that passed immersion tests at home, cables are passed thru self-sealing glands used in underwater probe, and finally dessicant bags were placed inside the case. It could be not enough for humidity sucked by hundred of warming-cooling cycles in outdoor environment.

Almost invisible single layer coating

Almost invisible single layer coating

To recreate failure in controlled environment I prepared torture test: logger case with blinded cable glands was cooled for a while in a fridge, then opened up and placed over steam of boiling water to catch as much frosting moisture as possible and then freeze it again down to -18C. Within and hour I had constantly restarting system in my lab.

The box was reopened, warmed and dried. I applied single layer of conformance coating using V-66 insulating varnish in spray. I used it already building underwater probe. PCB covering required masking SD card and SIM slots as well as board connectors. Ten minutes later fast drying coating was ready and torture test begun again. Over 5 hours frosty station power-cycled GSM section 20 times during which the box restarted 4 times. Single layer of coating was a joke for such rapid condensation.

I moved to another cycle of warming, drying and spraying 2nd coating layer. And 3rd layer on top of GSM section. Half day later and system also reported 4 restarts due to same reason. Imagine my frustration and racing thoughts: if coating does not work then prototype has some other serious construction flaw rendering the whole work useless.

Thick glossy coating

Thick wet glossy coating

Thick glossy coating

Dried glossy coating

Coating challenging cavities

Challenging cavities like slot in GSM module case

Browsing youtube and extra lecture made me realizing how thick coating must be to cover curved elements of PCB landscape. The more complex shape the more meticulous coating is required. I give it one more chance literally sinking boards in varnish so thick that it took 2 hours to dry out. Surprisingly finishing was still considerably thin and spotted only as shiny blobs around angled corners and cavities. Torture test had begun once more. After first 8 hours it was working fine. To eliminate error deviation I pushed limits even further and applied steam once more. Logger worked fine for another 24 hours in fridge without a single sign of problem. It is not the end though, later this year I will have to mare a circle around all stations subsequently replacing boards to apply conformant coating on all of them, easily adding another unplanned 20 work-hours.

This was another hard lesson of engineering and humbleness. I started appreciate cosy environments software developer has, mostly virtual places ruled by logic and equipped with sophisticated testing and debugging tools. Going down to hardware confrontation with bumpy physical world full of uncertainity is unevitable.

See also other related articles:
[posts_by_tag tag=”WLS”]

This entry was posted in Electronics and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.