When your computer crashes and you get the dreaded blue screen or your smartphone freezes and you have to go through the time-consuming process of a reset, most likely you blame the manufacturer: Microsoft or Apple or Samsung. In many instances, however, these operational failures may be caused by the impact of electrically charged particles generated by cosmic rays that originate outside the solar system.
"This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva, professor of electrical engineering at Vanderbilt University, in a presentation on Friday, Feb. 17 at a session titled "Cloudy with a Chance of Solar Flares: Quantifying the Risk of Space Weather" at the annual meeting of the American Association for the Advancement of Science in Boston.
When cosmic rays traveling at fractions of the speed of light strike the Earth's atmosphere they create cascades of secondary particles including energetic neutrons, muons, pions and alpha particles. Millions of these particles strike your body each second. Despite their numbers, this subatomic torrent is imperceptible and has no known harmful effects on living organisms. However, a fraction of these particles carry enough energy to interfere with the operation of microelectronic circuitry. When they interact with integrated circuits, they may alter individual bits of data stored in memory. This is called a single-event upset or SEU.
Since it is difficult to know when and where these particles will strike and they do not do any physical damage, the malfunctions they cause are very difficult to characterize. As a result, determining the prevalence of SEUs is not easy or straightforward. "When you have a single bit flip, it could have any number of causes. It could be a software bug or a hardware flaw, for example. The only way you can determine that it is a single-event upset is by eliminating all the other possible causes," Bhuva explained.
There have been a number of incidents that illustrate how serious the problem can be, Bhuva reported. For example, in 2003 in the town of Schaerbeek, Belgium a bit flip in an electronic voting machine added 4,096 extra votes to one candidate. The error was only detected because it gave the candidate more votes than were possible and it was traced to a single bit flip in the machine's register. In 2008, the avionics system of a Qantus passenger jet flying from Singapore to Perth appeared to suffer from a single-event upset that caused the autopilot to disengage. As a result, the aircraft dove 690 feet in only 23 seconds, injuring about a third of the passengers seriously enough to cause the aircraft to divert to the nearest airstrip. In addition, there have been a number of unexplained glitches in airline computers - some of which experts feel must have been caused by SEUs - that have resulted in cancellation of hundreds of flights resulting in significant economic losses.
An analysis of SEU failure rates for consumer electronic devices performed by Ritesh Mastipuram and Edwin Wee at Cypress Semiconductor on a previous generation of technology shows how prevalent the problem may be. Their results were published in 2004 in Electronic Design News and provided the following estimates:
- A simple cell phone with 500 kilobytes of memory should only have one potential error every 28 years.
- A router farm like those used by Internet providers with only 25 gigabytes of memory may experience one potential networking error that interrupts their operation every 17 hours.
- A person flying in an airplane at 35,000 feet (where radiation levels are considerably higher than they are at sea level) who is working on a laptop with 500 kilobytes of memory may experience one potential error every five hours.
The 16-nanometer study was funded by a group of top microelectronics companies, including Altera, ARM, AMD, Broadcom, Cisco Systems, Marvell, MediaTek, Renesas, Qualcomm, Synopsys, and TSMC
"The semiconductor manufacturers are very concerned about this problem because it is getting more serious as the size of the transistors in computer chips shrink and the power and capacity of our digital systems increase," Bhuva said. "In addition, microelectronic circuits are everywhere and our society is becoming increasingly dependent on them."
To determine the rate of SEUs in 16-nanometer chips, the Vanderbilt researchers took samples of the integrated circuits to the Irradiation of Chips and Electronics (ICE) House at Los Alamos National Laboratory. There they exposed them to a neutron beam and analyzed how many SEUs the chips experienced. Experts measure the failure rate of microelectronic circuits in a unit called a FIT, which stands for failure in time. One FIT is one failure per transistor in one billion hours of operation. That may seem infinitesimal but it adds up extremely quickly with billions of transistors in many of our devices and billions of electronic systems in use today (the number of smartphones alone is in the billions). Most electronic components have failure rates measured in 100's and 1,000's of FITs.
"Our study confirms that this is a serious and growing problem," said Bhuva. "This did not come as a surprise. Through our research on radiation effects on electronic circuits developed for military and space applications, we have been anticipating such effects on electronic systems operating in the terrestrial environment."
Although the details of the Vanderbilt studies are proprietary, Bhuva described the general trend that they have found in the last three generations of integrated circuit technology: 28-nanometer, 20-nanometer and 16-nanometer.
As transistor sizes have shrunk, they have required less and less electrical charge to represent a logical bit. So the likelihood that one bit will "flip" from 0 to 1 (or 1 to 0) when struck by an energetic particle has been increasing. This has been partially offset by the fact that as the transistors have gotten smaller they have become smaller targets so the rate at which they are struck has decreased.
More significantly, the current generation of 16-nanometer circuits have a 3D architecture that replaced the previous 2D architecture and has proven to be significantly less susceptible to SEUs. Although this improvement has been offset by the increase in the number of transistors in each chip, the failure rate at the chip level has also dropped slightly. However, the increase in the total number of transistors being used in new electronic systems has meant that the SEU failure rate at the device level has continued to rise.
Unfortunately, it is not practical to simply shield microelectronics from these energetic particles. For example, it would take more than 10 feet of concrete to keep a circuit from being zapped by energetic neutrons. However, there are ways to design computer chips to dramatically reduce their vulnerability.
For cases where reliability is absolutely critical, you can simply design the processors in triplicate and have them vote. Bhuva pointed out: "The probability that SEUs will occur in two of the circuits at the same time is vanishingly small. So if two circuits produce the same result it should be correct." This is the approach that NASA used to maximize the reliability of spacecraft computer systems.
The good news, Bhuva said, is that the aviation, medical equipment, IT, transportation, communications, financial and power industries are all aware of the problem and are taking steps to address it. "It is only the consumer electronics sector that has been lagging behind in addressing this problem."
The engineer's bottom line: "This is a major problem for industry and engineers, but it isn't something that members of the general public need to worry much about."