There’s a BIG problem with how we test electronics and energy storage…

And it slows production for EVERY device you can imagine.

Madhav Malhotra
Voltx
19 min readApr 4, 2021

--

I’m a researcher from Voltx: an AI product to reduce lifetime testing time for supercapacitors and batteries by 85–96%. On the side, we’re writing this article to explain testing issues for lasers, LEDs, photovoltaics, optical coatings, and integrated circuit chips!

Feel free to email the editor of this article (Alishba Imran) or Shagun Maheshwari with any questions.

Have you had an old phone where the battery somehow drops from 80% to 10%…?

The swollen phone battery… a dreaded sight in all too many homes. 😭 (CC-BY-2.0)

Why can’t you buy a magical new phone with magically better batteries to make that problem magically disappear???

Well, you can… we already have magically better batteries to fix this problem! ‘Solid-state’ batteries from companies like Solid Power can hold 2x as much electricity as current batteries in phones and laptops. AND they’re safer and last longer too.

What’s the catch? 😕

Well, we have magically better batteries, but you can’t BUY them. At least, not for a few years. That’s because of the big problem with how we make electronics: it usually takes 3 months — multiple years to test them! As I’ll soon explain, electronics have testing issues at EVERY step of their development (from the initial design to maintenance while in use).

And it’s not just a problem for your old phone batteries. Incredibly important innovations are being slowed down by testing: solar panels for clean energy, LiDAR for autonomous vehicles, lasers for fibre-optic networks… EVERY SINGLE electronic device you can IMAGINE has this issue.

So can we get to the technical stuff already?

Sure! 😄 I’m going to walk you through these key lessons:

  1. What are ‘electronic tests’ anyways?
  2. What (specifically) do tests measure?
  3. Why is ‘EVERY’ electronic test slow?
  4. What are current ways to speed up testing?

1. What are ‘electronic tests’ anyways?

There are some common categories of testing for all electronic industries. One of the longest types of testing is reliability testing. I’ll summarise a few major reliability tests throughout the device’s life: from the beginning of designing an electronic to maintaining it when it’s in use.

Overview of the tests I’ll talk about in a product’s life (Source: Madhav Malhotra)

Reliability enhancement tests: these tests happen when electronics are still being designed. The goal is to find the maximum limits of stress (ex. vibrations, heat, current) that will break a product design. Then, engineers can fix the most common reasons for failure.

For example, hard drives (electronics that store data on older computers) have a VERY tiny ‘head’ that reads and stores data on a magnetic disc (details here). It can be as small as a flake of pepper and is suspended over a disc rotating at 130 km/h! That’s why ANY vibrations can damage this ‘head.’ So engineers should concentrate on trying to design better products by making the head safer.

Engineers have to make this tiny part (as small as a pepper flake) survive for years. (CC-BY-2.0)

Accelerated lifetime test: These tests are done before producing a new electronic product for sale on markets. Their purpose is to find the specifications to market the product with (specifically, the product’s warranty and lifetime). But some electronics last a REALLY long time! An LED bulb can last over 10,000 hours. Manufacturers don’t want to wait that long before they find out how to market their product’s lifetime! 😱

So instead, electronics (ex. LEDs) are put in stressful environments (ex. 85° C and/or 100% humidity 🔥) so they start to break down faster. Regardless of the specific device, you usually set the temperature, humidity, current (how many electron flow), and/or voltage (controls how fast electrons flow) higher.

With this ‘accelerated’ degradation, manufacturers might only have to wait a few thousand hours instead of tens of thousands of hours to find their products’ lifetime. Keep in mind… even 1000 hours is a REALLY long time (42 days of running a device — without pause).

Burn-in/screening tests: These tests happen during the production process. Their purpose is to find units of a product that have manufacturing defects (ex. cracks, wires not soldered tightly, exposure to contaminants, etc.). They do this by setting a ‘challenge’ for all units by forcing them to survive tough conditions for a short amount of time. 💪 The theory is that units with defects will degrade in this short amount of time, so they can be separated from functional devices.

For example, consider semiconductor manufacturing for integrated circuit chips (fancy words for the devices in your computer that do all the calculations). Every chip can have BILLIONS of transistors (switches to control where the electricity goes). There’s no way you can test all of them! 😰 So manufacturers will bake the chips at 130° C and 85% humidity for 4 days). By then, all the defective chips would have failed. So any quality-checks later won’t pass chips that are functioning now but will break in just a few days.

Intel’s chip fabrication plant in Arizona. The orange containers move the chips around isolated from the environment to minimise contamination/defects. (Public Domain)

Acceptance Test: These tests happen right before the electronics are installed for use. Their purpose is to make sure that products delivered match the standards they’re supposed to. This is more important for larger electronics (like a factory machine) than consumer electronics. For example, buyers of an optical coating for laser mirrors/lenses might check that the coating isn’t damaged by chemical solvents and that it protects the mirrors even if you try to scratch it with sandpaper.

Maintenance/Field Testing: These tests happen after electronics are deployed on the field. Their purpose is to check electronics’ quality and remaining useful life. Remaining useful life is NOT the same as lifetime. Lifetime means: “My phone battery will last 3 years.” Remaining useful life means: “I’ve used my battery for a year and now it has 2 years left to last.”

This is a LOT harder to predict.

Why? Because the real world is messy! You don’t just have a nice simple lab with exactly 50° C of heat, no weather changes, and no humans dropping devices in a toilet… 😅 It’s hard to account for the probability of ANY of those issues happening SOMEtime in the next year. That’s why startups exist just to predict the remaining useful life of important electronics (like electric car batteries).

Here’s the recap again if you got lost:

Don’t worry, no pop quizzes later if they’re hard to remember 😁 (Source: Madhav Malhotra)

#2 Okay, what (specifically) do all these tests measure???

Good question! If you were looking for patterns above, you might have noticed how a LOT of tests (for everything from hard drives to solar panels to lasers) involves heating them up. It’s one of the most common reasons for electronics to degrade. Why is that?

Basically, modern electronics are made with very SPECIFIC chemistry. All the atoms have to be in just the right place (and the places have weird names, like ‘p-n junction’ 😕). If you increase temperature, atoms move about more and more. Eventually, they don’t stay in the places we want and in the order we want… so the electronic breaks. That’s why temperature is part of so many electronic tests.

🔑 Oversimplified: for every 10° C increase in temperature, an electronic breaks twice as fast.

Other common variables that can affect the lifetime of many types of electronics are the current and voltage they receive. Quick reminder: current is the flow of electrons, resistance slows down the flow, and voltage speeds up the flow.

A common analogy of voltage, resistance, and current on the Internet. (Source: Unknown)

Alongside these basic variables measured, there are some that are more specific to the electronic being measured.

  • Sometimes, they’re based on the above. For example, laser diodes measure the change in their threshold current — which is still a measurement of current. But it’s specifically the minimum current before the laser can operate. As the laser degrades, more and more current is needed before it can operate, so its threshold current increases. This can be monitored as a sign of aging.
  • Sometimes, they’re unique to individual electronics. For instance, batteries measure a variable called self-discharge when they age, which is when a battery loses energy without being plugged in. Over time, this decreases the amount of energy the battery has stored. This can also be tracked as a sign of aging.

But behind these variables is a LOT of complicated physics. 😵 It has to do with two key terms: failure modes and failure mechanisms.

  • Failure modes are events that break an electronic.
  • Failure mechanisms are the causes behind those events.
  • Physics variables are how we tell when a failure mode occurs.

🔑 Every electronic has UNIQUE failure modes, though we detect them using SIMILAR variables.

For example, take a power cable that transmits electricity underwater. Because of the water, the cable would undergo corrosion (failure mechanism). Eventually, there would be so much corrosion that the cable would snap (failure mode). And we would detect the cable snapping when current is no longer flowing through the wire (variable).

This underwater power cable has corroded metal. You can imagine how inconvenient this is to monitor. 😮 (Public Domain)

Though we can use current to monitor failure modes of power cables just like we can use it for laser diodes, there are VERY different failure modes for the two electronics. This is the in-depth explanation of why it’s so complicated to perform field testing for the remaining useful life of electronics. For every electronic you want to test, you have to consider ALL the environmental conditions that could trigger ANY of the unique failure modes for that electronic. And you have to simulate the physics of the situation to understand when the failure mode might be triggered! 😰

So what do electronic tests measure? Well, many DIFFERENT failure mechanisms cause damage until many DIFFERENT failure modes break the electronics and we can measure this happening with SIMILAR variables.

#3 Okay, and why is EVERY electronic test slow?

Given all these differences, you’d think that the problem of diving underwater to test submarine power cables HAS to be different than just checking if your phone battery is dying more quickly. That’s true — for field/maintenance testing. It certainly is a LOT harder to monitor cables on the ocean floor, which is why they’re made to last for decades without needing maintenance.

It’s the same for reliability enhancement tests and other tests done at the design stage of products; you have different questions to answer if you want to make a battery last longer vs. a power cable.

But slow tests affect EVERY electronic you can imagine for accelerated lifetime testing and burn-in testing. As I explained earlier, these tests are often done by manipulating the exact same variables: you set the temperature, humidity, current, and/or voltage higher and then WAIT…

…and eventually, you’ll check a specific variable (this part is different for different electronics) to see how performance decreased over time.

To make it less theoretical, here’s an example of EXACTLY what slow ‘things’ engineers are doing when designing an electronic. First, I’ll talk about lifetime testing.

Let’s say you’re a brilliant ✨shiny✨ laser that’s about to be tested to see if you work! I’ll just put you in a 75°C oven (with all your other laser friends) and see how long it takes you to break… 😶

Here’s what I’d see:

This figure and data are not my work. (Source: Lawrence A. Johnson, ILX Lightwave, 2006).

This is a curve like many others in electronics testing. It shows how variable X (in this case current) changes as time goes on and a device ages. Basically, as you shine longer, you age more and get tired. You need more fuel (current) to keep you going. Eventually, you’ll need so much more current (in this case, 20% more than normal) that you just won’t meet my laser shining needs… 😕

At that time, you’re officially a non-functional laser and I’ll record your lifetime as the taken until your current is 20% more than normal.

But what about those maroon, purple, and pink lasers whose current increases really slowly??? Even after waiting 1000+ hours, we don’t know how long it’ll take for their current to increase by 20%. This is the key problem slowing down lifetime tests:

🔑 Some lasers have currents increase VERY slowly. We’d have to wait thousands of hours before their current increased 20%!

But luckily, engineers have another alternative to waiting this long. You can see how the laser lifetimes increase pretty similarly AFTER 100 hours. So we can just extrapolate a line of best fit forward without actually having to wait thousands of hours for the laser to fail!

(Like this — but with higher accuracy and higher resolution 😁)

Still, remember this test is being done in a 75°C oven (‘burn-in chamber’). Lasers degrade a lot faster here than in everyday applications. This is where we use tools called Arrhenius equations to figure out what the laser lifetime would be at regular temperatures, given what we know about their lifetime at high temperatures.

Warning… CHEMISTRY ahead😭

Remember this monstrosity from high school??? 😈

If you haven’t seen the Arrhenius Equation since your high school chemistry class, fear not! Here’s the oversimplified version:

  • k is how fast atoms move. T is the temperature.
  • k is equal to a complicated mess that we don’t care about. We just need to know k (how fast atoms move) increases as T (temperature) increases.
  • SO the complicated mess allows us to model how fast atoms move at different temperatures (among other things… probably 😁).

And remember what I said in part #2! The faster atoms move, the faster electronics degrade. So, the Arrhenius equation can be used to model how long it will take for electronics to degrade at different temperatures:

Life is inversely proportional to how fast atoms move, which is complicated to calculate.

With this information, we can plug in different temperatures to the complicated part and see how it changes the temperature! We’ll get a graph like this:

These two points can tell us EVERYTHING… and they don’t even mean anything! I made up the data 😁 (Source: Madhav Malhotra)

Keep in mind this assumes that you don’t compare SUCH different temperatures that they cause different failure modes (ex. comparing 75°C from our test to 1000°C temperature that causes the laser to melt). So basically, we’re just using statistics to find equations of lines and points on those lines to predict lifetime!

Now, what happens during laser testing BEFORE 100 hours pass? That’s where burn-in testing happens (ie. every laser is run for 100 hours to find any defects). That’s because many failure modes for electronics happen ONLY in that first little while of testing. (It’s like you ONLY get chickenpox once — usually when you’re young. 😷) But for lasers, it’s more like the atoms in the crystal structure ‘dislocate/diffuse’ to the wrong places when they’re young. 😕

The problem with that is that it’s pretty hard to predict all those specific one-time failure modes. Burn-in testing doesn’t have a simple line with a constant risk of failure to extrapolate like for lifetime testing. That’s why statistical models aren’t as widely used to accelerate burn-in testing. Usually, burn-in tests have less optimisation in general (though I’ll talk about SOME in the next section).

#4 What are current ways to speed up testing?

Approaches to speed up both burn-in and accelerated lifetime testing are similar for the most part. Usually, they rely on changing the test conditions, so the electronics degrade faster. For example, you might run a burn-in test for an integrated circuit chip (computer chip) at 130°C instead of 110°C. Or, you might run a lifetime test where you have high temperature AND high humidity to make the electronic degrade faster ♨️ You’ll find a bunch of names for these kinds of tests in the industry:

  • Highly Accelerated Stress Test (HAST)
  • Highly Accelerated Lifetime Test (HALT)
  • High-Temperature Operating Life (HTOL) Test
  • Accelerated Lifetime Test
  • _________ & _________ Cycling (ex. Power and Temperature Cycling)

All these tests (REGARDLESS of the specific electronic device) follow the same principle: add harsh conditions to make the electronic degrade faster. Worse conditions = faster degradation = cheaper test. The issue is figuring out how to balance the test accuracy, how long the test takes, and making sure not to create unrealistic conditions where test results don’t apply to reality. How do you do that?? VERY complicated physics simulations 😵 (example for testing solar panel coatings).

This complexity is why testing standard organisations exist. Like the International Standards Organisation (ISO), the International Electrotechnical Commission (IEC), the Deutsches Institut für Normung (DIN), and SOOOO many more! These organisations have hundreds of committees with experts who have decades of industry experience that decide the specific test requirements for any electronic part (from batteries to supercapacitors to hard drives and beyond).

🔑 It’s complicated + unscalable to change test standards to speed up testing.

Keep in mind that these standards aren’t regulations. Manufacturers just use them to show their products’ quality. But if your test saves enough money, you CAN still build new solutions that modify the standards. The smart way to modify them is not to come up with your own standards. Instead, you want to use the regular procedure with the regular variables measured — but create some sort of statistical/simulation-based tool to predict the test outcome early. These types of tools are called prognostics’.

There are LOTS of physics simulations (‘physics of failure models) to predict outcomes for every type of electronic you can imagine. What holds back physics simulations is that they’re EXTREMELY specific. Both to one type of product (ex. batteries), but also to specific variables within a product test (ex. using ‘self-discharge for batteries specifically). If you did manage to simulate ALL the complex failure modes of a product, you would get what is called a ‘bathtub curve’.

Those are a lot of lines… (Public Domain)

To understand this graph, you need to know that each of the failure rates (‘infant mortality’, ‘constant’, and ‘wearout’) corresponds to a specific failure mode (ie. a specific event going wrong with a product).

  • “Infant mortality” failures happen due to defects in production (ex. an integrated circuit chip doesn’t conduct electricity well because it has silicon impurities). These failures are high in the beginning when burn-in testing finds defective products, but are low after that for functional products.
  • “Random” failures happen due to unpredictable failure modes. For example, lasers require very specific currents to function. If the laser’s power supply unexpectedly has a surge in current, the laser will overheat and start to melt. The risk of a failure mode like this is always the same (underlying in the background).
  • “Wear out” failures have to do with the product’s aging as it degrades. This is why they increase as time goes on. For example, as solar panels are left outside and exposed to weather over time, they accumulate tiny cracks that build up into larger and larger ones. Over time, these cracks present a greater risk of the solar panel failing as they accumulate.
  • And finally, IF you can model all the right failure rates with all the right failure modes, you can predict the ‘observed’ failure rate which combines all of this. This is very complicated due to all the possible ways for a product to fail at any given time.

🔑 It’s complicated + unscalable to model physics to predict test outcomes.

The other type of prognostic is based on statistical simulations. Instead of using physics equations to model and predict very specific phenomena, statistical models find relationships between data. They work regardless of the relationships between variables, so they’re more generally applicable. Statistical models can extrapolate specific graphs with lines of best fit (like above), analyse special statistical distributions like the Weibull distribution, and even use machine learning techniques to classify and forecast data.

Still, not all tests and electronics will be ‘fixable’ with statistical prognostics alone. For example, some types of electronic products don’t have ANY failure modes that show up for a certain period of time (longer than 100 hours). Though burn-in tests for 100 hours would catch defective units for many types of electronics, this means some electronics would need even longer burn-in tests.

This pretty much eliminates any type of statistical model where you want to collect some data and then extrapolate a curve of fit forward to find the lifetime. 😕 There won’t be any observable changes to collect data on for the products within a short timespan. To figure out how early failure modes kick in for different electronics, you can refer to graphs called P-F curves’.

Device health is a generic variable. We can use specific ones — ex. power output for solar panels. (Source: Author)

These are created for maintenance technicians that do field testing on electronics. The aim is to only test the electronics when failure modes will start to create observable changes. Before that time, it’s a waste of effort trying to gather data about electronic health. Keep in mind, this information doesn’t directly apply to burn-in testing (because it’s meant for functional electronics that are in the field instead of potentially defective units during production). But it can give us relative insight into how long it takes for one electronic’s failures to occur compared to another.

I’ll break down the points on the graph from left to right:

  • The point when a failure mode is first triggered: this information in a P-F curve tells us the number of hours until failure mode X can start to happen. (Keep in mind, each failure mode for an electronic will have a separate P-F curve) Though the point when the failure mode is triggered happens pretty early in the visual above, imagine that the graph has a flat line where the device is at its best health for a long time. Some electronics are like that, so any type of reliability testing would be hard to speed up.
  • The point where the device health dips low enough that it’s first detectable: the number of hours until this happens is ideally the maximum number of hours that you do testing for. With a ‘magic’ statistical model, you could already use the early data by this time to predict what the rest of the curve would look like.
  • The point where the device health dips low enough that it’s easily observable: realistically, you’ll catch the failure SOMEtime between the point where failures are ‘first’ detectable and this point.
    As a real-life example, the electricity grid has massive electronics called transformers which change the voltage of electricity. They operate at specific temperatures, so temperature changes are one of the first clues that something is wrong. The transformer’s components are surrounded by oil to insulate them, so sensitive tests can pick up small differences in oil temperature as a sign of failure. But later on, there might be more obvious clues like obvious overheating or buzzing sounds.
If you’ve ever seen equipment at electricity stations with those tower-y spirals (transformer bushings) at the top, those are transformers. (Public Domain)
  • And finally, the point where the electronic health dips so low that it can’t serve its original function is called functional failure. Note that from the point where failure modes can first be detected to this point is the ‘predictive interval’ — it’s only during this timeframe that prognostics can be useful by collecting data BEFORE the device fails.

The other issue to keep in mind is that some electronics won’t have quantifiable data that you can use for statistical modelling solutions. Two reasons for this are that quantifiable test standards haven’t been developed or the electronic product is too complicated.

For instance, test standards for optical coatings in fibre-optic networks (ie. high-speed Internet) have only been around for about 45 years. Coatings are tested to make sure they can resist scratches and don’t easily come off things they’re applied to (like mirrors and lenses). These tests take hundreds to thousands of hours and some parts are quantifiable (ex. 1000 hours at 85°C and 85% humidity). But other parts are just mere ‘visual’ checks: rub a cheesecloth against a coating 50 times and see if you have scratches.

It’s hard to gather data about the effect of cheesecloth rubs though. 😁 So statistical models aren’t easy to build here.

In contrast, integrated circuit chips (computer chips) have had testing standards since the 1970s. But each chip has billions of transistors (electronic switches to let current pass through), capacitors (electronic devices to store static charge), and other components. How do you test if one of them is broken???? 😮😮😮😮

No magical engineering solutions here, unfortunately. You just don’t. Instead, engineers will create what are called ‘test patterns’. These are basically different input, output pairs to test on the computer chip. If you input data and get the expected output, the individual computer chip is working. For example, you could write data to the computer chip and then read it back. If the data you read isn’t what you tried to write, you know the computer chip didn’t save the data properly.

BUT data saved isn’t really a ‘physics’ variable that we can quantifiably test like with other electronics. And even if we tried to test variables like temperature, current, etc. — which of the billions of components in a chip do you take readings from?? This is why statistical models are also hard to build here.

🔑 Statistical simulations can predict test outcomes for many electronics. But only for test standards that collect numeric data.

In both cases, though, there is an option for you if you just REALLY love statistics. 😅 Instead of collecting data from optical coatings or computer chips, you collect data from the machines that make them. And you use different data from the factory production environment to optimise the production process and MAYBE make some predictions about the functionality of individual batches of electronics.

There are very complicated companies working on this very complicated approach in major industries like computer chip production. Hopefully, the immense talent going into electronics testing problems will make slow testing a relic of the past! And then, you finally CAN buy better phone batteries without waiting for years…

In “short” 😑😜

  • Slow testing = slow development of new electronics
  • Lifetime testing, and burn-in testing = important + universal + slow.
  • Failure modes = events where electronics break. Monitored via variables like current, power, etc.
  • Physics simulations = complicated + specific ways to predict test outcomes via failure modes.
  • Statistical simulations = generally-applicable ways to predict test outcomes via data trends.
  • Statistical simulations don’t work for electronics with a long time until failure or no possibility to collect data.

If you have any questions about this article, feel free to email Voltx’s cofounders: Alishba Imran (editor of this article) or Shagun Maheshwari!

Thank you to: 🙏

  • Dr. Jeff Jones from the IEC. We wouldn’t have understood the connections between different electronic products without you!
  • Dr. Darayus Patel from the Nanyang Technological University. I’m grateful for all your enthusiastic support in breaking down semiconductor fabrication with me!
  • Dr. Stefaan Vandendriessche from Edmund Optics. We couldn’t have imagined the issues with testing optical coatings without your tip!
  • Alishba Imran from Voltx. I appreciate your help in editing this article and sharing your research from photovoltaic testing!

--

--

Madhav Malhotra
Voltx

Is helpful/friendly :-) Wants to solve neglected global problems. Linkedin: linkedin.com/in/madhav-malhotra/