Robust Design Channel

Robust Design explores the toolbox of design principles that embedded developers can use to build components and systems that continuously work as intended, even when they are subjected to unintended operating circumstances.

How could easing restrictions for in-flight electronics affect designs?

Wednesday, March 28th, 2012 by Robert Cravotta

The FAA (Federal Aviation Administration) has given permission to pilots on some airlines to use iPads in the cockpit in place of paper charts and manuals. In order to gain this permission, those airlines had to demonstrate that the tablet did not emit radio waves that could interfere with aircraft electronics. In this case, the airlines only had to certify the types of devices the pilots wanted to use rather than any of the devices that passengers would want to use. This type of requirements-easing is a first step toward possibly allowing electronics use during landings and takeoffs for passengers.

Devices, such as the Amazon Kindle, that use an E-ink display, can emit radio emissions that are less than 0.00003 volts per meter (according to tests conducted at EMT Labs) when in use – well under the 100 volts per meter of electrical interference that the FAA requires of airplanes. Even if every passenger was using a device with emissions this low, it would not exceed the minimum shielding requirements for aircraft.

A challenge though is whether allowing some devices versus others to operate throughout a flight would create situations where enough passengers might accidently leave on or operate their unapproved devices so that taken together – all of the devices might exceed the safety constraints for radio emissions.

On the other hand, being able to operate an electronic device throughout a flight would be a huge selling point for many people – and this could lead to further economic incentives for product designers to push their designs to those limits that could gain permission from the FAA.

Is the talk about easing the restrictions for using electronic gadgets during all phases of a flight wishful thinking or is the technology advancing far enough to be able to offer devices that can operate well below the safety limits for unrestricted use on aircrafts? I suspect this on-going dialogue between the FAA, airlines, and electronics manufacturers could yield some worthwhile ideas on how to ensure proper certification, operation, and failsafe functions within an aircraft environment that could make unrestricted use of electronic gadgets a real possibility in the near future. What do you think will help this idea along, and what challenges need to be resolved before it can become a reality?

Are you using Built-in Self Tests?

Wednesday, February 15th, 2012 by Robert Cravotta

On many of the projects I worked on it made a lot of sense to implement BISTs (built-in self tests) because the systems either had some safety requirements or the cost of executing a test run of a prototype system was expensive enough that it justified the extra cost of making sure the system was in as good a shape as it could be before committing to the test. A quick search for articles about BIST techniques suggested that it may not be adopted as a general design technique except in safety critical, high margin, or automotive applications. I suspect that my literature search does not reflect reality and/or developers are using a different term for BIST.

A BIST consists of tests that a system can initiate and execute on itself, via software and extra hardware, to confirm that it is operating within some set of conditions. In designs without ECC (Error-correcting code) memory, we might include tests to ensure the memory was operating correctly; these tests might be exhaustive or based on sampling depending on the specifics of each project and the time constraints for system boot up. To test peripherals, we could use loop backs between specific pins so that the system could control what the peripheral would receive and confirm that outputs and inputs matched.

We often employed a longer and a shorter version of the BIST to accommodate boot time requirements. The longer version usually was activated manually or only as part of a cold start (possibly with an override signal). The short version might be activated automatically upon a cold or warm start. Despite the effort we put into designing, implementing, and testing BIST as well as developing responses when a BIST failed, we never actually experienced a BIST failure.

Are you using BIST in your designs? Are you specifying your own test sets, or are you relying on built-in tests that reside in BIOS or third-party firmware? Are BISTs a luxury or a necessity with consumer products? What are appropriate actions that a system might make if a BIST failure is detected?

Do you employ “Brown M&Ms” in your designs?

Wednesday, January 25th, 2012 by Robert Cravotta

I have witnessed many conversations where someone accuses a vendor of forcing customers to use only their own accessories, parts, or consumables as a way to extract the largest amount of revenue out of the customer base. A non-exhaustive list of examples of such products includes parts for automobiles, ink cartridges for printers, and batteries for mobile devices. While there may be some situations where a company is trying to own the entire vertical market around their product, there is often a reasonable and less sinister explanation for requiring such compliance by the user – namely to minimize the number of ways an end user can damage a product and create avoidable support costs and bad marketing press.

The urban legend that the rock band Van Halen employed a contract clause that required a venue to provide a bowl of M&Ms backstage but with all of the brown candies removed is not only true, but provides an excellent example of such a non-sinister explanation. According to David Lee Roth (the band’s lead singer) autobiography, the bowl of M&Ms with all of the brown candies removed was a nearly costless way for them to test whether the people setting up their stage followed all of the details in their extensive setup and venue requirements. If the band found a single brown candy in the bowl, they ordered a complete line check of the stage before they would agree that the entire stage setup met their safety requirements.

This non-sinister description is consistent with the type of products that I hear people complain that the vendor is merely locking them into the consumables for higher revenues. However, when I examine the details I usually see a machine, such as an automobile, that requires tight tolerances on every part; otherwise small variations in non-approved components can combine to create unanticipated oscillations in the body of the vehicle. In the case of printers, variations in the formula for the ink can gum up the mechanical portions of the system when put through the wide range of temperature and humidity environments that printers are operated in. And for mobile device providers are very keen to keep the rechargeable batteries in their products from exploding and hurting their customers.

First, do you employ some clever “Brown M&M” in your design that helps to signal when components may or may not play together well? This could be as simple as performing a version check of the software before allowing the system to go into full operation. Or is the concept of “Brown M&Ms” just a story to cover greedy practices by companies?

Have you experienced a “bad luck” test failure?

Wednesday, December 7th, 2011 by Robert Cravotta

Despite all of the precautions that the Mythbusters team takes when doing their tests, the team accidentally launched a cannonball into a neighborhood and through a home. The test consisted of firing a 6-inch cannonball out of a homemade cannon to measure the cannonball’s velocity. The cannonball was fired at a sheriff’s bomb disposal range, and it was supposed to hit large containers filled with water. The projectile missed the containers and made an unlucky bounce off a safety beam sending it into the nearby neighborhood. Luckily, despite the property damage, including careening through a house with people sleeping in it, no one was hurt.

This event reminds me of a number of bad luck test failures I have experienced. Two different events involved similar autonomous vehicle tests, but the failures were due to interactions with other groups. In the first case, we experienced a bad luck failure during a test flight that failed because we had delayed the test to ensure that the test could complete successfully. In this test, we had a small autonomous vehicle powered with rocket engines. The rocket fuel (MMH and NTO) is very dangerous to work with, so we handled it as little as possible. We had fueled up the system for a test flight when the word came down that the launch was going to be delayed because we were using the same kind of launch vehicle that had just experienced three failed flights before our test.

While we waited for the failure analysis to complete, our test vehicle was placed into storage with the fuel (there really was no way to empty the fuel tanks as the single-test system had not been designed for that). A few months later we got the go ahead on the test, and we pulled the vehicle out of storage. The ground and flight checkouts passed with flying colors and the launch proceeded. However, during the test, once our vehicle blew its ordnance to allow the fuel to flow through the propulsion system, the seals catastrophically failed and the fuel immediately vented. The failure occurred because the seals were not designed to be in constant contact with the fuel for the months that it was in storage. The good news was that all of the electronics were operating correctly, just that the vehicle had no fuel to do what it was intended to do.

The other bad luck failure was the result of poor communication about an interface change. In this case, the system had been built around a 100Hz control cycle. A group new to the project decided to change the inertial measurement unit so that it operated at 400Hz. The change in sample rate was not communicated to the entire team and the resulting test flight was a spectacular spinning out of control failed flight.

In most of the bad luck failures I am aware of, the failure occurred because of assumptions that masked or hid the consequences of miscommunication or unexpected decisions made by one group within the entire team. In our case, the tests were part of a series of tests and they mostly cost us precious time, but sometimes such failures are more serious. For example, the Mars Climate Orbiter (in 1999) unexpectedly disintegrated while executing a navigation command. The root cause of that failure/error was a mismatch in the measurement systems used. One team used English units while another team used Metric units.

I guess calling these bad luck failures is a nice way to say a group of people did not perform all of the checks they should have before starting their tests. Have you ever experienced a “bad luck” failure? What was the root cause for the failure and could a change in procedures have prevented it?

How should embedded systems handle battery failures?

Wednesday, November 30th, 2011 by Robert Cravotta

Batteries – increasingly we cannot live without them. We use batteries in more devices than ever before, especially as the trend to make a mobile version of everything continues its relentless advance. However, the investigation and events surrounding the battery fires for the Chevy Volt is yet another reminder that every engineering decision involves tradeoffs. In this case, damaged batteries, especially large ones, can cause fires. However, this is not the first time we have seen damaged battery related issues – remember the exploding cell phone batteries from a few years ago? Well that problem has not been completely licked as there are still reports of exploding cell phones even today (in Brazil).

These incidents remind me of when I worked on a battery charger and controller system for an aircraft. We put a large amount of effort into ensuring that the fifty plus pound battery could not and would not explode no matter what type of failures it might endure. We had to develop a range of algorithms to constantly monitor each cell of the battery and appropriately respond if anything improper started to occur with any of them. One additional constraint on our responses though was that the battery had to deliver power when it was demanded by the system despite parts of the battery being damaged or failing.

Even though keeping the battery operating as well as it can under all conditions represents an extreme operating condition, I do not believe it is all that extreme a condition when you realize that automobiles and possibly even cell phones sometimes demand similar levels of operation. I recall discussing the exploding batteries a number of years ago, and one comment was that the exploding batteries was a system level design concern rather than just a battery manufacturing issue – in most of the exploding phones cases at that time, the explosions were the consequence of improperly charging the battery at an earlier time. Adding intelligence to the battery to reject a charging load that was out of some specification was a system-level method of minimizing the opportunity to damage the batteries via improper charging.

Given the wide range of applications that batteries are finding use in, what design guidelines do you think embedded systems should follow to provide the safest operation of batteries despite the innumerable ways that they can be damaged or fail? Is disabling the system appropriate?

Food for thought on disabling the system is how CFL (compact fluorescent lights) handle end-of-life conditions for the bulbs when too much of the mercury has migrated to the other end of the lighting tube – they purposefully burn out a fuse so that the controller board is unusable. While this simple approach avoids operating a CFL beyond its safe range, it has caused much concern among the user population as more and more people are scared by the burning components in their lamp.

How should embedded systems handle battery failures? Is there a one size fits all approach or even a tiered approach to handling different types of failures so that users can confidently use their devices without fear of explosions and fire while knowing when there is a problem with the battery system and getting it fixed before it becomes a major problem?

What game(s) do you recommend?

Thursday, June 30th, 2011 by Robert Cravotta

I have been thinking about how games and puzzles can help teach concepts and strengthen a person’s thought patterns for specific types of problem solving. However, there are literally thousands of games available across a multitude of forms, whether they are card, board, or computer-based games. The large number of options can make it challenging to even know when one might be particularly well suited to helping you train your mind for a type of design project. Discussion forums, like this one can collect lessons learned and make you aware of games or puzzles that others have found useful in exercising their minds – as well as being entertaining.

I have a handful of games that I could suggest, but I will start by offering only one recommendation in the hopes that other people will share their finds and thoughts about when and why the recommendation would be worthwhile to someone else.

For anyone that needs to do deep thinking while taking into account a wide range of conditions from a system perspective, I recommend checking out the ancient game of Go. It is a perfect knowledge game played between two players, and it has a ranking or handicap system that makes it possible for two players that are of slightly different strengths to play a challenging game for both players. Rather than explaining the specifics of the game here, I would instead like to focus on what the game forces you to do in order to play competently.

The rules are very simple – each player alternates turns placing a stone on a grid board. The goal of the game is to surround and capture the most territory. The grid is of sufficient size (19×19 points) that your moves have both a short term and a long term impact. Understanding the subtlety and depth of the long term impact of your moves grows in richness with experience and practice – not unlike designing a system in such a way as to avoid shooting yourself in the foot during troubleshooting. If you are too cautious, your opponent will capture too much of the board for your excellent long term planning to matter. If you play too aggressively – such as to capture as much territory as directly or as quickly as possible, you risk trying to defend what you have laid a claim to with a structure that is too weak to withstand any stress from your opponent.

The more I play Go, the easier I am able to see how the relationships between decisions and trade-offs affect how well the game – or a project – will turn out. Being able to find an adequate balance between building a strong structure and progressing forward at an appropriate pace is a constant exercise in being able to read your environment and adjusting to changing conditions.

I would recommend Go to anyone that needs to consider the system level impacts of their design decisions. Do you have a game you would recommend for embedded developers? If so, what is it and why might an embedded developer be interested in trying it out?

Are consumer products crossing the line to too cheap?

Wednesday, January 19th, 2011 by Robert Cravotta

I love that the price of so many products continues to fall year in and year out. However, I have recently started to wonder if in some cases the design trade-offs that the development teams for these products are making to lowering the cost to produce them are crossing a quality line. I do not have a lot of data points, but I will share my observations with my office phones as an example that might inspire you to share your experience with some product. Maybe in the aggregate, our stories will uncover whether there really is a trend for razor thin quality margins.

Over the previous ten years, I have relied on three different cordless phone systems. The price for each phone system was progressively lower than the previous phone system and usually provided more functionality than the previous phone. In each case, the phone that I replaced was still working, but there was something else that made replacing the phone worthwhile.

The first cordless phone I purchased was a single-line, 2.4GHz cordless phone. It consisted of a single handset that mounted on a full function base station. I liked that phone a lot. Everything about it was robust – except the fact that it operated in the 2.4GHz band along with so many other devices in and around my office. For example, if someone turned on the microwave, it would cause interference with the phone.

I eventually replaced the 2.4 GHz cordless phone with a dual-line, multi-handset, 5.8 GHz phone systems. That system cost me less than the 2.4 GHz phone and offered additional valuable features. The hand-set was smaller and weighed less. The dual-line capability allowed me to consolidate my phone lines to a single system and the multi-handset feature allowed me to make both phone lines available everywhere in my home office. There were a few features that the new handsets did not provide that I missed from the original phones, but I could use the phones no matter how other devices were being used around the area – so I was happy with the phones.

After a few years of heavy use, the batteries could not hold enough charge. I would regularly switch between handsets during a typical day because the battery in each handset provided less than an hour of talk time. I bought replacement batteries which were better than the original batteries, but only slightly so. With continued use, the new batteries provided longer life (I assume this was because each handset’s state of charge values were slowly adjusting to the new batteries – I’d love if someone could explain or point me to where someone explains why this happened). However, the batteries were only useful for about a year.

At this point, I bought my current cordless phone system which is made by the same manufacturer as the previous two phones. It is a DECT 6.0, dual-line, multi-handset system. It cost less money than either of the other two systems and added a larger database on the base station. However, the phone system exhibits intermittent behavior that I never experienced with my other phones.

Most notably, the communication between the base station and handsets is not always robust. Sometimes a handset will start doing a continuous ringing instead of the normal on/off cycle. Other times the handset will not receive the caller-id that the base station normally sends to it. And other times I will notice a clicking sound that I have not been able to attribute to anything. These examples of lower robustness make me wonder how much the design team shrunk the product’s performance margins to meet a lower price point.

Are these systemic examples of margins that are too small or did I just get unlucky with the phone I received? Because the phone works well most of the time, I suspect it is narrow quality margins that are the culprit – and this made me wonder if other people are noticing similar changes in the robustness of newer products. Do you have a product that you have noticed a change across generations?

What is your most memorable demonstration/test mishap?

Wednesday, January 12th, 2011 by Robert Cravotta

The crush of product and technology demonstrations at CES is over. As an attendee of the show, the vast majority of the product demonstrations I saw seemed to perform as expected. The technology demonstrations on the other hand did not always fare quite so well – but then again – the technology demonstrations were prototypes of possibly useful ways to harness new ideas rather than fully developed and productized devices. Seeing all of these demonstrations at the show reminded me of the prototypes I worked on and some of the spectacular ways that things could go wrong. I suspect that sharing these stories with each other will pass around some valuable (and possibly expensive) lessons learned to the group here.

On the lighter side of the mishap scale, I still like the autonomous robots that Texas Instruments demonstrated last year at ESC San Jose. The demonstration consisted of four small, wheeled robots that would independently roam around the table top. When they bumped into each other, they would politely back away and zoom off in another direction. That is, except for one of the robots which appeared to be a bit pushy and bossy as it would push the other robots around longer before it would back away. In this case, the robots were all running the same software. The difference in behavior had to do with a different sensitivity of the pressure bar that told the robot that it had collided with something (a wall or another robot in this case).

I like this small scale example because it demonstrates that even identical devices can take on significantly different observable behaviors because of small variances in the components that make up the device. It also demonstrates the possible value for closed-loop control systems to be able to access independent or outside reference points so as to be able to calibrate their behavior to some set of norms (how’s that for a follow-up idea on that particular demonstration). I personally would love to see an algorithm that allowed the robots to gradually influence each other’s behavior, but the robots might need more sensors to be able to do that in any meaningful way.

On the more profound side of the mishap scale, my most noteworthy stories involve live testing of fully autonomous vehicles that would maneuver in low orbit space. These tests were quite expensive to perform, so we did not have the luxury of “do-overs”. Especially in failure, we had to learn as much about the system as we could; the post mortem analysis could last months after the live test.

In one case, we had a system that was prepped (loaded with fuel) for a low orbit test; however, the family of transport vehicle we were using had experienced several catastrophic failures over the past year that resulted in the payloads being lost. We put the low-orbit payload system, with the fuel loaded, into storage for a few months while the company responsible for the transport vehicle went through a review and correction process. Eventually we got the go ahead to perform the test. The transport vehicle delivered its payload perfectly; however, when the payload system activated its fuel system, the seals for the fuel lines blew.

In this case, the prototype system had not been designed to store fuel for a period of months. The intended scenario was to load the vehicle with fuel and launch it within days – not months. During the period of time in storage, the corrosive fuel and oxidizer weakened the seals so that they blew when the full pressure of the fuel system was placed upon them during flight. A key takeaway from this experience was to understand the full range of operating and non-operating scenarios that the system might be subjected to – including being subjected to extended storage. In this case, a solution to the problem would be implemented as additional steps and conditions in the test procedures.

My favorite profound failure involves a similar low-orbit vehicle that we designed to succeed when presented with a three-sigma (99.7%) scenario. In this test though, there was a cascade of failures during the delivery to low-orbit phase of the test, which presented us with a nine-sigma scenario. Despite the series of mishaps leading to deployment of the vehicle, the vehicle was almost able to completely compensate for its bad placement – except that it ran out of fuel as it was issuing the final engine commands to put it into the correct location. To the casual observer, the test was an utter failure, but to the people working on that project, the test demonstrated a system that was more robust than we ever thought it could be.

Do you have any demonstration or testing mishaps that you can share? What did you learn from it and how did you change things so that the mishap would not occur again?

What is your favorite failsafe design?

Wednesday, January 5th, 2011 by Robert Cravotta

We had snow falling for a few hours where I live this week. This is remarkable only to the extent that the last time we had any snow fall was over 21 years ago. The falling snow got me thinking about how most things in our neighborhood, such as cars, roads, houses, and our plumbing, are not subjected to the wonders of snow with any regularity. On days like that, I am thankful that the people who designed most of the things we rely on took into account what impact different extremes, such as hot and cold weather, would have on their design. Designing a system to operate or degrade gracefully in rare operating conditions is a robust design concept that seems to be missing in so many “scientific or technical” television shows.

Designing systems so that they fail in a safe way is an important engineering concept– and it is often invisible to the end user. Developing a failsafe system is an exercise in trading between the consequences and probability of a failure and the cost to mitigate those consequences. There is no single best way to design a failsafe system, but two main tools available to designers are to incorporate interlocks or safeties into the system/or and to implement processes that the user needs to be aware of to mitigate the failure state. Take for example the simple inflatable beach ball; the ones I have seen have such a long list of warnings and disclaimers printed on them that is quite humorous – until you realize that every item printed on that ball probably has a legal case associated with it.

I was completely unaware until a few months ago how a rodent could make an automobile inoperable. Worst, our vehicle became unsteerable while the car was being driven. Fortunately no one got hurt (except the rat that caused the failure). In this case, it looks like the rat got caught in one of the belts in the engine compartment that ultimately made the power steering fail. I was surprised to find out this is actually a common failure when I looked it up on the Internet. I am not aware of a way to design better safety into the vehicle, so we have changed our process when using automobiles. In our case, we do periodic checks of the engine compartment to see if there are any signs of an animal living in there, and we sprinkled peppermint oil around the compartment because we heard that rodents hate the smell.

The ways to make a system failsafe are numerous and I suspect there are a lot of great ideas that have been used over the years. As an example, let me share a memorable failsafe mechanism we implemented on a Space Shuttle Payload I worked on for two years. The payload was going to be actually flying around the Space Shuttle – which means it would be firing engines more than once. This was ground breaking as launching satellites involves firing the engines only once. As a result, we had to go to great lengths to ensure that there could be no way that the engines could misfire – or worse, that the payload could receive a malicious command from the ground to direct the payload into a collision course with the Shuttle. All of the fault tolerant systems and failsafe mechanisms made the design quite complicated. In contrast, the mechanism we implemented to prevent acting on a malicious command was to use a table of random numbers that were loaded onto the payload 30 minutes before the launch and would be known to only two people. Using encryption was not a feasible option at that time because we just did not have the computing power to do it.

Another story of making a system more failsafe involved an X-ray machine. I was never able to confirm if this actually occurred or was a local urban legend, but the lesson is still valid. The model of X-ray machine in question was exposing patients to larger doses of radiation than it was supposed to when the technician pressed the backspace key during a small time window. The short term fix was to send out an order to remove the backspace key from all of the keyboards. The take-away for me was that there are fast, quick, and cheap ways to alleviate a problem that allow you to take the appropriate efforts to find a better way to fix the problem.

Have you ever used a clever approach to making your designs more failsafe? Have you ever run across a product you used that implemented an elegant failsafe mechanism? Have you ever seen a product that you thought of a better way that they could have made the system failsafe or degrade gracefully?

How do you exercise your orthogonal thinking?

Wednesday, December 29th, 2010 by Robert Cravotta

How are Christmas and Halloween the same? The intended answer to this question requires you to look at the question from different angles to find the significant relationship between these seemingly unrelated events. In fact, to be a competent problem solver, you often need to be able to look at a problem from multiple angles and find a way to take advantage of a relationship between different parts of the problem that might not be immediately obvious. If the relationship was obvious, there might not be a problem to solve.

I have found over the years that doing different types of puzzles and thinking games often help me to juggle the conditions of a problem around and find that elusive relationship that makes the problem solvable. While I do not believe being able to solve Sudoku puzzles will make you smarter, I do believe that practicing Sudoku puzzle in different ways can help exercise your “cognitive muscles” so that you can more easily reorganize difficult and abstract concepts around in your mind and find the critical relationship between the different parts.

There are several approaches to solving Sudoku puzzles and each requires a different set of cognitive wiring to perform competently. One approach, and one that I see most electronic versions of the puzzle support, involves penciling in all of the possible valid numbers in each square and using a set of rules to eliminate numbers from each square until there is one valid answer. Another approach finds the valid numbers without using the relationships between the “penciled” numbers. Each approach exercises my thought process in very different ways, and I find that switching between them provides a benefit when I am working on a tough problem.

I believe being able to switch gears and represent data in equivalent but different representations is a key skill to effective problem solving. In the case of Christmas and Halloween, rather than looking at the social context associated with each day, looking at the date of each day – October 31 and December 25 can suggest a non-obvious relationship.

I find that many of the best types of puzzles or games for exercising orthogonal thinking engage a visual mode of looking at the problem. The ancient board game of Go is an excellent example. The more I play Go, the more abstract relationships I am able to recognize and most surprisingly – apply to life and problem solving. If you have never played Go, I strongly recommend.

Another game I find a lot of value for exercising orthogonal thinking is Contract Bridge – mostly because it is a game that involves incomplete information – much like real life problems – and relies on the ability of the players to communicate information with each other within a highly constricted vocabulary. Often times, the toughest design problems are tough precisely because it is difficult to verbalize or describe what the problem actually is.

As for the relationship between October 31 and December 25, it is interesting that the abbreviations for these two dates also correspond to notation of the same exact number in two different number bases – Oct(al) 31 is the same value as Dec(imal) 25.

These examples are some of the ways I exercise my orthogonal thinking. What are your favorite ways to stretch your mind and practice switching gears on the same problem?