Multiprocessing Channel

Faster processing speeds have reached the point of diminishing returns. Chip architects and embedded designers are turning to multiprocessing architectures to harness the needed processing capability while keeping within challenging energy budgets. This series focuses on the challenges of using multicore devices, as well as homogeneous and heterogeneous topologies for multiprocessing.

Implementing system-wide load balancing

Friday, May 20th, 2011 by Carolina Blanch and Rogier Baert and Maja DHondt

To enable intelligent adaptation for multiprocessing designs that perform system-wide load balancing, we implemented several functions and added these to the processors. Examples of these functions are constant resource and workload monitoring as well as bandwidth monitoring of the network link. This allows the device to promptly react to situations where (1) the workload exceeds the available resources, which can occur due to variations in the workload as well as (2) reductions of the available processing power, due to running down the batteries for example. In both cases, the processing of some tasks should be transferred to more powerful devices on the network.

In our experiment, device A and device B both implement the decoding of video streams, followed by an image analysis function (logo-detector used for quality control) and re-encoding of the video streams (See Figure 2). Depending on the resource availability at either device, some stream processing tasks are transferred seamlessly to the less overloaded device. Finally, the output of all processed streams is sent to a display client. This task migration automatically lowers the workload at the overloaded device while all videos are visualized without any artifacts.

Load balancing at the system level involves migrating tasks between devices in the network.

We implement a Device Runtime Manager (RM) at each device that takes care of the monitoring of workload, resources, video quality, and link bandwidth within each device.Note that bandwidth monitoring is required to guarantee that the processed data can be accommodated in the available bandwidth.

While the Device RM can take care of the load balancing within the device, a Network Runtime Manager is implemented to perform the load balancing between devices in the network. To do so, it receives the monitored information from each individual Device RM and decides where, on which device, to execute each task. This way, when resources at Device A are insufficient for the existing workload the Network RM shifts some stream processing tasks from Device A to B. Obviously, this task distribution between device A and B depends on the resource availability at both devices. In our experiment the screen on the client display shows in which device each displayed video stream is currently processed. In the example in Figure 2, due to lack of resources in device A, the processing of 6 video streams has been migrated to device B while the remaining 3 have been processed on device A.

In a similar way, Figure 3 shows the monitored workload on both device A and B. Around time 65s, several stream processing tasks are added to device A, causing its overload. To decrease the load of device A the Network RM gradually shifts tasks from device A to device B resulting in the load distribution given from time 80s on. This way, load balancing between devices in the network overcomes the processing limitations of device A.

The processing load distribution over time for devices A and B.

Another way to implement load balancing between devices would be by means of down-scaling the task requirements. For video decoding applications for example, this may translate into reducing the spatial or temporal resolution of the video stream. In another experimental setup, when the workload on a decoding client becomes too high, the server transcodes the video streams to a lower resolution before sending them to the client device. This naturally reduces the workload on the device client at the cost of an increased processing at the server.

Note that we chose to place the Network RM at device A, but it could be located at any other network device as the overhead involved is very low. Note also that in both experiments the decision to migrate tasks or adapt the contents is taken to fit the processing constraints and to guarantee the desired quality-of-service.

However, another motivation for load balancing can be to minimize the energy consumption at a specific device, extending its battery lifetime, or even controlling the processors’ temperature. Think of cloud computing data centers where cooling systems are critical and account for a big fraction of the energy consumed. Load balancing between servers could be used to control the processors temperature by switching on an off processors or switching tasks between them. This could potentially reduce the cooling effort and pave the way for greener and more efficient data centers.

We are facing a future where applications are becoming increasingly complex and demanding with a highly dynamic and unpredictable workload. To tackle these challenges we require high flexibility of adaptation and cooperation from devices in the network. Our research and experimentscontribute to an environment with intelligent and flexible devices that are able to optimally exploit and share their resources and those available in the network while adapting to system dynamics such as workload, resources and bandwidth.

Dynamic runtime task assignment

Tuesday, April 19th, 2011 by Carolina Blanch and Rogier Baert and Maja DHondt

Innovative multimedia applications – think of multi-camera surveillance or multipoint videoconferencing – are demanding both in processing power and bandwidth requirements. Moreover, a video processing workload is highly dynamic and subject to stringent real-time constraints. To process these types of applications, the trend is to use commodity hardware, because it provides higher flexibility and reduces the hardware cost. This hardware is often heterogeneous, a mix of central processing units (CPUs), graphic processing units (GPUs) and digital signal processors (DSPs).

But implementing these multimedia applications efficiently onto one or more heterogeneous processors is a challenge, especially if you also want to take into account the highly dynamic workload. State-of-the-art solutions tackle this challenge through fixed assignments of specific tasks to types of processors. However, such static assignments lack the flexibility to adapt to workload or resource variations and often lead to poor efficiency and over-dimensioning.

Another strategy would be to implement dynamic task assignment. To prove that this is feasible, we have implemented middleware that performs both run-time monitoring of workloads and resources, and runtime task assignment onto and between multiple heterogeneous processors. As platforms, we used the NVidia Quadro FX3700 and dual Intel quad core Xeon processors.

Load-balancing between the cores of a device

Our experiment consists of running multiple pipelines where MPEG-4 decoding, frame conversion and AVC video encoding are serialized tasks. From all these tasks, the most demanding motion estimation task, part of video encoding, can be run either on a CPU or it can be CUDA-accelerated on the GPU. Figure 1 compares our experimental runtime assignment strategy with two static assignment strategies that mimic the behavior of state-of-the-art OS-based systems.

This chart shows how the throughput increases by dynamic task assignment within a device.

The first static assignment considered consists of letting the operating system assign all tasks onto the CPU cores. The second one assigns all CUDA-accelerated tasks on the GPU while the remaining tasks are scheduled on the CPU cores. We can see how the latter one, enabling GPU-implementable versions of the most demanding tasks, increases the number of streams that can be processed from 10 to 15. However at this point, the GPU becomes the bottleneck and it limits the number of processed frames.

The last, dynamic, strategy overcomes both CPU and GPU limitations and bottlenecks by finding at runtime an optimal balance between CPU and GPU assignments. By doing so, an increased throughput of 20% is achieved in comparison with fixed assignments to GPU and CPU. This way, the efficiency and flexibility of the available hardware is increased while the overhead remains marginal, around 0.05% of the total execution time.

From the device to the cloud

Load balancing within the device is a first step that is necessary to maximize the device’s capabilities and to cope with demanding and variable workloads. But to overcome the limited processing capacity of a device, a second step is needed: load balancing at the system level. Only this will allow exploiting the potential of a highly connected environment where resources from other devices can be shared.

In addition, today’s applications tend to become more complex and demanding in both processing and bandwidth terms. On top of this, users keep demanding lighter and portable multifunctional devices where longer battery duration is desirable. Naturally, this poses serious challenges for these devices to meet the processing power required by many applications.

The way to solve this is by balancing the load between multiple devices, by offloading tasks from overloaded or processing-constrained devices to more capable ones that can process these tasks remotely. This is linked to “thin client” and “cloud computing” concepts where the limited processing resources on a device are virtually expanded by shifting processing tasks to other devices in the same network.

As an example, think of home networks or local area networks through which multiple heterogeneous devices are connected. Some of these devices are light portable devices such as I-phones and PDAs with limited processing capabilities and battery, while others are capable of higher processing such as media centers or desktops at home, or other processing nodes/servers in the network infrastructure.

One consequence from migrating tasks from lighter devices to more capable ones is that the communication and throughput between devices increases. In particular, in the case of video applications, where video is remotely decoded and transmitted, the bandwidth demand can be very high. Fortunately, upcoming wireless technologies are providing increasingly high bandwidth and connectivity enabling load balancing. This is the case of LTE femto cells where up to 100 Mbps downlink are available, or wireless HD communications systems in the 60GHz range where even 25Gbps are promised.

However, meeting the bandwidth needs is only one of the many challenges posed. Achieving efficient and flexible load balancing in the network also requires a high degree of cooperation and intelligence from devices in the network. This implies not only processing capabilities at the network side but also monitoring, decision making, and signaling capabilities.

Experimental work at imec has shown that, as challenging as it may sound, the future in which all devices in the network efficiently communicate and share resources is much closer than we think.

Exploring multiprocessing extremes

Friday, August 6th, 2010 by Robert Cravotta

Extreme multiprocessing is an interesting topic because it can mean vastly different things to different people depending on what types of problems they are trying to solve.

At one end of the spectrum, there are multiprocessing designs that maximize the amount of processing work that the system performs within a unit of time while staying within an energy budget to perform that work. These types of designs, often high-compute, parallel processing, work station, or server systems, are able to deliver a higher processing throughput rate at lower power dissipation than if they used a hypothetical single core processor that ran at significantly faster clock rates. The multiple processor cores in these types of systems might operate in the GHz range.

While multiprocessing architectures are an approach to increase processing throughput while maintaining an energy budget, for the past few years, I have been unofficially hearing from high performance processor suppliers that some of their customers are asking for faster processors despite the higher energy budget. These designers understand how to build their software systems using a single instruction-stream model. The contemporary programming models and tools are falling short for enabling software developers to scale their code across multiple instruction streams. The increased software complexity and risks outweigh the complexity of managing the higher thermal and energy thresholds.

At the other end of the spectrum, there are multiprocessing designs that rely on multiple processor cores to partition the workload among independent resources to minimize resource dependencies and design complexity. These types of designs are the meat and potatoes of the embedded multiprocessing world. The multiple processor cores in these types of systems might operate in the 10’s to 100’s MHz range.

Let me clarify how I am using multiprocessing to avoid confusion. Multiprocessing designs use more than a single processing core, working together (even indirectly) to accomplish some system level function. I do not assume what type of cores the design uses, nor whether they are identical, similar, or dissimilar. I also do not assume that the cores are co-located in the same silicon die, chip package, board, or even chassis because a primary difference for each of these implementation options are energy dissipation and latency of the data flow. The design concepts are similar between each scale as long as the implementation meets the energy and latency thresholds. To further clarify, multicore is a subset of multiprocessing where the processing cores are co-located in the same silicon die.

I will to try to identify the size, speed, energy, and processing width limits for multiprocessing systems for each of these types of designers. In the next extreme processing article, I will explore how scaling multiprocessing upwards might change basic assumptions about processor architectures.

Robotics and autonomous systems

Tuesday, July 27th, 2010 by Robert Cravotta

Robotics is embedded design made visible. It is one of the few ways that users and designers can see and understand the rate of change in embedded technology. The various sensing and actuating subsystems are not the end-system, nor does the user much care how they are implemented, but both user and designer can recognize how each of the subsystems contribute, at a high level of abstraction, to the behavior of the end-system.

The state of the art for robotic systems keeps improving. Robots are not limited to military applications. Robots are entrenched in the consumer market in the form of toys and cleaning robots. Aquaproducts and iRobot are two companies that sell robots into the consumer market that clean pools, carpets, roof gutters, and hard floors.

A recent video from the GRASP (General Robotics Automation Sensing and Perception) Laboratory at the University of Pennsylvania demonstrates aggressive maneuvers for an autonomous, flying quadrotor (or quadrocopter). The quadrotor video demonstrates that it can autonomously sense and adjust for obstacles, as well as execute and recover from performing complex flight maneuvers.

An even more exciting capability is groups of autonomous robots that are able to work together toward a single goal. A recent video demonstrates multiple quadrotors flying together to carry a rigid structure. At this point, the demonstration only involves rigid structures, and I have not yet been able to confirm whether the cooperative control mechanism can work with carrying non-rigid structures.

Building robots that can autonomously work together in groups is a long-term goal. There are robot soccer competitions that groups such as FIRA and RoboCup sponsor throughout the year to promote interest and research into cooperative robots. However, building packs of cooperating robots is not limited to games. Six development teams were recently announced as finalists for the inaugural MAGIC (Multi Autonomous Ground-Robotic International Challenge) event.

Robotics relies on the integration of software, electronics, and mechanical systems. Robotics systems need to be able to coordinate sensing the external world with their own internal self-state to navigate through the real world and accomplish a task or goal. As robotic systems continue to mature, they are incorporating more context recognition of their surroundings, self-state, and goals, so that they can perform effective planning. Lastly, multiprocessing concepts are put to practical tests, not only within a single robot, but these concepts are tested within packs of robots. Understanding what does and does not work with robots may strongly influence the next round of innovations within embedded designs as they adopt and implement more multiprocessing concepts.

Extreme Processing: Oil Containment Team vs. High-End Multiprocessing

Friday, June 11th, 2010 by Robert Cravotta

Teaser: Extreme processing thresholds do not only apply to the small end of the spectrum – they also apply to the upper end of the spectrum where designers are pushing the processing performance so hard that they are limited by how well the devices and system enclosures are able to dissipate heat. Watching the BP oil well containment effort may offer some possible insights and hints at the direction that extreme high processing systems are headed.

Categories: extreme processing, fault tolerance (redundancy), multiprocessing

Image Caption: “The incident command centre at Houma, Louisiana. Over 2500 people are working on the response operation. © BP p.l.c.”

Extreme Processing: Oil Containment Team vs. High-End Multiprocessing

So far, in this extreme processing series, I have been focusing on the low or small end of the extreme processing spectrum. But extreme processing thresholds do not only apply to the small end of the spectrum – they also apply to the upper end of the spectrum where designers are pushing the processing performance so hard that they are limited by how well the devices and system enclosures are able to dissipate heat. Watching the BP oil well containment effort may offer some possible insights and hints at the direction that extreme high processing systems are headed.

100611-bp-command-center.jpg


According to the BP CEO’s video, there are 17,000 people working on the oil containment team. At a crude level, the containment team is analogous to a 17,000 core multiprocessing system. Now consider that contemporary extreme multiprocessing devices generally offer a dozen or less cores in a single package. Some of the highest density multicore devices contain approximately 200 cores in a single package. The logistics of managing 17,000 distinct team members toward a single set of goals by delivering new information where it is needed as quickly as possible is analogous to the challenges designers of high-end multiprocessing systems face.

The people on the containment team span multiple disciplines, companies, and languages. Even though each team member brings a unique set of skills and knowledge to the team, there is some redundancy in the partitioning of those people. Take for example the 500 people in the crisis center. That group necessarily consists of two or three shifts of people that fulfill the same role in the center because people need to sleep and no single person could operate the center 24 hours a day. A certain amount of redundancy for each type of task the team performs is critical to avoid single-point failures because someone gets sick, hurt, or otherwise becomes unavailable.

Out in the field are many ships directly involved in the containment effort at the surface of the ocean over the leaking oil pipe. Every movement of those ships needs to be carefully planned, checked, and verified by a logistics team before the ships can execute them because those ships are hosting up to a dozen active ROVs (Remotely operated vehicles) that are connected to the ship via mile long cables. Tangling those cables could be disastrous.

In the video, we learn that the planning lead-time for the procedures that the field team executes extends 6 to 12 hours ahead, and some planning extends out approximately a week. The larger, more ambitious projects require even more planning time. What is perhaps understated is that the time frames for these projects is up to four times faster than the normal pace – approximately one week to do what would normally occur in one month of planning.

The 17,000 people are working simultaneously, similar to the many cores in multiprocessing systems. There are people that specialize in routing data and new information to the appropriate groups, analogous to how the scheduling circuits in multiprocessing systems operate. The containment team is executing planning across multiple paths, analogous to speculative execution and multi-pipelining systems. The structure of the team cannot afford the critical path hit of sending all of the information to a central core team to analyze and make decisions – those decisions are made in distributed pockets and the results of those decisions flow to the central core team to ensure decisions from different teams are not exclusive or conflicting with each other.

I see many parallels with the challenges facing designers of multiprocessing systems. How about you? If you would like to be an information source for this series or provide a guest post, please contact me at Embedded Insights.

[Editor's Note: This was originally posted on the Embedded Master]