Parallel Programming - Basic Theory For The Unwary

Preface - Why Should You Care?

Writing sequential programs is the what most, if not all, programmers are being initially trained to do. At a certain phase in their life they discover that there are other models for programs. One of them is using parallelism. This means that instead of having your program carried out in a sequence, one instruction at a time, it is being executed by several different entities simultaneously. This can sometimes make the programs simpler to design, and may also run faster than a matching sequential program.

For example, if you have a computer with several processors, they might each be running a part of the program simultaneously, and thus complete it faster than if only one processor would have to run the whole program.

Another example is if you have a program that needs to read data from a disk, and meanwhile make some heavy calculations on it. Since the disk can transfer data to memory without intervention of the CPU, it would make sense to split your program into two parts running in parallel: one handles I/O, and reads data into memory. The other does the heavy calculations. You could do it all in one sequential part, that majorly does the calculations, and from time to time goes to read another block of data, but it is harder to write it this way, or to be efficient (how will the program know when the last block of data was read, and it is time to read another block?)

Scope Of This Tutorial

This document attempts to illustrate the terms and principles commonly used in parallel systems. It is by no means a replacement for a good parallel programming course. Yet, it may make it easier for people without this background able to read and understand the various parallel programming tutorials, and start writing parallel programs. I‘d still advise that they eventually take a course or two about parallel and/or distributed programming. I‘d like to hear from you if this tutorial really achieved its purpose for you, if it was too theoretical, too short, or was flawed in a different way.

Part of this tutorial concentrates on the underlying hardware and operating system software used in parallel systems. The last section tries to give a few in-sites as to when to try to make a system parallelic, how to design it properly, and what kind of tools and frameworks one could expect to find that can ease such development.

Entities Of Parallel Systems

A parallel system is a system (software and/or hardware) that allows one to write programs whose different parts are carried out in different threads of execution.
In order to better understand what a parallel (or parallelic) system is, we should check what are the different components such a system is made of.

Processes

A process is an entity that executes a computer program, and manipulates all other resources in order to fulfill the mission of the program. Several (or many) processes may be running on the same computer, or on different computers. Usually a single process is running on a single computer. Also, usually each process has its own address space, as well as a few other private resources (an private execution stack, for example).

Note - what we call here ‘process‘ is a broader entity than what you might know as a Unix process. We should have actually called it a ‘runnable‘, and in practice it could be a Unix process, a Unix thread and so on.

Resources

A resource is an entity that can be used by a process to perform some specific operation. There could be physical resources, as well as logical (or virtual) resources.
A physical resource could be, for example, a disk, which is used for saving files. Or a keyboard, which is used to read data from the user.
A logical resource could be a shared memory area, which several processes may access in order to read data to, or read data from.

Executers

An executer is a special resource that is used to execute a process. Usually, this is a CPU, so we‘ll use the term CPU from now on. Note that not all CPUs are equal - some are general-purpose, and might carry out the program as a whole. Others are specialized for different tasks - a CPU that only does mathematical floating point operations; A CPU that only handles graphical operations (commonly found on graphical screen cards, and serving as a graphic accelerator) and so on.

Schedulers

A scheduler is an entity specifying when processes will run, on which CPUs (executors) they will run, and in what order. Such a scheduler may be controlling a single computer, or several computers (see clustering below).

Synchronizers

A synchronizer is an entity used to control the access of processes to resources. Some synchronizers are used to make sure only a single process uses a given resource at any one time. Other synchronizers are used to make sure a given resource is accessed in a given order of priority by different processes. There is a host of other types of synchronizers to be dealt with. You will most likely bump into synchronizers such as Mutexes, Semaphores, Monitors, Conditions, etc.

Short Terms Dictionary

There are a few terms used in conjunction with all types of parallel systems. We will describe them here, divided into several categories in some sort of a (hopefully) logical order. For each term, we‘ll try to give a real-life example making it easier to grasp the concept, and to remember it (I‘d call this kind of example a "mantra". Forgive me for abusing this word).

Accessing Resources

Mutual Exclusion: A mechanism used to make sure no two processes are trying to use a resource at the same time. Used to avoid corrupting the internal state state of this resource. (mantra - that great big nurse sitting in front of the doctor‘s room, making sure no one gets in until the previous person comes out).
Deadlock: A situation in which a group of two or more processes are all waiting for a set of resources that are currently taken by other processes in the same group, or waiting for events that are supposed to be sent from other processes in the group. (mantra - when was the last time you were anxiously dialing to your boyfriend, finding his phone line constantly busy, while he was trying to call you back all this time? Each one of you was making one phone busy, while trying to dial the other‘s phone‘s number. If none of you would have let go of his/her phone, you‘d be in a deadlock. The phone is, indeed, a valuable resource).
Starvation: A situation where a process that tries to access some resource is never granted access to that resource (mantra - think of the last time you went into your fast-food restaurant, approached the counter, and were always pushed back to the end of the line by the crowd, verbally "starving to death").
Race Condition: A situation in which two (or more) processes are doing some competing operations at the same time, and the results might come up screwed up due to collisions. (mantra - think of two people trying to get into the same door at the same time, jamming each other‘s way in).
Busy Wait: A situation in which a process that is waiting for a resource to become free, enters a loop of constantly polling the resource in order to find if it is free. In the process it consumes all available CPU power to perform its constant polls. (mantra - when you are waiting for the cable guy to come and connect you new apartment, and keep looking out the window all the time to see when the technician arrives).

Handling Events And Notifications

Asynchronous Notification

A method for one process to notify a second process about some state change, while the second process is executing some unrelated operations. This is usually done using hardware interrupts, or using Unix signals (see the Unix signal programming tutorial). In properly designed programs, this notification method is not commonly used. (mantra - you go downtown to do some business, when suddenly your cellular phone rings, with your dad on the other side of the line, telling you his PC tells him he is "bad and invalid", and asks what to do...).

Synchronous Notification

A method for one process to notify a second process about some state change, that the second process will receive only when specifically polling for such notification. This is done by using functions such as select() (see internetworking with Unix sockets) in some sort of an event loop.

Event Loop

A control loop constantly executed by an event-driven program - its structure is:

Wait for a new event, suspending execution until one arrives from another process (or from the operating system).
Check which event type we got.
Invoke a function that can handle the given event type.
Go back to ‘1‘.

(mantra - a technical support technician is sitting by the phone, doing nothing in particular. Every once in a while a phone call arrives - an event - and the technician handles the call, and afterwards goes back to waiting for another call).

Process Scheduling

Process Preemption: An event of the system suspending the execution of a process that is in the middle of putting the CPU to good use, in order to let other processes run. This makes sure no single process will take over the full CPU, even if it gets into an infinite loop and does not make any blocking system calls.
Context Switch: The method by which the operating system makes one CPU suspend the execution of on process, and continue executing a second process. This is called a context switch since the CPU needs to switch to using the full context of the second process - its execution stack, its memory area, the values the registers contained when last executing this process, etc. A context switch may occure when the running process blocks waiting for some resource to become free, when it yields the CPU voluntarily, or when it is being preempted by the operating system‘s scheduler. (mantra - you talk to your mom on the phone, while trying to watch a very interesting T.V. program - you switch your attention between you T.V. and your mom constantly, always taking a short while to remember where you were before the last switch).
Round-Robin Scheduling: A scheduling algorithm in which several processes are being executed by an executed one after the other in turn, each getting the same amount of time to run. After this time slice passes, a context-switch is made, where the first process on the list gets to run next, while the suspended process is being put back at the end of the list. (mantra - when the ideal teacher in class makes sure every pupil gets a chance to speak in turn - well, as if that ideal teacher exists...).
Prioritized Scheduling: A scheduling algorithm in which a priority is assigned to each process, and when several processes are waiting for the CPU to get executed, the one with highest priority is being run, usually until it gets blocked waiting for some resource, or until a process with a higher priority wakes up and demands CPU time. (mantra - on a T.V. debate, some people are given higher priority than others, and interrupt everybody else when they have something to say. On Israeli T.V. - see under ‘Popolitica‘).
Load Balancing: A method by which several tasks are scheduled for separate CPUs or computers, based on the current load on each CPU. Sometimes this includes the ability to migrate processes from one CPU (or computer) to another during its execution.
Commands Pipelining: A process in which commands are broken into several smaller steps, that may be executed in sequence using different components of the CPU, meanwhile allowing others components of the CPU to start executing the next command in parallel. For example, suppose that we have a CPU component that can read data from memory, and another that can do mathematical calculations. If we have two consecutive addition commands that don‘t affect each other (i.e. working on un-related parameters), the CPU could first load the data for the first command using its I/O component, and then while the mathematical component calculates the sum of the data, the I/O component can start fetching the data for the next command in parallel, making overall execution faster. (mantra - when you wake up very late, and find you are very late for school, you start wearing your shirt, and while slipping it on your body with one hand, the other hand rushes to pick up a new pair of socks from the drawer, and so on).

Implementations Of Parallel Systems

Parallel systems implementation may be done in software, in hardware, or as a combination of both. It can also be made using symmetric elements, all of which are capable of performing the same tasks in the same level of efficiency, or using units that are specializing in in different jobs. We will show here several of the commonly used approaches for building a parallel system. In real life cases, we will usually see several methods combined, hopefully due to some good reasoning by the designers, and not by just the driving of marketeers wanting their product to carry the title of a "Parallelic system designed with the Cache-Coherency paradigm, using state-of-the-art virtual multi-threaded scheduling policy". You get the idea.

Implementations In Software

Software implementations of parallel systems usually have to handle the task of letting many processes run on a limited amount of hardware, using it in the most efficient way possible. Most efficient might mean "making sure the CPU is never idle", or better yet "making sure the whole system finishes its tasks as quickly as possible, in human time".

Multi Processes

This time we use the term process to denote an operating system process. All Unix systems, as well as many other operating systems, are multi-process systems. In most of these systems, each process is given its own address space, and they all share the same CPU in turn. The scheduling algorithm might be a simple round-robin algorithm, or one that takes priorities into account (priorities are mostly needed for real-time programs, that must be granted some limits on how long it‘ll take since an event they need arrives, until they are allowed to execute it).

Multi Threading Kernels

A system similar to the multi-process system, except that here a process may be divided into several threads. All threads share the same data area, and are being scheduled in a similar way to how processes are scheduled. Due to the sharing of data area (and few other resources), a context-switch between two threads of the same process is faster than a context switch between two processes. Also, passing data between threads is easier than between two processes, due to the shared address space. On the other hand, protection of threads from one another is weaker than that given to processes - if one thread corrupts its memory area, all other threads in its process may be directly affected. Threads supported by the kernel are sometimes called ‘Light-Weight Processes‘ (LWP).

Multi Threading Libraries

Similar to the kernel-based threads, except that the operating system‘s kernel is not aware of these threads - they are created and handled by a library. The advantage is that "context-switching" between them is faster - it does not involve any system call (with the overhead of switching into kernel mode). On the other hand, if one thread gets blocked by the operating system when invoking some system call (e.g. sleep(), or waiting for input from a network device), the whole process gets suspended, including all its threads. Thus a multi-threaded library implementation is suitable for calculation-intensive applications, not for I/O intensive applications.

Implementations In Hardware

Implementations of parallel systems often involve using extra hardware - several CPUs that can execute different processes in parallel being the most notable hardware addition.

SMP - Symmetric Multi-Processing

This method is used to contain several CPUs in a single computer, each of which has access to the same set of memory chips, and each working as a general-purpose CPU, that can execute any process in the system. The operating system is responsible for assigning newly created processes to specific processes, using some load-balancing algorithm. Sometimes these systems also handle moving a process from one CPU to another, that just became free.

In the past it was a simple task (after the current CPU finished executing one command, simple copy its registers contents into another CPU, and set it to continue execution from the same position). However, these days a CPU executes several commands of a process simultaneously, using pipelining techniques, and also contains an internal cache containing some data and commands, making such a process migration harder to implement, and more wasteful (all the partially-completed commands in the pipeline have to be flashed, the cache of the second CPU has to be reloaded, etc).

SMP systems exist now on many platforms, including PCs running with Intel and Intel-clone CPUs, Sun SPARC stations, using Ultra-SPARC CPUs (or older SPARC-10 and SPARC-20 CPUs), IBM RS/6000 systems using several PowerPC CPUs, and so on. Support for SMP systems is built into many operating systems running on these types of hardware, usually different by the amount of CPUs supported, as well as various internal implementation parameters.

CC-NUMA - Cache-Coherent Non-Uniform Memory Access

This is a different concept of having a parallel system with several CPUs. sometimes this system uses specialized processes to do different tasks, sometimes the access to just access to different memory parts is done in an asymmetric way (i.e. every CPU (or group of CPUs) have its own memory part, and thus memory is not shared between CPUs). An example of such a system is Silicon Graphic‘s Origin 2000 system. CC-NUMA systems are usually harder to implement than SMP systems (and thus more expensive), and thus normally not found in low-end or mid-sized systems.

Clustering

Clustering is a technique used to make several computers act as one larger machine, splitting tasks amongst them. They allow one to take several cheap stations, and combine them together to a larger system. It also allows for more redundancy for the system - if one machine in the cluster dies off, the other machines can cover up for it until the malfunctioning machine is repaired, and all this without bringing the whole system down. This type of setup is thus common in systems that must run 24 hours non-stop.

Clustering is often implemented in software, often using a protocol named PVM to communicate between the different machines. Examples fir such systems are Beowulf, for Linux systems, or the clustering system by Tandem corporation.

Tools And Methods For Parallel Applications Development

Writing a parallel application takes a different approach then writing a sequential program. After we decide what needs to be done, we need to decide who gets to do what, and find points where extra parallelism would be beneficial. We then need to decide how our different runnables are going to communicate with one another - sometimes a whole slew of different communications methods is used in one large parallel application, each of which fits a particular need best. We then come to the art of debugging parallel applications, which requires some techniques not required when debugging sequential applications. You will note that similar techniques are also used when debugging device drivers, and even windowing GUI applications.

Note that we‘re not trying to teach the whole methodology in a few paragraphs, but rather just to point out a few places where one might search for more information and wisdom.

Designing A Parallel Application

The first step in designing a parallel application, is determining what level or parallelism, if at all, is beneficial to the problem our application tries to solve. In many cases, parallelism would add much more overhead, than benefits. An important factor is the experience of the programmers with parallel systems. This is not a factor when you‘re trying to learn, of-course, but it is a factor if you want to get something done in an reasonable amount of time. take into account some extra overhead needed to fix hard bugs that stem from timing problems, race conditions, deadlocks and the like.

Once we decided to use parallel programming, we should work on decomposing our system into units that would logically belong to a single runnable. Sometimes we find very natural divisions, other times only experience will help us, or better, looking at other similar applications for which we could find some success record. If we‘re programming in order to learn, we should mostly experiment, write code, test it, dump bad ideas, and be ready to write again from scratch. If we see our design leads to new complexities, its probably time for a change.

Communications Frameworks

A very important factor for the success or a parallel application, is choosing an appropriate communications framework. There are several such framework in common use, and for anything but simplistic and experimental work, we should consider using one of them. We‘ll show here a few examples, thought of-course other methods (including methods implemented by various commercial products) exist.

ONC RPC - Remote Procedure Call

Remote Procedure Calls (RPC) are a method originally developed by Sun microsystems©, allows one process to activate a procedure in a second process, passing it parameters, and optionally getting a result back from the call.

The set of procedures supported by a process are defined in a file using notation called ‘RPC Language‘, and is pre-processed by a tool named ‘rpcgen‘, which creates two groups of files forming two ‘stubs‘. One stub defines functions whose invocation will cause a message to be sent to the remote process, with a request to invoke a certain procedure. This function is invoked by the first (client) process, and returns when we get a reply from the second (server) process, with the value it returned. The second stub contains declarations of functions that need to be implemented by the second (server) process, in order to actually implement the procedures.

During the years, new RPC variants were created, most notably ‘DCE RPC‘, which is part of the ‘Distributed Computing Environment‘, now being maintained by the open group.

OMG‘s CORBA - Common Object Request Broker Architecture

CORBA (Common Object Request Broker Architecture) was started up as an attempt of few hundreds of companies to define a standard that allows clients to invoke methods on specific objects running in remote servers.

This framework defines a language-neutral protocol to allow processes to communicate even if they are written in different programming languages, or running on different operating systems. A declarative language, named IDL (Interface Definition Language) was defined to allow specifying language-neutral interfaces. Each interface has a name, and contains a set of methods and attributes. The interface is then pre-processed by some tool, and generates client and server stubs, similarly to how it is done with ‘rpcgen‘ for RPC. An entity named ‘ORB‘ (Object Request Broker) is then used to allow these clients and servers to communicate.

Above this basic interface, a set of standard services were defined, that supply features that are commonly required: A naming service, to allow client processes to locate remote objects by name, An event service, to allow different objects to register for events sent by other objects, and so on. These services are collectively known as ‘Horizontal CORBA Services‘.
Yet other services are being defined for different areas of computing, for instance, services to be used by medical applications. These are called ‘vertical services‘.

For more information about CORBA, please refer to the Object Management Group‘s web site. You may also check the free CORBA page to locate and download various ‘free‘ implementations of CORBA.

I might annoy some of the readers here, but although we are dealing here with Unix programming, we cannot ignore what appears to be the currently more developed distributed objects framework, DCOM (Distributed Component Object Model). DCOM gives us rather similar services to what CORBA does, and people usually argue that their range of services is smaller than what CORBA provides. However, DCOM served as the foundation for the ActiveX set of interfaces, that are used to allow one windows application to activate another one and fully control it. This is most commonly used to allow one application to embed an object created by another application, and allow the user to manipulate the embedded object by invoking the second application when required. Of-course, the ActiveX interface allows much more than that.

There are several reasons why DCOM is important also for Unix programmers:

In many cases, one needs to be able to make Unix and Windows programs communicate. For that reason, a standard for a CORBA to DCOM interface has been defined by the OMG, and you, as a Unix programmer, might find yourself in a need to interface with such programs.
There are various ideas that exist in DCOM, that are worthy of looking at, and implementing on top of native Unix frameworks (such as CORBA). The rule here is - "don‘t throw out the baby with the bath water".
The Win32 API system is being ported to the Unix environment by several companies, and COM (the non-distributed version of DCOM) is already available for various Unix platforms. DCOM is also being ported to these platforms. When this is done, Unix programmers will be able to employ DCOM without the need to use a Microsoft© operating system.

Third-Party Libraries Supporting Process/Thread Abstractions

Various third-party libraries exist, whose purpose is to ease the development of cross-platform applications. of those, various libraries try to make multi-process and multi-threaded programming easier.

ACE (Adaptive Communications Environment) is a large C++ library, developed at the Washington university in St. Louis. ACE attempts to supply abstractions for a lot of system programming concepts, including sockets, pipes, share memory, processes and threads. These abstractions allow one source-code to be compiled by different compilers on different operating systems, from PCs running Linux, BSD and windows systems, through most types of Unix for workstations, and up to IBM‘s MVS open edition, and not to forget several real-time operating systems, such as VxWorks and LynxOS. There is also a version of ACE ported to Java, named JACE.

Rogue Wave© is a company known for writing commercial libraries that are used to ease the development of applications. One of their libraries is named ‘Threads++‘, and is used to make multi-threaded programming easier. This library is something to consider when developing a commercial multi-threaded application. Refer to Rogue Wave‘s home page for more information.

Debugging And Logging Techniques

Last, but not least, comes the host of problems we face when trying to debug parallel applications. The first problem is that things happen in different processes or threads at the same time, and we need a debugger that can follow all relevant runnables simultaneously. most debuggers now can handle processes properly, and on platforms with multi-threading support, usually the native debugger can handle threads as well. Of course, we need some kind of IDE if we want to use these debuggers without loosing our sanity.

However, because of the complex nature of such applications, and their sensitivity to timing issues, stopping the application in order to examine it with a debugger is something we sometimes cannot afford. Thus, our best shot would be at extensive logging, that can be turned on and off while the application is running. Such a logging facility must allow us to see which process or thread wrote each log record, and exactly when, to be able to deduce anything out of this info. We would also need to make the format of the log files easy to parse, in order to locate interesting events. It is advised that such a logging mechanism be devised early in the life of the project, as it can save hours of battling processes later.

One of the first problems we face when looking for the the cause of a bug in a parallel application, is finding the responsible process. Many times several processes are sending messages in a chain, and this chain breaks somewhere along the way. One method that could be used to debug such problems between two interacting processes, is simply suspending both of them. Then running the one that should initiate the message with a debugger, checking that the data it contains is legal and that the message it is about to send contains correct information. Then we can attach a debugger to the second process, set a breakpoint on the function that is supposed to receive the message, and resume the execution of this process. Of-course, this method cannot be employed when the message handling is sensitive to some timing constraints. In that case, only extensive logging will help us.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。