Writing sequential programs is the what most, if not all, programmers are being initially trained to do. At a certain phase in their life they discover that there are other models for programs. One of them is using parallelism. This means that instead of having your program carried out in a sequence, one instruction at a time, it is being executed by several different entities simultaneously. This can sometimes make the programs simpler to design, and may also run faster than a matching sequential program.
For example, if you have a computer with several processors, they might each be running a part of the program simultaneously, and thus complete it faster than if only one processor would have to run the whole program.
Another example is if you have a program that needs to read data from a disk, and meanwhile make some heavy calculations on it. Since the disk can transfer data to memory without intervention of the CPU, it would make sense to split your program into two parts running in parallel: one handles I/O, and reads data into memory. The other does the heavy calculations. You could do it all in one sequential part, that majorly does the calculations, and from time to time goes to read another block of data, but it is harder to write it this way, or to be efficient (how will the program know when the last block of data was read, and it is time to read another block?)
This document attempts to illustrate the terms and principles commonly used in parallel systems. It is by no means a replacement for a good parallel programming course. Yet, it may make it easier for people without this background able to read and understand the various parallel programming tutorials, and start writing parallel programs. I‘d still advise that they eventually take a course or two about parallel and/or distributed programming. I‘d like to hear from you if this tutorial really achieved its purpose for you, if it was too theoretical, too short, or was flawed in a different way.
Part of this tutorial concentrates on the underlying hardware and operating system software used in parallel systems. The last section tries to give a few in-sites as to when to try to make a system parallelic, how to design it properly, and what kind of tools and frameworks one could expect to find that can ease such development.
A parallel system is a system (software and/or hardware) that allows one to write programs whose different parts are carried out in different threads of execution.
In order to better understand what a parallel (or parallelic) system is, we should check what are the different components such a system is made of.
A process is an entity that executes a computer program, and manipulates all other resources in order to fulfill the mission of the program. Several (or many) processes may be running on the same computer, or on different computers. Usually a single process is running on a single computer. Also, usually each process has its own address space, as well as a few other private resources (an private execution stack, for example).
Note - what we call here ‘process‘ is a broader entity than what you might know as a Unix process. We should have actually called it a ‘runnable‘, and in practice it could be a Unix process, a Unix thread and so on.
A resource is an entity that can be used by a process to perform some specific operation. There could be physical resources, as well as logical (or virtual) resources.
A physical resource could be, for example, a disk, which is used for saving files. Or a keyboard, which is used to read data from the user.
A logical resource could be a shared memory area, which several processes may access in order to read data to, or read data from.
An executer is a special resource that is used to execute a process. Usually, this is a CPU, so we‘ll use the term CPU from now on. Note that not all CPUs are equal - some are general-purpose, and might carry out the program as a whole. Others are specialized for different tasks - a CPU that only does mathematical floating point operations; A CPU that only handles graphical operations (commonly found on graphical screen cards, and serving as a graphic accelerator) and so on.
A scheduler is an entity specifying when processes will run, on which CPUs (executors) they will run, and in what order. Such a scheduler may be controlling a single computer, or several computers (see clustering below).
A synchronizer is an entity used to control the access of processes to resources. Some synchronizers are used to make sure only a single process uses a given resource at any one time. Other synchronizers are used to make sure a given resource is accessed in a given order of priority by different processes. There is a host of other types of synchronizers to be dealt with. You will most likely bump into synchronizers such as Mutexes, Semaphores, Monitors, Conditions, etc.
There are a few terms used in conjunction with all types of parallel systems. We will describe them here, divided into several categories in some sort of a (hopefully) logical order. For each term, we‘ll try to give a real-life example making it easier to grasp the concept, and to remember it (I‘d call this kind of example a "mantra". Forgive me for abusing this word).
select()
(see internetworking with Unix sockets) in some sort of an event loop. Parallel systems implementation may be done in software, in hardware, or as a combination of both. It can also be made using symmetric elements, all of which are capable of performing the same tasks in the same level of efficiency, or using units that are specializing in in different jobs. We will show here several of the commonly used approaches for building a parallel system. In real life cases, we will usually see several methods combined, hopefully due to some good reasoning by the designers, and not by just the driving of marketeers wanting their product to carry the title of a "Parallelic system designed with the Cache-Coherency paradigm, using state-of-the-art virtual multi-threaded scheduling policy". You get the idea.
Software implementations of parallel systems usually have to handle the task of letting many processes run on a limited amount of hardware, using it in the most efficient way possible. Most efficient might mean "making sure the CPU is never idle", or better yet "making sure the whole system finishes its tasks as quickly as possible, in human time".
This time we use the term process to denote an operating system process. All Unix systems, as well as many other operating systems, are multi-process systems. In most of these systems, each process is given its own address space, and they all share the same CPU in turn. The scheduling algorithm might be a simple round-robin algorithm, or one that takes priorities into account (priorities are mostly needed for real-time programs, that must be granted some limits on how long it‘ll take since an event they need arrives, until they are allowed to execute it).
A system similar to the multi-process system, except that here a process may be divided into several threads. All threads share the same data area, and are being scheduled in a similar way to how processes are scheduled. Due to the sharing of data area (and few other resources), a context-switch between two threads of the same process is faster than a context switch between two processes. Also, passing data between threads is easier than between two processes, due to the shared address space. On the other hand, protection of threads from one another is weaker than that given to processes - if one thread corrupts its memory area, all other threads in its process may be directly affected. Threads supported by the kernel are sometimes called ‘Light-Weight Processes‘ (LWP).
Similar to the kernel-based threads, except that the operating system‘s kernel is not aware of these threads - they are created and handled by a library. The advantage is that "context-switching" between them is faster - it does not involve any system call (with the overhead of switching into kernel mode). On the other hand, if one thread gets blocked by the operating system when invoking some system call (e.g. sleep()
, or waiting for input from a network device), the whole process gets suspended, including all its threads. Thus a multi-threaded library implementation is suitable for calculation-intensive applications, not for I/O intensive applications.
Implementations of parallel systems often involve using extra hardware - several CPUs that can execute different processes in parallel being the most notable hardware addition.
This method is used to contain several CPUs in a single computer, each of which has access to the same set of memory chips, and each working as a general-purpose CPU, that can execute any process in the system. The operating system is responsible for assigning newly created processes to specific processes, using some load-balancing algorithm. Sometimes these systems also handle moving a process from one CPU to another, that just became free.
In the past it was a simple task (after the current CPU finished executing one command, simple copy its registers contents into another CPU, and set it to continue execution from the same position). However, these days a CPU executes several commands of a process simultaneously, using pipelining techniques, and also contains an internal cache containing some data and commands, making such a process migration harder to implement, and more wasteful (all the partially-completed commands in the pipeline have to be flashed, the cache of the second CPU has to be reloaded, etc).
SMP systems exist now on many platforms, including PCs running with Intel and Intel-clone CPUs, Sun SPARC stations, using Ultra-SPARC CPUs (or older SPARC-10 and SPARC-20 CPUs), IBM RS/6000 systems using several PowerPC CPUs, and so on. Support for SMP systems is built into many operating systems running on these types of hardware, usually different by the amount of CPUs supported, as well as various internal implementation parameters.
This is a different concept of having a parallel system with several CPUs. sometimes this system uses specialized processes to do different tasks, sometimes the access to just access to different memory parts is done in an asymmetric way (i.e. every CPU (or group of CPUs) have its own memory part, and thus memory is not shared between CPUs). An example of such a system is Silicon Graphic‘s Origin 2000 system. CC-NUMA systems are usually harder to implement than SMP systems (and thus more expensive), and thus normally not found in low-end or mid-sized systems.
Clustering is a technique used to make several computers act as one larger machine, splitting tasks amongst them. They allow one to take several cheap stations, and combine them together to a larger system. It also allows for more redundancy for the system - if one machine in the cluster dies off, the other machines can cover up for it until the malfunctioning machine is repaired, and all this without bringing the whole system down. This type of setup is thus common in systems that must run 24 hours non-stop.
Clustering is often implemented in software, often using a protocol named PVM to communicate between the different machines. Examples fir such systems are Beowulf, for Linux systems, or the clustering system by Tandem corporation.
Writing a parallel application takes a different approach then writing a sequential program. After we decide what needs to be done, we need to decide who gets to do what, and find points where extra parallelism would be beneficial. We then need to decide how our different runnables are going to communicate with one another - sometimes a whole slew of different communications methods is used in one large parallel application, each of which fits a particular need best. We then come to the art of debugging parallel applications, which requires some techniques not required when debugging sequential applications. You will note that similar techniques are also used when debugging device drivers, and even windowing GUI applications.
Note that we‘re not trying to teach the whole methodology in a few paragraphs, but rather just to point out a few places where one might search for more information and wisdom.
The first step in designing a parallel application, is determining what level or parallelism, if at all, is beneficial to the problem our application tries to solve. In many cases, parallelism would add much more overhead, than benefits. An important factor is the experience of the programmers with parallel systems. This is not a factor when you‘re trying to learn, of-course, but it is a factor if you want to get something done in an reasonable amount of time. take into account some extra overhead needed to fix hard bugs that stem from timing problems, race conditions, deadlocks and the like.
Once we decided to use parallel programming, we should work on decomposing our system into units that would logically belong to a single runnable. Sometimes we find very natural divisions, other times only experience will help us, or better, looking at other similar applications for which we could find some success record. If we‘re programming in order to learn, we should mostly experiment, write code, test it, dump bad ideas, and be ready to write again from scratch. If we see our design leads to new complexities, its probably time for a change.
A very important factor for the success or a parallel application, is choosing an appropriate communications framework. There are several such framework in common use, and for anything but simplistic and experimental work, we should consider using one of them. We‘ll show here a few examples, thought of-course other methods (including methods implemented by various commercial products) exist.
Remote Procedure Calls (RPC) are a method originally developed by Sun microsystems©, allows one process to activate a procedure in a second process, passing it parameters, and optionally getting a result back from the call.
The set of procedures supported by a process are defined in a file using notation called ‘RPC Language‘, and is pre-processed by a tool named ‘rpcgen‘, which creates two groups of files forming two ‘stubs‘. One stub defines functions whose invocation will cause a message to be sent to the remote process, with a request to invoke a certain procedure. This function is invoked by the first (client) process, and returns when we get a reply from the second (server) process, with the value it returned. The second stub contains declarations of functions that need to be implemented by the second (server) process, in order to actually implement the procedures.
During the years, new RPC variants were created, most notably ‘DCE RPC‘, which is part of the ‘Distributed Computing Environment‘, now being maintained by the open group.
CORBA (Common Object Request Broker Architecture) was started up as an attempt of few hundreds of companies to define a standard that allows clients to invoke methods on specific objects running in remote servers.
This framework defines a language-neutral protocol to allow processes to communicate even if they are written in different programming languages, or running on different operating systems. A declarative language, named IDL (Interface Definition Language) was defined to allow specifying language-neutral interfaces. Each interface has a name, and contains a set of methods and attributes. The interface is then pre-processed by some tool, and generates client and server stubs, similarly to how it is done with ‘rpcgen‘ for RPC. An entity named ‘ORB‘ (Object Request Broker) is then used to allow these clients and servers to communicate.
Above this basic interface, a set of standard services were defined, that supply features that are commonly required: A naming service, to allow client processes to locate remote objects by name, An event service, to allow different objects to register for events sent by other objects, and so on. These services are collectively known as ‘Horizontal CORBA Services‘.
Yet other services are being defined for different areas of computing, for instance, services to be used by medical applications. These are called ‘vertical services‘.
For more information about CORBA, please refer to the Object Management Group‘s web site. You may also check the free CORBA page to locate and download various ‘free‘ implementations of CORBA.
I might annoy some of the readers here, but although we are dealing here with Unix programming, we cannot ignore what appears to be the currently more developed distributed objects framework, DCOM (Distributed Component Object Model). DCOM gives us rather similar services to what CORBA does, and people usually argue that their range of services is smaller than what CORBA provides. However, DCOM served as the foundation for the ActiveX set of interfaces, that are used to allow one windows application to activate another one and fully control it. This is most commonly used to allow one application to embed an object created by another application, and allow the user to manipulate the embedded object by invoking the second application when required. Of-course, the ActiveX interface allows much more than that.
There are several reasons why DCOM is important also for Unix programmers:
Various third-party libraries exist, whose purpose is to ease the development of cross-platform applications. of those, various libraries try to make multi-process and multi-threaded programming easier.
ACE (Adaptive Communications Environment) is a large C++ library, developed at the Washington university in St. Louis. ACE attempts to supply abstractions for a lot of system programming concepts, including sockets, pipes, share memory, processes and threads. These abstractions allow one source-code to be compiled by different compilers on different operating systems, from PCs running Linux, BSD and windows systems, through most types of Unix for workstations, and up to IBM‘s MVS open edition, and not to forget several real-time operating systems, such as VxWorks and LynxOS. There is also a version of ACE ported to Java, named JACE.
Rogue Wave© is a company known for writing commercial libraries that are used to ease the development of applications. One of their libraries is named ‘Threads++‘, and is used to make multi-threaded programming easier. This library is something to consider when developing a commercial multi-threaded application. Refer to Rogue Wave‘s home page for more information.
Last, but not least, comes the host of problems we face when trying to debug parallel applications. The first problem is that things happen in different processes or threads at the same time, and we need a debugger that can follow all relevant runnables simultaneously. most debuggers now can handle processes properly, and on platforms with multi-threading support, usually the native debugger can handle threads as well. Of course, we need some kind of IDE if we want to use these debuggers without loosing our sanity.
However, because of the complex nature of such applications, and their sensitivity to timing issues, stopping the application in order to examine it with a debugger is something we sometimes cannot afford. Thus, our best shot would be at extensive logging, that can be turned on and off while the application is running. Such a logging facility must allow us to see which process or thread wrote each log record, and exactly when, to be able to deduce anything out of this info. We would also need to make the format of the log files easy to parse, in order to locate interesting events. It is advised that such a logging mechanism be devised early in the life of the project, as it can save hours of battling processes later.
One of the first problems we face when looking for the the cause of a bug in a parallel application, is finding the responsible process. Many times several processes are sending messages in a chain, and this chain breaks somewhere along the way. One method that could be used to debug such problems between two interacting processes, is simply suspending both of them. Then running the one that should initiate the message with a debugger, checking that the data it contains is legal and that the message it is about to send contains correct information. Then we can attach a debugger to the second process, set a breakpoint on the function that is supposed to receive the message, and resume the execution of this process. Of-course, this method cannot be employed when the message handling is sensitive to some timing constraints. In that case, only extensive logging will help us.
联系客服