1. Introduction

How can we reach the peak performance of a machine? The challenge of creating an algorithm that can be implemented on a parallel machine utilizing its architecture in such a way that produces a faster clock-time is the very question that drives parallel computing. Despite the advancement and complexity of modern computer architecture, it is still a finite machine and there are limitations that must be taken into consideration while implementing an algorithm. Such as, is the translated computer code operating at peak efficiency without exceeding memory limits? This does not mean the code should have the fewest amount of operations. In fact, using two different algorithms, the one with more operations might be more efficient if the operations are executed at the same time (running parallel), as opposed to the algorithm with fewer operations that execute in series.

So how can we utilize a parallel machine to execute an optimal number of operations within a given algorithm? There are many issues that must be addressed in order to answer this question such as task partitioning, the mapping of independent tasks on multiple processors or task scheduling, and assigning the simultaneous execution of tasks to one or more processors. Task synchronization, determining an order of execution so that information exchanged among tasks maintain the desired progress of iterations needed for the algorithm; must also be taken under consideration. Another issue to be aware of is implementing an algorithm that is dependent on the specifics of parallel computer architecture. In addition to providing limited applicability, this approach would render the algorithm obsolete once the architecture changes in one of the fastest changing fields throughout the world.

There are a lot of elements to consider when dealing with parallel optimization and it is necessary to know which model or models will help you achieve an optimal efficiency. Two important models are control parallelism, which pertains to the partition of instruction sets that are independent and executed concurrently, as well as data parallelism, pertaining to the simultaneous performance of instructions on many data elements by many processors. After reading this technical journal you should have a greater understanding of the principles behind control and data parallelism. In addition obtain a basic understanding of several techniques to execute an optimal number of operations concurrently utilizing a parallel machine; and posses a greater overall understanding on the issues, techniques, and applications of parallel computing.

2.1 Hazards and Conventions of Programming to Specific Parallel Architecture

When designing a parallel algorithm that utilizes the peak performance of a machine it is often achieved only through the implementation of an algorithm that exploits that specific architecture. However, by taking a more general approach, one can design an algorithm that is not dependent on a specific architecture, but still render a close to peak performance efficiency. This approach is greatly desired and should be used over an algorithm design that is dependent on a specific architecture. This will ensure the algorithm does not become obsolete once the architecture changes and will also improve applicability. There are so many diverse parallel architectures in existence and an algorithm should have enough flexibility to allow its implementation on a range of architectures without a great degree of difficulty.

2.2 Control and Data Parallelism

There are two models that help facilitate the implementation of parallel algorithms on a wide range of parallel architectures, control parallelism and data parallelism. Control parallelism partitions the instructions of a program into instruction sets that can be executed concurrently due to the fact that the sets are independent of each other. Pipelining is a popular type of control parallelism. Data parallelism simultaneously performs instructions on many data elements using many processors by creating tasks from the partitioning of the problems data and then distributing them to multiple processors. Multiple tasks can be scheduled on the same processor for execution so the actual number of processors on the target machine is not critical. Data parallelism is generally favored over control parallelism because as problems become larger complexity of the algorithm and the code remains unchanged, only the amount of data increases. Because of this, data parallelism allows more processors to be effectively utilized for large-scale problems.

2.3 Task Partitioning, Scheduling, and Synchronization

A parallel algorithm that requires a large number of operations to reach a solution can be more efficient than a sequential algorithm with fewer operations. So the question becomes in what ways do parallelism affect computations? There are specific issues that must be addressed when designing a proper algorithm for a parallel implementation and they are task partitioning, task scheduling, and task synchronization.

2.3.1 Task Partitioning

Task partitioning deals with the problem of partitioning operations or data into independent tasks to be mapped on multiple processors. Operations of an algorithm are partitioned into sets that are independent from each other and proceed to overlap in the duration of their execution. The problem data are partitioned into blocks without interdependencies and are therefore able to process multiple blocks in parallel. A Task is the name given to the partitions of operations or blocks of independent data. Task partitioning becomes easier to solve in algorithms designed with independent operations or algorithms that maintain small subsets of the problem data at each step. Therefore, by addressing the problem of task partitioning through the design of suitable algorithms the algorithm designer can assist the applications programmer by helping to eliminating a crucial problem in parallel programming.

2.3.2 Task Scheduling

Task scheduling addresses the issue of determining how to assign tasks to one or more processors for simultaneous execution. This problem cannot be left to the programmer alone due to the large variety of architectures; the algorithm designer must design an algorithm that can be structured to utilize the number of available processors on a variety of different architectures. However, a satisfactory solution can be obtained in the scheduling of tasks to processors for a variety of architectures if the underlying theoretical algorithm is flexible. Therefore, so long as the operations of the algorithm can be structured to have as many independent tasks as the number of available processors the programmer should be able to resolve any scheduling problem.

2.3.3 Task Synchronization

Task synchronization is the question of determining an order for the execution of tasks and the instances in which information must be exchanged among tasks to ensure the correct progress of iterations according to the algorithm throughout its execution. This may appear to be a problem that is strictly solved by the programmer’s implementation, however, an algorithm design whose convergence is guaranteed that ensures the requirements for synchronization are not excessive is likely to be more efficient when implemented in a parallel architecture.

2.4 Work-Depth Models

A work-depth model takes the focus away from any particular machine and draws its focus to the algorithm by examining the total number of operations performed by that algorithm and the dependencies among those operations. The work W of the algorithm is the total number of performed operations; depth D is the longest chain of dependencies throughout its operations. The ratio P = W/D is called the parallelism of the algorithm. The advantage of using a work-depth model is the lack of machine-dependent details as used in other models that only serve to complicate the design and analysis of algorithms. The figure below shows a circuit for adding 16 numbers. All arcs or edges are directed towards the bottom, input arcs are at the top, each + node adds the values of each incoming arc and places the result on its outgoing arc. The sum of all inputs is returned on the single output at the bottom.