Task clustering for multigrain architectures

Excerpts from "Tecniche avanzate di compilazione applicate a macchine parallele e riconfigurabili", Laurea thesis by Giovanni Agosta

Introduction

During the last decade, the study of the nature of algorithms has brought to notice the existance of different quantities of parallelism within programs of every kind. Parallelism is the possibility to concurrently execute a number of different instructions of a program, without prejudice for the correctness of the execution.

In order to execute more than one instruction at a time, an dedicated hardware support is required, but since this allows a considerable reduction of the execution time of a program, a number of different solutions were devised to profit of the innate parallelism of algoritms.

As entire new brands of parallelism were discovered, they were supported by new architectural paradigms.

Different brands of parallelism can be distinguished by their granularity, that is the nature of the context wherein that parallelism may be detected and used.

We acknowledge three main brands of parallelism:

Task level parallelism (TPL), or coarse grain parallelism, determined by different execution threads flowing in parallel. This is the parallelism between concurrent processes, or between two mutually independent, both flow-wise and data-wise, code blocks.

Loop level parallelism (LLP), or medium grain parallelism, determined by the possibility to concurrently execute two different iterations of a loop. This is the most common brand of parallelism, directly available within the doall constructs of some languages like FORTRAN, or through dependence analysis and parallelising trasforms.

Instruction level parallelism (ILP), determined by the possibility to execute two machine instructions at the same time.

In order to exploit these three brands of parallelism, different architectural approaches were adopted. Multiprocessor architectures face mainly the issues of coarse and medium level parallelism, while VLIW and superscalar machine are two different strategies for ILP exploitation.

For each of those architectures different compiler techniques are needed, in order to expose useful parallelism hidden in sequential codes.

However, the previous strategies are extremely specialised, and none of them is able to efficiently manage all brands of parallelism. It is then necessary to turn to multigrain architectures, crossbreeds devised to combine the qualities of different architectural paradigms.

In addition to optimum management of parallelism, there are other goals in architecture and compiler design. Especially, load balance between processing units and the cost of communication among those units are instrumental in ensuring the minimization of the execution time.

A family of architectures fit to face both fine and medium-coarse grain parallelism is the VLIW multiprocessor. These machines sport both hardware support for multiflow execution, to exploit coarse grain parallelism, and for very long instructions, for ILP management. On the other hand, they pose to the compiler the issues of multiprocessor compilation and those of VLIW-oriented compilation.

The key issue in VLIW multiprocessor compilation is the partition of instructions over the processing units, which must guarantee the properties of a VLIW code while allowing for a dynamic execution, which will protect against sudden variations or imbalances in the computation load.

The solution we devised allows runtime scheduling of code blocks, previously compiled in VLIW format. The code blocks are identified by a dedicated task-clustering algorithm, which partitions the original flowgraph into a set of clusters.

This presents two boons: a simplified runtime scheduling, as the number of nodes in the flowgraph is reduced; and an upper bound to the communication costs, as the inter-cluster communication costs are zeroed.

agathokles@libero.it