1 Introduction

Distributed-memory systems are generally used for large-scale simulations. To program such systems, Message Passing Interface (MPI) is widely adopted. However, programming with MPI is difficult because programmers must describe inter-process communications with consideration of the execution flow of their programs, which might cause deadlocks or wrong results.

To address this issue, a parallel language named High Performance Fortran (HPF) was proposed in 1991. With HPF, programmers can execute their serial programs in parallel by inserting minimal directives into them. If the programmers specify data distribution with HPF directives, the compilers do all other tasks for parallelization (e.g. communication generation and work distribution). However, HPF was not widely accepted eventually because the compilers’ automatic processing prevents the programmers from performance tuning, and the performance depends heavily on the environment (e.g. compiler and hardware)

________________________________________________________________________________________________

Note

For more details, please refer: Ken Kennedy, Charles Koelbel and Hans Zima: The Rise and Fall of High Performance Fortran: An Historical Object Lesson, Proc. 3rd ACM SIGPLAN History of Programming Languages Conf. (HOPL-III), pp. 7-1–7-22 (2007).

________________________________________________________________________________________________

In such circumstance, to develop a new parallel programming model that enables easy parallelization of existing serial programs and design a new language based on it, “the XMP Specification Working Group” was established in 2008. This group utilized the lessons from the experience of HPF to define a new parallel language XcalableMP (XMP). The group was reorganized to one of the working groups of PC Cluster Consortium in 2011.

It is learned from the lessons of HPF that more automatic processing of compilers increases the gap between a program and its execution, and, as a result, decreases the usability of the language. In XMP, the programmers specify explicitly the details of parallel programs on the basis of compiler directives to make their execution easy to understand. In particular, they can specify explicitly communication, synchronization, data mapping, and work mapping to facilitate performance tuning. In addition, XMP supports features for one-sided communication on each process, which was not available in HPF. This feature might enable programmers to implement parallel algorithms more easily.

In this chapter, an overview of the programming model and language specification of XMP is shown. You can find the latest and complete language specification of XMP in: XcalableMP Specification Working Group, XcalableMP Specification Version 1.4, http://xcalablemp.org/download/spec/xmp-spec-1.4.pdf (2018).

1.1 Target Hardware

The target of XcalableMP is distributed-memory multicomputers (Fig. 1). Each compute node, which may contain several cores, has its own local memory (shared by the cores, if any), and is connected with the others via an interconnection network. Each node can access its local memory directly and remote memory (the memory of another node) indirectly (i.e. via inter-node communication). However, it is assumed that accessing remote memory may be much slower than accessing local memory.

Fig. 1
figure 1

Target hardware of XMP

1.2 Execution Model

The execution entities in an XMP program are referred to as XMP nodes or, more simply, nodes, which has its own memory and can communicate with each other.

An XcalableMP program execution is based on the Single Program Multiple Data (SPMD) model, where each node starts execution from the same main routine, and continues to execute the same code independently (i.e. asynchronously) until it encounters an XcalableMP construct (Fig. 2).

Fig. 2
figure 2

Execution model of XMP

A set of nodes that executes a procedure, statement, loop, a block, etc. is referred to as its executing node set , and is determined by the innermost task, loop, or array directive surrounding it dynamically, or at runtime. The current executing node set is an executing node set of the current context, which is managed by the XcalableMP runtime system on each node.

The current executing node set at the beginning of the program execution, or entire node set , is a node set that contains all the available nodes, which can be specified in an implementation-defined way (e.g. through a command-line option).

When a node encounters at runtime either a loop, array, or task construct, and is contained by the node set specified (explicitly or implicitly) by the on clause of the directive, it updates the current executing node set with the specified one and executes the body of the construct, after which it resumes the last executing node set and proceeds to execute the subsequent statements.

In particular, when a node in the current executing node set encounters a loop or an array construct, it executes the loop or the array assignment in parallel with the other nodes, so that each iteration of the loop or element of the assignment is independently executed by the node in which the specified data element resides.

When a node encounters a synchronization or a communication directive, synchronization or communication occurs between it and the other nodes. That is, such global constructs are performed collectively by the current executing nodes. Note that neither synchronization nor communication occurs unless these constructs are specified.

1.3 Data Model

There are two classes of data in XcalableMP: global data and local data . Data declared in an XcalableMP program are local by default.

Global data are distributed onto a node set by the align directive (see Sect. 2.4). Each fragment of distributed global data is allocated in the local memory of a node in the node set.

Local data comprises all data that are not global. They are replicated within the local memory of each of the executing nodes.

A node can access directly only local data and sections of global data that reside in its local memory. To access data in remote memory, explicit communication must be specified in such ways as global communication constructs and coarray assignments (Fig. 3).

Fig. 3
figure 3

Data model of XMP

1.4 Programming Models

1.4.1 Partitioned Global Address Space

XMP can be classified as a partitioned global address space (PGAS) language, such as Co-Array Fortran [1], Unified Parallel C [2], and Chapel [3].

In such PGAS languages, multiple executing entities (i.e. threads, processes, or nodes in XMP) share a part of their address space, which is, however, partitioned and a portion of which is local to each executing entity.

The two programming models, global-view and local-view, that XMP supports to achieve high performance and productivity on PGAS are explained below.

1.4.2 Global-View Programming Model

The global-view programming model is useful when, starting from a serial version of a program, the programmer parallelizes it in a data-parallel style by adding directives with minimum modification. Based on this model, the programmer specifies the distribution of data among nodes using the data distribution directives. The loop construct assigns each iteration of a loop to the node at which the computed data is located. The global-view communication directives are used to synchronize nodes, maintain the consistency of shadow areas of distributed data, and move sections of distributed data globally. Note that the programmer must specify explicitly communication to make all data references in their program local using appropriate directives.

In many cases, the XcalableMP program following the global-view programming model is based on a serial program, and it can produce the same result, regardless of the number of nodes (Fig. 4).

Fig. 4
figure 4

Parallelization based on the global-view programming model

There are three groups of directives for this model:

  • Data mapping, which specifies the data distribution and mapping to nodes

  • Work mapping (parallelization), which specifies the work distribution and mapping to nodes.

  • Communication and synchronization, which specify how a node communicates and synchronizes with the other nodes.

Because these directives are ignored as a comment by the compilers of base languages (Fortran and C), an XcalableMP program can usually be compiled by them to ensure that they run properly.

1.4.3 Local-View Programming Model

The local-view programming model is suitable for programs that implement an algorithm and a remote data reference that are to be executed by each node (Fig. 5).

Fig. 5
figure 5

Local-view programming model

For this model, some language extensions and directives are provided. The coarray notation, which is imported from Fortran 2008, is one such extension, and can be used to explicitly specify data on which node is to be accessed. For example, the expression of A( i) [N] in XcalableMP Fortran is used to access an array element of A( i) located on the nodeN. If the access is a reference, then a one-sided communication to read the value from the remote memory (i.e. the get operation) is issued by the executing node. If the access is a definition, then a one-sided communication to write the value to the remote memory (i.e. the put operation) is issued by the executing node.

1.4.4 Mixture of Global View and Local View

In the global-view model, nodes are used to distribute data and works. In the local-view model, nodes are used to address remote data in the coarray notation. In application programs, the programmers should choose an appropriate data model according to the characteristics of their program. Figure 6 illustrates the global view and the local view of data.

Fig. 6
figure 6

Global view and local view

Data can have both a global view and a local view, and can be accessed in both of the views. XcalableMP provides a directive to give the local name (alias) to global data declared in the global-view programming model to enable them to also be accessed in the local-view programming model. This feature is useful to optimize a certain part of a program by using explicit remote data access in the local-view programming model.

1.5 Base Languages

The XcalableMP language specification is defined on the basis of Fortran and C as the base languages. More specifically, the base language of XcalableMP Fortran is Fortran 90 or later, and that of XcalableMP C is ISO C90 (ANSI C89) or later with some extensions (see below).

1.5.1 Array Section in XcalableMP C

In XcalableMP C, the base language C is extended so that a part of an array, that is, an array section or subarray, can be put in an array assignment statement, which is described in Sect. 1.5.2, and some XcalableMP constructs. An array section is built from a subset of the elements of an array, which is specified by a sequence of square-bracketed integer expressions or triplets, which are in the form of:

[ base ] : [ length ] [ :step ]

When step is positive, the triplet specifies a set of subscripts that is a regularly spaced integer sequence of length length beginning with base and proceeding in increments of step up to the largest. The same applies to negative step too.

When base is omitted, it is assumed to be 0. When length is omitted, it is assumed to be the number of remainder elements of the dimension of the array. When step is omitted, it is assumed to be 1.

Assuming that an array A is declared by the following statement,

int A[100]; some array sections can be specified as follows:

A[10:10] :

array section of 10 elements from A[10] to A[19]

A[10:] :

array section of 90 elements from A[10] to A[99]

A[:10] :

array section of 10 elements from A[0] to A[9]

A[10:5:2] :

array section of 5 elements from A[10] to A[18] by step 2

A[:] :

array section of the whole of A

1.5.2 Array Assignment Statement in XcalableMP C

In XcalableMP C, the base language C is also extended so that it supports array assignment statements just as Fortran does.

With such statement, the value of each element of the result of the right-hand side expression is assigned to the corresponding element of the array section on the left-hand side. When an operator or an elemental function is applied to array sections in the right-hand side expression, it is evaluated to an array section that has the same shape as that of the operands or arguments, and each element of which is the result of the operator or function applied to the corresponding element of the operands or arguments. A scalar object is assumed to be an array section that has the same shape as that of the other array section(s) in the expression or on the left-hand side, and where each element has its value.

Note that an array assignment is a statement, and therefore cannot appear as an expression in any other statements.

In the example below, an array assignment statement in the fourth line copies the five elements from B[0] to B[4] into the elements from A[5] to A[9].

1.6 Interoperability

Most of the existing parallel applications are written with MPI. It is not realistic to port them over to XMP because each of them consists of millions of lines.

Because XMP is interoperable with MPI, users can develop an XMP application by modifying a part of an existing one instead of rewriting it totally. Besides, when developing a parallel application from scratch, it is possible to use XMP to write a complicated part of, for example, domain decomposition while they use MPI, which could be faster than XMP, to write a hot-spot part that need to be tuned carefully. In addition, XMP is interoperable with OpenMP and Python (see Chap. 5).

It might be difficult to develop an application with just one programming language or framework since it generally has its own strong and weak points. Thus, an XMP program is interoperable with those in other languages to provide both high productivity and performance.

2 Data Mapping

2.1 nodes Directive

The nodes directive declares a node array, which is an array-like arrangement of nodes in a node set. A node array can be multi-dimensional.

The nodes directive declares a one-dimensional node arrayp that includes four nodes. In XMP/C, it is zero-based and consists of p[0], p[1], p[2], and p[3]. In XMP/Fortran, it is one-based and consists of p(1), p(2), p(3), and p(4).

The nodes directive declares two-dimensional node arrayp that includes six nodes. In XMP/C, it consists of p[0][0], p[0][1], p[0][2], p[1][0], p[1][1], and p[1][2]. In XMP/Fortran, it consists of p(1,1), p(2,1), p(3,1), p(1,2), p(2,2), and p(3,2).

________________________________________________________________________________________________

Note

The ordering of the elements in a node array follows that of a normal array in the base language, C or Fortran.

________________________________________________________________________________________________

An asterisk can be specified as the size in the nodes directive to declare a dynamicnode array. In the above code, one-dimensional dynamic node arrayp is declared with an asterisk as the size. The actual size of a dynamic node array is determined at runtime to fit the size of the current executing node set. For example, when the programmer runs the sample code with three nodes, the node arrayp includes three nodes.

The programmer can also declare multi-dimensional dynamic node arrays with an asterisk.

When the programmer runs the sample code with 12 nodes, the node arrayp has a shape of 4 × 3, in C, or 3 × 4, in Fortran.

________________________________________________________________________________________________

Note

The programmer can put an asterisk only in the last dimension, in XMP/Fortran, or the first dimension, in XMP/C, of the node array.

________________________________________________________________________________________________

_______________________________________________________________

Hint

The dynamic node array may interfere with compiler optimizations. In general, programs with static ones achieve better performance.

________________________________________________________________________________________________

The programmer can declare a node subarray derived from an existing node array. Node subarrays can be used, for example, to optimize inter-node communication by reducing the number of nodes participating in the communication.

In line 1, a node arrayp including 16 nodes is declared. In line 2, a node subarray q corresponding to the first half of p is declared. In line 3, a two-dimensional node subarray r corresponding to the latter half of p is declared.

The programmer can declare an n-dimensional node subarray derived from an m-dimensional one (Fig. 7).

Fig. 7
figure 7

Node subarrays

In line 1, a two-dimensional node arrayp including 4 × 2 nodes is declared. In line 2, a node subarray row derived from a single row of p is declared. In line 3, a node subarray col derived from a single column of p is declared.

A colon represents a triplet which indicates all possible indices in the dimension. An asterisk indicates the index of the current executing node in the dimension. For example, col[2] corresponds to p[0][0:2] on nodes p[0][0] and p[0][1], and to p[1][0:2] on nodes p[1][0] and p[1][1] in XMP/C. Similarly, col(2) corresponds to p(1:2,1) on nodes p(1,1) and p(2,1), and to p(1:2,2) on nodes p(1,2)p(2,2) in XMP/Fortran.

In XMP/C, row[0] corresponds to p[0][0] and p[0][1] on p[:][0] and p[:][1], respectively; col[0] corresponds to p[0][0], p[1][0], p[2][0], and p[3][0] on p[0][:], p[1][:], p[2][:], p[3][:], respectively. In XMP/Fortran, row(1) corresponds to p(1,1) and p(2,1) on p(1,:) and p(2,:), respectively; col(1) corresponds to p(1,1), p(1,2), p(1,3), and p(1,4) on p(:,1), p(:,2), p(:,3), p(:,4), respectively.

________________________________________________________________________________________________

Note

The semantics of an asterisk in a node reference is different from that in a declaration.

________________________________________________________________________________________________

2.2 template Directive

The template directive declares a template, which is a virtual array that is used as a “template” of parallelization in the programs and to be distributed onto a node array.

This template directive declares a one-dimensional templatet having ten elements. Templates are indexed in the similar manner to arrays in the base languages. For the above examples, the templatet is indexed from zero to nine (i.e. t[0]t[9]), in XMP/C, or one to ten (i.e. t(1)t(10)), in XMP/Fortran.

________________________________________________________________________________________________

Hint

In many cases, a template should be declared to have the same shape as your target array.

________________________________________________________________________________________________

The template directive declares a two-dimensional templatet that has 10 × 20 elements. In XMP/C, t is indexed from t[0][0] to t[9][19], and, in XMP/Fortran, from t(1,1) to t(20,10).

In the above examples, a colon instead of an integer is specified as the size to declare a one-dimensional dynamic templatet. The colon indicates that the size of the template is not fixed and to be fixed at runtime by the template_fix construct (Sect. 2.6).

2.3 distribute Directive

The distribute directive specifies a distribution of the target template. Either of block, cyclic, block-cyclic, or gblock (i.e. uneven block) can be specified to distribute a dimension of a template.

2.3.1 Block Distribution

The target template t is divided into contiguous blocks and distributed among nodes in the node arrayp (Fig. 8). Let’s suppose that the size of the template is N and the number of nodes is K. If N is divisible by K, a block of size NK is assigned to each node; otherwise, a block of size ceil(NK) is assigned to each of Nceil(NK) nodes, a block of size mod(N, K) to one node, and no block to (K − Nceil(NK) − 1) nodes. The block distribution is useful for regular computations such as a stencil one.

Fig. 8
figure 8

Block distribution

________________________________________________________________________________________________

Note

The function ceil(x) returns a minimum integer value greater than or equal to x, and mod(x, y) returns x modulo y.

________________________________________________________________________________________________

Since ceil(22∕3) is 8, eight elements are allocated on each of p[0] and p[1], and the remaining six elements are allocated on p[2].

2.3.2 Cyclic Distribution

The target templatet is divided into chunks of size one and distributed among nodes in the node arrayp in a round-robin manner (Fig. 9). The cyclic distribution is useful for the case where the load on each element of the template is not balanced.

Fig. 9
figure 9

Cyclic distribution

2.3.3 Block-Cyclic Distribution

The target templatet is divided into chunks of size w and distributed among nodes in the node arrayp in a round-robin manner (Fig. 10). The block-cyclic distribution is useful for the case where the load on each element of the template is not balanced but the locality of the elements is required.

Fig. 10
figure 10

Block-cyclic distribution

2.3.4 Gblock Distribution

The target templatet is divided into contiguous blocks of size W[0], W[1], ⋯, in XMP/C, or W(1), W(2), ⋯, in XMP/Fortran, and distributed among nodes in the node arrayp (Fig. 11). The array W is called a mapping array. The programmer can specify irregular (uneven) block distribution with the gblock format.

Fig. 11
figure 11

Gblock distribution

The programmer can specify an asterisk instead of a mapping array in the gblock distribution to defer fixing the actual distribution. In such a case, the actual distribution will be fixed at runtime by using the template_fix construct.

2.3.5 Distribution of Multi-Dimensional Templates

The programmer can distribute a multi-dimensional template onto a node array.

The distribute directive declares the distribution of a two-dimensional templatet onto a two-dimensional node arrayp. Each dimension of the template is divided in a block manner and each of the rectangular region is assigned to a node (Fig. 12).

Fig. 12
figure 12

Example of multi-dimensional distribution (1)

The programmer can specify a different distribution format in each of the dimension of a template (Fig. 13).

Fig. 13
figure 13

Example of multi-dimensional distribution (2)

When an asterisk is specified in a distribute directive as a distribution format, the target dimension is “non-distributed.” In the following example, the first dimension is distributed in a block manner and the second dimension is non-distributed (Fig. 14).

Fig. 14
figure 14

Example of multi-dimensional distribution (3)

2.4 align Directive

The align directive specifies that an array is to be mapped in the same way as a specified template. In other words, an align directive defines the correspondence of elements between an array and a template, and each of the array element is allocated on the node where the corresponding template element is assigned.

The array a is decomposed and laid out so that each element a(i) is colocated with the corresponding template element t(i) (Fig. 15).

Fig. 15
figure 15

Example of array alignment (1)

The align directive can also be used for multi-dimensional arrays (Fig. 16).

Fig. 16
figure 16

Example of array alignment (2)

The programmer can align an n-dimensional array with an m-dimensional template for n > m (Fig. 17).

Fig. 17
figure 17

Example of array alignment (3)

When an asterisk is specified as a subscript in a dimension of the target array in the align directive, the dimension is “collapsed” (i.e. not distributed). In the sample program above, the first dimension of the array a is distributed onto the node arrayp while the second dimension is collapsed.

In XMP/C, a[0:2][:] will be allocated on p[0] while, in XMP/Fortran, a(:,1:2) will be allocated on p(1).

The programmer also can align an n-dimensional array with an m-dimensional template for n < m (Fig. 18).

Fig. 18
figure 18

Example of array alignment (4)

When an asterisk is specified as a subscript in a dimension of the target template in the align directive, the array will be “replicated” along the axis of the dimension.

In XMP/C, a[0:4] will be replicated and allocated on p[0][0] and p[0][1] while, in XMP/Fortran, a(1:4) will be allocated on p(1,1) and p(2,1).

2.5 Dynamic Allocation of Distributed Array

This section explains how distributed (i.e. global) arrays are allocated at runtime. The basic procedure is common in XMP/C and XMP/Fortran with a few specific difference.

In XMP/C, first, declare a pointer of the type of the target array; second, align it as if it were an array; finally, allocate memory for it with the xmp_malloc( ) function. xmp_desc_of( ) is an intrinsic/built-in function that returns the descriptor of the XMP object (i.e. nodes, templates, or global arrays) specified by the argument.

In XMP/Fortran, first, declare an allocatable array; second, align it; finally, allocate memory for it with the allocate statement.

For multi-dimensional arrays, the procedure is the same as that for one-dimensional ones, as follows:

_____________________________________________________________

Note

If the size of the template is not fixed until runtime, the programmer has to fix it at runtime with the template_fix construct.

________________________________________________________________________________________________

2.6 template_fix Construct

The template_fix construct fixes the shape and/or the distribution of an unfixed template.

In the above sample code, first, a templatet whose size is unfixed (“:”) is declared; second, a pointer a, in XMP/C, or an allocatable array a, in XMP/Fortran, is aligned with the template; third, the size of the template is fixed with a template_fix construct; finally, the pointer or the allocatable array is allocated with the xmp_malloc() built-in function in XMP/C or the allocate statement in XMP/Fortran, respectively.

________________________________________________________________________________________________

Note

The template_fix constructs can be applied to a template only once.

________________________________________________________________________________________________

This construct can also be used to fix a mapping array of a template that is distributed in “gblock(∗)” at declaration.

3 Work Mapping

3.1 task and tasks Construct

The task construct defines a task that is executed by a specified node set. The tasks construct asserts that the task constructs it surrounds can be executed in parallel.

3.1.1 task Construct

When a node encounters a task construct at runtime, it executes the associated block (called a task) if it is included by the node set specified by the on clause; otherwise, it skips the execution of the block (Fig. 19).

Fig. 19
figure 19

Example of task construct (1)

In the above example, nodesp[1], p[2], and p[3] invoke the printf() function, and p[1] outputs “1: Hello” in XMP/C; p(2), p(3), and p(4) execute the write statement, and p(2) outputs “2: Hello” in XMP/Fortran.

Note that a new node set is generated by each task construct. Let’s consider inserting a bcast construct into the task.

This bcast construct is executed by the node set specified by the on clause of the task construct. Thus, the nodep[1] broadcasts the value of num to p[2] and p[3] in XMP/C, and p(2) to p(3) and p(4) in XMP/Fortran.

The bcast construct in the above code is equivalent to that in the following code, where it is executed by a new node setq that is explicitly declared.

Note that the task is executed by the node set specified by the on clause. Therefore, xmpc_node_num( ) and xmp_node_num( ) return the id in the node set.

For example, consider inserting xmpc_node_num( ) or xmp_node_num( ) into the task in the first program.

The nodep[1] outputs “0: Hello” in XMP/C, and p(2) “1: Hello” in XMP/Fortran.

________________________________________________________________________________________________

Note

A new node set should be collectively generated by all of the executing nodes at the point of a task construct unless it is surrounded by a tasks construct. Therefore, in the above example, p[0] in XMP/C and p(1) in XMP/Fortran must process the task construct.

________________________________________________________________________________________________

3.1.2 tasks Construct

Let’s consider that each of two tasks invokes a function.

In the above example, the two tasks cannot be executed in parallel because the on clauses must be evaluated by all of the executing nodes (Fig. 20).

Fig. 20
figure 20

Example of task construct (2)

In such a case, the programmer must specify a tasks construct surrounding the tasks to execute them in parallel (Fig. 21).

Fig. 21
figure 21

Example of tasks construct

Because the node sets specified by the on clauses of the task constructs surrounded by a tasks construct are disjoint, they can be executed in parallel.

3.2 loop Construct

The loop construct is used to parallelize a loop.

The loop directive above specifies that the iteration i of the following loop is executed by the node that owns the template element t[i] or t(i), which is specified in the on clause.

Such a loop must satisfy the following two conditions:

  1. 1.

    There is no data/control dependence among the iterations. In other words, the iterations of the loop can be executed in any order to produce the same result.

  2. 2.

    Elements of distributed arrays, if any, are accessed only by the node(s) that own(s) the elements.

The programs below are examples of a right loop directive and a loop statement. Condition 1 is satisfied because i is the only one index of the distributed arraya that is accessed within the loop, and condition 2 is also satisfied because the indices of the template in the on clause of the loop directive are identical to that of the distributed array (Fig. 22).

Fig. 22
figure 22

Example of loop construct (1)

Then, is it possible to parallelize the loops in the example below where the loop bounds are shrunk from the above?

In this case, conditions 1 and 2 are satisfied and therefore it is possible to parallelize them. In XMP/C, p[0] processes the indices from one to four and p[1] from five to eight. In XMP/Fortran, p(1) processes the indices from two to five and p(2) from six to nine (Fig. 23).

Fig. 23
figure 23

Example of loop construct (2)

Next, is it possible to parallelize the below loops in which the index of the distributed array is different?

In this case, condition 1 is satisfied but 2 is not, and therefore it is not possible to parallelize them. In XMP/C, p[0] tries to access a[5] but does not own it. In XMP/Fortran, p(1) tries to access a(6) but does not own it (Fig. 24).

Fig. 24
figure 24

Example of loop construct (3)

3.2.1 Reduction Computation

The serial programs below are examples of a reduction computation.

If the above loops are parallelized just by adding a loop directive, the value of the variable sum varies from node to node because it is calculated separately on each node (Fig. 25). The value should be reduced to produce the right result.

Fig. 25
figure 25

Example of reduction computation (1)

Then, to correct the error in the above code, add a reduction clause to the loop directive as follows (Fig. 26).

Fig. 26
figure 26

Example of reduction computation (2)

An operator and target variables for reduction computation are specified in a reduction clause. In the above examples, a “+” operator and a target variable sum are specified for the reduction computation to produce a total sum among nodes.

Operations that can be specified as an operator in a reduction clause are limited to the following associative ones.

_____________________________________________________________

Note

The total result is calculated by combining the partial results on all nodes. The ordering of the combination is unspecified. Hence, if the target variable is a type of floating point (e.g. float in XMP/C or real in XMP/Fortran), the difference of the order can make a little bit difference in the result value from that in the original serial execution.

________________________________________________________________________________________________

3.2.2 Parallelizing Nested Loop

Parallelization of nested loops can be specified similarly to a single one, as follows.

3.3 array Construct

The array construct is for work mapping of array assignment statements.

The above is equivalent to the below.

This construct can also be applied to multi-dimensional arrays.

_____________________________________________________________

Note

The template appearing in the on clause must have the same shape as the arrays in the following statement. The right-hand side value in this construct must be identical among all nodes because the array construct is a global (i.e. collective) operation.

________________________________________________________________________________________________

4 Data Communication

4.1 shadow Directive and reflect Construct

Stencil computation frequently appears in scientific simulation programs, where, to update an array element a[i], its neighboring elements a[i-1] and a[i+1] are referenced. If a[i] is on the boundary region of a block-distributed array on a node, a[i+1] may reside on another (neighboring) node.

Since it involves large overhead to copy a[i+1] from the neighboring node to update each a[i], a technique of copying collectively the elements on the neighboring node to the area added to the distributed array on each node is usually adopted. In XMP, such additional area is called “shadow.”

4.1.1 Declaring Shadow

Shadow areas can be declared with the shadow directive. In the example below, an array a has shadow areas of width one on both the lower and upper bounds.

In the Fig. 27, shaded elements are those that each node owns and white ones are shadow.

Fig. 27
figure 27

Example of shadow directive (1)

________________________________________________________________________________________________

Note

Arrays distributed in a cyclic manner cannot have shadow.

________________________________________________________________________________________________

In some programs, it is natural that the widths of the shadow area on the lower and upper bounds are different. There is also a case where the shadow area exists only on either of the bounds. In the example below, it is declared that a distributed arraya has a shadow area of width one only on the upper bound (Fig. 28).

Fig. 28
figure 28

Example of shadow directive (2)

The values on the left- and right-hand sides of a colon designate the widths on the lower and upper bounds, respectively.

4.1.2 Updating Shadow

To copy data to shadow areas from neighboring nodes, use the reflect construct. In the example below, the shadow areas of an array a that are of width one on both the upper and lower bounds are updated (Fig. 29).

Fig. 29
figure 29

Example of reflect construct (1)

With this reflect directive, in XMP/C, nodep[1] sends an element a[4] to the shadow area on the upper bound on nodep[0] and a[7] to the shadow area on the lower bound on p[2]; p[0] sends an element a[3] to the shadow area on the lower bound on p[1], and p[2] sends a[8] to the shadow area on the upper bound on p[1].

Similarly, in XMP/Fortran, nodep(2) sends an element a(5) to the shadow area on the upper bound on nodep(1) and a(8) to the shadow area on the lower bound on p(3); p(1) sends an element a(4) to the shadow area on the lower bound on p(2), and p(3) sends a(9) to the shadow area on the upper bound on p(2).

The default behavior of a reflect directive is to update the whole of the shadow area declared by the shadow directive. However, there are some cases where a specific part of the shadow area is to be updated to reduce the communication cost at a point of the code.

To update only a specific part of the shadow area, add the width clause to the reflect directive.

The values on the left- and right-hand sides of a colon in the width clause designate the widths on the lower and upper bounds to be updated, respectively. In the example below, only the shadow area on the upper bound is updated (Fig. 30).

Fig. 30
figure 30

Example of reflect construct (2)

_____________________________________________________________

Note

If the widths of the shadow areas to be updated on the upper and lower bounds are equal, that is, for example, width(1:1), you can abbreviate it as width(1).

________________________________________________________________________________________________

_______________________________________________________________

Note

It is not possible to update the shadow area on a particular node because reflect is a collective operation.

________________________________________________________________________________________________

The reflect directive does not update either the shadow area on the lower bound on the leading node or that on the upper bound on the last node. However, the values in such areas are needed for stencil computation if periodic boundary conditions are used in the computation.

To update such areas, add a periodic qualifier into the width clause. Let’s look at the following example where an array a having shadow areas of width one on both the lower and upper bounds appears (Fig. 31).

Fig. 31
figure 31

Example of periodic reflect construct

The periodic qualifier has the following effects, in addition to that of a normal reflect directive: in XMP/C, nodep[0] sends an element a[0] to the shadow area on the upper bound on nodep[3], and p[3] sends a[15] to the shadow area on the lower bound on p[0]; in XMP/Fortran, nodep(1) sends an element a(1) to the shadow area on the upper bound on nodep(4), and p(4) sends a(16) to the shadow area on the lower bound on p(1).

The shadow directive and reflect construct can be applied to arrays distributed in multiple dimensions. The following programs are the examples for two-dimensional distribution.

The central node receives data from the surrounding eight nodes to update its shadow areas (Fig. 32). The shadow areas of the other nodes are also updated, which is omitted in the figure.

Fig. 32
figure 32

Example of multi-dimensional shadow (1)

For some applications, data from ordinal directions are not necessary. In such a case, the data communication from/to the ordinal directions can be avoided by adding the orthogonal clause to a reflect construct (Fig. 33).

Fig. 33
figure 33

Example of multi-dimensional shadow (2)

_______________________________________________________________

Note

The orthogonal clause is effective only for arrays more than one dimension of which is distributed.

________________________________________________________________________________________________

Besides, you can also add shadow areas to only specified dimension (Fig. 34).

Fig. 34
figure 34

Example of multi-dimensional shadow (3)

For the array a, 0 is specified as the shadow width in non-distributed dimensions.

4.2 gmove Construct

The programmers can specify a communication of distributed arrays in the form of assignment statements by using the gmove construct. In other words, with the gmove construct, any array assignment between two arrays (i.e. global data movement) that may involve inter-node communication can be specified.

There are three modes of gmove; “collective mode,” “in mode,” and “out mode.”

4.2.1 Collective Mode

The global data movement involved by a collectivegmove is performed collectively, and results in implicit synchronization among the executing nodes.

In XMP/C, p[0] sends b[0]-b[3] to p[2]-p[3], and p[1] sends b[4] to p[3]. Similarly, in XMP/Fortran, p(1) sends b(1)-b(4) to p(3)-p(4), and p(2) sends b(5) to p(4) (Fig. 35).

Fig. 35
figure 35

Collective gmove (1)

While array a is distributed in a cyclic manner, array b is distributed in a block manner.

In XMP/C, p[0] sends b[0] and b[4] to p[2] and p[3]. p[1] sends b[1] to p[2]. Each element of p[2] and p[3] will be copied locally. Similarly, in XMP/Fortran, p(1) sends b(1) and b(5) to p(3) and p(4). p(2) sends b(2) to p(3). Each element of p(3) and p(4) will be copied locally (Fig. 36).

Fig. 36
figure 36

Collective gmove (2)

By using this method, the distribution of an array can be “changed” during computation.

In this example (Fig. 37), the elements of an array b that is distributed in a block manner are copied to the corresponding elements of an array a that is distributed in a generalized-block manner. For the arrays a and b, communication occurs if the corresponding elements reside in different nodes (arrows illustrate communication between nodes in the figures).

Fig. 37
figure 37

Collective gmove (3)

In the assignment statement, if a scalar (i.e. one element of an array or a variable) is specified on the right-hand side and an array section is specified on the left-hand side, a broadcast communication occurs for it.

In this example (Fig. 38), in XMP/C, an array element b[0] of nodep[0] will be broadcasted to the specified array section on nodep[2] and p[3]. Similarly, in XMP/Fortran, an array element b(1) of nodep(1) will be broadcasted to the specified array section on nodep(3) and p(4).

Fig. 38
figure 38

Collective gmove (4)

Not only distributed arrays but also replicated arrays can be specified on the right-hand side.

In this example, a replicated array b is locally copied to distributed arraya without communication.

In this example (Fig. 39), in XMP/C, b[0][0:2] on p[0], b[0][2:2] of p[1], b[0][4:2] on p[2] and b[0][6:2] on p[3] are copied to a[0][:] on p[0]. Similarly, in XMP/Fortran, b(1:2,1) on p(1), b(3:4,1) of p(2), b(5:6,1) on p(3) and b(7:8,1) on p(4) are copied to a(:,1) on p(1).

Fig. 39
figure 39

Collective gmove (4)

4.2.2 In Mode

The right-hand side data of the assignment, all or part of which may reside outside the executing node set, can be transferred from its owner nodes to the executing nodes with an ingmove.

In this example, the task directive divides four nodes into two sets, the first-half and the second-half. A gmove construct that is in an in mode copies data using a get operation from the second-half node to the first-half node (Fig. 40).

Fig. 40
figure 40

In gmove

4.2.3 Out Mode

For the left-hand side data of the assignment, all or part of which may reside outside the executing node set, the corresponding elements can be transferred from the executing nodes to its owner nodes with an outgmove construct.

A gmove construct that is in out mode copies data using a put communication from the first-half nodes to the second-half nodes (Fig. 41).

Fig. 41
figure 41

Out gmove

4.3 barrier Construct

The barrier construct executes a barrier synchronization.

You can specify a node set on which the barrier synchronization is to be performed by using the on clause. In the example below, a barrier synchronization is performed among the first two nodes of p.

4.4 reduction Construct

This construct performs a reduction operation. It has the same meaning as the reduction clause of the loop construct, but this construct can be specified anywhere executable constructs can be located (Fig. 42).

Fig. 42
figure 42

reduction construct (1)

You can specify the executing node set by using the on clause. In the example below, only the values on the last two of the four nodes are targeted by the reduction construct (Fig. 43).

Fig. 43
figure 43

reduction construct (2)

The operators you can use in the reduction construct are as follows:

_____________________________________________________________

Note

In contrast to the reduction clause of the loop construct, which precedes loops, the reduction construct does not accept operators of firstmax, firstmin, lastmax, and lastmin.

________________________________________________________________________________________________

_______________________________________________________________

Note

Similar to the reduction clause, the reduction construct may generate slightly different results in a parallel execution from those in a sequential execution, because the results depend on the order of combining the value.

________________________________________________________________________________________________

4.5 bcast Construct

The bcast construct broadcasts the values of the variables on the node specified by the from clause, that is, the root node, to the node set specified by the on clause. If there is no from clause, the first node of the executing node set is selected as the root node. If there is no on clause, the current executing node set of the construct is selected as the executing node set.

In the example below, the first node of the node setp, that is, p[0] or p(1), is the root node (Fig. 44).

Fig. 44
figure 44

bcast construct (1)

In the example below, the last node, that is, p[3] or p(4), is the root node (Fig. 45).

Fig. 45
figure 45

bcast construct (2)

In the example below, only the last three of four nodes are included by the executing node set of the bcast construct (Fig. 46).

Fig. 46
figure 46

bcast construct (3)

4.6 wait_async Construct

Communication directives (i.e. reflect, gmove, reduction, bcast, and reduce_shadow) can perform asynchronous communication if the async clause is added. The wait_async construct is used to guarantee the completion of such an asynchronous communication.

Since the bcast directive has an async clause, communication may not be completed immediately after the bcast directive. The completion of that communication is guaranteed with the wait_async construct having the same value as that of the async clause. Therefore, between the bcast construct and the wait_async constructs, you may not reference the target variable of the bcast directive.

________________________________________________________________________________________________

Hint

Asynchronous communication can be overlapped with the following computation to hide its overhead.

________________________________________________________________________________________________

_______________________________________________________________

Note

Expressions that can be specified as tags in the async clause are of type int, in XMP/C, or integer, in XMP/Fortran.

________________________________________________________________________________________________

4.7 reduce_shadow Construct

The reduce_shadow directive adds the value of a shadow object to the corresponding data object of the array.

For the above example, in XMP/C, a[3] on p[0] has a value of eight, and a[4] on p[1] has a value of ten. Similarly, in XMP/Fortran, a(4) of p(1) has a value of eight, and a(5) on p(2) has a value of ten (Fig. 47).

Fig. 47
figure 47

reduce_shadow construct (1)

The programmers can add the periodic modifier to the width clause to reduce shadow objects to the corresponding data object periodically.

In addition to the first example, in XMP/C, a[0] on p[0] has a value of two, and a[7] on p[1] has a value of 16. Similarly, in XMP/Fortran, a(1) in p(1) has a value of two, and a(8) in p(2) has a value of 16 (Fig. 48).

Fig. 48
figure 48

reduce_shadow construct (2)

5 Local-View Programming

5.1 Introduction

The programmer can use coarrays to specify one-sided communication in the local-view model.

Depending on the environment, such one-sided communication might achieve better performance than global communication in the global-view model. However, it is more difficult and complicated to write parallel programs in the local-view model because the programmer must specify every detail of parallelization, such as data mapping, work mapping, and communication.

The coarray feature in XMP/Fortran is upward-compatible with that in Fortran 2008; that in XMP/C is defined as an extension to the base language.

An execution entity in local-view XMP programs is referred to as an “image” while a node in global-view ones. These two words have almost the same meaning in XMP.

5.2 Coarray Declaration

In XMP/C, the programmer declares a coarray by adding “:[∗]” after the array declaration. In XMP/Fortran, the programmer declares a coarray by adding “[∗]” after the array declaration.

________________________________________________________________________________________________

Note

Based on Fortran 2008, coarrays should have the same size among all images.

________________________________________________________________________________________________

Coarrays can be accessed in expressions by remote images as well as the local images.

5.3 Put Communication

When a coarray appears in the left-hand side of an assignment statement, it involves put communication.

The integer in the square bracket specifies the target image index. The image index is zero-based, in XMP/C, or one-based, in XMP/Fortran. xmpc_this_image() in XMP/C and this_image() in XMP/Fortran return the current image index.

In the above example, in XMP/C, an image zero puts b[3:3] to a[0:3] on image one; in XMP/Fortran, an image one puts b(3:5) to a(1:3) on image two. Figure 49 illustrates the put communication performed in the example.

Fig. 49
figure 49

Remote write to a coarray

5.4 Get Communication

When a coarray appears in the right-hand side of an assignment statement, it involves get communication.

In the above example, in XMP/C, an image 0 gets a[0:3] from an image 1 and copies it to b[3:3]; in XMP/Fortran, an image 1 gets a(1:3) from an image 2 and copies it to b(3:5) of an image 1. Figure 50 illustrates the get communication performed in the example.

Fig. 50
figure 50

Remote read from a coarray

_______________________________________________________________

Hint

As illustrated above, get communication involves an extra step to send a request to the target node. Put communication achieves better performance than get because there is no such extra step.

________________________________________________________________________________________________

5.5 Synchronization

5.5.1 Sync All

At “sync all,” each image waits until all issued one-sided communication is complete and then performs barrier synchronization among the all images.

In the above example, the left image puts data to the right image and both nodes invoke sync all. When both nodes return from it, the execution continues to the following statements (Fig. 51).

Fig. 51
figure 51

sync all

5.5.2 Sync Images

Each image in the specified image set waits until all one-sided communication issued is complete, and performs barrier synchronization among the images.

5.5.3 Sync Memory

Each image waits until all one-sided communication is complete. This function/statement does not imply barrier synchronization, unlike sync all and sync images, and therefore can be locally executed.

6 Procedure Interface

Procedure calls in XMP are almost the same as those in the base language. Procedure calls between other languages or to external libraries are also allowed if the base language supports them.

In the example below, a function/subroutine sub1() calls another function/subroutine sub2() with a distributed arrayx as an argument.

To handle a parameter or dummy argument as a global data in the callee procedure, the programmer need to explicitly distribute it with an align directive (Fig. 52).

Fig. 52
figure 52

Passing a global argument to a global parameter

If no align directive is specified in the callee procedure for a parameter or dummy argument that is declared as a global data in the caller procedure, it is handled as if it were declared in the callee procedure as a local data on each node, as follows (Fig. 53).

Fig. 53
figure 53

Passing a global argument to a local parameter

7 XMPT Tool Interface

7.1 Overview

XMPT is the tool interface of XMP and inspired by OMPT, which is the tool interface of OpenMP [4]. Hence, XMPT is designed as event-based and callback-based as OMPT; that is, for each event at runtime, the corresponding callback is invoked. One or more XMPT events are defined corresponding to each of XMP constructs and coarray-related actions (e.g. remote write/read and synchronization).

XMPT is preliminarily implemented in the Omni XMP compiler chapter “Implementation and Performance Evaluation of Omni Compiler”, and used in MUST [5] and experimentally in Extrae [6]. More details of the application of XMPT in MUST are described in [7].

7.2 Specification

7.2.1 Initialization

Tool developers can provide the xmpt_initialize function in which they register a callback for each of the XMPT events of interest, as follows.

In the above example, the tool developer implements callbacks

callback_bcast_begin and callback_bcast_end that interact with his/her tool.

When an XMP program starts execution, the XMP runtime implicitly invokes xmpt_initialize, if provided, to set up the callbacks.

7.2.2 Events

XMPT defines XMPT events each of which corresponds to an XMP construct or a coarray-related action. Below is the list of XMPT events. For each of the events, the function signature of the corresponding callback is specifically defined. Note that the ones from xmpt_event_coarray_remote_write to

xmpt_event_sync_images_end are coarray-related.

xmpt_event_task_begin xmpt_event_task_end xmpt_event_tasks_begin xmpt_event_tasks_end xmpt_event_loop_begin xmpt_event_loop_end xmpt_event_array_begin xmpt_event_array_end xmpt_event_reflect_begin xmpt_event_reflect_begin_async xmpt_event_reflect_end xmpt_event_gmove_begin xmpt_event_gmove_begin_async xmpt_event_gmove_end xmpt_event_barrier_begin xmpt_event_barrier_end xmpt_event_reduction_begin xmpt_event_reduction_begin_async xmpt_event_reduction_end xmpt_event_bcast_begin xmpt_event_bcast_begin_async xmpt_event_bcast_end xmpt_event_wait_async_begin xmpt_event_wait_async_end xmpt_event_coarray_remote_write xmpt_event_coarray_remote_read xmpt_event_coarray_local_write xmpt_event_coarray_local_read xmpt_event_sync_memory_begin xmpt_event_sync_memory_end xmpt_event_sync_all_begin xmpt_event_sync_all_end xmpt_event_sync_image_begin xmpt_event_sync_image_end xmpt_event_sync_images_all_begin xmpt_event_sync_images_all_end xmpt_event_sync_images_begin xmpt_event_sync_images_end

When one of the XMPT events for which callbacks are registered occurs at runtime, the corresponding callback is invoked by the XMP runtime. For example, if callbacks are registered for events xmpt_event_bcast_begin and xmpt_event_bcast_end as in the example in the previous section, the callbacks callback_bcast_begin and callback_bcast_end are invoked immediately before and after each of bcast constructs, respectively.

The XMP runtime passes therein all the information about the construct, including the mapping of the target global arrays, to the callback as its parameters. Thus, the tool is able to extract necessary information from the arguments.