1 Background

To develop applications on high-performance computing (HPC) cluster systems, Partitioned Global Address Space (PGAS) languages that can demonstrate high productivity are used [2,3,4,5]. Because the use of PGAS languages is familiar in one-sided communication, applications in PGAS languages can sometimes exhibit higher performance than those using MPI library by directly using a communication layer close to hardware [5, 6]. Examples of PGAS languages include XcalableMP (XMP) [5, 7, 8]; XcalableACC [9,10,11]; Coarray Fortran [12], PCJ [13], Unified Parallel C,Footnote 1 UPC++ [14], HabaneroUPC++ [15], X10 [16], Chapel [17], and DASH [18].

Although PGAS languages have many advantages, re-implementing an existing MPI application using a PGAS language is often not realistic for the following reasons: (1) since the number of lines of a real-world application code may reach several million, the programming cost for re-implementing is excessive, and (2) cases where productivity and performance have been improved by PGAS languages are limited. Moreover, since each programming language generally has its own strong and weak points, it is difficult to develop all parallel applications using just one programming language.

In order to exploit the advantages of a PGAS language, it is important to consider the linkage between the PGAS language and other languages. For example, modifying only a part where the performance or code outlook will be better using a PGAS language has the potential to partially alleviate the two problems listed in the previous paragraph because the programming cost of partial re-implementation is smaller than that of whole re-implementation.

We have designed an XMP language, and developed Omni compiler described in Chapter 2. This chapter describes the development of linkage functions between XMP and MPI library for Omni compiler. Moreover, it also describes how to call XMP program from Python program. Especially, since Python has numerous scientific computing libraries, we believe that linking Python and XMP will lead to a significant reduction in programming cost when developing HPC applications.

2 Translation by Omni Compiler

Figure 1 shows an example of a code translation for XMP in C language, where the code of the left figure is translated into that of the right figure.Footnote 2 Omni compiler inserts xmp_init_all() and xmp_finalize_all() automatically to perform the initialization and finalization processes, respectively. These functions are defined in Omni compiler runtime library. Since xmp_init_all() calls MPI_Init() and MPI_Comm_dup() internally to duplicate MPI_COMM_WORLD. Since the newly duplicated communicator is used to perform XMP communication, user MPI communication does not conflict with XMP communication. Additionally, xmp_finalize_all() calls MPI_Finalize() function internally.

Fig. 1
figure 1

Example of translation in Omni compiler [19]

In an XMP program, the task directive can divide a node set. The implementation of the task directive creates a new communicator based on the duplicated communicator by MPI_Comm_split(). If communication occurs within the range of the task directive, then the new communicator is used to perform the communication.

Omni compiler renames a user main() as xmpc_main() in order to place the xmpc_main() between xmp_init_all() and xmp_finalize_all(). New main() calls xmpc_main(). The above special translation is performed only for main(). For other functions (such as foo()), renaming is not performed.

3 Functions for Mixed-Language

For a mixed-language programming with XMP, Omni compiler has the following functions.

  • Calling an MPI program from an XMP program

  • Calling an XMP program from an MPI program

  • Calling an XMP program from a Python program

3.1 Function to Call MPI Program from XMP Program

Table 1 shows the functions to call an MPI program from an XMP program. These functions are defined in Appendix A.1 of the XMP specification version 1.4.Footnote 3 Figure 2 also shows an example of how to use these functions. In line 1 of the left figure, an XMP header file (xmp.h) is included to use the functions in Table 1. In line 2, an MPI header file (mpi.h) is included to obtain the information of the MPI communicator type that is used in line 7. In line 6, xmp_init_mpi() initializes an MPI environment. In line 8, xmp_get_mpi_comm() returns an MPI communicator from the information of the executing XMP node set. In line 9, foo() which is defined in right figure is called. In line 10, xmp_finalize_mpi() finalizes the MPI environment. Note that xmp_get_mpi_comm() and other MPI functions must be placed between xmp_init_mpi() and xmp_finalize_mpi().

Fig. 2
figure 2

Example of calling MPI program (mpi.c) from XMP program (xmp.c) [19]

Table 1 Functions to call MPI program from XMP program [19]

The implementations of xmp_init_mpi() and xmp_finalize_mpi() are empty because, as shown in Fig. 1, the xmp_init_all() and xmp_finalize_all() are always invoked at the beginning and the end of the program, and these functions initialize and finalize the MPI environment. Next, an implementation of xmp_get_mpi_comm() will be described. As shown in Sect. 2, the task directive creates a new MPI communicator. Thus, an MPI communicator is stored at a stack data architecture in Omni compiler runtime library. The xmp_get_mpi_comm() returns an MPI communicator at the top of the stack.

An example of compilation using Omni compiler is as follows:

$ mpicc mpi.c -c -o mpi.o$ xmpcc xmp.c -c -o xmp.o $ xmpcc mpi.o xmp.o -o a.out

Since Omni compiler uses an MPI compiler as a native compiler internally, MPI programs can also be compiled with the “xmpcc” command as follows:

$ xmpcc mpi.c -o mpi.o

Thus, it is also possible to execute all the compilation work using a single command as follows:

$ xmpcc mpi.c xmp.c -o a.out

The execution binary is executed using the execution command provided by user’s MPI environment as follows:

$ mpirun -np 4 a.out

3.2 Function to Call XMP Program from MPI Program

Table 2 shows the functions to call an XMP program from an MPI program. These functions are defined in Appendix A.2 of the XMP specification version 1.4. Figure 3 shows an example of how to use these functions. In line 1 of the left figure, an XMP header file is included to use functions in Table 2. In line 7, xmp_init() initializes an XMP environment, and creates the XMP node set based on the communicator specified in its argument. In line 8, foo() in the right figure is called. In line 9, xmp_finalize() finalizes the XMP environment. Note that the XMP functions must be placed between xmp_init() and xmp_finalize().

Fig. 3
figure 3

Example of calling XMP program from MPI program [19]

Table 2 Functions to call XMP program from MPI program [19]

Though the xmp_init() is implemented to call the xmp_init_all() function described in Sect. 2, it also performs the following ingenuities:

  • Before calling xmp_init(), MPI_Init() must be called in the MPI program. Therefore, we added a procedure that ensures MPI_Init() is not called in xmp_init_all().

  • As shown in Sect. 2, xmp_init_all() duplicates MPI_COMM_WORLD to perform XMP communication. We also added a procedure that ensures the communicator specified in xmp_init() is duplicated instead of MPI_COMM_WORLD.

Basically, xmp_finalize() is also implemented to call xmp_finalize_all() in Sect. 2. As with xmp_init(), after calling xmp_finalize(), MPI_Finalize() is called in an MPI program. Therefore, we added a procedure to ensure that MPI_Finalize() is not called in xmp_finalize_all().

The method to compile and execute is the same as Sect. 3.1.

3.3 Function to Call XMP Program from Python Program

There are two types of calling an XMP program from a Python program in Omni compiler; the one is “calling from a parallel Python program,” the other is “calling from a sequential Python program.”

3.3.1 From Parallel Python Program

Figure 4 shows an example of calling an XMP program from a parallel Python program. We assume the use of “mpi4py” for a Python MPI environment. In line 2 of the left figure, an XMP package is imported which is in Omni compier. In line 4, the shared library (bar.so) created from the right figure is read. In line 8, xmp.Lib.call() calls a parallel XMP program. This function performs initialization and finalization for an XMP environment internally.

Fig. 4
figure 4

Example of calling XMP program (bar.c) from parallel Python program (bar.py) [19]

To use the features, the Omni compiler runtime library must be a shared library. Therefore, we develop a new compile process to create a shared library for Omni compiler. When adding an option “--enable-shared” to “./configure” as follows, shared libraries are created.

$ ./configure --enable-shared

An example of compilation and execution using Omni compiler is as follows. A shared library is created by “xmpcc” command from a user program. Compile options used to create the shared library depend on a native compiler (e.g., “-fPIC -shard” if the native compiler is gcc). The execution binary is executed via Python.

$ xmpcc -fPIC -shared bar.c -o bar.so$ mpirun -np 4 python bar.py

3.3.2 From Sequential Python Program

Figure 5 shows an example of calling an XMP program from a sequential Python program. In line 6, xmp.Lib.spawn() calls a parallel XMP program. The first argument is number of nodes in XMP. The last argument is an option for asynchronous operation. If it is true, processing may return to python before “foo()” completes. In line 7, xmp.Lib.wait() waits until “foo()” completes. In line 9, xmp.Lib.elapse_time() returns processing time for “foo()”.

Fig. 5
figure 5

Example of spawning XMP program from sequential Python program [19]

An example of execution using Omni compiler is as follows. Note that a python program executes with one process, but an XMP program is executed with number of processes specified in code.

$ mpirun -np 1 python bar.py

4 Application to Order/Degree Problem

4.1 What Is Order/Degree Program

The order/degree problem is a problem that minimizes the diameter and average shortest path length (ASPL) among vertices in an undirected graph with a given number of vertices and degrees. The problem is useful for designing low latency interconnection networks (http://research.nii.ac.jp/graphgolf).

From the number of vertices (n) and degrees (d), the theoretical lower bounds of the diameter (K n,d) and the ASPL (L n,d) are calculated as follows [20]:

$$\displaystyle \begin{aligned} S_{n,d} = \sum_{i=1}^{K_{n,d}-1} id(d-1)^{i-1} \end{aligned}$$
$$\displaystyle \begin{aligned} R_{n,d} = n - 1 - \sum_{i=1}^{K_{n,d}-1} d(d-1)^{i-1} \end{aligned}$$

Figures 6 and 7 show examples of graphs with n = 10 and d = 3 and their distance matrices. The distance matrix indicates the shortest hop count between vertices. The diameter is the maximum value of the elements in the distance matrix. ASPL is the value obtained by dividing the total value of all elements by the number of elements (n 2 − n)∕2. While the graph in Fig. 6 has random edges, the graph in Fig. 7 has edges optimized by our algorithm, which is described in Sect. 4.2. The diameter and ASPL of Fig. 7 are theoretical lower bounds.

Fig. 6
figure 6

Diameter = 3, ASPL = 1.89

Fig. 7
figure 7

Diameter = 2, ASPL =1.67

In an effort to expand the order/degree problem into open science, the National Institute of Informatics has held a “Graph Golf” competition every year since 2015 to search for the smallest diameter and ASPL. A combination of several vertices and degrees is provided in each of those events. The competition has two categories: one is the General Graph Category, where vertices are placed freely, and the other is the Grid Graph Category, where vertices are placed on a two-dimensional grid. This section deals with the General Graph Category. Table 3 shows the combination of vertices and degrees used in 2017.

Table 3 Combination of vertices and degrees of General Graph Category in 2017

A python program “create-random.py” for the order/degree problem is available on the official Graph Golf website. The program outputs follow from the number of vertices and degrees. These calculations use the Python networkx package (https://networkx.github.io).

  • Initial graph with random edges (the graph of Fig. 6 is created by this function)

  • Calculation of diameter and ASPL

  • Graph figure in Portable Network Graphics (PNG) format (the graphs shown Figs. 6 and 7 are created by this function)

4.2 Implementation

The “create-random.py” does not search out the smallest diameter and ASPL. Moreover, to obtain diameter and ASPL of a graph, it is necessary to calculate all of the shortest paths among its vertices. Although the “create-random.py” calculates the shortest paths using the shortest_path_length method of the networkx package, this method requires a significant amount of time.

To search for the smallest diameter and ASPL, we developed a GraphGolf code in both Python and XMP/C based on “create-random.py.” A Simulated Annealing (SA) [21, 22] algorithm is used for optimization. The shortest paths calculation is parallelized by XMP directives. Figure 8 shows a flow chart for the algorithm. While Python is used to create initial graph and output the figure, XMP/C is used to implement other parts.

Fig. 8
figure 8

Flow chart of an algorithm in Python and XMP/C

Figures 9 and 10 show a portion of our codes. In lines 6–10 of Fig. 9, the number of vertices and degrees are transformed to allow their use in the program. In lines 12–17, an initial graph is created and set a variable arr. The first element of the variable arr stores the number of vertices and the degree. In lines 19–20, the initial graph is passed to the xmp_graphgolf function of Fig. 10, which optimizes it. The result is saved in a variable arr, which is the same as one of the arguments. After line 22, the result is transformed into a figure. In lines 1–3 of Fig. 10, the XMP directives declare the template and node set, and then distributes the template onto the node set in a block manner. In line 1, the “[:]” means that the template size is not fixed at this point. In line 9, the template_fix directive determines the template size vertices, which is the number of vertices. In lines 11–13, a diameter and ASPL are calculated. The calculation is a state of the “Compute energy” shown in Fig. 8. This calculation uses the top-down approach of the breadth-first search. In line 10, the loop directive parallelizes the loop statement to calculate each shortest path in parallel. In lines 14–15, the reduction directives aggregate a diameter and ASPL stored at each node.

Fig. 9
figure 9

Code in Python

Fig. 10
figure 10

Code in XMP/C

4.3 Evaluation

To evaluate the performance of the Graph Golf application, we used the “coma” system located in University of Tsukuba, the specifications of which are shown in Table 4.

Table 4 Coma system specifications

We measured the time required to calculate all of the shortest paths once by changing the number of nodes. We assigned 20 XMP nodes to each compute node, and used a medium-sized number of vertices and degrees (n = 9344, d = 10), as shown in Table 3.

Figure 11 shows performance results where the bar graph shows the time measurements, and the line graph shows the parallel efficiency of one XMP node. The time for one XMP node is 123.17 s, while the time for 1280 XMP nodes (using 64 compute nodes) is 0.13 s, which is 921 times faster. Since the time using the Python networkx package is 148.83 s in one CPU core, XMP achieved a performance improvement of 21%.

Fig. 11
figure 11

Performance evaluation

From an examination of Fig. 11, we found that the parallel efficiency decreases as the number of nodes increases. We consider that the following reasons are responsible for the decrease:

  • The ratio of communication time increases. Figure 12 shows the ratio of communication time and calculation time to the total time. As the number of nodes increases, the proportion of communication time also increases.

    Fig. 12
    figure 12

    Calculation and communication ratio

  • The parallelized loop lengths are non-uniform. The length of the loop statement in line 11 of Fig. 10 is 9344, which is the same as the number of vertices. Since the length is divided by the number of nodes, the length non-uniformity increases as the number of nodes increases.

5 Conclusion

This chapter describes how to use linkage functions between XMP and MPI library in Omni compiler. Moreover, it also describes how to call an XMP program from a Python program. Users can call functions written in these languages with a simple procedure. Since many existing parallel applications are written in MPI library, these functions can be useful for extending existing applications. Furthermore, it will be possible to effectively use the high functionality of Python.