1 Introduction

Spatial point processes are useful as statistical models for situations where observed patterns of points are to be analysed. These points typically represent locations of objects (e.g. trees in a forest, neurons in brain tissue) or events (e.g. disease cases, petty crimes) randomly occurring in \({\mathbb {R}}^d\), \(d \ge 2\). Both the number and positions of points occurring in an observation window are considered random. The points often exhibit interactions (attractive or repulsive) rather than independence, which makes the analysis of such datasets challenging.

Recently, statistical analysis of point pattern data has been of particular interest in a broad range of scientific disciplines (Illian et al. 2004), including biology and biostatistics (Andersen et al. 2018; Hooghoudt et al. 2017; Kuronen et al. 2021; Li et al. 2016), ecology (Redenbach et al. 2009; Zhang et al. 2022), or epidemiology (Diggle et al. 2005).

Habitually, analysing spatial point pattern data means working with only one pattern from a specific measurement. Recently, another approach has become rather frequent, too: a collection of patterns, formed by independent realisations of some underlying stochastic model, is analysed simultaneously (Bagchi and Illian 2015; Myllymäki et al. 2014; Ramón et al. 2016). These independent realisations are referred to as replicated point patterns. The increasing popularity of this data type has encouraged the adaptation of methods, such as supervised classification, to the point pattern setting.

In supervised classification, which this paper focuses on, the task is to predict a label variable (indicating class membership) for a newly observed point pattern, using our knowledge about a sample collection of labeled patterns (training data). The number of classes is known beforehand, based on the class membership of patterns in the training data. The unsupervised classification, where the number of classes is unknown, is out of the scope of this paper.

In the literature, to our best knowledge, the problem of supervised classification of point patterns has been studied only to a limited extent. Properties of a Bayes classifier for patterns generated by inhomogeneous Poisson point processes with different intensity functions are discussed in Cholaquidis et al. (2017). However, this method is based on the independence properties of the Poisson point process and its use is thus limited to a small class of models. On the other hand, no assumptions about the underlying stochastic models are made in Mateu et al. (2015), where the task for replicated point patterns is transformed to the classification task in \({\mathbb {R}}^2\), with the help of multidimensional scaling (Torgerson 1952).

In Koňasová and Dvořák (2021), Pawlasová and Dvořák (2022), a kernel regression classifier for functional data (Ferraty and Vieu 2006) is adapted for replicated point patterns. Instead of classifying the patterns themselves, a selected functional characteristic (e.g. the pair correlation function) is estimated for each pattern. These estimated values are considered functional observations, and the classification is performed in the context of functional data.

The idea of linking point patterns to functional data also appears in Mateu et al. (2015)—the dissimilarity matrix needed for the multidimensional scaling is based on the same type of dissimilarity measure that is used for the kernel regression classifier in Koňasová and Dvořák (2021), Pawlasová and Dvořák (2022). The problem of assessing dissimilarities for point patterns is also discussed in Jalilian and Mateu (2022), where the dissimilarity measure is built with the help of a Siamese neural network. Then, this dissimilarity measure is used to classify point patterns in a special scenario of one-shot learning. Finally, authors in Vo et al. (2018) briefly discuss the model-based supervised classification. Unsupervised classification is explored in Ayala et al. (2006).

This paper discusses the use of multi-class classifiers based on neural networks in the context of replicated point patterns. It employs a procedure described in Thind et al. (2022), where both functional and scalar observations enter a neural network. Hence, as in Koňasová and Dvořák (2021), Pawlasová and Dvořák (2022), each pattern is represented by estimated values of a selected functional characteristic, and the classification is performed in the context of functional data. The resulting decision about class membership is based on the spatial properties of the observed patterns that the selected characteristic describes. Therefore, with a carefully chosen characteristic, this method has a great potential within a wide range of possible classification scenarios. Moreover, it can be used without assuming stationarity of the underlying point processes, and it can be easily extended to more complicated settings (e.g. point patterns in non-Euclidean spaces or realisations of random sets).

The work of Pawlasová et al. (2023) uses this functional representation of point patterns to perform binary supervised classification based on neural networks. The present paper extends the idea to the multi-class classification problem. We demonstrate through a simulation study and a real data application that the approach based on the neural networks can outperform the benchmark method (kernel regression). We present simulation experiments, where individual classes are composed of realisations of the stationary Thomas process (Thomas 1949) with different choices of model parameters (model for attractive interactions, we regulate the strength of the interactions). Moreover, we consider a class of realisations from the Poisson point process (benchmark model for the complete spatial randomness hypothesis) as an extreme case of the choice of parameters of the Thomas model.

The paper is organized as follows. Section 2 provides a brief theoretical background on spatial point patterns and their functional characteristics, including the definition of the pair correlation function, which plays a crucial role in the sequel. Section 3 summarizes the methodology introduced in Thind et al. (2022) about functional neural network models and describes the classification procedure. Section 4 is devoted to the design of the simulation experiments, and the results are presented in Sect. 5. The real dataset of intramembranous particles is analysed in Sect. 6. The concluding remarks and discussion are given in Sect. 7.

2 Background on spatial point patterns

This section presents necessary definitions from the point process theory. Our exposition closely follows the book Møller and Waagepetersen (2004). For a detailed explanation of the theoretical foundations, see, e.g. Daley and Vere-Jones (2008).

Throughout the paper, we consider a simple point process X to be a random locally finite subset of \({\mathbb {R}}^d\), \(d \ge 2\). Each point \(x \in X\) corresponds to a specific object (e.g. tree in a forest) or event (e.g. disease case) occurring at the location \(x \in {\mathbb {R}}^d\). Observed realisations of the random process X are called point patterns.

2.1 Moment properties and pair correlation function

Let us quickly introduce some of the moment properties of the point process X. The intensity function

$$\begin{aligned} \lambda : {\mathbb {R}}^d \longrightarrow [0, \infty ] \end{aligned}$$

is a non-negative measurable function such that \(\lambda (x) \, \textrm{d}x\) corresponds to the probability of observing a point of X in a neighbourhood of x with an infinitesimally small volume \(\textrm{d}x\). For a stationary process, i.e. with translation invariant probability distribution in \({\mathbb {R}}^d\), the intensity function is constant. In such a case, the constant \(\lambda\) is called the intensity of the point process and is interpreted as the expected number of points of the process that occur in a set with unit d-dimensional volume.

Similarly, the second-order product density

$$\begin{aligned} \lambda ^{(2)}: {\mathbb {R}}^d \times {\mathbb {R}}^d \longrightarrow [0, \infty ] \end{aligned}$$

is a non-negative measurable function such that \(\lambda ^{(2)}(x, y) \, \textrm{d}x \, \textrm{d}y\) corresponds to the probability of observing two points of X that occur jointly in the neighbourhoods of x and y with infinitesimally small volumes \(\textrm{d}x\) and \(\textrm{d}y\).

Assuming the existence of \(\lambda\) and \(\lambda ^{(2)}\), the pair correlation function is defined as

$$\begin{aligned} g(x, y) = \dfrac{\lambda ^{(2)}(x, y)}{\lambda (x)\lambda (y)}, \quad x,y \in {\mathbb {R}}^d, \end{aligned}$$

whenever \(\lambda (x)\lambda (y) > 0\). If \(\lambda (x)=0\) or \(\lambda (y)=0\), we set \(g(x, y) = 0\). If g is translation invariant and isotropic (invariant under rotations around the origin), then it depends only on the distance between the arguments, i.e. \(g(x, y) = g\left( \Vert x - y \Vert \right)\).

2.2 Point process models

In this paper, two popular point process models are used. The first model is the stationary Poisson point process and the second model is the stationary Thomas process (Thomas 1949). The intensity functions of both processes are constant, see Sect. 2.1.

The stationary Poisson point process is used as the benchmark model for the complete spatial randomness hypothesis (no interactions are present among the points of the process). For this model, the second-order product density is also constant and it corresponds to \(\lambda ^2\). Hence, \(g \equiv 1\). Note that this is a special example of a translation invariant and isotropic pair correlation function.

Having in mind the special form of g for the Poisson model, we notice that for a general (stationary) point process the value of g(xy) provides information about the likelihood of observing two points of the process jointly occurring in infinitesimally small neighbourhoods of x and y, relative to the corresponding likelihood in the model with no interactions. So, g is often used to detect deviations from the complete spatial randomness hypothesis: values above the benchmark value 1 indicate aggregation of points in the process, and values smaller than 1 suggest repulsive interactions.

The Thomas process belongs to the class of Poisson cluster processes and is fully determined by a set of three parameters \(\kappa , \mu\), and \(\sigma\). The parameter \(\kappa > 0\) determines the intensity of the stationary Poisson point process of unobserved mother points. The parameter \(\mu > 0\) represents the expectation of the Poisson number of daughter points around each mother. The daughter points are independent, and the displacement of a daughter point around its mother point is driven by the centered bivariate Gaussian distribution with the variance matrix being the \(\sigma ^2 > 0\) multiple of the identity matrix.

The pair correlation function of the Thomas process is translation invariant and isotropic. It follows from the construction that the process exhibits attractive interactions between points (Møller and Waagepetersen 2004; Thomas 1949). The strength of the interactions is driven by the model parameter \(\sigma\). An illustration of the dependence of the values of the theoretical pair correlation function (that reflects the changes in the strength of interactions) on \(\sigma\) is shown in Fig. 2.

3 Functional neural networks

In this section, we first recall the basic concepts of the functional neural networks from Thind et al. (2022). Subsequently, employing their input layer, we establish a multi-class classification framework for point pattern data and outline the steps involved in the proposed procedure. We remark that point pattern data was not considered in Thind et al. (2022).

3.1 Functional input layer

In Thind et al. (2022), the goal is to build a neural network that has \(K + J\) input variables, \(K, J \in {\mathbb {N}}\), where the first K of them are functional variables and the remaining J are scalars. In detail, the authors consider functions \(f_k: \tau _k \longrightarrow {\mathbb {R}}\), \(k = 1, 2, \ldots , K\) (\(\tau _k\) are possibly different intervals in \({\mathbb {R}}\)), and scalars \(z^{(1)}_j \in {\mathbb {R}}\), \(j = 1, 2, \ldots , J\).

Denote by \(n_1 \in {\mathbb {N}}\) the number of neurons in the first layer of the network. The i-th neuron of this layer transfers the value

$$\begin{aligned} z^{(2)}_i = a\left( \sum ^K_{k = 1} \int _{\tau _k} \beta _{ik}(t)f_k(t) \ \textrm{d}t + \sum ^J_{j = 1} w^{(1)}_{ij}z^{(1)}_j + b_i^{(1)} \right) , \ \ i = 1, 2, \ldots , n_1, \end{aligned}$$
(1)

where \(a: {\mathbb {R}} \longrightarrow {\mathbb {R}}\) is an activation function and \(b_i^{(1)} \in {\mathbb {R}}\) is the bias. Two types of weighting appear in (1): weight functions \(\lbrace \beta _{ik}: \tau _k \longrightarrow {\mathbb {R}}\rbrace\) and scalar weights \(\lbrace w^{(1)}_{ij}, b^{(1)}_i\rbrace\). The values of the weights are optimized during the training of the network. To overcome the difficulty of finding the optimal weight functions, \(\beta _{ik}\) are considered as a linear combination of basis functions \(\phi _1, \ldots , \phi _{m_k}\). Standard choices for the basis are either the Fourier basis or the B-spline basis. The sum

$$\begin{aligned} \sum ^K_{k = 1} \int _{\tau _k} \beta _{ik}(t) f_k(t) \ \textrm{d}t \end{aligned}$$

can be expressed as

$$\begin{aligned} \sum ^K_{k = 1} \sum ^{m_k}_{l = 1} c_{ilk} \int _{\tau _k} \phi _l(t) f_k(t) \ \textrm{d}t, \end{aligned}$$
(2)

where the integrals can be calculated prior to the hyperparameter tuning of the network. The coefficients \(\left\{ c_{ilk}\right\}\) act as standard scalar weights in the first layer of the network. An in-depth analysis of the computational point of view is provided in Thind et al. (2022). In the software R, functional neural networks are provided by the package FuncNN (Thind et al. 2022).

3.2 Supervised classification of point patterns

The proposed procedure of classifying point pattern data using a neural network with this specific input layer is schematically depicted in Fig. 1. We begin with a given dataset that consists of multiple point patterns. This dataset is divided into three distinct subsets: a training set, a validation set, and a test set. The recommended proportion of data allocated to the training set and validation set relative to the test set varies throughout the literature, the proportions from 50:50 to 80:20 are used the most frequently. However, the optimal choice depends on various factors such as the sample size of the data at hand.

As the second step, we choose a suitable functional summary characteristic, e.g. the pair correlation function. The empirical (estimated) pair correlation function serves as the input function \(f_1\) that enters the neural network. Naturally, other functional characteristics such as the nearest-neighbour distance distribution function may be chosen. A comprehensive summary of commonly used characteristics is given in Illian et al. (2004), Møller and Waagepetersen (2004). The selection of a suitable characteristic, together with the domain for its estimation, requires some expert knowledge, as it is essential to capture distinctions among different classes. It is worth stressing that we utilize only a single functional characteristic, which corresponds to the choices \(K=1\) and \(J=0\) in (1). In detail, the i-th neuron of the input layer transfers the value

$$\begin{aligned} z_i = a\left( \sum ^{m_1}_{l = 1} c_{il1} \int _{\tau _1} \phi _l(t) f_1(t) \ \textrm{d}t + b_i^{(1)} \right) . \end{aligned}$$
(3)

In some cases, for more complex data, it may be worthwhile to look for an optimal combination of several functional and scalar characteristics, as it may improve the classification accuracy.

The next step involves computing the integrals specified in (3).

Then, we construct the whole neural network by potentially incorporating other dense layers (for an explanation, see, e.g. Goodfellow et al. 2016), dropout layers, and notably, the output layer equipped with the softmax activation function with the number of neurons aligning with the number of classes, which is known prior to the beginning of the procedure.

Using the training and validation datasets, we fine-tune the neural network’s architecture to optimize its performance. This tuning process involves adjusting the network’s hyperparameters, such as the number of layers, the number of units within each layer, activation functions, batch size, and the dropout rate. As this is a standard procedure, for details and interpretation of the hyperparameters we refer to standard literature, see e.g. Chollet and Allaire (2018, Chapter 4) or Goodfellow et al. (2016).

The final step is to evaluate the best-performing model on the test dataset which has not been used for training—the test dataset is used to assess the model’s ability to make predictions on new data.

Fig. 1
figure 1

The scheme of the classification procedure based on the neural network with functional inputs

4 Simulation experiments

In this section, we examine the behaviour of the neural network (NN) classifier described in Sect. 3 for three, five, and nine classes, and we generate three collections of point patterns for each case: training, validation, and testing data. The training data are composed of 1000 realisations per class, and validation and testing data are composed of 200 realisations per class. All the datasets are generated using the package spatstat (Baddeley et al. 2015) for the statistical software R.

For each point pattern, the pair correlation function g is estimated at 50 equidistant points over the interval \(\tau _1 = \left[ 0, 0.25\right]\). The upper endpoint is chosen based on the shape and size of the observation window (which is the unit square) and a common rule of thumb for rectangular windows (Baddeley et al. 2015), i.e. taking the quarter of the length of the shorter side of the rectangular window. These estimates then play the role of the function \(f_1\), see Sect. 3. We focus on g mainly because of its widespread use in practical applications and the ease of interpretation.

For the estimation of g, we use the pcf.ppp function from the package spatstat with a modified kernel estimator to improve the bias of the estimator for values of the argument close to zero (the parameter ‘divisor’ is set to the value ‘d’). Moreover, the translation edge correction is applied (Illian et al. 2004). We use the Fourier basis with \(m_1 = 29\) basis vectors, which is the default choice in the setting in Thind et al. (2022), to evaluate the integrals in (2). The networks are trained using the R interface for Python libraries keras and tensorflow (Allaire et al. 2016; Chollet and Allaire 2017).

4.1 Description of experiments

We begin with a ternary classification where the first class consists of realisations of a stationary Poisson point process. For the two remaining classes, realisations of stationary Thomas processes with different model parameters are considered.

Fig. 2
figure 2

Illustration of a single realisation of the Thomas process with \(\sigma\) set to 0.02, 0.05, and 0.1, going from left to right. The rightmost panel illustrates a single realisation of the Poisson point process

During the experiment, the intensity \(\lambda\) of the Poisson point process (the underlying model for the first class) is set to 400, and we set the parameters \(\kappa\) and \(\mu\) of the Thomas process accordingly so that the two processes have the same intensity, i.e. \(\kappa =50\) and \(\mu = 8\). All the considered realisations are observed in the unit square, resulting in the mean number of 400 points per pattern. The strength of the interactions is driven by the parameter \(\sigma\). Roughly speaking, small values of \(\sigma\) correspond to strong clustering, and as \(\sigma\) grows, the interactions become weaker, see Fig. 2. Given the known shape and size of the fixed observation window and since the analytical formula for g is known under the Thomas model, we can track the influence of \(\sigma\) on the behaviour of g, see Pawlasová et al. (2023, Figure 1). Concrete values of \(\sigma\) that we associate with strong/moderate/weak clustering are given later in the text.

In the ternary classification, the second class is set to be the realisations of the Thomas process with \(\sigma = 0.10\) (denoted by T\(\left[ 0.100\right]\)). This particular choice of \(\sigma\) corresponds to weak attractive interactions. With values of \(\sigma\) growing past 0.10, the interactions are even weaker. In the given observation window, it would not be possible to reasonably distinguish between such weak interactions and the Poisson point process by any method, since the Poisson process can be seen as the limit of the Thomas process with \(\sigma \rightarrow \infty\). Hence, this choice of \(\sigma\) challenges the NN classifier in the context of the classification between the first and the second class. Finally, the third class corresponds to the realisations of the Thomas process with \(\sigma = 0.05\) (denoted by T\(\left[ 0.050\right]\)). This choice of \(\sigma\) leads to a rather strong clustering, making this model distinguishable from the Poisson point process. Previous experiments with binary classification suggest that the classes T\(\left[ 0.100\right]\) and T\(\left[ 0.050\right]\) should be easily distinguishable, see Pawlasová et al. (2023), Pawlasová and Dvořák (2022)). Examples of realisations from the three classes are given in Fig. 2.

With an increasing number of classes, the classification becomes more challenging. Therefore, we repeat the experiment with five and nine classes, respectively. We look for a significant drop in the performance of the NN classifier. The five classes scenario consists of the following classes: Poisson, T\(\left[ 0.100\right]\), T\(\left[ 0.075\right]\), T\(\left[ 0.050\right]\), and T\(\left[ 0.025\right]\). Similarly, the nine classes scenario consists of the Poisson, T\(\left[ 0.100\right]\), T\(\left[ 0.090\right]\), T\(\left[ 0.080\right]\), T\(\left[ 0.070\right]\), T\(\left[ 0.060\right]\), T\(\left[ 0.050\right]\), T\(\left[ 0.040\right]\), and T\(\left[ 0.030\right]\) classes.

4.2 Experiment evaluation

The performance of the NN classifier is evaluated through the following quantities: the confusion matrix, per-class accuracy, and overall accuracy. The (ij) entry of the confusion matrix (where i and j range from 1 to the total number of classes) corresponds to the total number of patterns in the testing data coming from the j-th class whose predicted label has the value i. The per-class accuracies correspond to the diagonal of the confusion matrix divided by the total number of patterns in the corresponding class in the testing data. Finally, the overall accuracy is the relative number of point patterns from the testing data whose predicted label matches the true one, that is, the average per-class accuracy in the case with the same number of testing patterns from each class.

All the reported quantities are compared to the corresponding quantities for the kernel regression classifier, which can be considered a benchmark method for point pattern data. Having the same training and testing data as for the NN classifier, the kernel regression classification is performed using the Bayes classifier together with the k-nearest neighbours algorithm and the kernel regression method. This approach includes an automatic procedure for the optimal choice of the hyperparameter k. The dissimilarity measure (needed to determine the nearest neighbours) is based on the empirical pair correlation function and the Cramér-von-Mises type formula, which is typically used to build dissimilarities in the context of functional data (Ferraty and Vieu 2006). A detailed description of the classifier together with some simulation experiments mapping its performance for point patterns can be found in Pawlasová and Dvořák (2022). For binary classification, the performance of the kernel regression classifier is compared to the NN classifier in Pawlasová et al. (2023).

We do not aim to compare the NN approach with the Siamese network discriminant model from Jalilian and Mateu (2022) because the latter uses the particular framework of a one-shot learning classification. Such a framework works with the support set consisting of \(M \in {\mathbb {N}}\) patterns from M different classes, where only one of the patterns has the same class membership as the incoming observation. This set is used to train the Siamese network. On the other hand, the approach from Thind et al. (2022) requires an extensive collection of training data with multiple observations for each of the considered classes.

Moreover, this paper does not specify any model for the analysed point pattern data; we want to keep the method purely non-parametric. Hence, we do not compare our results with the approaches proposed in Vo et al. (2018), and we use the non-parametric kernel regression method as a benchmark.

5 Results

In this section, we report the results for the multi-class classification for three, five, and nine classes, respectively. In the three and five-class cases, the optimal architecture is obtained by the input layer equipped with the ReLU activation function followed by the dropout layer and the output layer equipped with the softmax activation function. For the nine-class classification, the input layer is followed by an additional dense layer equipped again with the ReLU activation function.

5.1 Three classes

For the ternary classification, the optimal network architecture is obtained by 64 neurons in the input layer, i.e. \(n_1=64\) in (1), and dropout rate 0.3. The batch size is set to 32.

Table 1 represents the confusion matrix for the ternary classification. First, the dominant proportion of the predicted labels for all three columns matches the true ones (off-diagonal values are small). The per-class accuracy is above \(90.0\%\) for all three classes. In detail, the accuracies are equal to \(94.5\%\), \(92.5\%\), and \(97.5\%\) for the Poisson, T\(\left[ 0.100\right]\), and T\(\left[ 0.050\right]\), respectively. The lowest per-class accuracy is attained for T\(\left[ 0.100\right]\). It is caused by the fact that realisations from this class may be mislabelled as realisations from Poisson and T\(\left[ 0.050\right]\). On the other hand, we do not observe any realisation from the Poisson class to be incorrectly labeled as T\(\left[ 0.050\right]\) and vice versa. The overall accuracy is \(94.8\%\), which indicates successful classification and the behaviour of the classifier corresponds to our expectations.

For the kernel regression, similar conclusions can be made. From Table 1, we can see that the lowest per-class accuracy is again attained for T\(\left[ 0.100\right]\). However, in comparison to the NN classifier, the percentage of mislabelled realisations is higher. The overall accuracy is similar to the NN classifier.

Having a training set with thousands of patterns is a natural context for the NN classifier but not for the kernel regression classifier. Hence we investigate their performance also in the setting with much smaller training data. The overall accuracy for the kernel regression equals \(90.8\%\) (50 patterns per class) and \(93.8\%\) (25 patterns per class). A smaller training data size can significantly reduce the computational time needed for the kernel regression method. Although it is possible to train the NN classifier on such a small training set, the overall accuracy decreases with decreasing training sample size (in this particular example, the overall accuracy is \(88.7\%\) for 50 patterns per class and \(86.2\%\) for 25 patterns per class, respectively).

Table 1 Confusion matrix for the ternary classification

5.2 Five classes

For the five-class classification, the optimal network architecture is obtained by 64 neurons in the input layer and a dropout rate of 0.2. The batch size is set to 32.

Table 2 summarizes the results. Regarding the NN classifier, it can be seen that there are three classes (Poisson, T\(\left[ 0.050\right]\), and T\(\left[ 0.025\right]\)) for which we do not observe mislabelled realisations, except for a minor number of cases where we incorrectly label these realisations with the label of a neighbouring class. on the other hand, a part of realisations from T\(\left[ 0.100\right]\) and T\(\left[ 0.075\right]\) are incorrectly assigned into at least two other classes. As a result, T\(\left[ 0.100\right]\) and T\(\left[ 0.075\right]\) have a significantly lower per-class accuracy than the other three classes.

In comparison to the ternary classification, the overall accuracy decreases from \(94.8\%\) to \(85.9\%\). Such a decline is expected, since we reduce the difference between the values of \(\sigma\) that define the individual classes. From the point of view of the empirical pair correlation function, it is more challenging to distinguish between realisations from T\(\left[ 0.100\right]\) and T\(\left[ 0.075\right]\) than between realisations from T\(\left[ 0.100\right]\) and T\(\left[ 0.050\right]\). The kernel regression classifier performs slightly worse than the NN classifier in all the reported quantities.

However, for the kernel regression, the overall accuracy and the per-class accuracies stay the same if we use a small set of training data (with 50 patterns per group). On the other hand, under this small training set assumption, the (optimal) NN classifier has an overall accuracy of \(77.9\%\).

Table 2 Confusion matrix for the five-class classification

5.3 Nine classes

Table 3 Confusion matrix for the nine-class classification

The optimal network architecture for the nine-class classification is obtained by the input layer followed by one additional dense layer, both with 32 neurons. The dropout rate is 0.2 and the batch size is set to 256.

Notice that the number of dense layers has increased compared to the previous examples. This is because by increasing the number of classes, the classification becomes more complex. For a smaller number of classes, networks with more dense layers lead to over-fitting.

Table 3 shows that only the two most extreme classes (Poisson and T\(\left[ 0.030\right]\)) have the per-class accuracy above \(90.0\%\), for both classifiers. Regarding the NN classifier, the lowest per-class accuracy is attained for T\(\left[ 0.090\right]\), where the realisations can be confused easily with those from other classes including Poisson, T\(\left[ 0.100\right]\), T\(\left[ 0.080\right]\), and T\(\left[ 0.070\right]\). The overall accuracy is \(61.5\%\).

Regarding the kernel regression classifier, the per-class accuracy is in most cases lower than for the NN classifier. A significant difference can be observed for T\(\left[ 0.100\right]\), where the kernel regression tends to incorrectly label the realisations of T\(\left[ 0.100\right]\) by the Poisson label more often than the NN classifier. The overall accuracy is \(57.3\%\).

For this experiment, the empirical pair correlation function does not provide sufficient information to capture the subtle changes in interactions for the neighbouring classes. Hence, as was expected, the performance of both classifiers is significantly worse than in the previous experiments. However, Table 3 shows that both classifiers keep their ability to distinguish between the groups for which the model parameters are sufficiently different.

6 Application to HeLa cells dataset

We assess the performance of both classifiers on a real dataset comprising of 68 point patterns of intramembranous particles of mitochondrial membranes from HeLa cells (Schladitz et al. 2003). This dataset is categorized into three groups based on the different environments the cell line was observed in. It is composed of 33 patterns observed under normal conditions, 21 after exposition with rotenone, and 14 after exposition with sodium acid. An illustration of the observed patterns is given in Fig. 3.

Fig. 3
figure 3

An illustration of the point patterns of HeLa cells, from left to right: acid-treated, untreated (normal conditions), and rotenone-treated

We divide the data into training, validation, and test datasets, such that the training and validation datasets consist together of \(60\%\) of patterns from each group, and the proportion of training/validation is 70/30. The test dataset consists of 13 patterns observed under normal conditions, 8 rotenone-treated and 6 acid-treated patterns.

Regarding the optimal network architecture, the best performance was obtained by the input layer with 64 neurons and a dropout rate of 0.1. The batch size is set to 2.

Table 4 Confusion matrix for the HeLa cells dataset

Table 4 presents an overview of our findings. Notably, the acid-treated patterns exhibit the lowest per-class accuracies, with the NN approach misclassifying 50% of these patterns as untreated and the kernel regression method misclassifying these patterns at a rate of 83.3%. This may be attributed mainly to the resemblance between these two classes, as we can observe in the illustration depicted in Fig. 3. For the remaining two classes, both classifiers demonstrate comparable per-class accuracies, which are deemed acceptable given the limited sample size. Nevertheless, it is worth stressing that the overall accuracy is higher for the NN classifier, aligning with the conclusions drawn from our simulation study. It is also worth noting that higher accuracies could be potentially achieved with larger datasets.

7 Discussion

Our experiments show that functional neural networks are suitable for the supervised classification of point patterns, even in multi-class settings. This connection between spatial statistics and the fast-developing field of deep learning is of great interest to a wide range of applied research. Our simulation study includes point pattern data from stochastic models with attractive interactions of different strengths. Moreover, the complete spatial randomness model (the Poisson point process) is considered as an extreme case of weak attractive interactions. However, previous research (Pawlasová and Dvořák 2022), together with our preliminary experiments, suggests that the results would be very similar for repulsive models, e.g. the Gaussian determinantal point processes. For clarity, we kept the experiments as simple as possible. Nevertheless, they cover the fundamental problem of spatial statistics and point pattern analysis—distinguishing between different types of interactions.

As expected, the overall accuracy and the per-class accuracies decrease with the increasing number of classes for both classifiers. In the three experiments, the benchmark method of kernel regression has been outperformed by the proposed NN classifier.

In the three and five-class cases, the information obtained from the empirical pair correlation function is enough to perform a successful classification. The choice of the method (NN classifier or kernel regression) then plays a minor role.

For the nine-class example, the step in the sequence of the values of \(\sigma\) is small. Thus, the variability in the empirical pair correlation function between realisations from one class is comparable with the variability between the realisations from classes with neighbouring values of \(\sigma\). Thus, we are limited by the amount of information available rather than by the classifiers themselves. On the other hand, both classifiers can still correctly label realisations from classes with extreme values of \(\sigma\).

Since we work with simulated data and we control the underlying stochastic models, the pair correlation function was a convenient choice for the feature to be extracted from the analysed point patterns. However, a wrong choice can negatively impact the performance of the studied classifier. In practice, choosing appropriate (functional) summary characteristics is challenging. Expert knowledge of the problem should always guide the choice of the summary characteristic. Note that the NN classifier can be easily modified to combine several functional and numerical summary characteristics. On the contrary, such a modification is less straightforward for kernel regression. For instance, it is not obvious how to combine several dissimilarity measures (based on different functional characteristics) into one efficient dissimilarity measure.

In the analysis of the real dataset in Sect. 6, the pair correlation function turned out to be successful in the classification, while other characteristics such as the nearest-neighbour distance distribution function or the empty-space distribution function (Illian et al. 2004) resulted in poor classification. This further illustrates the need to choose a summary characteristic suitable for the problem at hand.

Both classification methods considered above (NN classifiers and kernel regression) do not require a constant intensity function of the point process model that generated the analysed point patterns. Moreover, the classification task can be directly extended to more complicated settings, such as random sets, provided that relevant summary characteristics are available.

The supervised classification task depends on the training data consisting of a certain number of patterns. If only a few patterns with a very large number of points are available for the training data, e.g. in applications such as fluorescence microscopy (Andersen et al. 2018; Hooghoudt et al. 2017; Li et al. 2016) or airborne laser scanning (Mehtätalo et al. 2022), we can subsample many smaller patterns and thus obtain a rather large dataset of patterns of the same type. Nevertheless, it is worth pointing out that this is not a viable strategy if the observed patterns have a small number of points.