Neural networks with functional inputs for multi-class supervised classification of replicated point patterns

Pawlasová, Kateřina; Karafiátová, Iva; Dvořák, Jiří

doi:10.1007/s11634-024-00579-5

Neural networks with functional inputs for multi-class supervised classification of replicated point patterns

Regular Article
Open access
Published: 07 February 2024

Volume 18, pages 705–721, (2024)
Cite this article

Download PDF

You have full access to this open access article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Neural networks with functional inputs for multi-class supervised classification of replicated point patterns

Download PDF

709 Accesses
Explore all metrics

Abstract

A spatial point pattern is a collection of points observed in a bounded region of the Euclidean plane or space. With the dynamic development of modern imaging methods, large datasets of point patterns are available representing for example sub-cellular location patterns for human proteins or large forest populations. The main goal of this paper is to show the possibility of solving the supervised multi-class classification task for this particular type of complex data via functional neural networks. To predict the class membership for a newly observed point pattern, we compute an empirical estimate of a selected functional characteristic. Then, we consider such estimated function to be a functional variable entering the network. In a simulation study, we show that the neural network approach outperforms the kernel regression classifier that we consider a benchmark method in the point pattern setting. We also analyse a real dataset of point patterns of intramembranous particles and illustrate the practical applicability of the proposed method.

Supervised Classification via Neural Networks for Replicated Point Patterns

Spatial point patterns generation on remote sensing data using convolutional neural networks with further statistical analysis

Article Open access 22 August 2022

Machine learning for cluster analysis of localization microscopy data

Article Open access 20 March 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Spatial point processes are useful as statistical models for situations where observed patterns of points are to be analysed. These points typically represent locations of objects (e.g. trees in a forest, neurons in brain tissue) or events (e.g. disease cases, petty crimes) randomly occurring in ${\mathbb {R}}^d$, $d \ge 2$. Both the number and positions of points occurring in an observation window are considered random. The points often exhibit interactions (attractive or repulsive) rather than independence, which makes the analysis of such datasets challenging.

Recently, statistical analysis of point pattern data has been of particular interest in a broad range of scientific disciplines (Illian et al. 2004), including biology and biostatistics (Andersen et al. 2018; Hooghoudt et al. 2017; Kuronen et al. 2021; Li et al. 2016), ecology (Redenbach et al. 2009; Zhang et al. 2022), or epidemiology (Diggle et al. 2005).

Habitually, analysing spatial point pattern data means working with only one pattern from a specific measurement. Recently, another approach has become rather frequent, too: a collection of patterns, formed by independent realisations of some underlying stochastic model, is analysed simultaneously (Bagchi and Illian 2015; Myllymäki et al. 2014; Ramón et al. 2016). These independent realisations are referred to as replicated point patterns. The increasing popularity of this data type has encouraged the adaptation of methods, such as supervised classification, to the point pattern setting.

In supervised classification, which this paper focuses on, the task is to predict a label variable (indicating class membership) for a newly observed point pattern, using our knowledge about a sample collection of labeled patterns (training data). The number of classes is known beforehand, based on the class membership of patterns in the training data. The unsupervised classification, where the number of classes is unknown, is out of the scope of this paper.

In the literature, to our best knowledge, the problem of supervised classification of point patterns has been studied only to a limited extent. Properties of a Bayes classifier for patterns generated by inhomogeneous Poisson point processes with different intensity functions are discussed in Cholaquidis et al. (2017). However, this method is based on the independence properties of the Poisson point process and its use is thus limited to a small class of models. On the other hand, no assumptions about the underlying stochastic models are made in Mateu et al. (2015), where the task for replicated point patterns is transformed to the classification task in ${\mathbb {R}}^2$, with the help of multidimensional scaling (Torgerson 1952).

In Koňasová and Dvořák (2021), Pawlasová and Dvořák (2022), a kernel regression classifier for functional data (Ferraty and Vieu 2006) is adapted for replicated point patterns. Instead of classifying the patterns themselves, a selected functional characteristic (e.g. the pair correlation function) is estimated for each pattern. These estimated values are considered functional observations, and the classification is performed in the context of functional data.

The idea of linking point patterns to functional data also appears in Mateu et al. (2015)—the dissimilarity matrix needed for the multidimensional scaling is based on the same type of dissimilarity measure that is used for the kernel regression classifier in Koňasová and Dvořák (2021), Pawlasová and Dvořák (2022). The problem of assessing dissimilarities for point patterns is also discussed in Jalilian and Mateu (2022), where the dissimilarity measure is built with the help of a Siamese neural network. Then, this dissimilarity measure is used to classify point patterns in a special scenario of one-shot learning. Finally, authors in Vo et al. (2018) briefly discuss the model-based supervised classification. Unsupervised classification is explored in Ayala et al. (2006).

This paper discusses the use of multi-class classifiers based on neural networks in the context of replicated point patterns. It employs a procedure described in Thind et al. (2022), where both functional and scalar observations enter a neural network. Hence, as in Koňasová and Dvořák (2021), Pawlasová and Dvořák (2022), each pattern is represented by estimated values of a selected functional characteristic, and the classification is performed in the context of functional data. The resulting decision about class membership is based on the spatial properties of the observed patterns that the selected characteristic describes. Therefore, with a carefully chosen characteristic, this method has a great potential within a wide range of possible classification scenarios. Moreover, it can be used without assuming stationarity of the underlying point processes, and it can be easily extended to more complicated settings (e.g. point patterns in non-Euclidean spaces or realisations of random sets).

The work of Pawlasová et al. (2023) uses this functional representation of point patterns to perform binary supervised classification based on neural networks. The present paper extends the idea to the multi-class classification problem. We demonstrate through a simulation study and a real data application that the approach based on the neural networks can outperform the benchmark method (kernel regression). We present simulation experiments, where individual classes are composed of realisations of the stationary Thomas process (Thomas 1949) with different choices of model parameters (model for attractive interactions, we regulate the strength of the interactions). Moreover, we consider a class of realisations from the Poisson point process (benchmark model for the complete spatial randomness hypothesis) as an extreme case of the choice of parameters of the Thomas model.

The paper is organized as follows. Section 2 provides a brief theoretical background on spatial point patterns and their functional characteristics, including the definition of the pair correlation function, which plays a crucial role in the sequel. Section 3 summarizes the methodology introduced in Thind et al. (2022) about functional neural network models and describes the classification procedure. Section 4 is devoted to the design of the simulation experiments, and the results are presented in Sect. 5. The real dataset of intramembranous particles is analysed in Sect. 6. The concluding remarks and discussion are given in Sect. 7.

2 Background on spatial point patterns

This section presents necessary definitions from the point process theory. Our exposition closely follows the book Møller and Waagepetersen (2004). For a detailed explanation of the theoretical foundations, see, e.g. Daley and Vere-Jones (2008).

Throughout the paper, we consider a simple point process X to be a random locally finite subset of ${\mathbb {R}}^d$, $d \ge 2$. Each point $x \in X$ corresponds to a specific object (e.g. tree in a forest) or event (e.g. disease case) occurring at the location $x \in {\mathbb {R}}^d$. Observed realisations of the random process X are called point patterns.

2.1 Moment properties and pair correlation function

Let us quickly introduce some of the moment properties of the point process X. The intensity function

$$\begin{aligned} \lambda : {\mathbb {R}}^d \longrightarrow [0, \infty ] \end{aligned}$$

is a non-negative measurable function such that $\lambda (x) \, \textrm{d}x$ corresponds to the probability of observing a point of X in a neighbourhood of x with an infinitesimally small volume $\textrm{d}x$. For a stationary process, i.e. with translation invariant probability distribution in ${\mathbb {R}}^d$, the intensity function is constant. In such a case, the constant $\lambda$ is called the intensity of the point process and is interpreted as the expected number of points of the process that occur in a set with unit d-dimensional volume.

Similarly, the second-order product density

$$\begin{aligned} \lambda ^{(2)}: {\mathbb {R}}^d \times {\mathbb {R}}^d \longrightarrow [0, \infty ] \end{aligned}$$

is a non-negative measurable function such that $\lambda ^{(2)}(x, y) \, \textrm{d}x \, \textrm{d}y$ corresponds to the probability of observing two points of X that occur jointly in the neighbourhoods of x and y with infinitesimally small volumes $\textrm{d}x$ and $\textrm{d}y$.

Assuming the existence of $\lambda$ and $\lambda ^{(2)}$, the pair correlation function is defined as

$$\begin{aligned} g(x, y) = \dfrac{\lambda ^{(2)}(x, y)}{\lambda (x)\lambda (y)}, \quad x,y \in {\mathbb {R}}^d, \end{aligned}$$

whenever $\lambda (x)\lambda (y) > 0$. If $\lambda (x)=0$ or $\lambda (y)=0$, we set $g(x, y) = 0$. If g is translation invariant and isotropic (invariant under rotations around the origin), then it depends only on the distance between the arguments, i.e. $g(x, y) = g\left( \Vert x - y \Vert \right)$.

2.2 Point process models

In this paper, two popular point process models are used. The first model is the stationary Poisson point process and the second model is the stationary Thomas process (Thomas 1949). The intensity functions of both processes are constant, see Sect. 2.1.

The stationary Poisson point process is used as the benchmark model for the complete spatial randomness hypothesis (no interactions are present among the points of the process). For this model, the second-order product density is also constant and it corresponds to $\lambda ^2$. Hence, $g \equiv 1$. Note that this is a special example of a translation invariant and isotropic pair correlation function.

Having in mind the special form of g for the Poisson model, we notice that for a general (stationary) point process the value of g(x, y) provides information about the likelihood of observing two points of the process jointly occurring in infinitesimally small neighbourhoods of x and y, relative to the corresponding likelihood in the model with no interactions. So, g is often used to detect deviations from the complete spatial randomness hypothesis: values above the benchmark value 1 indicate aggregation of points in the process, and values smaller than 1 suggest repulsive interactions.

The Thomas process belongs to the class of Poisson cluster processes and is fully determined by a set of three parameters $\kappa , \mu$, and $\sigma$. The parameter $\kappa > 0$ determines the intensity of the stationary Poisson point process of unobserved mother points. The parameter $\mu > 0$ represents the expectation of the Poisson number of daughter points around each mother. The daughter points are independent, and the displacement of a daughter point around its mother point is driven by the centered bivariate Gaussian distribution with the variance matrix being the $\sigma ^2 > 0$ multiple of the identity matrix.

The pair correlation function of the Thomas process is translation invariant and isotropic. It follows from the construction that the process exhibits attractive interactions between points (Møller and Waagepetersen 2004; Thomas 1949). The strength of the interactions is driven by the model parameter $\sigma$. An illustration of the dependence of the values of the theoretical pair correlation function (that reflects the changes in the strength of interactions) on $\sigma$ is shown in Fig. 2.

3 Functional neural networks

In this section, we first recall the basic concepts of the functional neural networks from Thind et al. (2022). Subsequently, employing their input layer, we establish a multi-class classification framework for point pattern data and outline the steps involved in the proposed procedure. We remark that point pattern data was not considered in Thind et al. (2022).

3.1 Functional input layer

In Thind et al. (2022), the goal is to build a neural network that has $K + J$ input variables, $K, J \in {\mathbb {N}}$, where the first K of them are functional variables and the remaining J are scalars. In detail, the authors consider functions $f_k: \tau _k \longrightarrow {\mathbb {R}}$, $k = 1, 2, \ldots , K$ ($\tau _k$ are possibly different intervals in ${\mathbb {R}}$), and scalars $z^{(1)}_j \in {\mathbb {R}}$, $j = 1, 2, \ldots , J$.

Denote by $n_1 \in {\mathbb {N}}$ the number of neurons in the first layer of the network. The i-th neuron of this layer transfers the value

$$\begin{aligned} z^{(2)}_i = a\left( \sum ^K_{k = 1} \int _{\tau _k} \beta _{ik}(t)f_k(t) \ \textrm{d}t + \sum ^J_{j = 1} w^{(1)}_{ij}z^{(1)}_j + b_i^{(1)} \right) , \ \ i = 1, 2, \ldots , n_1, \end{aligned}$$

(1)

where $a: {\mathbb {R}} \longrightarrow {\mathbb {R}}$ is an activation function and $b_i^{(1)} \in {\mathbb {R}}$ is the bias. Two types of weighting appear in (1): weight functions $\lbrace \beta _{ik}: \tau _k \longrightarrow {\mathbb {R}}\rbrace$ and scalar weights $\lbrace w^{(1)}_{ij}, b^{(1)}_i\rbrace$. The values of the weights are optimized during the training of the network. To overcome the difficulty of finding the optimal weight functions, $\beta _{ik}$ are considered as a linear combination of basis functions $\phi _1, \ldots , \phi _{m_k}$. Standard choices for the basis are either the Fourier basis or the B-spline basis. The sum

$$\begin{aligned} \sum ^K_{k = 1} \int _{\tau _k} \beta _{ik}(t) f_k(t) \ \textrm{d}t \end{aligned}$$

can be expressed as

$$\begin{aligned} \sum ^K_{k = 1} \sum ^{m_k}_{l = 1} c_{ilk} \int _{\tau _k} \phi _l(t) f_k(t) \ \textrm{d}t, \end{aligned}$$

(2)

where the integrals can be calculated prior to the hyperparameter tuning of the network. The coefficients $\left\{ c_{ilk}\right\}$ act as standard scalar weights in the first layer of the network. An in-depth analysis of the computational point of view is provided in Thind et al. (2022). In the software R, functional neural networks are provided by the package FuncNN (Thind et al. 2022).

3.2 Supervised classification of point patterns

The proposed procedure of classifying point pattern data using a neural network with this specific input layer is schematically depicted in Fig. 1. We begin with a given dataset that consists of multiple point patterns. This dataset is divided into three distinct subsets: a training set, a validation set, and a test set. The recommended proportion of data allocated to the training set and validation set relative to the test set varies throughout the literature, the proportions from 50:50 to 80:20 are used the most frequently. However, the optimal choice depends on various factors such as the sample size of the data at hand.

As the second step, we choose a suitable functional summary characteristic, e.g. the pair correlation function. The empirical (estimated) pair correlation function serves as the input function $f_1$ that enters the neural network. Naturally, other functional characteristics such as the nearest-neighbour distance distribution function may be chosen. A comprehensive summary of commonly used characteristics is given in Illian et al. (2004), Møller and Waagepetersen (2004). The selection of a suitable characteristic, together with the domain for its estimation, requires some expert knowledge, as it is essential to capture distinctions among different classes. It is worth stressing that we utilize only a single functional characteristic, which corresponds to the choices $K=1$ and $J=0$ in (1). In detail, the i-th neuron of the input layer transfers the value

$$\begin{aligned} z_i = a\left( \sum ^{m_1}_{l = 1} c_{il1} \int _{\tau _1} \phi _l(t) f_1(t) \ \textrm{d}t + b_i^{(1)} \right) . \end{aligned}$$

(3)

In some cases, for more complex data, it may be worthwhile to look for an optimal combination of several functional and scalar characteristics, as it may improve the classification accuracy.

The next step involves computing the integrals specified in (3).

Then, we construct the whole neural network by potentially incorporating other dense layers (for an explanation, see, e.g. Goodfellow et al. 2016), dropout layers, and notably, the output layer equipped with the softmax activation function with the number of neurons aligning with the number of classes, which is known prior to the beginning of the procedure.

Using the training and validation datasets, we fine-tune the neural network’s architecture to optimize its performance. This tuning process involves adjusting the network’s hyperparameters, such as the number of layers, the number of units within each layer, activation functions, batch size, and the dropout rate. As this is a standard procedure, for details and interpretation of the hyperparameters we refer to standard literature, see e.g. Chollet and Allaire (2018, Chapter 4) or Goodfellow et al. (2016).

The final step is to evaluate the best-performing model on the test dataset which has not been used for training—the test dataset is used to assess the model’s ability to make predictions on new data.

4 Simulation experiments

In this section, we examine the behaviour of the neural network (NN) classifier described in Sect. 3 for three, five, and nine classes, and we generate three collections of point patterns for each case: training, validation, and testing data. The training data are composed of 1000 realisations per class, and validation and testing data are composed of 200 realisations per class. All the datasets are generated using the package spatstat (Baddeley et al. 2015) for the statistical software R.

For each point pattern, the pair correlation function g is estimated at 50 equidistant points over the interval $\tau _1 = \left[ 0, 0.25\right]$. The upper endpoint is chosen based on the shape and size of the observation window (which is the unit square) and a common rule of thumb for rectangular windows (Baddeley et al. 2015), i.e. taking the quarter of the length of the shorter side of the rectangular window. These estimates then play the role of the function $f_1$, see Sect. 3. We focus on g mainly because of its widespread use in practical applications and the ease of interpretation.

For the estimation of g, we use the pcf.ppp function from the package spatstat with a modified kernel estimator to improve the bias of the estimator for values of the argument close to zero (the parameter ‘divisor’ is set to the value ‘d’). Moreover, the translation edge correction is applied (Illian et al. 2004). We use the Fourier basis with $m_1 = 29$ basis vectors, which is the default choice in the setting in Thind et al. (2022), to evaluate the integrals in (2). The networks are trained using the R interface for Python libraries keras and tensorflow (Allaire et al. 2016; Chollet and Allaire 2017).

4.1 Description of experiments

We begin with a ternary classification where the first class consists of realisations of a stationary Poisson point process. For the two remaining classes, realisations of stationary Thomas processes with different model parameters are considered.

During the experiment, the intensity $\lambda$ of the Poisson point process (the underlying model for the first class) is set to 400, and we set the parameters $\kappa$ and $\mu$ of the Thomas process accordingly so that the two processes have the same intensity, i.e. $\kappa =50$ and $\mu = 8$. All the considered realisations are observed in the unit square, resulting in the mean number of 400 points per pattern. The strength of the interactions is driven by the parameter $\sigma$. Roughly speaking, small values of $\sigma$ correspond to strong clustering, and as $\sigma$ grows, the interactions become weaker, see Fig. 2. Given the known shape and size of the fixed observation window and since the analytical formula for g is known under the Thomas model, we can track the influence of $\sigma$ on the behaviour of g, see Pawlasová et al. (2023, Figure 1). Concrete values of $\sigma$ that we associate with strong/moderate/weak clustering are given later in the text.

In the ternary classification, the second class is set to be the realisations of the Thomas process with $\sigma = 0.10$ (denoted by T$\left[ 0.100\right]$). This particular choice of $\sigma$ corresponds to weak attractive interactions. With values of $\sigma$ growing past 0.10, the interactions are even weaker. In the given observation window, it would not be possible to reasonably distinguish between such weak interactions and the Poisson point process by any method, since the Poisson process can be seen as the limit of the Thomas process with $\sigma \rightarrow \infty$. Hence, this choice of $\sigma$ challenges the NN classifier in the context of the classification between the first and the second class. Finally, the third class corresponds to the realisations of the Thomas process with $\sigma = 0.05$ (denoted by T$\left[ 0.050\right]$). This choice of $\sigma$ leads to a rather strong clustering, making this model distinguishable from the Poisson point process. Previous experiments with binary classification suggest that the classes T$\left[ 0.100\right]$ and T$\left[ 0.050\right]$ should be easily distinguishable, see Pawlasová et al. (2023), Pawlasová and Dvořák (2022)). Examples of realisations from the three classes are given in Fig. 2.

With an increasing number of classes, the classification becomes more challenging. Therefore, we repeat the experiment with five and nine classes, respectively. We look for a significant drop in the performance of the NN classifier. The five classes scenario consists of the following classes: Poisson, T$\left[ 0.100\right]$, T$\left[ 0.075\right]$, T$\left[ 0.050\right]$, and T$\left[ 0.025\right]$. Similarly, the nine classes scenario consists of the Poisson, T$\left[ 0.100\right]$, T$\left[ 0.090\right]$, T$\left[ 0.080\right]$, T$\left[ 0.070\right]$, T$\left[ 0.060\right]$, T$\left[ 0.050\right]$, T$\left[ 0.040\right]$, and T$\left[ 0.030\right]$ classes.

4.2 Experiment evaluation

The performance of the NN classifier is evaluated through the following quantities: the confusion matrix, per-class accuracy, and overall accuracy. The (i, j) entry of the confusion matrix (where i and j range from 1 to the total number of classes) corresponds to the total number of patterns in the testing data coming from the j-th class whose predicted label has the value i. The per-class accuracies correspond to the diagonal of the confusion matrix divided by the total number of patterns in the corresponding class in the testing data. Finally, the overall accuracy is the relative number of point patterns from the testing data whose predicted label matches the true one, that is, the average per-class accuracy in the case with the same number of testing patterns from each class.

All the reported quantities are compared to the corresponding quantities for the kernel regression classifier, which can be considered a benchmark method for point pattern data. Having the same training and testing data as for the NN classifier, the kernel regression classification is performed using the Bayes classifier together with the k-nearest neighbours algorithm and the kernel regression method. This approach includes an automatic procedure for the optimal choice of the hyperparameter k. The dissimilarity measure (needed to determine the nearest neighbours) is based on the empirical pair correlation function and the Cramér-von-Mises type formula, which is typically used to build dissimilarities in the context of functional data (Ferraty and Vieu 2006). A detailed description of the classifier together with some simulation experiments mapping its performance for point patterns can be found in Pawlasová and Dvořák (2022). For binary classification, the performance of the kernel regression classifier is compared to the NN classifier in Pawlasová et al. (2023).

We do not aim to compare the NN approach with the Siamese network discriminant model from Jalilian and Mateu (2022) because the latter uses the particular framework of a one-shot learning classification. Such a framework works with the support set consisting of $M \in {\mathbb {N}}$ patterns from M different classes, where only one of the patterns has the same class membership as the incoming observation. This set is used to train the Siamese network. On the other hand, the approach from Thind et al. (2022) requires an extensive collection of training data with multiple observations for each of the considered classes.

Moreover, this paper does not specify any model for the analysed point pattern data; we want to keep the method purely non-parametric. Hence, we do not compare our results with the approaches proposed in Vo et al. (2018), and we use the non-parametric kernel regression method as a benchmark.

5 Results

In this section, we report the results for the multi-class classification for three, five, and nine classes, respectively. In the three and five-class cases, the optimal architecture is obtained by the input layer equipped with the ReLU activation function followed by the dropout layer and the output layer equipped with the softmax activation function. For the nine-class classification, the input layer is followed by an additional dense layer equipped again with the ReLU activation function.

5.1 Three classes

For the ternary classification, the optimal network architecture is obtained by 64 neurons in the input layer, i.e. $n_1=64$ in (1), and dropout rate 0.3. The batch size is set to 32.

Table 1 represents the confusion matrix for the ternary classification. First, the dominant proportion of the predicted labels for all three columns matches the true ones (off-diagonal values are small). The per-class accuracy is above $90.0\%$ for all three classes. In detail, the accuracies are equal to $94.5\%$, $92.5\%$, and $97.5\%$ for the Poisson, T$\left[ 0.100\right]$, and T$\left[ 0.050\right]$, respectively. The lowest per-class accuracy is attained for T$\left[ 0.100\right]$. It is caused by the fact that realisations from this class may be mislabelled as realisations from Poisson and T$\left[ 0.050\right]$. On the other hand, we do not observe any realisation from the Poisson class to be incorrectly labeled as T$\left[ 0.050\right]$ and vice versa. The overall accuracy is $94.8\%$, which indicates successful classification and the behaviour of the classifier corresponds to our expectations.

For the kernel regression, similar conclusions can be made. From Table 1, we can see that the lowest per-class accuracy is again attained for T$\left[ 0.100\right]$. However, in comparison to the NN classifier, the percentage of mislabelled realisations is higher. The overall accuracy is similar to the NN classifier.

Having a training set with thousands of patterns is a natural context for the NN classifier but not for the kernel regression classifier. Hence we investigate their performance also in the setting with much smaller training data. The overall accuracy for the kernel regression equals $90.8\%$ (50 patterns per class) and $93.8\%$ (25 patterns per class). A smaller training data size can significantly reduce the computational time needed for the kernel regression method. Although it is possible to train the NN classifier on such a small training set, the overall accuracy decreases with decreasing training sample size (in this particular example, the overall accuracy is $88.7\%$ for 50 patterns per class and $86.2\%$ for 25 patterns per class, respectively).

Table 1 Confusion matrix for the ternary classification

Full size table

5.2 Five classes

For the five-class classification, the optimal network architecture is obtained by 64 neurons in the input layer and a dropout rate of 0.2. The batch size is set to 32.

Table 2 summarizes the results. Regarding the NN classifier, it can be seen that there are three classes (Poisson, T$\left[ 0.050\right]$, and T$\left[ 0.025\right]$) for which we do not observe mislabelled realisations, except for a minor number of cases where we incorrectly label these realisations with the label of a neighbouring class. on the other hand, a part of realisations from T$\left[ 0.100\right]$ and T$\left[ 0.075\right]$ are incorrectly assigned into at least two other classes. As a result, T$\left[ 0.100\right]$ and T$\left[ 0.075\right]$ have a significantly lower per-class accuracy than the other three classes.

In comparison to the ternary classification, the overall accuracy decreases from $94.8\%$ to $85.9\%$. Such a decline is expected, since we reduce the difference between the values of $\sigma$ that define the individual classes. From the point of view of the empirical pair correlation function, it is more challenging to distinguish between realisations from T$\left[ 0.100\right]$ and T$\left[ 0.075\right]$ than between realisations from T$\left[ 0.100\right]$ and T$\left[ 0.050\right]$. The kernel regression classifier performs slightly worse than the NN classifier in all the reported quantities.

However, for the kernel regression, the overall accuracy and the per-class accuracies stay the same if we use a small set of training data (with 50 patterns per group). On the other hand, under this small training set assumption, the (optimal) NN classifier has an overall accuracy of $77.9\%$.

Table 2 Confusion matrix for the five-class classification

Full size table

5.3 Nine classes

Table 3 Confusion matrix for the nine-class classification

Full size table

The optimal network architecture for the nine-class classification is obtained by the input layer followed by one additional dense layer, both with 32 neurons. The dropout rate is 0.2 and the batch size is set to 256.

Notice that the number of dense layers has increased compared to the previous examples. This is because by increasing the number of classes, the classification becomes more complex. For a smaller number of classes, networks with more dense layers lead to over-fitting.

Table 3 shows that only the two most extreme classes (Poisson and T$\left[ 0.030\right]$) have the per-class accuracy above $90.0\%$, for both classifiers. Regarding the NN classifier, the lowest per-class accuracy is attained for T$\left[ 0.090\right]$, where the realisations can be confused easily with those from other classes including Poisson, T$\left[ 0.100\right]$, T$\left[ 0.080\right]$, and T$\left[ 0.070\right]$. The overall accuracy is $61.5\%$.

Regarding the kernel regression classifier, the per-class accuracy is in most cases lower than for the NN classifier. A significant difference can be observed for T$\left[ 0.100\right]$, where the kernel regression tends to incorrectly label the realisations of T$\left[ 0.100\right]$ by the Poisson label more often than the NN classifier. The overall accuracy is $57.3\%$.

For this experiment, the empirical pair correlation function does not provide sufficient information to capture the subtle changes in interactions for the neighbouring classes. Hence, as was expected, the performance of both classifiers is significantly worse than in the previous experiments. However, Table 3 shows that both classifiers keep their ability to distinguish between the groups for which the model parameters are sufficiently different.

6 Application to HeLa cells dataset

We assess the performance of both classifiers on a real dataset comprising of 68 point patterns of intramembranous particles of mitochondrial membranes from HeLa cells (Schladitz et al. 2003). This dataset is categorized into three groups based on the different environments the cell line was observed in. It is composed of 33 patterns observed under normal conditions, 21 after exposition with rotenone, and 14 after exposition with sodium acid. An illustration of the observed patterns is given in Fig. 3.

We divide the data into training, validation, and test datasets, such that the training and validation datasets consist together of $60\%$ of patterns from each group, and the proportion of training/validation is 70/30. The test dataset consists of 13 patterns observed under normal conditions, 8 rotenone-treated and 6 acid-treated patterns.

Regarding the optimal network architecture, the best performance was obtained by the input layer with 64 neurons and a dropout rate of 0.1. The batch size is set to 2.

Table 4 Confusion matrix for the HeLa cells dataset

Full size table

Table 4 presents an overview of our findings. Notably, the acid-treated patterns exhibit the lowest per-class accuracies, with the NN approach misclassifying 50% of these patterns as untreated and the kernel regression method misclassifying these patterns at a rate of 83.3%. This may be attributed mainly to the resemblance between these two classes, as we can observe in the illustration depicted in Fig. 3. For the remaining two classes, both classifiers demonstrate comparable per-class accuracies, which are deemed acceptable given the limited sample size. Nevertheless, it is worth stressing that the overall accuracy is higher for the NN classifier, aligning with the conclusions drawn from our simulation study. It is also worth noting that higher accuracies could be potentially achieved with larger datasets.

7 Discussion

Our experiments show that functional neural networks are suitable for the supervised classification of point patterns, even in multi-class settings. This connection between spatial statistics and the fast-developing field of deep learning is of great interest to a wide range of applied research. Our simulation study includes point pattern data from stochastic models with attractive interactions of different strengths. Moreover, the complete spatial randomness model (the Poisson point process) is considered as an extreme case of weak attractive interactions. However, previous research (Pawlasová and Dvořák 2022), together with our preliminary experiments, suggests that the results would be very similar for repulsive models, e.g. the Gaussian determinantal point processes. For clarity, we kept the experiments as simple as possible. Nevertheless, they cover the fundamental problem of spatial statistics and point pattern analysis—distinguishing between different types of interactions.

As expected, the overall accuracy and the per-class accuracies decrease with the increasing number of classes for both classifiers. In the three experiments, the benchmark method of kernel regression has been outperformed by the proposed NN classifier.

In the three and five-class cases, the information obtained from the empirical pair correlation function is enough to perform a successful classification. The choice of the method (NN classifier or kernel regression) then plays a minor role.

For the nine-class example, the step in the sequence of the values of $\sigma$ is small. Thus, the variability in the empirical pair correlation function between realisations from one class is comparable with the variability between the realisations from classes with neighbouring values of $\sigma$. Thus, we are limited by the amount of information available rather than by the classifiers themselves. On the other hand, both classifiers can still correctly label realisations from classes with extreme values of $\sigma$.

Since we work with simulated data and we control the underlying stochastic models, the pair correlation function was a convenient choice for the feature to be extracted from the analysed point patterns. However, a wrong choice can negatively impact the performance of the studied classifier. In practice, choosing appropriate (functional) summary characteristics is challenging. Expert knowledge of the problem should always guide the choice of the summary characteristic. Note that the NN classifier can be easily modified to combine several functional and numerical summary characteristics. On the contrary, such a modification is less straightforward for kernel regression. For instance, it is not obvious how to combine several dissimilarity measures (based on different functional characteristics) into one efficient dissimilarity measure.

In the analysis of the real dataset in Sect. 6, the pair correlation function turned out to be successful in the classification, while other characteristics such as the nearest-neighbour distance distribution function or the empty-space distribution function (Illian et al. 2004) resulted in poor classification. This further illustrates the need to choose a summary characteristic suitable for the problem at hand.

Both classification methods considered above (NN classifiers and kernel regression) do not require a constant intensity function of the point process model that generated the analysed point patterns. Moreover, the classification task can be directly extended to more complicated settings, such as random sets, provided that relevant summary characteristics are available.

The supervised classification task depends on the training data consisting of a certain number of patterns. If only a few patterns with a very large number of points are available for the training data, e.g. in applications such as fluorescence microscopy (Andersen et al. 2018; Hooghoudt et al. 2017; Li et al. 2016) or airborne laser scanning (Mehtätalo et al. 2022), we can subsample many smaller patterns and thus obtain a rather large dataset of patterns of the same type. Nevertheless, it is worth pointing out that this is not a viable strategy if the observed patterns have a small number of points.

References

Allaire J, Eddelbuettel D, Golding N, et al (2016) Tensorflow: R Interface to TensorFlow. https://github.com/rstudio/tensorflow
Andersen I, Hahn U, Arnspang E et al (2018) Double Cox cluster processes—with applications to photoactivated localization microscopy. Spat Stat 27:58–73. https://doi.org/10.1016/j.spasta.2018.04.009
Article MathSciNet Google Scholar
Ayala G, Epifanio I, Simo A et al (2006) Clustering of spatial point patterns. Comput Stat Data Anal 50(4):1016–1032. https://doi.org/10.1016/j.csda.2004.10.013
Article MathSciNet Google Scholar
Baddeley A, Rubak E, Turner R (2015) Spatial point patterns: methodology and applications with R. Chapman & Hall/CRC Press, Boca Raton
Book Google Scholar
Bagchi R, Illian J (2015) A method for analysing replicated point patterns in ecology. Methods Ecol Evol 6(4):482–490. https://doi.org/10.1111/2041-210X.12335
Article Google Scholar
Cholaquidis A, Forzani L, Llop P et al (2017) On the classification problem for Poisson point processes. J Multivar Anal 153:1–15. https://doi.org/10.1016/j.jmva.2016.09.002
Article MathSciNet Google Scholar
Chollet F, Allaire J (2018) Deep learning with R. Manning
Chollet F, Allaire J, et al (2017) R interface to keras. https://github.com/rstudio/keras
Daley D, Vere-Jones D (2008) An introduction to the theory of point processes, 2nd edn. Springer, New York
Book Google Scholar
Diggle P, Rowlingson B, Su T (2005) Point process methodology for on-line spatio-temporal disease surveillance. Environmetrics 16(5):423–434. https://doi.org/10.1002/env.712
Article MathSciNet Google Scholar
Ferraty F, Vieu P (2006) Nonparametric functional data analysis theory and practice. Springer, New York
Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
Google Scholar
Hooghoudt JO, Barroso M, Waagepetersen R (2017) Toward Bayesian inference of the spatial distribution of proteins from three-cube Förster resonance energy transfer data. Ann Appl Stat 11(3):1711–1737. https://doi.org/10.1214/17-AOAS1054
Article MathSciNet Google Scholar
Illian J, Penttinen A, Stoyan H et al (2004) Statistical analysis and modelling of spatial point patterns. Wiley, Chichester
Google Scholar
Jalilian A, Mateu J (2022) Assessing similarities between spatial point patterns with a Siamese neural network discriminant model. Adv Data Anal Classif. https://doi.org/10.1007/s11634-021-00485-0
Article Google Scholar
Koňasová K, Dvořák J (2021) Techniques from functional data analysis adaptable for spatial point patterns. In: Proceedings of the 22nd European young statisticians meeting, pp 593–660, https://www.eysm2021.panteion.gr/publications.html
Kuronen M, Myllymäki M, Loavenbruck A et al (2021) Point process models for sweat gland activation observed with noise. Stat Med 40(8):2055–2072. https://doi.org/10.1002/sim.8891
Article MathSciNet Google Scholar
Li Y, Majarian TD, Naik AW et al (2016) Point process models for localization and interdependence of punctate cellular structures. Cytom A 89(7):633–643. https://doi.org/10.1002/cyto.a.22873
Article Google Scholar
Mateu J, Schoenberg F, Diez D et al (2015) On measures of dissimilarity between point patterns: classification based on prototypes and multidimensional scaling. Biom J 57(2):340–358. https://doi.org/10.1002/bimj.201300150
Article MathSciNet Google Scholar
Mehtätalo L, Yazigi A, Kansanen K et al (2022) Estimation of forest stand characteristics using individual tree detection, stochastic geometry and a sequential spatial point process model. Int J Appl Earth Obs Geoinf 112(102):920. https://doi.org/10.1016/j.jag.2022.102920
Article Google Scholar
Møller J, Waagepetersen R (2004) Statistical inference and simulation for spatial point processes. Chapman & Hall/CRC Press, Boca Raton
Google Scholar
Myllymäki M, Särkkä A, Vehtari A (2014) Hierarchical second-order analysis of replicated spatial point patterns with non-spatial covariates. Spat Stat 8:104–121. https://doi.org/10.1016/j.spasta.2013.07.006
Article MathSciNet Google Scholar
Pawlasová K, Dvořák J (2022) Supervised nonparametric classification in the context of replicated point patterns. Image Anal Stereol 41(2):57–109. https://doi.org/10.5566/ias.2652
Article MathSciNet Google Scholar
Pawlasová K, Karafiátová I, Dvořák J (2023) Supervised classification via neural networks for replicated point patterns. In: Classification and data science in the digital age. Springer, Cham
Ramón P, de la Cruz M, Chacón-Labella J et al (2016) A new non-parametric method for analyzing replicated point patterns in ecology. Ecography 39(11):1109–1117. https://doi.org/10.1111/ecog.01848
Article Google Scholar
Redenbach C, Särkkä A, Freitag J et al (2009) Anisotropy analysis of pressed point processes. Adv Stat Anal 93(3):237–261. https://doi.org/10.1007/s10182-009-0106-5
Article MathSciNet Google Scholar
Schladitz K, Sarkka A, Pavenstadt I et al (2003) Statistical analysis of intramembranous particles using freeze fracture specimens. J Microsc 211(2):137–153. https://doi.org/10.1046/j.1365-2818.2003.01210.x
Article MathSciNet Google Scholar
Thind B, Multani K, Cao J (2022) Deep learning with functional inputs. J Comput Graph Stat 32(1):171–180. https://doi.org/10.1080/10618600.2022.2097914
Article MathSciNet Google Scholar
Thomas M (1949) A generalization of Poisson’s binomial limit for use in ecology. Biometrika 36(1/2):18–25. https://doi.org/10.2307/2332526
Article MathSciNet Google Scholar
Torgerson W (1952) Multidimensional scaling: I. Theory and method. Psychometrika 17:401–419. https://doi.org/10.1007/BF02288916
Article MathSciNet Google Scholar
Vo B, Dam N, Phung D et al (2018) Model-based learning for point pattern data. Pattern Recognit 84:136–151. https://doi.org/10.1016/j.patcog.2018.07.008
Article Google Scholar
Zhang W, Chipperfield JD, Illian JB et al (2022) A flexible and efficient Bayesian implementation of point process models for spatial capture-recapture data. Ecology. https://doi.org/10.1002/ecy.3887
Article Google Scholar

Download references

Acknowledgements

The authors are grateful to the anonymous reviewers for their thorough comments that helped significantly improve the quality of the paper. The work of Kateřina Pawlasová and Iva Karafiátová has been supported from the Grant schemes at Charles University, Project No. CZ.02.2.69/0.0/0.0/19 073/0016935.

Funding

Open access publishing supported by the National Technical Library in Prague.

Author information

Authors and Affiliations

Department of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, Charles University, Ke Karlovu 3, 121 16, Praha 2, Czech Republic
Kateřina Pawlasová, Iva Karafiátová & Jiří Dvořák

Authors

Kateřina Pawlasová
View author publications
You can also search for this author in PubMed Google Scholar
Iva Karafiátová
View author publications
You can also search for this author in PubMed Google Scholar
Jiří Dvořák
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiří Dvořák.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pawlasová, K., Karafiátová, I. & Dvořák, J. Neural networks with functional inputs for multi-class supervised classification of replicated point patterns. Adv Data Anal Classif 18, 705–721 (2024). https://doi.org/10.1007/s11634-024-00579-5

Download citation

Received: 01 December 2022
Accepted: 20 December 2023
Published: 07 February 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s11634-024-00579-5

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Neural networks with functional inputs for multi-class supervised classification of replicated point patterns

Abstract

Similar content being viewed by others

Supervised Classification via Neural Networks for Replicated Point Patterns

Spatial point patterns generation on remote sensing data using convolutional neural networks with further statistical analysis

Machine learning for cluster analysis of localization microscopy data

1 Introduction

2 Background on spatial point patterns