Introduction

Rapid advancements in technology, particularly in areas such as information technology, artificial intelligence, and the Internet of Things, have endowed cities with tools to collect data, analyze them, and make data-driven decisions for urban planning and management. These decisions often rely on parameter values that are not definitively known at the time they are made. Forecasting models that predict the most probable realization of uncertain parameters can be employed. However, for decisions with enduring implications in the medium to long-term, it is essential to account not only for the most likely scenario but also for a range of potential outcomes. In such cases, scenario generation procedures are necessary to assemble a comprehensive representation of potential future developments. This allows the decision model to effectively mitigate risks and strengthen the resilience of chosen strategies. This proactive approach enables adaptability to a variety of future circumstances and empowers decision-makers to anticipate and hedge against uncertainty.

When dealing with transportation management problems, it is crucial to consider various possible traffic conditions and generate scenarios for travel speed across the road network. In smart cities, hundreds of thousands of strategically placed sensors provide the capability to collect substantial data on travel speed under different conditions. Several studies, such as Lv et al. (2015), Geng et al. (2019), Yao et al. (2019), have successfully employed machine learning techniques to analyze this data and develop speed forecasting models. Nevertheless, unlike previous works, our objective is to develop a generative machine learning model that can generate scenarios by effectively capturing the joint distribution of speed variables throughout a road network. This approach accounts for complex dependencies among speed variations along road links. It prioritizes the generation of scenarios that approximate the actual joint distribution of the variables, rather than solely focusing on forecasting the most likely scenario. This will enable us to generate scenarios for consideration in decision models under uncertainty, such as those based on stochastic programming or robust optimization.

Speed measurements on road networks are influenced not only by proximity but also by exogenous factors, including human behavior, weather conditions, and the geographical distribution of Points of Interest (POIs), such as theaters, schools, hospitals, and shopping centers. Learning the joint distribution and dependencies among these variables from sensor data holds the potential to acquire more robust and generalizable representations. Undoubtedly, this task is challenging, as the learned models must not only account for intricate topological relationships but also effectively handle the influence of traffic attractors.

In this study, we propose a model based on generative adversarial networks (GANs) (Goodfellow et al. 2020), a deep learning (DL) architecture. GAN models are self-supervised and have demonstrated their potential in capturing multivariate joint distributions across various domains, including image generation (Bao et al. 2017; Gao et al. 2018; Reed et al. 2016), molecular structure generation (Maziarka et al. 2020), and multi-modal information integration (Sutter et al. 2020). To enhance the performance of our GAN-based model, we suggest pre-training the generator using a Variational AutoEncoder (VAE) (Kingma and Welling 2013a). The paper is structured as follows: the next section, we review related work. In Sect. 3, we formalize the problem. The subsequent section outlines our proposed model. The penultimate section provides details on the datasets used for model validation and presents the results compared to state-of-the-art methods. Finally, we offer conclusions and suggest possible directions for future research.

Related Work

Scenario generation is an algorithmic strategy to organize the realizations of the stochastic process, which describes the random parameters affecting a decision problem, with a finite discrete distribution that can be used to apply computational solution techniques. The quality of the solution obtained naturally depends on the scenarios used. In traffic engineering, scenarios generation is a sophisticated approach to enhance the planning and development of urban transportation systems. Using scenarios, it is possible to create and analyze a wide range of traffic conditions in order to optimize the design and functionality of infrastructures such as roads and urban spaces. Scenario generation allows transport and urban planners to implement solutions that take into account uncertainty, for example, to identify potential vulnerabilities in road networks, as proposed in Ciplyte et al. (2014), and to develop strategies to minimize traffic pollution, as in  Ejercito et al. (2017). To make resilient and robust decisions, traffic scenarios generation techniques can be used to generate traffic flow, as in  Cervellera et al. (2022) and  Wu et al. (2020), and traffic travel time scenarios, as proposed in  Chen et al. (2017), Yu et al. (2020) and  Meiping et al. (2019). Our proposed model focuses on scenarios generation of speed values on a road network. In the last years, several approaches have been proposed to generate scenarios whose distributions approximate the original multivariate data distributions, especially in the field of asset management (Kouwenberg and Zenios 2001) and renewable energy production (Ma et al. 2013). Scenario generation methods are generally based on sampling (Kaut and Wallace 2003) and forecasting (Kaut 2017; Lucheroni et al. 2019) techniques. Sampling-based methods generate scenarios by iteratively drawing samples from the underlying data distributions. Within this category, we can mention techniques such as Monte Carlo sampling (Dong et al. 2019; Xie et al. 2018), Markov Chain Monte Carlo sampling (Papaefthymiou and Klockl 2008), and Latin Hypercube sampling (Yu et al. 2009). Forecasting-based methods are based on models trained using historical data without making any assumptions regarding the underlying distribution functions. Some examples are Auto-Regressive Moving Average models (ARMA) (Meibom et al. 2011), Auto-Regressive Integrated Moving Average (ARIMA) (Chen et al. 2010) and more recently generative models based on Neural Networks (Vagropoulos et al. 2016; Stappers et al. 2020). In a multivariate setting, sampling-based approaches must rely on a joint probability distribution model that in real cases, when a large number of variables must be considered, is not easy to obtain. One popular technique for modeling joint probability distribution employs a family of multivariate distributions with uniform marginals called copulas (Becker 2018; Valizadeh Haghi and Lotfifard 2015), a statistical model grounded on Sklar’s theorem (Chen et al. 2013). This theorem assumes that a joint distribution can be described by the marginal distribution function of each random variable in a multivariate setting. The Copula function models the dependencies between random variables in a multivariate structure, taking as input the marginal distribution function, and can be sampled to generate new scenarios (Kaut and Wallace 2011). The Copula function sampling method avoids constructing a joint probability distribution but requires marginal distributions that resemble the normal distribution. Some of the commonly used copulas include elliptical and Archimedean. However, when considering real case studies, empirical copulas (copulas from real data) that consist of a mixture of copulas commonly modeled using parametric methods are often used. These methods, however, tend to become computationally challenging when applied to high-dimensional data. More recently, generative models have been proposed for statistical modeling. These learn to generate data with the same statistics as a given training dataset, effectively learning its distribution. The model can be used to output new samples that could plausibly have belonged to the original dataset. GAN architectures (Jiang et al. 2018; Chen et al. 2018) are examples of this approach in the context of scenario generation. However, generative methods are mainly used for forecasting and not for learning data distributions. Forecasting the most probable scenario is a different task and they disregard scenarios with low probability that, however, could have a big impact when dealing with decisions under uncertainty. For this reason, we propose a new model based on GAN aimed at learning the joint probability distribution directly from data, to generate a finite discrete distribution of scenarios.

Problem Statement

Given a set of \(N\) roads with \(k \in \{1, \ldots , N\}\), let \(D^k = (x^k_1, x^k_2, \ldots , x^k_n) \in \mathbb {R}^n\), \(\forall k \in \{1, \ldots , N\}\), be a collection of \(n\) observations regarding the average speed on road \(k\). The sample size for each dataset is \(n\), and each sample has been collected simultaneously for all roads. In other words, each data point is an \(N\)-tuple of observations \(s_i = (x^1_i, x^2_i, \ldots , x^N_i) \in \mathbb {R}^N\), where \(i \in \{1, \ldots , n\}\). Let the set of all \(n\)-tuples \(\{s_1, s_2, \ldots , s_n\} \in \mathbb {R}^{N \times n}\) be the observation set \(S\). It represents a particular realization of an unknown multivariate random variable \(X\), while \(D^k\) represents realizations of the marginal variable \(X^k\) for all \(k \in \{1, \ldots , N\}\).

A random variable is a formalization of a quantity (in this case, average car speeds measured on the road network) that depends on random events. The formal characterization of a random variable is typically a complex process, as discussed in the previous section, where various approaches from the literature have been presented. In this study, we propose to approximate \(X\) using a deep neural network \(N(\theta ): \mathcal {{\xi }}\sim \mathcal {N}(0, 1)^Z \longrightarrow \mathbb {R}^N\). This network takes as input a sample extracted from a \(Z\)-dimensional uncorrelated Gaussian distribution and outputs an \(N\)-tuple of real values representing the speeds on the considered roads. The network’s weights \(\theta \) are obtained through an adversarial training process and backpropagation over the observation set S.

Fig. 1
figure 1

Architectural view of the proposed generative approach

Fig. 2
figure 2

PEMS-BAY detectors and main roadways

Proposed Approach

This research employs Deep Learning techniques, specifically a Generative Adversarial Network (GAN) and Variational AutoEncoder (VAE), for traffic speed scenario generation. A GAN comprises two distinct neural networks: a Generator (Gen), responsible for creating synthetic instances from random noise as input, and a Discriminator (Dis), which is trained to differentiate between real instances (referred to as \(I_{\text {R}}\) in the following) and those generated artificially (synthetic instances or \(I_{\text {S}}\)). This architecture implements a minimax two-player game, in which the Discriminator aims to enhance its ability to distinguish real from artificially generated instances, while the Generator strives to produce synthetic instances that closely resemble real ones. The GAN architecture employs the binary cross-entropy (\(\mathcal {L}_{\text {BCE}}\)) loss function, presented below in vectorized form:

$$\begin{aligned} \mathcal {L}_\text {{BCE}} = - \frac{1}{|I|} \left[ L \cdot \log (P) + (1 - L) \cdot \log (1 - P)\right] . \end{aligned}$$
(1)

Here, \(P \in [0,1]^{|I|}\) represents the vector of probabilities assigned by the Discriminator to a set of instances I, and \(L\in \{0,1\}^{|I|}\) is the vector of binary labels associated with real instances (1) and synthetic instances (0).

The training process of a GAN is structured into three distinct phases, iteratively applied:

  1. 1.

    The Discriminator is trained only on real instances (\(I_{\text {R}}\)). During this phase, the loss function is minimized, affecting only the Discriminator’s weights. The Generator is kept idle, meaning that its weights are not updated.

  2. 2.

    The Discriminator is trained on synthetic instances generated by the Generator from uncorrelated Gaussian noise \(z\), that is \(\text {Gen}(z)=I_{S}\). In this phase, the Generator also is kept idle.

  3. 3.

    The Generator is trained with the objective of generating synthetic instances capable of deceiving the Discriminator. The loss function from Eq. 1 reads as follows:

    $$\begin{aligned} \mathcal {L}_{\text {BCE}} = -\frac{1}{|I_{\text {S}}|} \log (1 - \text {Dis}(\text {Gen}(z))). \end{aligned}$$
    (2)

    Here, \(\text {Dis}(\text {Gen}(z)) \in [0,1]^{|I_\textrm{S}|}\) represents the vector of probabilities assigned by the Discriminator to synthetic instances \(\text {Gen}(z) = I_{\text {S}}\), and \(L = \{0\}^{|I_\textrm{S}|}\). During this phase, the loss function is maximized by updating the Generator’s weights while the Discriminator is kept idle.

GAN-based architectures have demonstrated their effectiveness in various fields but often suffers from convergence issues (Kodali et al. 2017). To address this challenge and mitigate instability, we propose pre-training the Generator using a Variational AutoEncoder (VAE) (Kingma and Welling 2013b). VAE models extend the AutoEncoder architecture (Goodfellow et al. 2017; Hsieh 2001), which comprises two key components: an Encoder, which is responsible for generating a compact representation (referred to as the latent representation or latent space) of the input and a Decoder, which, starting from the latent representation, reconstructs the input to the model. In VAE architectures, the encoder creates a latent space that represents the regularized distribution of the input data. The decoder samples from the latent space distribution to reconstruct the provided input. The choice of a VAE for pre-training the Generator is grounded in two fundamental considerations: (i) It is possible to design a VAE where the primary objective is to learn how to reconstruct the input distribution, rather than reconstructing the individual instances provided as input to the encoder, and (ii) It is possible to feed the decoder with samples extracted from an uncorrelated multivariate Gaussian distribution to generate synthetic instances consistent with a specific data distribution. presents an architectural overview of the proposed approach. To achieve goals (i) and (ii) we propose the following specialized loss function:

$$\begin{aligned} \mathcal {L}_{\text {VAE}}=\alpha \cdot \mathcal {L}_{\text {JSD}} + \beta \cdot \mathcal {L}_{\text {M}} + \gamma \cdot \mathcal {L}_{\rho _\textrm{s}} + \delta \cdot \mathcal {L}_{\perp }. \end{aligned}$$
(3)

Here, \(\mathcal {L}_{\text {JSD}}\), \(\mathcal {L}{\text {M}}\), \(\mathcal {L}_{\rho _\textrm{s}}\), \(\mathcal {L}_{\perp }\) are different loss terms and \(\alpha \), \(\beta \), \(\gamma \), and \(\delta \) are relative scalar coefficients. The first three loss terms are introduced to steer the training process and penalize differences in terms of shape (\(\mathcal {L}_{\text {JSD}}\)), median (\(\mathcal {L}{\text {M}}\)), and correlation (\(\mathcal {L}_{\rho _\textrm{s}}\)) between the input and reconstructed random (Goal i), while \(\mathcal {L}_{\perp }\) is responsible for decorrelating the dimensions of the latent space (Goal ii). It is worth noting that to capture statistical features of both the input and reconstructed distributions, the computation of the loss function’s gradient involves considering instances within the input batch collectively, rather than aggregating the contribution of each instance individually.

Fig. 3
figure 3

METR-LA detectors and main roadways

Fig. 4
figure 4

Chengdu road network maps

Fig. 5
figure 5

The plot displays the correlations between a subset of 4 PEMS-BAY-16 latent space components. The distribution of marginal components (with a mean of 0 and variance of 1) is shown in the diagonal

Table 1 PEMS-BAY-32-METR-LA-32 datasets and PEMS-BAY-48/METR-LA-48 datasets architecture models

Going into more detail, the \(\mathcal {L}_{\text {JSD}}\) term (presented in Eq. 4) is included to force the model to generate an reconstructed joint distribution in which each constituent marginal distribution closely aligns with the corresponding input one. To achieve this, we employ the Jensen–Shannon Divergence (JSD) (Lin 1991), a metric quantifying the similarity between two distributions. To elaborate further, in our approach, we calculate the JSD to assess the dissimilarity between the kth input marginal variable (\(X^k_{\text {in}}\)) from the encoder and its corresponding reconstructed counterpart (\(X^k_{\text {out}}\)). We subsequently aggregate these measurements across all marginal variables (\(k=1\dots N\)). The \(\mathcal {L}_{\text {JSD}}\) is calculated as follows:

$$\begin{aligned} \mathcal {L}_{\text {JSD}}=\sum _{k=1}^{N}{\sqrt{\frac{1}{2}\cdot KL(X_{\text {out}}^k,m_k)+\frac{1}{2}\cdot KL(X_{\text {in}}^k,m_k)}}. \end{aligned}$$
(4)

Here, \(m_k\quad =\quad \frac{1}{2}\cdot (X_{\text {in}}^k + X_{\text {out}}^k)\) and KL represents the Kullback–Leibler divergence (Kullback and Leibler 1951). The loss term \(\mathcal {L}{\text {M}}\) (presented in Eq. 5) measures the aggregated distance between the medians of each input marginal variable (\(\widetilde{X}_{\text {in}}^k\)) and its reconstructed counterpart (\(\widetilde{X}_{\text {out}}^k\)).

$$\begin{aligned} \mathcal {L}{\text {M}}=\sum _{k=1}^{N}{|\widetilde{X}_{\text {in}}^k- \widetilde{X}_{\text {out}}^k|}. \end{aligned}$$
(5)

The loss term \(\mathcal {L}_{\rho _\textrm{s}}\) is introduced to penalize the dissimilarity between the input and reconstructed variables concerning their correlation, as quantified by Spearman correlation. Similar to Pearson correlation (Pearson 1895), Spearman correlation (\(\rho _\textrm{s}\)) (Spearman 1904; Schober et al. 2018) assesses the rank correlation of two random variables. Nonetheless, whereas Pearson correlation evaluates linear relationships, Spearman correlation considers both linear and non-linear relationships, as well as monotonic relationships. The Spearman correlation is calculated as follows:

$$\begin{aligned} \rho _\textrm{s} {(X,Y)} = \frac{\text {cov}(rg_X, rg_Y)}{\sigma _{rg_X} \sigma _{rg_Y}}. \end{aligned}$$
(6)

Here, \(rg_X\) and \(rg_Y\) represent the ranks associated with the random variables X and Y. Going into more detail, in our approach, we compute Spearman correlation matrices for every possible pair of marginal variables in both the input and reconstructed batches. Subsequently, the loss term \(\mathcal {L}_{\rho _\textrm{s}}\) is computed as the aggregated distance between the input and reconstructed correlation matrices, as follows:

$$\begin{aligned} \mathcal {L}_{\rho _\textrm{s}} = \sum _{k=1}^{N}{\sum _{l=1}^{N}{\left| \rho _\textrm{s}{(X^k_{\text {in}},X^l_{\text {in}}})-\rho _\textrm{s}{(X^k_{\text {out}},X^l_{\text {out}})} \right| }}. \end{aligned}$$
(7)

Here, \(\rho _\textrm{s}{(X^k_{\text {in}},X^l_{\text {in}})}\) represents the Spearman correlation calculated for the kth and lth input marginal variables while \(\rho _\textrm{s}{(X^k_{\text {out}},X^l_{\text {in}})}\) is the Spearman correlation for the reconstructed counterparts. Finally, to drive the training process to generate an uncorrelated latent space, the term \(\mathcal {L}_{\perp }\) is incorporated into the loss function, drawing inspiration from (Yoo et al. 2021). The rationale behind this choice lies in the idea that a latent space comprising uncorrelated variables compels the decoder to acquire an understanding of the relationships between its constituent components, as these relationships are no longer encoded in the latent space. Consequently, it becomes possible to generate verisimilar instances, i.e., instances that are consistent with the joint distribution used to train the VAE, by providing the decoder with uncorrelated Gaussian noise. The \(\mathcal {L}_{\perp }\) loss term is calculated as follows:

$$\begin{aligned} \mathcal {L}_{\perp }=\sum _{k=1}^{Z}{\sum _{i=1}^{Z}{\left| \frac{1}{M}\left( \sum _{j=1}^{M}{z_j^k z_j^i}\right) -\frac{1}{M^2}\left( \sum _{j=1}^{M}{z_j^k} \sum _{j=1}^{M}{z_j^i}\right) \right| }}, \end{aligned}$$
(8)

where Z represents the dimension of the latent space, M is the batch size, and \(z^k_j\) and \(z^i_j\) denote the values of the kth and ith latent space variables for the jth batch instance, respectively.

Experimental Evaluation

This section proves how the proposed GAN-based model for scenario generation can learn multivariate distribution and generate scenarios having the same distribution of the original dataset. Specifically, the scenarios generated are consistent and coherent to dataset instances.

Datasets

To evaluate our proposed model, we consider three datasets, PEMS-BAY, METR-LA, and CHENGDU.

PEMS-BAY (Li et al. 2017) is a dataset collected by the California Transportation Agencies Performance Measurement System (PeMS). The dataset contains average vehicle speed values computed every 5 min for each of the 325 speed detectors placed on the urban road network (Fig. 2) of Santa Clara (California, USA), between January \(1{\textrm{st}}\), 2017 and June \(30{\textrm{th}}\), 2017. The total number of observed traffic data points is 16,937,179.

METR-LA (Jagadish et al. 2014) is a dataset consisting of average vehicle speed values collected by 207 detectors on highways of Los Angeles County (Fig. 3). The data spans 4 months, from March \(1{\textrm{st}}\), 2012, to June \(30{\textrm{th}}\), 2012. It encompasses a total of 6,519,002 observed traffic data points.

Both PEMS-BAY and METR-LA datasets provide speed readings for different sections of 8 main roadways in both directions. For our experiments, we initially considered 16 marginal variables by averaging data from all sensors associated with each lane of the same road.

Table 2 CHENGDU dataset architecture model—16 roads

The CHENGDU dataset consists of data of the Chengdu road network (Gao et al. 2021) as described in Guo et al. (2019). Chengdu is a megacity in the western region of China. Recorded at 2-min intervals across five representative time horizons (3:00–5:00, 8:00–10:00, 12:00–14:00, 17:00–19:00, and 21:00–23:00), the dataset encompasses 5943 individual road segments (Fig. 4). To be consistent to the previous datasets, we considered the 16 main road arteries.

Fig. 6
figure 6figure 6

Correlation matrix plots for the three datasets analyzed, where the ith row and jth column refer to the ith and jth marginal variable, respectively. On the left, we report the correlation matrices of the empirical data and on the right, those of the generated data

Fig. 7
figure 7

Data distribution plots for marginal variables: the data distribution of the considered marginal variable for real data is shown in blue, while the distribution for generated data is shown in yellow. Original images and data distributions for PEMS-BAY-32, PEMS-BAY-48, METR-LA-32, and METR-LA-48 can be found on GitHub (Carbonera 2023)

Model Architecture

The PEMS-BAY and METR-LA datasets exhibit similar structures and data volumes, allowing for the use of the same general architecture with consistent hyperparameters. In contrast, the CHENGDU dataset necessitates a specialized configuration due to its different structure. The specifics of the model structures are presented below.

For the PEMS-BAY-16 and METR-LA-16 datasets, the Variational AutoEncoder (VAE) component of the proposed approach features three linear layers of heterogeneous dimensions in the encoder, mirrored in the decoder to ensure symmetry in the model’s representation learning. The input and output dimensions are \(\mathbb {R}^{16\times 1}\), and the latent space size is \(\mathbb {R}^{12\times 1}\). In addition, the latent space layer is preceded by a batch normalization layer (to enable the resulting Generator in the GAN architecture to work on normal data with mean 0 and variance 1). Figure 5 illustrates the pairwise correlations among components of the VAE latent space when fed with the PEMS-BAY-16 dataset. The decoder generates scenarios that approximate the real data distribution starting from an uncorrelated and Gaussian-distributed latent space, suggesting that the decoder is able to learn the distributions and correlations of the original data. In this way, we can use uncorrelated Gaussian-sampled values as input to the decoder to generate scenarios. It is worth noting that the decorrelation observed in the figure is a result of the specific loss function employed in the model training.

The generative component (GAN) is composed of a Discriminator featuring three linear layers, while the Generator corresponds to the Decoder element of the Variational AutoEncoder (VAE), as previously described. The architectural configuration, where network parameters such as the number of layers and neurons are chosen experimentally, of both the VAE and GAN components within the proposed approach for PEMS-BAY-16 and METR-LA-16 datasets is presented in Table 1 in the first column. The models’ architecture for PEMS-BAY-32/METR-LA-32 and PEMS-BAY-48/METR-LA-48 datasets are reported in the second and third of Table 1.

Since the CHENGDU dataset comprises a significantly larger quantity of samples, it has facilitated the exploration of deeper parameterizations for the generative model proposed in this study. The availability of large data volumes enables deeper networks to capture hierarchical features and abstractions within the data. This proves particularly advantageous when dealing with complex patterns or nuanced information inherent in traffic-related phenomena. Furthermore, training with large sample collections helps mitigate concerns about overfitting and promotes generalization. As a result, both the VAE and the GAN Generator have five hidden layers (instead of three), while the latent space of the VAE remains the same size (\(\mathbb {R}^{12\times 1}\)). Additional details are provided in Table 2.

Computational Results

The analysis of the results obtained on the PEMS-BAY-16, METR-LA-16, and CHENGDU datasets reveals our approach’s proficiency in learning and replicating the actual correlation among random variables. To qualitatively assess our approach, we present in Fig. 6 the correlation matrices of both the original and generated distributions for PEMS-BAY-16, METR-LA-16, and CHENGDU. The visualizations reveal strong similarities in patterns, suggesting that our proposed model effectively captures the correlations.

Table 3 PEMS-BAY-16 dataset
Table 4 METR-LA-16 dataset

In Fig. 7, we show a visual comparison between the original and the generated distributions of different marginal variables, each associated to the speed on the corresponding road. It is possible to observe that empirical distributions of different shapes are well represented.

To compare our method with a state-of-the-art method, we conducted a comparative analysis between our proposed approach and the Gaussian copulas, a widely used technique for scenario generation. Tables 3, 4, and 5 display a comprehensive summary of statistical measures for each random variable. Specifically, we present the mean and standard deviation of the actual empirical distribution, alongside the distributions derived from 1000 instances generated by both approaches.

Table 5 CHENGDU dataset
Table 6 Wasserstein distances—10 repetitions

For our comparative analysis of distribution shapes, we also computed the Wasserstein distance between the original marginal distributions and the ones generated by our model and the Copula model, respectively. Table 6 exhibits our model’s superior or comparable performance to the Copula method in generating less common instances in the distribution tails, with significantly lower computational effort. Further insights are gained by examining these tables in conjunction with the marginal distributions depicted in Fig. 7. The results indicate that, when a marginal distribution deviates from the Gaussian distribution, the Copula-based model becomes less effective in replicating the true distribution, reporting a higher Wasserstein distance compared to our model. This underscores our model’s substantial capacity to learn and faithfully replicate complex joint probability distributions directly from data, without making any assumptions about the distribution of the underlying variables. Notably, this ability is evident in scenarios such as bimodal distributions, as seen in marginal variables #9 and #11 in the METR-LA-16 dataset (Fig. 7b), distributions characterized by outliers, like marginal variables #9 and #11 in the PEMS-BAY-16 dataset (Fig. 7a), or distributions exhibiting pronounced skewness, for instance, marginal variables #0 and #14 in the METR-LA-16 dataset (Fig. 7b), and marginal variables #5 and #10 in the CHENGDU dataset (Fig. 7c).

To study the ability of our model to learn spatial correlations on an increased number of marginal components, we split each road of the PEMS-BAY and METR-LA datasets into sub-segments, obtaining models with 32 marginal variables, by splitting the roads in 2 segments, and 48 marginal variables, by splitting the original roads into 3 sections. The resulting datasets are denoted with PEMS-BAY-\(N\) and METR-LA-\(N\), where N is the number of road sections considered. The results demonstrate that the model effectively captures and reproduces the correlations present in the original data. Figure 8 visually presents the correlation matrices for both original and generated data from the PEMS-BAY-\(N\) and METR-LA-\(N\) datasets, with 32 and 48 road segments. Notably, the correlation patterns observed in the original data matrices (on the left) persist for the generated data (on the right).

Also for models with 32 and 48 marginal variables, we performed a comparative analysis with the Copula model. To enhance the robustness of our results, we performed 10 runs of the training and generation processes. The outcomes are presented in Tables 7, 8, 9, and 10.

Finally, in Fig. 9 we show, for each dataset, the 2D t-SNE (t-Distributed Stochastic Neighbor Embedding) plots (Maaten and Hinton 2008) of data distributions representing the real data (blue points), the GAN generated data (orange points) and the Copula generated data (green points), while for comparison, in Tables 6 and 9, the Wasserstein distance is reported.

T-SNE serves as a pivotal method for visualizing multidimensional data by effectively reducing it to two dimensions. T-SNE plots allow us to evaluate the ability of our model to learn features and patterns from the real data, and to generate scenarios where the more generated and real data are close, the more their t-SNE representation overlap (Tai et al. 2023). Observing the t-SNE representations of GAN-generated data, distributions reveal a substantial congruence with those derived from real data. Examining the Wasserstein distance in the METR-LA-16 dataset (see second of Table 6), one can observe a higher distance for marginal variables #3, #6, and #7. Similarly, in the METR-LA-32 dataset (see Table 9), marginal variables #10, #14, and #25 exhibit higher Wasserstein distances. Moreover, the mismatch between real and generated data in t-SNE plots is more pronounced in smaller METR-LA datasets, specifically those with 16 and 32 road segments, than in the larger one, that is the METR-LA-48 dataset with 48 road segments. This seemingly counterintuitive phenomenon can be explained by the dataset creation process, which involves aggregation on different scales of average speed information across segments belonging to the same roads. When the aggregation scale is larger, the resulting road segments are, on average, longer, leading to heightened variability in average speeds and longer-tailed distributions. In contrast, shorter segments (as in the METR-LA-48 dataset) result in reduced variability. Furthermore, considering shorter segments allows the model to more effectively discern latent patterns associated with exogenous factors that may influence traffic with continuity or periodicity, such as Points of Interest.

Conclusion and Future Works

This work introduces a generative model for realistic traffic scenarios. The model aims to capture marginal variable distributions and their correlations found in real data. Key contributions include: (i) a GAN model with a pre-trained VAE-based generator for scenario creation, (ii) a specialized loss function prompting the VAE to learn both overall distributions and variable correlations, and (iii) an empirical proof of the model’s ability to accurately replicate the actual underlying marginal distributions and correlations. This approach outperforms existing methods in faithfully reproducing complex distributions, ensuring consistency with real datasets in generating instances. A thorough analysis, employing statistical indices and Wasserstein distance to compare the generated and real distributions, has been conducted to assess the performance of our model in comparison to a Gaussian Copula-based approach. The findings indicate that our model outperforms the Copula-based model without necessitating assumptions about the actual marginal distributions. The model architecture presented in this work can be used to solve logistic problems under road speed uncertainty. By capturing the correlation between roads, robust solutions, resilient to the inherent uncertainties in traffic data, can be found. An analysis of the experimental results reveals the proposed model’s capability to generate scenarios close to the original data. Future advancements may focus on exploring Deep Learning architectures and techniques that incorporate the graph structure during both the training and generation processes. Finally, the proposed model presents opportunities for incorporating temporal correlations evolution, potentially through temporal Neural Networks based on attention mechanisms.

Fig. 8
figure 8figure 8

Correlation matrix plots of the PEMS-BAY datasets for the versions with a 32 road segments, and b 48 road segments, where the ith row and jth column refer to the ith and jth marginal variable, respectively. On the left we show the correlation matrices of the empirical data and on the right those of the generated data

Table 7 PEMS-BAY-32 dataset—Wasserstein distances—10 repetitions
Table 8 PEMS-BAY-48 dataset dataset—Wasserstein distances—10 repetitions
Table 9 METR-LA-32 dataset—Wasserstein distances—10 repetitions
Table 10 METR-LA-48 dataset—Wasserstein distances—10 repetitions
Fig. 9
figure 9

T-SNE visualization of real data (blue), data generated by our approach (orange), and data generated by the copula method (green) for the PEMS-BAY-16, METR-LA-16. T-SNE visualization of real data (blue), data generated by our approach (orange), and data generated by the copula method (green) for the PEMS-BAY-32 and METR-LA-32. T-SNE visualization of real data (blue), data generated by our approach (orange), and data generated by the copula method (green) for PEMS-BAY-48 and METR-LA-48