Abstract
Stochastic games are a well established model for multi-agent sequential decision making under uncertainty. In practical applications, though, agents often have only partial observability of their environment. Furthermore, agents increasingly perceive their environment using data-driven approaches such as neural networks trained on continuous data. We propose the model of neuro-symbolic partially-observable stochastic games (NS-POSGs), a variant of continuous-space concurrent stochastic games that explicitly incorporates neural perception mechanisms. We focus on a one-sided setting with a partially-informed agent using discrete, data-driven observations and another, fully-informed agent. We present a new method, called one-sided NS-HSVI, for approximate solution of one-sided NS-POSGs, which exploits the piecewise constant structure of the model. Using neural network pre-image analysis to construct finite polyhedral representations and particle-based representations for beliefs, we implement our approach and illustrate its practical applicability to the analysis of pedestrian-vehicle and pursuit-evasion scenarios.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
Strategic reasoning is essential to ensure stable multi-agent coordination in complex environments, e.g., autonomous driving or multi-robot planning. Partially-observable stochastic games (POSGs) are a natural model for settings involving multiple agents, uncertainty and partial information. They allow the synthesis of optimal (or near-optimal) strategies and equilibria that guarantee expected outcomes, even in adversarial scenarios. But POSGs also present significant challenges: key problems are undecidable, already for the single-agent case of partially observable Markov decision processes (POMDPs) [24], and practical algorithms for finding optimal values and strategies are lacking.
Computational tractability can be improved using one-sided POSGs, a subclass of two-agent, zero-sum POSGs where only one agent has partial information while the other agent is assumed to have full knowledge of the state [40, 41]. This can be useful when making worst-case assumptions about one agent, such as in an adversarial setting (e.g., an attacker-defender scenario) or a safety-critical domain (e.g., a pedestrian in an autonomous driving application).
From a computational perspective, one-sided POSGs avoid the need for nested beliefs [39], i.e., reasoning about beliefs not only over states but also over opponents’ beliefs. This is because the fully-informed agent can reconstruct beliefs from observation histories. Recent advances [19] have led to the first practical variant of heuristic search value iteration (HSVI) [31] for computing approximately optimal values and strategies in (finite) one-sided POSGs.
However, in many realistic autonomous coordination scenarios, agents perceive continuous environments using data-driven observation functions, typically implemented as neural networks (NNs). Examples include autonomous vehicles using NNs to perform object recognition or to estimate pedestrian intention, and NN-enabled vision in an airborne pursuit-evasion scenario.
In this paper, we introduce one-sided neuro-symbolic POSGs (NS-POSGs), a variant of continuous-space POSGs that explicitly incorporates neural perception mechanisms. We assume one partially-informed agent with a (finite-valued) observation function synthesised in a data-driven fashion, and a second agent with full observation of the (continuous) state. Continuous-space models with neural perception mechanisms have already been developed, but are limited to the simpler cases of POMDPs [36] and (fully-observable) stochastic games [33]. Our model provides the ability to reason about an agent with a realistic perception mechanism and operating in an adversarial or worst-case setting.
Solving continuous-space models, even approximately, is computationally challenging. One approach is to discretise and then use techniques for finite-state models (e.g., [19] in our case). But this can yield exponential growth of the state space, depending on the granularity and time-horizon used. Furthermore, decision boundaries for data-driven perception are typically irregular and can be misaligned with gridding schemes for discretisation, limiting precision.
An alternative is to exploit structure in the underlying model and work directly with the continuous-state model. For example, classic dynamic programming approaches to solving MDPs can be lifted to continuous-state variants [12]: a piecewise constant representation of the value function is computed, based on a partition of the state space created dynamically during solution. It is demonstrated that this approach can outperform discretisation and that it can also be generalised to solving POMDPs. We can adapt this approach to models with neural perception mechanisms [36], exploiting the fact that ReLU NN classifiers induce a finite decomposition of the continuous environment into polyhedra.
Contributions. The contributions of this paper are as follows. We first define the model of one-sided NS-POSGs and motivate it via an autonomous driving scenario based on a ReLU NN classifier for pedestrian intention learnt from public datasets [28]. We then prove that the (discounted reward) value function for NS-POSGs is continuous and convex, and is a fixed point of a minimax operator. Based on mild assumptions about the model, we give a piecewise linear and convex representation of the value function, which admits a finite polyhedral representation and which is closed with respect to the minimax operator.
In order to provide a feasible approach to approximating values of NS-POSGs, we present a variant of HSVI, which is a popular anytime algorithm for POMDPs that iteratively computes lower and upper bounds on values. We build on ideas from HSVI for finite one-sided POSGs [19] (but there are multiple challenges when moving to a continuous state space and NNs) and for POMDPs with neural perception mechanisms [36] (but, for us, the move to games brings a number of complications); see Sect. 6 for a detailed discussion.
We implement our one-sided NS-HSVI algorithm using the popular particle-based representation for beliefs and employing NN pre-image computation [25] to construct an initial finite polyhedral representation of perception functions. We apply this to the pedestrian-vehicle interaction scenario and a pursuit-evasion game inspired by mobile robotics applications, demonstrating the ability to synthesise agent strategies for models with complex perception functions, and to explore trade-offs when using perception mechanisms of varying precision.
Related Work. Solving POSGs is largely intractable. Methods based on exact dynamic programming [17] and approximations [11, 23] exist but have high computational cost. Further approaches exist for zero-sum POSGs, including conversion to extensive-form games [3], counterfactual regret minimisation [21, 22, 42] and methods based on reinforcement learning and search [5, 26]. In [9], an HSVI-like finite-horizon solver that provably converges to an \(\varepsilon \)-optimal solution is proposed; [32] provides convexity and concavity results but no algorithmic solution.
Methods exist for one-sided POSGs: a space partition approach when actions are public [40], a point-based approximate algorithm when observations are continuous [41] and projection to POMDPs based on factored representations [7]. But these are all restricted to finite-state games. Closer to our work, but still for finite models, is [19], which proposes an HSVI method for POSGs.
For the continuous-state but single-agent (POMDP) setting, point-based value iteration [6, 27, 38] and discrete space approximation [4] can be used; the former also uses \(\alpha \)-functions but works with (approximate) Gaussian mixtures or beta-densities, whereas we exploit structure, similarly to [12]. As discussed above, in earlier work, we proposed models and techniques for extending several simpler probabilistic models with neural perception mechanisms [33, 34, 36]. Recent work [37] builds on the one-sided NS-POSG model proposed in this paper, but focuses instead on online methods for strategy synthesis.
2 Background
POSGs. The semantics of our models are continuous-state partially observable concurrent stochastic games (POSGs) [5, 18, 21]. Letting \(\mathbb {P}(X)\) denote the space of probability measures on a Borel space X, POSGs are defined as follows.
A two-player POSG is a tuple \(\textsf{G}= (N, S, A, \delta , \mathcal {O},Z)\), where: \(N=\{1, 2\}\) is a set of two agents; S a Borel measurable set of states; \(A \triangleq A_1 {\times } A_2\) a finite set of joint actions where \(A_i\) are actions of agent i; \(\delta : (S {\times } A) \rightarrow \mathbb {P}(S)\) a probabilistic transition function; \(\mathcal {O}\triangleq \mathcal {O}_1 {\times } \mathcal {O}_2\) a finite set of joint observations where \(\mathcal {O}_i\) are observations of agent i; and \(Z: (S {\times } A {\times } S) \rightarrow \mathcal {O}\) an observation function.
In a state s of a POSG \(\textsf{G}\), each agent i selects an action \(a_i\) from \(A_i\). The probability to move to a state \(s'\) is \(\delta (s, (a_1, a_2))(s')\), and the subsequent observation is \(Z(s, (a_1, a_2), s')=(o_1, o_2)\), where agent i can only observe \(o_i\). A history of \(\textsf{G}\) is a sequence of states and joint actions \(\pi =(s^{0}, a^{0}, s^{1}, \dots , a^{t - 1}, s^t)\) such that \(\delta (s^k,a^k)(s^{k+1})>0\) for each k. For a history \(\pi \), we denote by \(\pi (k)\) the \((k{+}1)\)th state, and \(\pi [k]\) the \((k{+}1)\)th action. A (local) action-observation history (AOH) is the view of a history \(\pi \) from agent i’s perspective: \(\pi _i = (o_{i}^0, a_{i}^0, o_{i}^1,\dots , a_{i}^{t - 1}, o_{i}^t)\). If an agent has full information about the state, then we assume the agent is also informed of the history of joint actions. Let \( FPaths _{\textsf{G}}\) and \( FPaths _{\textsf{G},i}\) denote the sets of finite histories of \(\textsf{G}\) and AOHs of agent i, respectively.
A (behaviour) strategy of agent i is a mapping \(\sigma _i: FPaths _{\textsf{G},i} \rightarrow \mathbb {P}(A_i)\). We denote by \(\varSigma _i\) the set of strategies of agent i. A profile \(\sigma =(\sigma _1, \sigma _2)\) is a pair of strategies for each agent and we denote by \(\varSigma =\varSigma _1\times \varSigma _2\) the set of profiles.
Objectives. Agents 1 and 2 maximise and minimise, respectively, the expected value of the discounted reward \(Y(\pi ) = \sum _{k=0}^{\infty } \beta ^k r(\pi (k), \pi [k])\), where \(\pi \) is an infinite history, \(r : (S {\times } A) \rightarrow \mathbb {R}\) a reward structure and \(\beta \in (0, 1)\). The expected value of Y starting from state distribution b under profile \(\sigma \) is denoted \(\mathbb {E}_b^{\sigma } [Y]\).
Values and Minimax Strategies. If \(V^\star (b) \triangleq \sup \nolimits _{\sigma _1 \in \varSigma _1}\inf \nolimits _{\sigma _2 \in \varSigma _2} \mathbb {E}_b^{\sigma _1,\sigma _2}[Y]\) = \(\inf \nolimits _{\sigma _2 \in \varSigma _2} \sup \nolimits _{\sigma _1 \in \varSigma _1}\mathbb {E}_b^{\sigma _1,\sigma _2}[Y]\) for all \(b \in \mathbb {P}(S)\), then \(V^\star \) is called the value of \(\textsf{G}\). A profile \(\sigma ^{\star } = (\sigma _1^{\star }, \sigma _2^{\star })\) is a minimax strategy profile if, for any \(b \in \mathbb {P}(S)\), \(\smash {\mathbb {E}_b^{\sigma _1^{\star },\sigma _2}[Y] \ge \mathbb {E}_b^{\sigma _1^{\star },\sigma _2^{\star }}[Y] \ge \mathbb {E}_b^{\sigma _1,\sigma _2^{\star }}[Y]}\) for all \(\sigma _1 \in \varSigma _1\) and \(\sigma _2 \in \varSigma _2\).
3 One-Sided Neuro-Symbolic POSGs
We now introduce our model, aimed at commonly deployed multi-agent scenarios with data-driven perception, necessitating the use of continuous environments.
One-Sided NS-POSGs. A one-sided neuro-symbolic POSG (NS-POSG) comprises a partially informed, neuro-symbolic agent and a fully informed agent in a continuous-state environment. The first agent has a finite set of local states, and is endowed with a data-driven perception mechanism, through which (and only through which) it makes finite-valued observations of the environment’s state, stored locally as percepts. The second agent can directly observe both the local state and percept of the first agent, and the state of the environment.
Definition 1
(NS-POSG). A one-sided NS-POSG \(\textsf{C}\) comprises agents \(\textsf{Ag}_1 = (S_1,A_1, obs _1,\delta _1)\) and \(\textsf{Ag}_2 {=} (A_2)\), and environment \(E {=} (S_E,\delta _E)\), where:
-
\(S_1 = Loc _1 {\times } Per _1\) is a set of states for \(\textsf{Ag}_1\), where \( Loc _1\) and \( Per _1\) are finite sets of local states and percepts, respectively;
-
\(S_E\subseteq \mathbb {R}^e\) is a closed set of continuous environment states;
-
\(A_i\) is a finite set of actions for \(\textsf{Ag}_i\) and \(A \triangleq A_1 {\times } A_2\) is a set of joint actions;
-
\( obs _1 : ( Loc _1 {\times } S_E)\rightarrow Per _1\) is \(\textsf{Ag}_1\!\)’s perception function;
-
\(\delta _1: (S_1 {\times } A) \rightarrow \mathbb {P}( Loc _1)\) is \(\textsf{Ag}_1\!\)’s local probabilistic transition function;
-
\(\delta _E: ( Loc _1 {\times } S_E {\times } A) \rightarrow \mathbb {P}(S_E)\) is a finitely-branching probabilistic transition function for the environment.
One-sided NS-POSGs are a subclass of two-agent, hybrid-state POSGs with discrete observations (\(S_1\)) and actions for \(\textsf{Ag}_1\), and continuous observations (\(S_1 {\times } S_E\)) and discrete actions for \(\textsf{Ag}_2\). Additionally, \(\textsf{Ag}_1\) is informed of its own actions and \(\textsf{Ag}_2\) of joint actions. Thus, \(\textsf{Ag}_1\) is partially informed, without access to environment states and actions of \(\textsf{Ag}_2\), and \(\textsf{Ag}_2\) is fully informed. Since \(\textsf{Ag}_2\) needs no percepts, its local state and transition function are omitted.
The game executes as follows. A global state of \(\textsf{C}\) comprises a state \(s_1 = ( loc _1, per _1)\) for \(\textsf{Ag}_1\) and an environment state \(s_E\). In state \(s=(s_1,s_E)\), the two agents concurrently choose one of their actions, resulting in a joint action \(a=(a_1,a_2)\in A\). Next, the local state of \(\textsf{Ag}_1\) is updated to some \( loc _1'\in Loc _1\), according to \(\delta _1(s_1,a)\). At the same time, the environment state is updated to some \(s_E'\in S_E\) according to \(\delta _E( loc _1, s_E,a)\). Finally, the first agent \(\textsf{Ag}_1\), based on \( loc _1'\), generates a percept \( per _1' = obs _1( loc '_1,s'_E)\) by observing the environment state \(s_E'\) and \(\textsf{C}\) reaches the global state \(s'=(( loc _1', per _1'), s_E')\).
We focus on neural perception functions, i.e., for each local state \( loc _1\), we associate an NN classifier \(f_{ loc _1} : S_E \rightarrow \mathbb {P}( Per _1)\) that returns a distribution over percepts for each environment state \(s_E \in S_E\). Then \( obs _1( loc _1, s_E) = f^{\max }_{ loc _1}(s_E)\), where \(f^{\max }_{ loc _1}(s_E)\) is the percept with the largest probability in \(f_{ loc _1}(s_E)\) (a tie-breaking rule is applied if multiple percepts have the largest probability).
Motivating Example: Pedestrian-Vehicle Interaction. A key challenge for autonomous driving in urban environments is predicting pedestrians’ intentions or actions. One solution is NN classifiers, e.g., trained on video datasets [28, 29]. To illustrate our NS-POSG model, we consider decision making for an autonomous vehicle using an NN-based intention estimation model for a pedestrian at a crossing [28]. We use their simpler “vanilla” model, which takes two successive (relative) locations of the pedestrian (the top-left coordinates \((x_1,y_1)\) and \((x_2,y_2)\) of two fixed size bounding boxes around the pedestrian) and classifies its intention as: unlikely, likely or very likely to cross. We train a feed-forward NN classifier with ReLU activation functions over the PIE dataset [28].
We build this perception mechanism into an NS-POSG model of a vehicle yielding at a pedestrian crossing, based on [13], illustrated in Fig. 1. A pedestrian further ahead at the side of the road may decide to cross and the vehicle must decide how to adapt its speed. The first, partially-informed agent represents the vehicle. It observes the environment (comprising the successive pedestrian locations) using the NN-based perception mechanism to predict the pedestrian’s intention. This is stored as a percept and its speed as its local state. The vehicle chooses between selected (positive or negative) acceleration actions. The second agent, the pedestrian, is fully informed, providing a worst-case analysis of the vehicle decisions, and can decide to cross or return to the roadside. The goal of the vehicle is to minimise the likelihood of a collision with the pedestrian, which is achieved by associating a negative reward with this event.
Figure 1 also shows selected slices of the state space decomposition obtained by computing the pre-image [25] of the learnt NN classifier, for each of the three predicted intentions. The decision boundaries are non-trivial, justifying our goal of performing a formal analysis, but some intuitive characteristics can be seen. When \(x_2 \ge x_1\), meaning that the pedestrian is stationary or moving away from the road, it will generally be classified as unlikely to cross. We also see the prediction model is cautious when trying to make an estimation if its first observation is made from greater distance. More details are in [35].
One-Sided NS-POSG Semantics. A one-sided NS-POSG \(\textsf{C}\) induces a POSG \(\llbracket {\textsf{C}} \rrbracket \), where we restrict to states that are percept compatible, i.e., where \( per _1 = obs _1( loc _1, s_E)\) for \(s = (( loc _1, per _1), s_E)\). The semantics of a one-sided NS-POSG is closed with respect to percept compatible states.
Definition 2
(Semantics). Given a one-sided NS-POSG \(\textsf{C}\), as in Definition 1, its semantics is the POSG \(\llbracket {\textsf{C}} \rrbracket = (N,S,A, \delta , \mathcal {O}, Z)\) where:
-
\(N=\{1, 2\}\) is a set of two agents and \(A = A_1 \times A_2\);
-
\(S \subseteq S_1 \times S_E\) is the set of percept compatible states;
-
for \(s=(s_1,s_E),s'=(s_1',s_E')\in S\) and \(a \in A\) where \(s_1=( loc _1, per _1)\) and \(s_1'=( loc _1', per _1')\), we have \(\delta (s,a)(s')= \delta _1(s_1, a)( loc '_1) \delta _E( loc _1, s_E,a)(s_E')\);
-
\(\mathcal {O}= \mathcal {O}_1 \times \mathcal {O}_2\), where \(\mathcal {O}_1 = S_1\) and \(\mathcal {O}_2 = S\);
-
\(Z(s, a, s') = (s_1', s')\) for \(s \in S\), \(a \in A\) and \(s'=(s_1', s_E') \in S\).
Strategies. As \(\llbracket {\textsf{C}} \rrbracket \) is a POSG, we consider (behaviour) strategies for the two agents. Since \(\textsf{Ag}_2\) is fully informed, it can recover the beliefs of \(\textsf{Ag}_1\), thus removing nested beliefs. Hence, the AOHs of \(\textsf{Ag}_2\) are equal to the histories of \(\llbracket {\textsf{C}} \rrbracket \), i.e., \( FPaths _{\llbracket {\textsf{C}} \rrbracket ,2} = FPaths _{\llbracket {\textsf{C}} \rrbracket } \). We also consider the stage strategies at a history of \(\llbracket {\textsf{C}} \rrbracket \), which will later be required for solving the induced zero-sum normal-form games in the minimax operator. For a history \(\pi \) of \(\llbracket {\textsf{C}} \rrbracket \), a stage strategy for \(\textsf{Ag}_1\) is a distribution \(u_1 \in \mathbb {P}(A_1)\) and a stage strategy for \(\textsf{Ag}_2\) is a function \(u_2: S \rightarrow \mathbb {P}(A_2)\), i.e., \(u_2 \in \mathbb {P}(A_2 \mid S)\).
Beliefs. Since \(\textsf{Ag}_1\) is partially informed, it may need to infer the current state from its AOH. For an \(\textsf{Ag}_1\) state \(s_1 = ( loc _1, per _1)\), we let \(S_{E}^{s_1}\) be the set of environment states compatible with \(s_1\), i.e., \( S_{E}^{s_1} = \{ s_E \in S_E \mid obs _1( loc _1, s_E) = per _1 \}\). Since the states of \(\textsf{Ag}_1\) are also the observations of \(\textsf{Ag}_1\) and states of \(\llbracket {\textsf{C}} \rrbracket \) are percept compatible, a belief for \(\textsf{Ag}_1\), which can also be reconstructed by \(\textsf{Ag}_2\), can be represented as a pair \( b = (s_1, b_1)\), where \(s_1 \in S_1\), \(b_1 \in \mathbb {P}(S_E)\) and \(b_1(s_E) = 0\) for all \(s_E \in S_E \setminus S_E^{s_1}\). We denote by \(S_B\) the set of beliefs of \(\textsf{Ag}_1\).
Given a belief \((s_1, b_1)\), if action \(a_1\) is selected by \(\textsf{Ag}_1\), \(\textsf{Ag}_2\) is assumed to take stage strategy \(u_2 \in \mathbb {P}(A_2 \mid S)\) and \(s_1'\) is observed, then the updated belief of \(\textsf{Ag}_1\) via Bayesian inference is denoted \(\smash {(s_1', b_1^{s_1,a_1,u_2,s_1'})}\); see [35] for details.
4 Values of One-Sided NS-POSGs
We establish the value function of a one-sided NS-POSG \(\textsf{C}\) with semantics \(\llbracket {\textsf{C}} \rrbracket \), which gives the minimax expected reward from an initial belief, and show its convexity and continuity. Next, to compute it, we introduce minimax and maxsup operators specialised for one-sided NS-POSGs, and prove their equivalence. Finally, we provide a fixed-point characterisation of the value function.
Value Function. We assume a fixed reward structure r and discount factor \(\beta \). The value function of \(\textsf{C}\) represents the minimax expected reward in each possible initial belief of the game, given by \(V^{\star } : S_B \rightarrow \mathbb {R}\), where \(V^{\star }(s_1, b_1) = \mathbb {E}_{(s_1, b_1)}^{\sigma ^{\star }}[Y]\) for all \((s_1, b_1) \in S_B\) and \(\sigma ^{\star }\) is a minimax strategy profile of \(\llbracket {\textsf{C}} \rrbracket \).
The value function for zero-sum POSGs may not exist when the state space is uncountable [2, 14, 30] as in our case. In this paper, we only consider one-sided NS-POSGs that are determined, i.e., for which the value function exists.
Convexity and Continuity. Since r is bounded, the value function \(V^{\star }\) has lower and upper bounds \(L = \min \nolimits _{s \in S, a \in A} r(s,a) /(1- \beta )\) and \( U = \max \nolimits _{s \in S, a \in A} r(s,a)/(1- \beta )\). The proof of the following and all other results can be found in [35].
Theorem 1
(Convexity and continuity). For \(s_1 \in S_1\), \(V^{\star }(s_1, \cdot ) : \mathbb {P}(S_E) \rightarrow \mathbb {R}\) is convex and continuous, and for \(b_1, b_1' \in \mathbb {P}(S_E):\) \(|V^{\star }(s_1, b_1) - V^{\star }(s_1, b_1')| \le K(b_1, b_1')\) where \( K(b_1, b_1') = \frac{1}{2} (U - L) \int _{s_E \in S_E^{s_1}} \big | b_1(s_E) - b_1'(s_E) \big | d s_E \).
Minimax and maxsup operators. We give a fixed-point characterisation of the value function \(V^{\star }\), first introducing a minimax operator and then simplifying to an equivalent maxsup variant. The latter will be used in Sect. 5 to prove closure of our representation for value functions and in Sect. 6 to formulate HSVI. For \(f: S \rightarrow \mathbb {R}\) and belief \((s_1, b_1)\), let \(\langle f, (s_1, b_1) \rangle = \, { \int _{s_E \in S_E}} f(s_1, s_E) b_1(s_E) d s_E\) and \(\mathbb {F}(S_B)\) denote the space of functions mapping the beliefs \(S_B\) to reals \(\mathbb {R}\).
Definition 3
(Minimax). The minimax operator \(T : \mathbb {F}(S_B) {\rightarrow } \mathbb {F}(S_B)\) is given by:
for \(V \in \mathbb {F}(S_B)\) and \((s_1, b_1) \in S_B\), where \(\mathbb {E}_{(s_1,b_1),u_1,u_2}[r(s,a)] = \int _{s_E \in S_E} b_1(s_E)\) \(\sum _{(a_1, a_2) \in A} u_1(a_1) u_2(a_2 \mid s_1, s_E ) r((s_1, s_E),(a_1,a_2)) d s_E \).
Motivated by [19], which proposed an equivalent operator for the discrete case, we instead prove that the minimax operator has an equivalent simplified form over convex continuous functions of \(\mathbb {F}(S_B)\).
For \(\varGamma \subseteq \mathbb {F}(S)\), we let \(\varGamma ^{A_1 \times S_1}\) denote the set of vectors of elements of the convex hull of \(\varGamma \) indexed by \(A_1 {\times } S_1\). Furthermore, for \(u_1 \in \mathbb {P}(A_1)\), \(\overline{\alpha } = (\alpha ^{a_1,s_1'} )_{(a_1,s_1') \in A_1 \times S_1} \in \varGamma ^{A_1 \times S_1}\) and \(a_2 \in A_2\), we define \(f_{u_1,\overline{\alpha }, a_2} : S \rightarrow \mathbb {R}\) to be the function such that, for \(s \in S\):
where the sum over \(s_E'\) is due to the finite branching of \(\delta (s, (a_1, a_2))\).
Definition 4
(Maxsup). For \(\varnothing \ne \varGamma \subseteq \mathbb {F}(S)\), if \(V(s_1, b_1) = \sup _{\alpha \in \varGamma } \langle \alpha , (s_1,b _1) \rangle \) for \((s_1, b_1) \in S_B\), then the maxsup operator \(T_\varGamma : \mathbb {F}(S_B) \rightarrow \mathbb {F}(S_B)\) is defined as \([T_\varGamma V](s_1, b_1) = { \max _{u_1\in \mathbb {P}(A_1)} } { \sup _{\overline{\alpha } \in \varGamma ^{A_1 \times S_1} } } \langle f_{u_1,\overline{\alpha }}, (s_1, b_1) \rangle \) for \((s_1, b_1) \in S_B\) where \(f_{u_1,\overline{\alpha }} (s) = \min _{a_2 \in A_2} f_{u_1,\overline{\alpha },a_2} (s)\) for \(s \in S\).
In the maxsup operator, \(u_1\) and \(\overline{\alpha }\) are aligned with \(\textsf{Ag}_1 \!\)’s goal of maximising the objective, where \(u_1\) is over action distributions and \(\overline{\alpha }\) is over convex combinations of elements of \(\varGamma \). The minimisation by \(\textsf{Ag}_2\) is simplified to an optimisation over its finite action set in the function \(f_{u_1, \overline{\alpha }}\). Note that each state may require a different minimiser \(a_2\), as \(\textsf{Ag}_2\) knows the current state before taking an action.
The maxsup operator avoids the minimisation over Markov kernels with continuous states in the original minimax operator. Given \(u_1\) and \(\overline{\alpha }\), the minimisation can induce a pure best-response stage strategy \(u_2 \in \mathbb {P}(A_2 \mid S)\) such that, for any \(s \in S\), \(u_2(a_2' \mid s) = 1\) for some \(a_2' \in \arg \min _{a_2 \in A_2} f_{u_1,\overline{\alpha },a_2} (s)\). Using Theorem 1, the operator equivalence and fixed-point result are as follows.
Theorem 2
(Operator equivalence and fixed point). For \(\varnothing \ne \varGamma \subseteq \mathbb {F}(S)\), if \(V(s_1, b_1) = \sup _{\alpha \in \varGamma } \langle \alpha , (s_1,b _1) \rangle \) for \((s_1, b_1) \in S_B\), then the minimax operator T and maxsup operator \(T_\varGamma \) are equivalent and their unique fixed point is \(V^{\star }\).
5 P-PWLC Value Iteration
We next discuss a representation for value functions using piecewise constant (PWC) \(\alpha \)-functions, called P-PWLC (piecewise linear and convex under PWC), originally introduced in [36]. This representation extends the \(\alpha \)-functions of [6, 27, 38] for continuous-state POMDPs, but a key difference is that we work with polyhedral representations (induced precisely from NNs) rather than approximations based on Gaussian mixtures [27] or beta densities [15].
We show that, given PWC representations for an NS-POSG’s perception, reward and transition functions, and under mild assumptions on model structure, P-PWLC value functions are closed with respect to the minimax operator. This yields a (non-scalable) value iteration algorithm and, subsequently, the basis for a more practical point-based HSVI algorithm in Sect. 6.
PWC Representations. A finite connected partition (FCP) of S, denoted \(\varPhi \), is a finite collection of disjoint connected regions (subsets) of S that cover it.
Definition 5
(PWC function). A function \(f: S \rightarrow \mathbb {R}\) is piecewise constant (PWC) if there exists an FCP \(\varPhi \) of S such that \(f : \phi \rightarrow \mathbb {R}\) is constant for \(\phi \in \varPhi \). Let \(\mathbb {F}_{C}(S)\) be the set of PWC functions in \(\mathbb {F}(S)\).
Since we focus on NNs for \(\textsf{Ag}_1\!\)’s perception function \( obs _1\), it is PWC (as for the one-agent case [36]) and the state space S of a one-sided NS-POSG can be decomposed into a finite set of regions, each with the same observation. Formally, there exists a perception FCP \(\varPhi _{P}\), the smallest FCP of S such that all states in any \(\phi \in \varPhi _{P}\) are observationally equivalent, i.e., if \((s_1,s_E),(s_1',s_E')\in \phi \), then \(s_1=s_1'\). We can use \(\varPhi _P\) to find the set \(\smash {S_E^{s_1}}\) for any agent state \(s_1 \in S_1\). Given an NN representation of \( obs _1\), the corresponding FCP \(\varPhi _P\) can be extracted (or approximated) offline by analysing its pre-image [25].
We also need to make some assumptions about the transitions and rewards of one-sided NS-POSGs (in a similar style to [36]). Informally, we require that, for any decomposition \(\varPhi '\) of the state-space into regions (i.e., an FCP), there is a second decomposition \(\varPhi \), the pre-image FCP, such that states in regions of \(\varPhi \) have the same rewards and transition probabilities into regions of \(\varPhi '\). The transitions of the (continuous) environment must also be decomposable into regions.
Assumption 1
(Transitions and rewards). Given any FCP \(\varPhi '\) of S, there exists an FCP \(\varPhi \) of S, called the pre-image FCP of \(\varPhi '\), where for \(\phi \in \varPhi \), \(a \in A\) and \(\phi ' \in \varPhi '\) there exists \(\delta _{\varPhi } : (\varPhi {\times } A) \rightarrow \mathbb {P}(\varPhi ')\) and \(r_{\varPhi } : (\varPhi {\times } A) \rightarrow \mathbb {R}\) such that \(\delta (s, a)(s') = \delta _{\varPhi }(\phi , a)(\phi ')\) and \(r(s, a) = r_{\varPhi }(\phi , a)\) for \(s \in \phi \) and \(s' \in \phi '\). In addition, \(\delta _E\) can be expressed in the form \(\sum _{i=1}^{n} \mu _i \delta _E^i\), where \(n \in \mathbb {N}\), \(\mu _i\in [0,1]\), \(\sum _{i = 1}^{n} \mu _i = 1\) and \(\delta _E^i : ( Loc _1 {\times } S_E {\times } A) \rightarrow S_E\) are piecewise continuous functions.
The need for this assumption also becomes clear in our later algorithms, which compute a representation for an NS-POSG’s value function over a (polyhedral) partition of the state space. This partition is created dynamically over the iterations of the solution, using a pre-image based splitting operation.
We now show, using results for continuous-state POMDPs [27, 36], that \(V^{\star }\) is the limit of a sequence of \(\alpha \)-functions, called piecewise linear and convex under PWC \(\alpha \)-functions, first introduced in [36] for neuro-symbolic POMDPs.
Definition 6
(P-PWLC Function). A function \(V: S_B \rightarrow \mathbb {R}\) is piecewise linear and convex under PWC \(\alpha \)-functions (P-PWLC) if there exists a finite set \(\varGamma \subseteq \mathbb {F}_C(S)\) such that \(V(s_1, b_1) = \max _{\alpha \in \varGamma } \langle \alpha , (s_1, b_1) \rangle \) for \((s_1, b_1) \in S_B\), where the functions in \(\varGamma \) are called PWC \(\alpha \)-functions.
If \(V \in \mathbb {F}(S_B)\) is P-PWLC, then it can be represented by a set of PWC functions over S, i.e., as a finite set of FCP regions and a value vector. Recall that \(\langle \alpha , (s_1, b_1) \rangle = \int _{s_E \in S_E} \alpha (s_1, s_E) b_1(s_E) d s_E\), and therefore computing the value for a belief involves integration. For one-sided NS-POSGs, we demonstrate, under Assumption 1, closure of the P-PWLC representation for value functions under the minimax operator and the convergence of value iteration.
LP, Closure Property and Convergence. By showing that \(f_{u_1, \overline{\alpha }, a_2}\) in (2) is PWC in S (see [35]), we use Theorem 2 to demonstrate that, if V is P-PWLC, the minimax operation can be computed by solving an LP.
Lemma 1
(LP for Minimax and P-PWLC). If \(V \in \mathbb {F}(S_B)\) is P-PWLC, then \([TV](s_1, b_1)\) is given by an LP for \((s_1, b_1) \in S_B\).
Using Lemma 1, we show that the P-PWLC representation is closed under the minimax operator. This closure property enables iterative computation of a sequence of such functions to approximate \(V^{\star }\) to within a convergence guarantee.
Theorem 3
(P-PWLC closure and convergence). If \(V \in \mathbb {F}(S_B)\) is P-PWLC, then so is [TV]. If \(V^0 \in \mathbb {F}(S_B)\) is P-PWLC, then the sequence \((V^t)_{t = 0}^{\infty }\), such that \(V^{t+1} = [TV^t]\), is P-PWLC and converges to \(V^\star \).
An implementation of value iteration for one-sided NS-POSGs is therefore feasible, since each \(\alpha \)-function involved is PWC and thus allows for a finite representation. However, as the number of \(\alpha \)-functions grows exponentially in the number of iterations, it is not scalable in practice.
6 Heuristic Search Value Iteration for NS-POSGs
To provide a more practical approach to solving one-sided NS-POSGs, we now present a variant of HSVI (heuristic search value iteration) [31], an anytime algorithm that approximates the value function \(V^\star \) via lower and upper bound functions, updated through heuristically generated beliefs.
Our approach broadly follows the structure of HSVI for finite POSGs [19], but every step presents challenges when extending to continuous states and NN-based observations. In particular, we must work with integrals over beliefs and deal with uncountability, using P-PWLC (rather than PWLC) functions for lower bounds, and therefore different ingredients to prove convergence. Value computations are also much more complex because NN perception function induce FCPs, which are used to compute images, pre-images and intersections.
We also build on ideas from HVSI for (single-agent) neuro-symbolic POMDPs in [36]. The presence of two opposing agents brings three main challenges. First, value backups at belief points require solving normal-form games instead of maximising over one agent’s actions. Second, since the first agent is not informed of the joint action, in the value backups and belief updates of the maxsup operator uncountably many stage strategies of the second agent have to be considered, whereas, in the single-agent variant, the agent can decide the transition probabilistically on its own. Third, the forward exploration heuristic is more complex as it depends on the stage strategies of the agents in two-stage games.
6.1 Lower and Upper Bound Representations
We first discuss representing and updating the lower and upper bound functions.
Lower Bound Function. Selecting an appropriate representation for \(\alpha \)-functions requires closure properties with respect to the maxsup operator. Motivated by [36], we represent the lower bound \(V_{ lb }^{\varGamma } \in \mathbb {F}(S_B)\) as the P-PWLC function for a finite set \(\varGamma \subseteq \mathbb {F}_C(S)\) of PWC \(\alpha \)-functions (see Definition 6), for which the closure is guaranteed by Theorem 3. The lower bound \(V_{ lb }^{\varGamma }\) has a finite representation as each \(\alpha \)-function is PWC, and is initialised as in [19].
Upper Bound Function. The upper bound \(V_{ ub }^{\varUpsilon } \in \mathbb {F}(S_B)\) is represented by a finite set of belief-value points \(\varUpsilon = \{ ((s_1^i, b_1^i), y_i) \in S_B \times \mathbb {R} \mid i \in I \}\), where \(y_i\) is an upper bound of \(V^{\star }(s_1^i, b_1^i)\). Similarly to [36], for any \((s_1, b_1) \in S_B\), the upper bound \(V_{ ub }^{\varUpsilon }(s_1, b_1)\) is the lower envelope of the lower convex hull of the points in \(\varUpsilon \) satisfying the following LP problem: minimise
for \(i \in I_{s_{{1}}}\) where \(I_{s_1} = \{ i \in I \mid s_1^i = s_1 \}\) and \(K_{ ub } : \mathbb {P}(S_E) \times \mathbb {P}(S_E) \rightarrow \mathbb {R}\) measures the difference between two beliefs such that, if K is the function from Theorem 1, then for any \(b_1, b_1', b_1'' \in \mathbb {P}(S_E)\): \(K_{ ub }(b_1, b_1) = 0\),
Note that (3) is close to the upper bound in regular HSVI for finite-state spaces, except for the function \(K_{ ub }\) that measures the difference between two beliefs (two continuous-state functions). With respect to the upper bound used in [36], \(K_{ ub }\) here needs to satisfy an additional triangle property in (4) to ensure the continuity of \(V_{ ub }^{\varUpsilon }\), for the convergence of the point-based algorithm below. The properties of \(K_{ ub }\) imply that (3) is an upper bound after a value backup, as stated in Lemma 3 below. The upper bound \(V_{ ub }^{\varUpsilon }\) is initialised as in [19].
Lower Bound Updates. For the lower bound \(V_{ lb }^{\varGamma }\), in each iteration we add a new PWC \(\alpha \)-function \(\alpha ^{\star }\) to \(\varGamma \) at a belief \((s_1, b_1) \in S_{B}\) such that:
where the second equality follows from Lemma 1 and \((\overline{p}_1^{\star }, \overline{\alpha }^{\star })\) is computed via the optimal solution to the LP in Lemma 1 at \((s_1, b_1)\).
Using \(\overline{p}_1^\star \), \(\overline{\alpha }^{\star }\) and the perception FCP \(\varPhi _P\), Algorithm 1 computes a new \(\alpha \)-function \(\alpha ^{\star }\) at belief \((s_1, b_1)\). To guarantee (5) and improve efficiency, we only compute the backup values for regions \(\phi \in \varPhi _P\) over which \((s_1, b_1)\) has positive probabilities, i.e., \(s^\phi _1=s_1\) (where \(s^\phi _1\) is the unique agent state appearing in \(\phi \)) and \(\int _{(s_1, s_E) \in \phi } b_1(s_E) d s_E > 0\), and assign the trivial lower bound L otherwise.
For each region \(\phi \) either \(\alpha ^{\star }(\hat{s}_1,\hat{s}_E) = f_{\overline{p}_1^\star , \overline{\alpha }^{\star }} (\hat{s}_1,\hat{s}_E)\) or \(\alpha ^{\star }(\hat{s}_1,\hat{s}_E) = L\) for all \((\hat{s}_1,\hat{s}_E) \in \phi \). Computing the backup values in line 4 of Algorithm 1 state by state is computationally intractable, as \(\phi \) contains an infinite number of states. However, the following lemma shows that \(\alpha ^{\star }\) is PWC, allowing a tractable region-by-region backup, called Image-Split-Preimage-Product (ISPP) backup, which is adapted from the single-agent variant in [36]. The details of the ISPP backup for one-sided NS-POSGs are in [35]. The lemma also shows that the lower bound function increases and is valid after each update.
Lemma 2
(Lower bound). The function \(\alpha ^{\star }\) generated by Algorithm 1 is a PWC \(\alpha \)-function satisfying (5), and if \(\varGamma ' = \varGamma \cup \{\alpha ^{\star }\}\), then \(V_{ lb }^{\varGamma } \le V_{ lb }^{\varGamma '} \le V^{\star }\).
Upper Bound Updates. For the upper bound \(V_{ ub }^{\varUpsilon }\), due to representation (3), at a belief \((s_1, b_1) \in S_B\) in each iteration, we add a new belief-value point \(((s_1, b_1), y^{\star })\) to \(\varUpsilon \) such that \(y^{\star } = [TV_{ ub }^{\varUpsilon }](s_1, b_1)\). Computing \([TV_{ ub }^{\varUpsilon }](s_1, b_1)\) via (1) and (3) requires the concrete formula for \(K_{ ub }\) and the belief representations. Thus, we will show how to compute \([TV_{ ub }^{\varUpsilon }](s_1, b_1)\) when introducing belief representations below. The following lemma shows that \(y^{\star } \ge V^{\star } (s_1, b_1)\) required by (3), and the upper bound function is decreasing and is valid after each update.
Lemma 3
(Upper bound). Given a belief \((s_1, b_1) \in S_{B}\), if \(y^{\star } = [TV_{ ub }^{\varUpsilon }](s_1, b_1)\), then \(y^\star \) is an upper bound of \(V^{\star }\) at \((s_1, b_1)\), i.e., \(y^{\star } \ge V^{\star } (s_1, b_1)\), and if \(\varUpsilon ' = \varUpsilon \cup \{ ((s_1, b_1), y^{\star }) \}\), then \(V_{ ub }^{\varUpsilon } \ge V_{ ub }^{\varUpsilon '} \ge V^{\star }\).
6.2 One-Sided NS-HSVI
Algorithm 2 presents our NS-HSVI algorithm for one-sided NS-POSGs.
Forward Exploration Heuristic. The algorithm uses a heuristic approach to select which belief will be considered next. Similarly to finite-state one-sided POSGs [19], we focus on a belief that has the highest weighted excess gap. The excess gap at a belief \((s_1, b_1)\) with depth t from the initial belief is defined by \( excess _{t}(s_1, b_1) = V_{ ub }^{\varUpsilon }(s_1, b_1) - V_{ lb }^{\varGamma }(s_1, b_1) - \rho (t)\), where \(\rho (0) = \varepsilon \) and \(\rho (t {+} 1) = (\rho (t) - 2(U - L) \bar{\varepsilon }) / \beta \), and \(\bar{\varepsilon } \in (0, (1 - \beta ) \varepsilon / (2U - 2L))\). Using this excess gap, the next action-observation pair \((\hat{a}_1, \hat{s}_1)\) for exploration is selected from:
To compute the next belief via lines 8 and 9 of Algorithm 2, the minimax strategy profiles in stage games \([TV_{ lb }^{\varGamma }](s_1, b_1)\) and \([TV_{ ub }^{\varUpsilon }](s_1, b_1)\), i.e., \((u_1^{ ub }, u_2^{ lb })\), are required. Since \(V_{ lb }^{\varGamma }\) is P-PWLC, using Lemma 1, the strategy \( u_2^{ lb }\) is obtained by solving an LP. However, the computation of the strategy \(u_1^{ ub }\) depends on the representation of \((s_1, b_1)\) and the measure function \(K_{ ub }\), and thus will be discussed later. One-sided NS-HSVI has the following convergence guarantees.
Theorem 4
(One-sided NS-HSVI). For any \((s_{1}^{ init }, b_{1}^{ init }) \in S_B\) and \(\varepsilon > 0\), Algorithm 2 will terminate and upon termination: \(V_{ ub }^{\varUpsilon }(s_{1}^{ init }, b_{1}^{ init }) - V_{ lb }^{\varGamma }(s_{1}^{ init }, b_{1}^{ init }) \le \varepsilon \) and \(V_{ lb }^{\varGamma }(s_{1}^{ init }, b_{1}^{ init }) \le V^{\star }(s_{1}^{ init }, b_{1}^{ init }) \le V_{ ub }^{\varUpsilon }(s_{1}^{ init }, b_{1}^{ init })\).
6.3 Belief Representation and Computations
Implementing one-sided NS-HSVI depends on belief representations, as closed forms are needed. We use the popular particle-based representation [10, 27], which can approximate arbitrary beliefs and handle non-Gaussian systems. However, compared to region-based representations [36], it is more vulnerable to disturbances and can require many particles for a good approximation.
Particle-Based Beliefs. A particle-based belief \((s_1, b_1) \in S_B\) is represented by a weighted particle set \(\{ (s_E^i, \kappa _i) \}_{i=1}^{n_s}\) with a normalised weight \(\kappa _i\) for each particle \(s_E^i \in S_E\), where \(b_1(s_E) = \smash {\sum \nolimits _{i = 1}^{n_b}} \kappa _i D(s_E - s_E^i)\) for \(s_E \in S_E\) and \(D(s_E - s_E^i)\) is a Dirac delta function centred at 0.
To implement one-sided NS-HSVI using particle-based beliefs, we prove that \(V_{ lb }^{\varGamma }\) and \(V_{ ub }^{\varUpsilon }\) are eligible representations, as the belief update \(\smash {b_1^{s_1, a_1, u_2, s_1'}}\), expected values \(\langle \alpha , (s_1, b_1) \rangle \), \(\langle r, (s_1, b_1) \rangle \) and probability \(P(a_1, s_1' \mid (s_1,b_1), u_1, u_2)\) are computed as simple summations for a particle-based belief \((s_1, b_1)\) ( [35]).
Lower Bound. Since \(V_{ lb }^{\varGamma }\) is P-PWLC with PWC \(\alpha \)-functions \(\varGamma \), for a particle-based belief \((s_1, b_1)\) represented by \(\{ (s_E^i, \kappa _i) \}_{i=1}^{n_b}\), using Definition 6, \(V_{ lb }^{\varGamma }(s_1, b_1) = \max \nolimits _{\alpha \in \varGamma } \sum _{i=1}^{n_b} \kappa _i \alpha (s_1, s_E^i)\). The stage game \([TV_{ lb }^{\varGamma }](s_1, b_1)\) and minimax strategy profile \((u_1^{ lb },u_2^{ lb } )\) follow from solving the LP in Lemma 1.
Upper Bound. To compute \(V_{ ub }^{\varUpsilon }\) in (3), we need a function \(K_{ ub }\) to measure belief differences that satisfies (4). We take \(K_{ ub } = K\), which does so by definition. Given \(\varUpsilon = \{ ((s_1^i, b_1^i), y_i) \mid i \in I\}\), the upper bound and stage game can be computed by solving an LP, respectively, as demonstrated by the following theorem, and then the minimax strategy profile \((u_1^{ ub }, u_2^{ ub })\) is synthesised (see [35]).
Theorem 5
(LPs for upper bound). For a particle-based belief \((s_1, b_1) \in S_B\), \(V_{ ub }^{\varUpsilon }(s_1,b_1)\) and \([TV_{ ub }^{\varUpsilon }](s_1,b_1)\) are the optimal value of an LP, respectively.
7 Experimental Evaluation
We have built a prototype implementation in Python, using Gurobi [16] to solve the LPs needed for computing lower and upper bound values, and the minimax values and strategies of one-shot games. We use the Parma Polyhedra Library [1] to operate over polyhedral pre-images of NNs, \(\alpha \)-functions and reward structures.
Our evaluation uses two one-sided NS-POSG examples: a pursuit-evasion game and the pedestrian-vehicle scenario from Sect. 3. Below, we discuss the applicability and usefulness of our techniques on these examples. Due to limited space, we refer to [35] for more details of the models, including the training of the ReLU NN classifiers, and empirical results on performance.
Pursuit-Evasion. A pursuit-evasion game models a pursuer trying to catch an evader aiming to avoid capture. We build a continuous-space variant of the model from [19] inspired by mobile robotics applications [8, 20]. The environment includes the exact position of both agents. The (partially informed) pursuer uses an NN classifier to perceive its own location, which maps to one of \(3{\times }3\) grid cells. To showcase the ability of our methodology to assess the performance of realistic NN perception functions, we train two NNs, the second with a coarser accuracy.
Figure 2 shows simulations of strategies synthesised for the pursuer, using the two different NNs. Its actual location is a red dot, and the pink arrows denote the strategy. Blue squares show the cell that is output by the pursuer’s perception function, and black lines mark the underlying polyhedral decomposition. The pursuer’s belief over the evader’s location is shown by the green shading and annotated probabilities; it initially (correctly) believes that the evader is in cell C and the belief evolves based on the optimal counter-strategy of the evader.
The plots show we can synthesise non-trivial strategies for agents using NN-based perception in a partially observable setting. We can also study the impact of a poorly trained perception function. Figure 2(b), for the coarser NN, shows the pursuer repeatedly mis-detecting its location because the grid cells shapes are poorly approximated, and subsequently taking incorrect actions. This is exploited by the evader, leading to considerably worse performance for the pursuer.
Pedestrian-Vehicle Interaction. Figure 3 shows several simulations from strategies synthesised for the pedestrian-vehicle example described in Sect. 3 (Fig. 1), plotting the position \((x_2,y_2)\) of the pedestrian, relative to the vehicle. We fix the pedestrian’s strategy, to simulate a crossing scenario: it moves from right to left, i.e., decreasing \(x_2\). The (partially informed) vehicle’s perception function predicts the intention of the pedestrian (green/yellow/red = unlikely/likely/very likely to cross), shown as coloured dots. Above and below each circle, we indicate the acceleration actions taken (black) and current speeds (orange), respectively, which determine the distance \(y_2\) to the pedestrian crossing.
Again, we investigate the feasibility of generating strategies for agents with realistic NN-based perception. Here, the goal is to avoid a crash scenario, denoted by the shaded region at the bottom left of the plots. We find that, in many cases, safe strategies can be synthesised. Figure 3(a) shows an example; notice that the pedestrian intention is detected early. This is not true in (b) and (c), which show two simulations from a strategy and starting point where the perception function results in much later detection; (c) shows we were then unable to synthesise a strategy for the vehicle that is always safe.
8 Conclusions
We have proposed one-sided neuro-symbolic POSGs, designed to reason formally about partially observable agents equipped with neural perception mechanisms. We characterised the value function for discounted infinite-horizon rewards, and designed, implemented and evaluated a HSVI algorithm for approximate solution. Computational complexity is high due to expensive polyhedral operations. Nevertheless, our method provides an important baseline that can reason about true decision boundaries for game models with NN-based perception, against which efficiency improvements can later be benchmarked. We plan to investigate ways to improve performance, e.g., merging of adjacent polyhedra or Monte-Carlo planning methods, and to study restricted cases of two-sided NS-POSGs, e.g., those with public observations [18].
References
Bagnara, R., Hill, P.M., Zaffanella, E.: The Parma Polyhedra Library: toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems. Sci. Comput. Programm. 72(1), 3–21 (2008). https://www.bugseng.com/ppl
Bhabak, A., Saha, S.: Partially observable discrete-time discounted Markov games with general utility. arXiv:2211.07888 (2022)
Bosansky, B., Kiekintveld, C., Lisy, V., Pechoucek, M.: An exact double-oracle algorithm for zero-sum extensive-form games with imperfect information. J. Artif. Intell. Res. 51, 829–866 (2014)
Brechtel, S., Gindele, T., Dillmann, R.: Solving Continuous POMDPs: value iteration with incremental learning of an efficient space representation. In: Proceedings of ICML’13, pp. 370–378. PMLR (2013)
Brown, N., Bakhtin, A., Lerer, A., Gong, Q.: Combining deep reinforcement learning and search for imperfect-information games. In: Proceedings of NeurIPS’20, pp. 17057–17069. Curran Associates, Inc. (2020)
Burks, L., Loefgren, I., Ahmed, N.R.: Optimal continuous state POMDP planning with semantic observations: a variational approach. IEEE Trans. Rob. 35(6), 1488–1507 (2019)
Carr, S., Jansen, N., Bharadwaj, S., Spaan, M.T., Topcu, U.: Safe policies for factored partially observable stochastic games. In: Robotics: Science and System XVII (2021)
Chung, T.H., Hollinger, G.A., Isler, V.: Search and pursuit-evasion in mobile robotics. Auton. Robot. 31(4), 299–316 (2011)
Delage, A., Buffet, O., Dibangoye, J.S., Saffidine, A.: HSVI can solve zero-sum partially observable stochastic games. Dyn. Games Appl., 1–55 (2023)
Doucet, A., Freitas, N., Gordon, N. (eds.): Sequential Monte Carlo Methods in Practice. Springer, New York, NY (2001). https://doi.org/10.1007/978-1-4757-3437-9
Emery-Montemerlo, R., Gordon, G., Schneider, J., Thrun, S.: Approximate solutions for partially observable stochastic games with common payoffs. In: Proceedings of AAMAS’04, pp. 136–143. IEEE (2004)
Feng, Z., Dearden, R., Meuleau, N., Washington, R.: Dynamic programming for structured continuous Markov decision problems. In: Proceedings of UAI’04, pp. 154–161 (2004)
Fu, T., Miranda-Moreno, L., Saunier, N.: A novel framework to evaluate pedestrian safety at non-signalized locations. Accid. Anal. Prev. 111, 23–33 (2018)
Ghosh, M.K., McDonald, D., Sinha, S.: Zero-sum stochastic games with partial information. J. Optim. Theory Appl. 121, 99–118 (2004)
Guestrin, C., Hauskrecht, M., Kveton, B.: Solving factored MDPs with continuous and discrete variables. In: Proceedings of UAI’04, pp. 235–242 (2004)
Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2021). https://www.gurobi.com
Hansen, E.A., Bernstein, D.S., Zilberstein, S.: Dynamic programming for partially observable stochastic games. In: Proceedings of AAAI’04, vol. 4, pp. 709–715 (2004)
Horák, K., Bošanskỳ, B.: Solving partially observable stochastic games with public observations. In: Proceedings of AAAI’19, vol. 33, pp. 2029–2036 (2019)
Horák, K., Bošanskỳ, B., Kovařík, V., Kiekintveld, C.: Solving zero-sum one-sided partially observable stochastic games. Artif. Intell. 316, 103838 (2023)
Isler, V., Nikhil, K.: The role of information in the cop-robber game. Theoret. Comput. Sci. 399(3), 179–190 (2008)
Kovařík, V., Schmid, M., Burch, N., Bowling, M., Lisỳ, V.: Rethinking formal models of partially observable multiagent decision making. Artif. Intell. 303, 103645 (2022)
Kovařík, V., Seitz, D., Lisỳ, V., Rudolf, J., Sun, S., Ha, K.: Value functions for depth-limited solving in zero-sum imperfect-information games. Artif. Intell. 314, 103805 (2023)
Kumar, A., Zilberstein, S.: Dynamic programming approximations for partially observable stochastic games. In: Proceedings of FLAIRS’09, pp. 547–552 (2009)
Madani, O., Hanks, S., Condon, A.: On the undecidability of probabilistic planning and related stochastic optimization problems. Artif. Intell. 147(1–2), 5–34 (2003)
Matoba, K., Fleuret, F.: Computing preimages of deep neural networks with applications to safety (2020). https://openreview.net/forum?id=FN7_BUOG78e
Moravčík, M., et al.: DeepStack: expert-level artificial intelligence in heads-up no-limit poker. Science 356(6337), 508–513 (2017)
Porta, J.M., Vlassis, N., Spaan, M.T., Poupart, P.: Point-based value iteration for continuous POMDPs. J. Mach. Learn. Res. 7, 2329–2367 (2006)
Rasouli, A., Kotseruba, I., Kunic, T., Tsotsos, J.K.: PIE: a large-scale dataset and models for pedestrian intention estimation and trajectory prediction. In: Proceedings of ICCV’19, pp. 6262–6271 (2019)
Rasouli, A., Kotseruba, I., Tsotsos, J.K.: Are they going to cross? A benchmark dataset and baseline for pedestrian crosswalk behavior. In: Proceedings of ICCV’17, pp. 206–213 (2017)
Saha, S.: Zero-sum stochastic games with partial information and average payoff. J. Optim. Theory Appl. 160(1), 344–354 (2014)
Smith, T., Simmons, R.: Heuristic search value iteration for POMDPs. In: Proceedings of UAI’04, pp. 520–527. AUAI (2004)
Wiggers, A.J., Oliehoek, F.A., Roijers, D.M.: Structure in the value function of two-player zero-sum games of incomplete information. Front. Artif. Intell. Appl. 285, 1628–1629 (2016)
Yan, R., Santos, G., Norman, G., Parker, D., Kwiatkowska, M.: Strategy synthesis for zero-sum neuro-symbolic concurrent stochastic games. arXiv 2202.06255 (2022)
Yan, R., Santos, G., Duan, X., Parker, D., Kwiatkowska, M.: Finite-horizon equilibria for neuro-symbolic concurrent stochastic games. In: Proceedings of UAI’22, pp. 2170–2180. AUAI Press (2022)
Yan, R., Santos, G., Norman, G., Parker, D., Kwiatkowska, M.: Partially observable stochastic games with neural perception mechanisms. arXiv:2310.11566 (2023)
Yan, R., Santos, G., Norman, G., Parker, D., Kwiatkowska, M.: Point-based value iteration for POMDPs with neural perception mechanisms. arXiv 2306.17639 (2023)
Yan, R., Santos, G., Norman, G., Parker, D., Kwiatkowska, M.: HSVI-based online minimax strategies for partially observable stochastic games with neural perception mechanisms. In: Proceedings of L4DC’24 (2024)
Zamani, Z., Sanner, S., Poupart, P., Kersting, K.: Symbolic dynamic programming for continuous state and observation POMDPs. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Zettlemoyer, L., Milch, B., Kaelbling, L.: Multi-agent filtering with infinitely nested beliefs. In: Advances in Neural Information Processing Systems, vol. 21 (2008)
Zheng, W., Jung, T., Lin, H.: The Stackelberg equilibrium for one-sided zero-sum partially observable stochastic games. Automatica 140, 110231 (2022)
Zheng, W., Jung, T., Lin, H.: Continuous-observation one-sided two-player zero-sum partially observable stochastic game with public actions. IEEE Trans. Autom. Control, 1–15 (2023)
Zinkevich, M., Johanson, M., Bowling, M., Piccione, C.: Regret minimization in games with incomplete information. In: Advances in Neural Information Processing Systems, vol. 20 (2007)
Acknowledgements
This project was funded by the ERC under the European Union’s Horizon 2020 research and innovation programme (FUN2MODEL, grant agreement No.834115).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2025 The Author(s)
About this paper
Cite this paper
Yan, R., Santos, G., Norman, G., Parker, D., Kwiatkowska, M. (2025). Partially Observable Stochastic Games with Neural Perception Mechanisms. In: Platzer, A., Rozier, K.Y., Pradella, M., Rossi, M. (eds) Formal Methods. FM 2024. Lecture Notes in Computer Science, vol 14933. Springer, Cham. https://doi.org/10.1007/978-3-031-71162-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-71162-6_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71161-9
Online ISBN: 978-3-031-71162-6
eBook Packages: Computer ScienceComputer Science (R0)