Projection based inverse reinforcement learning for the analysis of dynamic treatment regimes

Shah, Syed Ihtesham Hussain; De Pietro, Giuseppe; Paragliola, Giovanni; Coronato, Antonio

doi:10.1007/s10489-022-04173-0

Projection based inverse reinforcement learning for the analysis of dynamic treatment regimes

Open access
Published: 21 October 2022

Volume 53, pages 14072–14084, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Projection based inverse reinforcement learning for the analysis of dynamic treatment regimes

Download PDF

Syed Ihtesham Hussain Shah ORCID: orcid.org/0000-0002-6390-1864^1,2,
Giuseppe De Pietro²,
Giovanni Paragliola² &
…
Antonio Coronato^2,3

1439 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Dynamic Treatment Regimes (DTRs) are adaptive treatment strategies that allow clinicians to personalize dynamically the treatment for each patient based on their step-by-step response to their treatment. There are a series of predefined alternative treatments for each disease and any patient may associate with one of these treatments according to his/her demographics. DTRs for a certain disease are studied and evaluated by means of statistical approaches where patients are randomized at each step of the treatment and their responses are observed. Recently, the Reinforcement Learning (RL) paradigm has also been applied to determine DTRs. However, such approaches may be limited by the need to design a true reward function, which may be difficult to formalize when the expert knowledge is not well assessed, as when the DTR is in the design phase. To address this limitation, an extension of the RL paradigm, namely Inverse Reinforcement Learning (IRL), has been adopted to learn the reward function from data, such as those derived from DTR trials. In this paper, we define a Projection Based Inverse Reinforcement Learning (PB-IRL) approach to learn the true underlying reward function for given demonstrations (DTR trials). Such a reward function can be used both to evaluate the set of DTRs determined for a certain disease, as well as to enable an RL-based intelligent agent to self-learn the best way and then act as a decision support system for the clinician.

Statistical Reinforcement Learning

Reinforcement Learning in Medical Diagnosis: An Overview

Computational medication regimen for Parkinson’s disease using reinforcement learning

Article Open access 29 April 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A dynamic treatment regime DTR, also known as an adaptive treatment strategy, is a sequence of tailored treatment decision rules. It specifies how the treatments should be adjusted over time according to the dynamic state of the patient [1]. Each rule takes input information about the patient, e.g. medical history, laboratory measurements, demographics, etc., and recommends treatment options that aim to optimize the effectiveness of the treatment program [2]. Predefined DTRs are typically evaluated by adopting the Sequential Multiple Assignment Randomized Trial (SMART) method [3] which is a sequence of observations and treatments. The SMART design requires that a group of patients be randomized initially against the possible treatments and then re-randomized, based on the response of the patient at successive stages of the treatment.

The most relevant goal of a SMART study is the determination of the optimal DTR. Statistical techniques [4] have been adopted to achieve this goal. Patients with similar conditions may have different reactions to given treatments. Some patients respond appropriately, some do not respond at all, and others may have complications(even leading a worse condition). Communities and groups update best practices over time, since there are no ground truths with respect to best treatments [5, 6]. This makes the existing system inexpedient [7]. On the other hand, Reinforcement Learning (RL) techniques learn and update themselves through trial and error. In clinical medicine, RL can be used to assign an optimal regime to patients with similar characteristics. Moreover, RL has been widely investigated, with the aim of identifying an optimal DTR. For example, Q-learning is among the earliest methods used to identify an optimal DTR, which fits linear outcome models in a recursive manner [8]. That is why RL techniques have an edge over existing supervised learning techniques.

If we consider treatments as “actions” and patient responses as “states” then data involved in a SMART study fit naturally into the RL paradigm [9]. RL algorithms [10, 11] provide a framework for decision making [12] through learning a “policy” [13]. The policy indicates what action to perform in each situation in order to achieve a certain goal.

The RL model analyzes the “expected reward” of each DTR trial (a sequence of patient responses and recommended treatments) in order to identify the optimal DTR. An optimal DTR is the sequence of treatments with respect to the patient’s condition (the patient response) which ultimately result in a positive change in the patient’s health. Rewards are the numeric values that represent the effectiveness of the treatment for the current response of the patient. The higher the value of the rewards, the more effective the treatment will be.

However, in many RL problems the reward function may be extremely scattered under this setting and it may be difficult to recognize which actions are useful to obtain the ultimate feedback [14, 15]. Clinically guided reward functions can help in identifying compact rewards [16] but this requires expert knowledge which is not available in the SMART method.

On the other hand, the proposed IRL can mitigate these associated problems with RL methods [17, 18]. “DTR trials” are the sequences of recommended treatments and patient responses. The proposed IRL aims at learning the reward function for these “DTR trials”. The values of the rewards indicate how good or bad these “DTR trials” are. In other words, the values of the rewards indicate how good is the response of the patient in relation to each recommended treatment. Once we have obtained the reward values for the DTRs, we can easily find the optimal DTR (the one which has the highest reward value) for a particular patient.

In this paper, we have proposed an IRL [19] technique to find the best fitting reward function for each “DTR trial”. We have considered “DTR trials” as “demonstrations” in the rest of the paper. Given some information about the patient (i.e.., location, cultural level and gender) some demonstrations (sequences of the patient responses and recommended treatments) are generated. Our aim is to find the reward functions for these demonstrations, where the “reward” (as described above) is a signal that gives feedback to the decision-maker, indicating how good the recommended treatment is in terms of the patient response. To achieve this goal, we assume that the reward function can be expressed as a linear function of known “features” and “weight vectors”. The weight vector is updated by finding similar policies to the demonstrated policies. A random search for similar policies can be time-consuming, especially in large state spaces. For this reason, we initially searched for random policies and later updated these policies by mixing them with their weights (explained in Section 4).

The results show that the proposed method is a quick and easy way to find the best fitting rewards for demonstrations. Once the true reward function is obtained, it can be used either to identify the optimal DTR or to train an RL-based agent that could act as a Decision Support System for the clinician to treat the patient.

The rest of the paper is organized as follows: Section 2 comprises a quick review of the background and problem formulation, where introductory concepts about the Markov Decision Process (MDP), RL and IRL are presented. We will discuss a case scenario in Section 3. The proposed approach and system model are detailed in Section 4. This section describes the problem of learning the reward function not explicitly, but through observing trajectories or expert demonstrations. A discussion about the experimental model, data description and results is presented in Section 5. The rewards learned from the proposed algorithm are tested on existing RL models in Section 6. Related work is described in Section 7 where a comparison between different methods of Imitation Learning (IL) in DTRs is discussed. We summarize the paper in Section 8.

2 Preliminaries

2.1 Reinforcement Learning (RL)

The idea of underlying RL [20] is to interact with the surrounding environment without having any prior knowledge and to learn from these interactions how to achieve a goal. A RL problem is typically described through a MDP. It holds the Markov property, which does not consider past information while taking actions in a current state. An MDP is a tuple (S,A, T,R,γ) [21] as explained in Table 1.

Table 1 Description of related notations

Full size table

The goal of RL methods is to learn the behavior of the environment through repetitive interactions. The component that interacts is called the Agent. An agent takes an action a_t in a state s_t at a time step t and the environment returns the next state s_t+ 1 and reward r_t+ 1. The aim of the Agent is to learn an optimal policy. The policy (it is a RL terminology) indicates what action to take in a state in order to maximize the cumulative reward. A value function V^π(s) at a state s is the expected reward by following a policy π [22]:

$$ \begin{array}{@{}rcl@{}} V^{\pi} (s)&= & \sum\limits_{a\in A} \pi(a \vert s) \sum\limits_{s^{\prime} \in S , r} T (s ,a , s^{\prime})\\ &&\left\{\right. \left.\!\!r(s,a) + \gamma V^{\pi}(s^{\prime}) \right\} \end{array} $$

(1)

Where r is the reward function, $T (s ,a , s^{\prime })$ is the transition probability and γ is the discount factor as described in Table 1.

The goal of the RL agent is to find the optimal policy π^∗ (the one with the highest value). The value function V (s) for the optimal policy π^∗ using the Bellman equation is:

$$ V^{*} (s)= \max_{a} \sum\limits_{s^{\prime},r} T (s ,a , s^{\prime}) [ r + \gamma V^{*}(s^{\prime})] $$

(2)

Equation 2 returns the best possible value V^∗ by exploring all the states S.

In some RL algorithms (e.g. SARSA), the linear functions of features ϕ_i(s,a) are used to find the optimal policy [23].

$$ Q_{w} (s,a)=\sum\limits_{i} w_{i} \phi_{i} (s,a) $$

(3)

Where Q(s,a) is the Q − value function at each state (s), ϕ_i(s,a) is the feature function ϕ : S → [0,1]^k and w is the weight vector that can be updated as:

$$ \begin{array}{@{}rcl@{}} w_{i} \leftarrow w_{i} + \eta \delta \phi(s,a) \\ \delta = r + \gamma Q_{w} (s^{\prime}, a^{\prime}) - Q_{w} (s,a) \end{array} $$

(4)

Where η is step size, γ is discount factor , and $s^{\prime }$, $a^{\prime }$ are the future state and action, respectively. In many applications, it is difficult to specify the reward function. These reward functions are sometimes very sparse and are not accurate. Therefore, to calculate the optimal policy in an uncertain environment, researchers have exploited IRL techniques to find the best fitting reward function against a set of given trajectories or expert demonstrations.

2.2 Inverse Reinforcement Learning (IRL)

IRL, also known as Learning from Demonstration (LD) [24], is the process of learning the preferences of an expert (agent) by observing its behavior and avoiding the specification of the reward function. IRL flips the RL problem and attempts to find an underlying reward function that is supposed to explain the expert behavior perfectly. Expert trajectories (sequences of states and actions), also known as expert demonstrations, represent the optimal policy. A policy may be optimal for many different reward functions. Therefore, the goal here is to find the true reward function that is being implicitly optimized by the optimal policy π^∗.

Linear Programming (LP) IRL [25] assumes that the expected value for an expert policy is always higher than that of all other policies. The expectation of the value function (as defined in (1)) is calculated as:

$$ \begin{array}{@{}rcl@{}} \forall_{s_{0} \epsilon S } , \forall_{a \epsilon A} \textbf{:} && \mathbb{E}_{\acute{s} \backsim T(\acute{s} \vert s , \pi(s))} \{V^{*}(\acute{s}) \} \geq \\ && \mathbb{E}_{\acute{s} \backsim T(\acute{s} \vert s , a)} \{V^{\pi}(\acute{s}) \}\\ \max \sum\limits_{s\in S_{0}} \underset{a \in A \vert \pi(s)}{\min} {} && \{ \mathbb{E}_{\acute{s} \backsim T(\acute{s} \vert s , \pi(s))} \{V^{*}(\acute{s}) \} - \\ && \{\mathbb{E}_{\acute{s} \backsim T(\acute{s} \vert s , a}) \{V^{\pi}(\acute{s}) \} \} \end{array} $$

(5) (6)

This formulation maximizes the smallest difference between the expected value generated by the expert policy and the expected value generated by non-expert policies. The reward function that maximizes this difference is considered to be the true reward function which perfectly explains the expert behavior. These methods examine almost all the policies to find the best one. In a very large state space, it is very difficult and time-consuming to analyze all the actions in all states. To mitigate this problem, PB-IRL has been proposed to allow the agent to quickly explore underlying reward functions for the expert policy.

3 Use case scenario

Figure 1 shows DTRs for the management of people with alcohol addiction. The patients are categorized according to the input information (e.g. Location, Cultural level and Gender). Each circle with an R denotes a Randomization stage of the DTR. At each stage R, a new treatment is recommended according to the response of the patient. There are three basic treatments that are possible: Medication therapy (MED), Psychological therapy (PSY ), or Telephone Monitoring (TM), plus a combination of these treatments e.g. PSY+MED. Each patient is arbitrarily assigned to one of the available treatments.

A patient is classified as a Responder if he/she has had fewer than two drinking days during the previous 2 months. Otherwise, the patient is classified as a Non-Responder. Consequently, for example, a patient who is a Non-Responder to PSY treatment is re-randomized to MED treatment, or a combination of PSY with MED (PSY+MED) treatment. Instead, participants who are classified as Responders to the initial treatment are checked via TM for an additional period.

4 Proposed technique

Let us consider a set of demonstrations or trajectories, τ_E as below:

$$ \begin{array}{@{}rcl@{}} \tau_{E} &=& \left[({s^{1}_{1}},{a^{1}_{1}},{s^{1}_{2}},{a^{1}_{2}} , ..,{s^{1}_{d}},{a^{1}_{d}} ),\right. \\ && \left. ({s^{2}_{1}},{a^{2}_{1}},{s^{2}_{2}},{a^{2}_{2}} , ..,{s^{2}_{d}},{a^{2}_{d}}), ...\right] \\ &=& \left[\tau^{1},\tau^{2},\tau^{3} ... \right] \end{array} $$

(7)

A demonstration τⁱ represents the sequence of expert (physician) decisions. It contains the expert polices π_E. In RL, the policy is to map a state to an action. In our model, actions (a_i,i ∈{0,1,..,d}) represent the set of treatments and states (s_t ∈ S) represent the set of patient’s response for each recommended treatment.

The goal is to estimate the true underlying reward function for these demonstrations. Rewards are the numerical values which indicate which treatment (action) is most effective against each of the patient’s responses (states).

The value V of a policy is the sum of all the discounted rewards collected by following that policy as given in (1). The expected value E[V_E(s)] is defined as:

$$ E[V_{E} (s)]=E[ \sum\limits^{\infty}_{t=1} \gamma^{t} R(s_{t}) \vert {\pi_{E}}] $$

(8)

$$ =E[ \sum\limits^{\infty}_{t=1} \gamma^{t} w.\phi(s_{t}) \vert {\pi_{E}}] $$

(9)

$$ =w . E[ \sum\limits^{\infty}_{t=1} \gamma^{t} \phi(s_{t}) \vert {\pi_{E}}] $$

(10)

where ϕ represents k known, fixed and bounded basis functions (or feature functions) ϕ : S → [0,1]^k and $w_{i} \epsilon \mathbb {R}$ are weights. We define an expected accumulated feature value vector as a feature expectation μ_E:

$$ \mu_{E} = E[ \sum\limits^{\infty}_{t=1} \gamma^{t} \phi(s_{t}) \vert {\pi_{E}}] $$

(11)

The expected value E[V_E(s)] in (10) can be represented as:

$$ E[V_{E}(s)] = w . \mu_{E} $$

(12)

Expert feature expectations μ_E are calculated by using expert policies π_E (given in the demonstrations) while estimated feature expectation $\mu (\hat {\pi })$ are calculated by using estimated policies $\hat {\pi }$.

The idea here is to find a random policy $\hat {\pi }$ and the estimate feature expectations $\mu (\hat {\pi })$. If the difference between the expert feature expectation μ_E and the estimated feature expectation $\mu (\hat {\pi })$ is not small enough to some predefined threshold value 𝜖 then;

update weight vector w and calculate reward function ($R={\sum }_{i} w_{i} \phi _{i}$).
by using the current reward function apply RL method to update the estimated policy and hence update the estimated feature expectation $\mu (\hat {\pi })$.

We are trying to find a reward function that helps RL to estimate a policy which minimizes the difference between $\mu (\hat {\pi })$ and μ_E. It is computationally complex to estimate such a policy by random searching, especially in a large state space.

Alternatively, we can generate a set of random policies $\{\hat {\pi }^{1} ,\ldots ., \hat {\pi }^{d} \}$ and mix them with a mixer weight λ_i in order to obtain a new policy. The probability of choosing $\hat {\pi }^{i}$ is given by λ_i. Feature expectation μ of new policy is a convex combination of the feature expectations of these randomly generated policies:

$$ {\mu} = \sum\limits_{i} \lambda_{i} {\mu}^{(i)} ; \lambda_{i} \geq 0, \sum\limits_{i} \lambda_{i} = 1. $$

(13)

According to the Caratheodory’s theorem [26], any point that is a convex combination of a set of N points with N > k + 1, can be written as a convex combination of a subset of only k + 1 points.

$$ \mu = \arg \underset{\mu \in{C0} \{\mu(\hat{\pi}^{i} )\}^{n}_{i=0}}{\min} ||\mu_{E}-\mu||_{2} $$

(14)

where C0 denotes the convex hull.

By doing so we can obtain a set of k + 1 policies which can generate estimated feature expectations that are equally close to the expert feature expectation:

$$ \begin{array}{@{}rcl@{}} \hat{\mu}^{i-1}=\hat{\mu}^{i-2}+ \frac{({\mu^{i-1}-\hat{\mu}^{i-2})}^{T}(\mu_{E} - \hat{\mu}^{i-2})}{({\mu^{i-1}-\hat{\mu}^{i-2})}^{T}(\mu^{i-1}-\hat{\mu}^{i-2})} \\{\left[\vphantom{\hat{\mu}^{i-1}=\hat{\mu}^{i-2}}\right.} \left.(\mu^{i-1}-\hat{\mu}^{i-2}) \right] \end{array} $$

(15)

we set

$$ w^{i}=\mu_{E} - \hat{\mu}^{(i-1)} $$

(16)

The weight vector wⁱ defines the values of the underlying reward function through dot multiplication with the feature vector ϕ. The reward function might be a linear combination of features ϕ.

$$ \begin{array}{@{}rcl@{}} R(s,a)&=& w_{1} \phi_{1} (s,a)+w_{2} \phi_{2} (s,a)+ \ldots{\ldots} +w_{k} \phi_{k} (s,a) \\ R(s,a)&=&\sum\limits_{i} \{w_{i} \phi_{i} (s,a) \} \end{array} $$

(17)

Threshold t represents the termination criteria. The algorithm terminates if t obtains less than the predefined parameter 𝜖.

$$ t= \left\|\mu_{E} - \hat{\mu}^{(i-1)}\right\|_{2} \leq \epsilon $$

(18)

Initially, we set $w^{1} = \mu _{E} - \hat {\mu }^{0}$, where $\hat {\mu ^{(0)}} = \mu ^{(0)}$. The workflow is shown in Fig. 2.

The dataset of the DTR trials provides the demonstrations τ_E and hence expert policies π_E. From these demonstrations, it is possible to estimate μ_E by following the procedure in (8)-(12).

On the other hand, an iterative process estimates the underlying reward function for these trajectories and updates the estimated policy accordingly to Algorithm 1.

At each iteration, 1) a reward is calculated with respect to the current weight vector ($R={\sum }_{i} w_{i} \phi _{i}$); 2) the RL method (V alue Iteration method) is executed to estimate the policy ${\hat {\pi }}^{i}$ for current reward; 3) the feature expectation μ is updated and $\hat {\mu }^{i-1}$ is Computed through (15); and 4) the weight vector is updated by calculating the difference between wⁱ = μ_E and $ \hat {\mu }^{(i-1)}$.

This process terminates as the Euclidean distance ($||\mu _{E} - \hat {\mu }^{(i-1)}||_{2} $) becomes less than or equal to some predefined value 𝜖.

5 Experiments

In this section, we report the results of an experiment that we have conducted to evaluate the proposed model.

First, we will describe the dataset and test. Next, we will present the results and validation.

5.1 Model setup

The DTR under examination (Fig. 1) is a two-stage decision process. We completed and mapped such a process into the RL modal as represented in Fig. 3. The model consists of seven states {s₀,...,s₆}. s₀ is the initial state. s₁ and s₃ represents the “responder(Res)” state to “PSY” and “MED” actions respectively. Similarly, s₂ and s₄ represent “Non-responder” responses of the patients for “PSY” and “MED” actions. While s₅ and s₆ are the terminal states.

Our model has four possible actions(treatments) $\{^{\prime }MED^{\prime }$, $^{\prime }PSY^{\prime }$, $^{\prime }TM^{\prime }$, $^{\prime }PSY+MED^{\prime }\}$, and eight probability distributions $T_{s\_a}$. Where T is the probability that the patient is `Responder’ being in the state s and selecting an action a.

There are t lhree pre-treatment features (i.e Gender, Location and Cultural level). We have a total of 12 possible combinations of these pre-treatment features(i.e 2 Genders, 2 Locations, 3 Cultural levels = 12) as shown in Table 2. These pre-treatment features represent the information about the patient. As an example, [M − D − L] stands for Gender = Male, Location = Downtown, and Cultural Level = Low.

Table 2 Features of the (SMART) design schematic for DTR

Full size table

This model is further divided into three stages. Stage-1 comprises s₀, stage-2 comprises on {s₁,s₂, s₃ands₄} and the final-stage contains {s₅ands₆} states. Two actions (PSY and MED) are possible at stage-1 and four actions(PSY, MED, TM, PSY+MED ) are possible at stage-2, as shown in the Table 2. The actions (treatments) selected at stage-1 depend on the pre-treatment features, whereas the actions at stage-2 depend on the response of the patient (e.g., Responder or Non-Responder). The final stage contains only the terminal states s₅ and s₆, which represent the outcomes of the treatment (success or failure).

5.2 Dataset description

In order to validate the proposed approach, we have generated a dataset by adopting the SMART method [3, 27, 28].

The referred DTRs are shown in Fig. 3. In this trial, each participant is randomly assigned to one of two possible initial treatments: psychology (PSY ) or medicine (MED). As already described in a previous section, the participants are classified as Non-Responders (NR) or Responders (Res) to the initial treatment according to whether they do (or do not) experience more than two heavy-drinking days during the two-month time period.

There are 12 possible combinations of pre-treatment features see Table 2. We have selected 500 demonstrations(DTR trials) for each of these combinations. An example of the demonstration for the pre-treatment parameter M-O-L is; “[s₀, PSY, s₂, MED, s₅]”, where s₂ and s₅ represent the “Responder” and “Success” states, respectively (see Fig. 3).

5.3 Results

Figure 4 shows the difference between the expert feature expectations and the estimated feature expectations along with the number of iterations as reported in the (18). The results have been obtained with a discount factor γ = 0.01. As an example, we considered the demonstrations of only three pre-treatment parameters (M-D-M, M-D-L, and M-D-H) to plot this graph. The trajectories in the graph show that we are on the right track in terms of learning the real reward function since, with each iteration, the difference between the expert feature expectations and the estimated feature expectations gets smaller.

These trends show the quick convergence behavior of the algorithm. The final difference value reduces below the threshold in less than 30 iterations; i.e., the last estimated reward values generate feature expectations very close to those generated by the expert policy.

On the other hand, the expected rewards at each state are estimated by exercising the proposed model for a given number of demonstrations, as shown in Table 3. These rewards provide the best representation of the demonstration for each pre-treatment feature. Consider “M-O-L” as the pre-treatment features. There were three types of repeated demonstration for “M-O-L” with the following shares:

Table 3 Rewards collected at each state against demonstrations

Full size table

[s₀, MED, s₄, PSY, s₅] = 32 %

[s₀, MED, s₃, TM, s₅] = 37 %

[s₀, MED, s₄, PSY +MED, s₅] = 31 %

There were some negative demonstrations that ended at state s₆ (the failure state). We only selected the positive demonstrations which end at state s₅ (the success state). The estimated rewards at each state for these demonstrations are presented in Table 3.

Next, we aimed to test how good these reward functions are. To achieve that objective, we used RL (the V alue − Iteration) algorithm, with the rewards already estimated through the proposed model. V alues for the different pre-treatment features were obtained by RL algorithm and are listed in Table 4. These values represent the preference of the RL agent at each stage. For example, let’s consider a patient with the characteristics M − O − L. In stage-1, the value of the treatment MED is greater than the value of the treatment PSY. Therefore, MED is preferred at stage-1. Similarly, at stage-2 there are four possible actions and among these, TM has the highest value. Therefore, TM is the preferred action in stage-2. Similarly, Success has the highest value in the final stage. Hence, the reproduced trajectory through the RL algorithm by using the estimated reward function is:

$$ [s_{0}, \text{MED}, s_{3}, \text{TM}, s_{5}] $$

This reproduced trajectory is the same as that given in the demonstrations with the highest percentage share. This means that the estimated reward function is the best explanation for these demonstrations. All the other rows in the Table 4 represent the values of each action for every pre-treatment feature.

Table 4 Values at each state collected by the Value Iteration algorithm

Full size table

In case, the RL-agent selecting MED at stage-1, then all the possible actions it can take at the second stage are PSY, TM and PSY + MED (see Fig. 3). By considering such a case, the aggregated values of each possible combination of actions at both stages are represented in Fig. 5. “MED PSY ” meaning that the MED and PSY actions are chosen at stage-1 and stage-2, respectively. The graph shows that MED TM has a higher aggregated value for all the pre-treatment features compared with MED PSY and MED PSY + MED.

On the other hand, Fig. 6 represents the aggregated values considering PSY action at stage-1. All the possible actions at stage-2 that the RL agent can take for this case are MED, TM and PSY + MED. The graph shows that the PSY TM actions, respectively at stage-1 and stage-2, provide the highest aggregated values for all the pre-treatment features. Note that these trends in Figs. 5 and 6 represent the policies of the RL algorithm.

6 Testing estimated rewards on existing models

As discussed earlier, it is very hard to find the true rewards for many RL problems, especially in a large state space. In many problems, researchers have to start with random rewards, that are difficult to tune in complex environments. Q-learning [29] and SARSA [30] are two important RL algorithms that are used in health care and personalized medicine. Usually, in existing models, rewards are randomly assigned to each state at the beginning. On the other hand, in the proposed approach, we estimate the best fitting rewards for each state.

In this section, we present a learning comparison of both (Q-learning and SARSA) models, using the existing approach with the estimated rewards via the proposed algorithm. A convergence performance of both RL models on the same DTR problem is presented in Fig. 7.

The learning curves of the Q-learning model, for both cases, are presented in Fig. 7(a). It can be seen that learning from the “IRL rewards” is faster than learning from the “existing approach”. With the IRL rewards, almost 415 episodes were needed to learn the DTR environment, while in the other case almost 809 episodes were required.

Similarly, the case of learning the DTR environment by adopting SARSA model is presented in Fig. 7(b). Once again, the learning curve is much faster in the case where the rewards obtained through the proposed algorithm are being used. It took almost 712 episodes to learn the SARSA model when the “existing approach” is used; whereas, it took almost 252 episodes when the “IRL rewards” were used. Both the models performed much better in terms of learning with the rewards that were obtained through the proposed IRL approach.

It is important to note that speeding up the learning process of a RL model is crucial in the case of a DTR. As already mentioned in the previous sections, statistical approaches (e.g., SMART) perform the execution of DTR trials (demonstrations). After the trials are completed, one can train an intelligent agent to get support in decisions making. However, to train a RL model from scratch takes many new “attempts”, the number of which can be reduced by using the proposed approach. To conclude, the proposed approach can be effectively adopted:

1.
to automatically estimate the ”value” of each DTR (which helps in finding the optimal DTR); and
2.
to reduce the time taken and the number of attempts needed to train the RL model.

7 Related work

The application of Machine Learning (ML) techniques in biomedicine [31], minimizing medication errors during home treatment [32, 33], risk management [34], communication [35,36,37] and healthcare [38,39,40] has been increased extensively in recent years. DTRs [41, 42] oversimplify personalized medicine to time-varying treatment settings in which the treatment is frequently tailored to a patient’s dynamic-state. DTRs are alternatively known as adaptive treatment strategies [43,44,45] or treatment policies [46, 47]. Behavior Cloning (BC) and RL [48, 49] are two methods exploited to learn DTRs.

BC [50] learns the policy through supervised learning by the direct mapping of states to actions. It can avoid interacting with the environment. However, without considerable improvement during the training, BC introduces a compounding error [51] over the length of the trajectories. BC can effectively recover the doctor’s policies when the Electronic Health Record (EHR) is optimal and plentiful. However, these methods are not suitable if the ground truth of the treatment is unclear [7].

On the other hand, the RL [52] and Deep Reinforcement Learning (DRL) [53] methods are based on maximizing the long-term reward of the patients [14, 48] from directly learning a policy. However, the learned policy is highly reliant on the accuracy of the pre-defined reward function. The true reward function is very crucial in RL in terms of finding the optimal DTR. In addition, these models are not explainable to an expert domain, due to the lack of true reward function, which is important in this sensitive application.

8 Conclusions and future directions

In this paper, we have proposed a PB-IRL approach to learn the underlying reward function for the automatic analysis of the DTRs from their trials (demonstrations). The reward function explains the behavior of experts (physicians) more effectively in comparison with RL policies. We have also shown that a RL-based decision support system can be directly and efficiently trained by adopting this reward function.

The reward function for demonstrations is estimated by learning the appropriate policies. Such policies can easily be found by mixing all the estimated policies corresponding to their mixture weights (as explained in Section 4). In a large state space, it is an easy and rapid way to obtain an optimal policy and hence the true reward function for the given demonstrations. In order to validate the approach, we have generated some demonstrations, applied the proposed algorithm, and estimated the reward function for these demonstrations. In order to check how good these estimated rewards are, we have exercised the V alue iteration RL algorithm and learned the policies. By comparing these learned policies with the demonstrated policies we have proved the accuracy of the algorithm. The results show that the proposed algorithm provides better dynamic treatment regimes and helps the existing models to learn the environment fast.

The proposed IRL technique mitigates the problems associated with the RL, but there are some limitations to be aware of.

1.
The proposed technique can be applied to the problems where we have the experts trajectories/demonstrations e.g., autonomous car driving: where we have the trajectories of taxi drivers or DTRs: where we have the Doctor policies.
2.
The IRL algorithms often make strong assumptions by considering any observed behavior as optimal. It can be challenging in a real-time environment.

IRL methods only consider the positive demonstrations or trajectories (i.e., those that result in the recovery or the survival of the patients) and learn a policy to recover these trajectories. The information embedded in negative trajectories (e.g. those that result in unsuccessful treatments) has been largely ignored. However, it could potentially help the learned policy to avoid repeating mistakes. Both positive and negative trajectories can be used to deduce the best practice and avoid errors in dynamic treatment regimes. These techniques, therefore, may increase the likelihood of patient survival through the exploitation of information from both positive and negative trajectories and provide better dynamic treatment regime.

Abbreviations

PB-IRL:: Projection Based Inverse Reinforcement Learning
SMART:: Sequential Multiple Assignment Randomized Trial
MDP:: Markov Decision Process
RL:: Reinforcement Learning
DRL:: Deep Reinforcement Learning
DTR:: Dynamic Treatment Regime
IRL:: Inverse Reinforcement Learning
IL:: Imitation Learning
LD:: Learning from Demonstration
ML:: Machine Learning
LP:: Linear Programming
BC:: Behavior Cloning

References

Chakraborty B, Murphy SA (2014) Dynamic treatment regimes. Ann Rev Stat Appl 1:447–464
Article Google Scholar
Isern D, Moreno A, Sánchez D, Hajnal Á, Pedone G, Varga LZ (2011) Agent-based execution of personalised home care treatments. Appl Intell 34(2):155–180
Article Google Scholar
Murphy SA (2005) An experimental design for the development of adaptive treatment strategies. Stat Med 24(10):1455–1481
Article MathSciNet Google Scholar
Lu K, Liao H (2022) A survey of group decision making methods in healthcare industry 4.0: bibliometrics, applications, and directions. Appl Intell 5:1–25
Google Scholar
Moodie EEM, Richardson TS, Stephens DA (2007) Demystifying optimal dynamic treatment regimes. Biometrics 63(2):447–455
Article MathSciNet MATH Google Scholar
Wang Y, Peng W u, Liu Y, Weng C, Zeng D (2016) Learning optimal individualized treatment rules from electronic health record data. In: 2016 IEEE international conference on healthcare informatics (ICHI), IEEE, pp 65–71
Utomo CP, Kurniawati H, Li X, Pokharel S (2019) Personalised medicine in critical care using bayesian reinforcement learning. In: International conference on advanced data mining and applications, Springer, pp 648–657
Zhang Z, et al. (2019) Reinforcement learning in clinical medicine: a method to optimize dynamic treatment regime over time. Ann Trans Med 7(14):345
Article Google Scholar
Naeem M, Tahir H Rizvi S, Coronato A (2020) A gentle introduction to reinforcement learning and its application in different fields. IEEE Access 8:209320–209344
Article Google Scholar
Nahum-Shani I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA, Waxmonsky JG, Jihnhee Y u, Murphy SA (2012) Q-learning: a data analysis method for constructing adaptive interventions. Psychol Meth 17(4):478
Article Google Scholar
Coronato A, Naeem M, Pietro GD, Paragliola G (2020) Reinforcement learning for intelligent healthcare applications: a survey. Artif Intell Med 109:101964
Article Google Scholar
Cortès U, Sànchez-Marrè M, Ceccaroni L, Poch M, et al. (2000) Artificial intelligence and environmental decision support systems. Appl Intell 13(1):77–91
Article Google Scholar
Chua K, Calandra R, McAllister R, Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in neural information processing systems, pp 4754–4765
Raghu A, Komorowski M, Celi LA, Szolovits P, Ghassemi M (2017) Continuous state-space models for optimal sepsis treatment-a deep reinforcement learning approach. arXiv:1705.08422
Bothe MK, Dickens L, Reichel K, Tellmann A, Ellger B, Westphal M, Faisal AA (2013) The use of reinforcement learning algorithms to meet the challenges of an artificial pancreas. Expert Rev Med Devices 10(5):661–673
Article Google Scholar
Raghu A, Komorowski M, Ahmed I, Celi L, Szolovits P, Ghassemi M (2017) Deep reinforcement learning for sepsis treatment. arXiv:1711.09602
Stuart R (1998) Learning agents for uncertain environments. In: Proceedings of the eleventh annual conference on computational learning theory, pp 101–103
Michini B, How JP (2012) Bayesian nonparametric inverse reinforcement learning. Lect Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7524 LNAI(PART 2):148–163
Google Scholar
Abbeel P, Ng AY (2004) Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the twenty-first international conference on machine learning, p 1
Jin H, Nath SS, Schneider S, Junghaenel D, Shinyi W u, Kaplan C (2021) An informatics approach to examine decision-making impairments in the daily life of individuals with depression. J Biomed Inform 122:103913
Article Google Scholar
Sutton RS, Barto AG, Klopf H (2016) Reinforcement learning: an introduction second edition in progress. MIT press, Cambridge
Google Scholar
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge
MATH Google Scholar
Grounds M, Kudenko D (2005) Parallel reinforcement learning with linear function approximation. In: Adaptive agents and multi-agent systems III. Adaptation and multi-agent learning, Springer, pp 60–74
Shah SIH, Coronato A (2021) Learning tasks in intelligent environments via inverse reinforcement learning. In: 2021 17th international conference on intelligent environments (IE), IEEE, pp 1–4
Ng AY, Russell SJ, et al. (2000) Algorithms for inverse reinforcement learning. In: Icml, vol 1. p 2
Karel H (1979) Nonstandard set theory. Am Math Mon 86(8):659–677
Article MathSciNet MATH Google Scholar
Lavori PW, Dawson R (2000) A design for testing clinical strategies: biased adaptive within-subject randomization. J R Stat Soc Ser A (Stat Soc) 163(1):29–38
Article Google Scholar
Lavori PW, Dawson R (2004) Dynamic treatment regimes: practical design considerations. Clin Trials 1(1):9–20
Article Google Scholar
Moodie EEM, Chakraborty B, Kramer MS (2012) Q-learning for estimating optimal dynamic treatment rules from observational data. Can J Stat 40(4):629–645
Article MathSciNet MATH Google Scholar
Yu C, Liu J, Nemati S, Yin G (2021) Reinforcement learning in healthcare: a survey. ACM Comput Surv (CSUR) 55(1):1–36
Article Google Scholar
Deepika SS, Geetha TV (2021) Pattern-based bootstrapping framework for biomedical relation extraction. Eng Appl Artif Intell 104130:99
Google Scholar
Ciampi M, Coronato A, Naeem M, Silvestri S (2022) An intelligent environment for preventing medication errors in home treatment. Expert Syst Appl 193:116434
Article Google Scholar
Naeem M, Coronato A (2022) An ai-empowered home-infrastructure to minimize medication errors. J Sensor Actuator Netw 11(1):13
Article Google Scholar
Paragliola G, Naeem M (2019) Risk management for nuclear medical department using reinforcement learning algorithms. J Reliab Intell Environ 5(2):105–113
Article Google Scholar
Naeem M, Pietro GD, Coronato A (2021) Application of reinforcement learning and deep learning in multiple-input and multiple-output (mimo) systems. Sensors 22(1):309
Article Google Scholar
Shah SIH, Alam S, Ghauri SA, Hussain A, Ansari FA (2019) A novel hybrid cuckoo search-extreme learning machine approach for modulation classification. IEEE Access 7:90525–90537
Article Google Scholar
Shah SIH, Coronato A, Ghauri SA, Alam S, Sarfraz M (2022) Csa-assisted gabor features for automatic modulation classification. Circ Syst Sig Process 41(3):1660–1682
Article Google Scholar
Cinque M, Coronato A, Testa A (2013) A failure modes and effects analysis of mobile health monitoring systems. In: Innovations and advances in computer, information, systems sciences, and engineering, Springer, pp 569–582
Naeem M, Paragliola G, Coronato A (2021) A reinforcement learning and deep learning based intelligent system for the support of impaired patients in home treatment. Expert Syst Appl 168:114285
Article Google Scholar
Khan AF, Jalil A, Haq IU, Shah SIH (2021) Automatic localization of macula and identification of macular degeneration in retinal fundus images. In: 2021 International conference on electrical, communication, and computer engineering (ICECCE), IEEE, pp 1–6
Murphy SA (2003) Optimal dynamic treatment regimes. J R Stat Soc Ser B (Stat Methodol) 65 (2):331–355
Article MathSciNet MATH Google Scholar
Robins JM (2004) Optimal structural nested models for optimal sequential decisions. In: Proceedings of the second seattle symposium in biostatistics, Springer, pp 189–326
Lavori PW, Dawson R (2008) Adaptive treatment strategies in chronic disease. Annu Rev Med 59:443–453
Article Google Scholar
Oetting AI, Levy JA, Weiss RD, Murphy SA (2011) Statistical methodology for a smart design in the development of adaptive treatment strategies. Causality Psychopathol Find Determinants Disord Cures 8:179–205
Google Scholar
Peter F, Thall H-GS, Estey EH (2002) Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. J Am Stat Assoc 97(457):29–39
Article MathSciNet MATH Google Scholar
Lunceford JK, Davidian M, Tsiatis AA (2002) Estimation of survival distributions of treatment policies in two-stage randomization designs in clinical trials. Biometrics 58(1):48–57
Article MathSciNet MATH Google Scholar
Wahed AS, Tsiatis AA (2006) Semiparametric efficient estimation of survival distributions in two-stage randomisation designs in clinical trials with censored data. Biometrika 93(1):163–177
Article MathSciNet MATH Google Scholar
Wang L u, Zhang W, He X, Zha H (2018) Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2447–2456
Zhang Y, Chen R, Tang J, Stewart WF, Sun J (2017) Leap: learning to prescribe effective and safe treatment combinations for multimorbidity. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1315–1324
Dean AP (1991) Efficient training of artificial neural networks for autonomous navigation. Neural Comput 3(1):88–97
Article Google Scholar
Ross S, Gordon G, Bagnell D (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 627–635
Ding S, Zhao X, Xu X, Sun T, Jia W (2019) An effective asynchronous framework for small scale reinforcement learning problems. Appl Intell 49(12):4303–4318
Article Google Scholar
Lin E, Chen Q, Qi X (2020) Deep reinforcement learning for imbalanced classification. Appl Intell 50(8):2488–2502
Article Google Scholar

Download references

Acknowledgements

This work has been partly supported by the AMICO project which has received funding from the National Programs (PON) of the Italian Ministry of Education, Universities and Research (MIUR): code ARS0100900 (Decreen.1989, 26 July 2018)

Funding

Open access funding provided by Università Parthenope di Napoli within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of ICT and Engineering, Parthenope University, Naples, Italy
Syed Ihtesham Hussain Shah
ICAR, National Research Council, Naples, Italy
Syed Ihtesham Hussain Shah, Giuseppe De Pietro, Giovanni Paragliola & Antonio Coronato
Università Telematica Giustino Fortunato, Benevento, Italy
Antonio Coronato

Authors

Syed Ihtesham Hussain Shah
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe De Pietro
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Paragliola
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Coronato
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Syed Ihtesham Hussain Shah.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Giuseppe De Pietro, Giovanni Paragliola and Antonio Coronato are contributed equally to this work.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shah, S.I.H., De Pietro, G., Paragliola, G. et al. Projection based inverse reinforcement learning for the analysis of dynamic treatment regimes. Appl Intell 53, 14072–14084 (2023). https://doi.org/10.1007/s10489-022-04173-0

Download citation

Accepted: 12 September 2022
Published: 21 October 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10489-022-04173-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Projection based inverse reinforcement learning for the analysis of dynamic treatment regimes

Abstract