Prospect Theory-inspired Automated P2P Energy Trading with Q-learning-based Dynamic Pricing

Ashutosh Timilsina Department of Computer Science
University of Kentucky
Lexington, USA
ashutosh.timilsina@uky.edu Simone Silvestri Department of Computer Science
University of Kentucky
Lexington, USA
simone.silvestri@uky.edu

Abstract

The widespread adoption of distributed energy resources, and the advent of smart grid technologies, have allowed traditionally passive power system users to become actively involved in energy trading. Recognizing the fact that the traditional centralized grid-driven energy markets offer minimal profitability to these users, recent research has shifted focus towards decentralized peer-to-peer (P2P) energy markets. In these markets, users trade energy with each other, with higher benefits than buying or selling to the grid. However, most researches in P2P energy trading largely overlook the user perception in the trading process, assuming constant availability, participation, and full compliance. As a result, these approaches may result in negative attitudes and reduced engagement over time. In this paper, we design an automated P2P energy market that takes user perception into account. We employ prospect theory to model the user perception and formulate an optimization framework to maximize the buyer’s perception while matching demand and production. Given the non-linear and non-convex nature of the optimization problem, we propose Differential Evolution-based Algorithm for Trading Energy called $D E b A T E$ . Additionally, we introduce a risk-sensitive Q-learning algorithm, named Pricing mechanism with Q-learning and Risk-sensitivity ( $P Q R$ ), which learns the optimal price for sellers considering their perceived utility. Results based on real traces of energy consumption and production, as well as realistic prospect theory functions, show that our approach achieves a $26 %$ higher perceived value for buyers and generates $7 %$ more reward for sellers, compared to a recent state of the art approach.

Peer-to-peer energy trading, differential evolution, dynamic pricing, prosumer, prospect theory, Q-learning.

Pre-print

I Introduction

Distributed Energy Resources (DER), such as rooftop solar and wind turbine, have seen widespread proliferation among consumers in recent years[12]. In addition, the advent of Smart Grid (SG) technologies, Advanced Metering Infrastructures (AMI), and home energy management systems, have added flexibility in energy generation/consumption for consumers. This, in turn, has allowed traditionally passive consumers to become actively involved in energy trading by sharing the excess energy generated at their premise to either grid or other buyers [25, 16]. These active consumers with energy production capabilities have been referred to as prosumers [16], as a portmanteau of “producers” and “consumers”. The role of prosumers in energy market has been recognized to some extent with the adoption of incentive schemes like Feed-in-Tariff (FiT) mechanism [26, 27]. FiT allows prosumers to sell excess energy to the grid and buy from grid when required [27]. However, existing energy trading modalities offer limited benefits to participating prosumers. This is due to the minimal prices at which energy is purchased by grid, as well as the low limits on the amount of energy that can be purchased [26, 27, 16].

I-a Literature Review and Motivation

Peer-to-peer (P2P) energy trading is a recently proposed decentralized modality for energy sharing aiming at solving limitations of centralized techniques. This modality has been gaining significant traction recently [27, 26]. Specifically, P2P energy trading allows prosumers to trade energy among each other at a negotiated price with or without the involvement of the grid [26]. It generates better monetary incentives for prosumers compared to existing mechanisms while also reducing their grid dependency [27]. Additionally, increased local energy generation/consumption resulting from P2P trading leads to the minimization of overall system energy loss while providing an effective way to achieve demand side management[32]. Benefits extend also to the grid operator, by providing savings in investments that would have been otherwise required to develop/maintain transmission infrastructure in a centralized power distribution architecture [16, 26].

P2P energy trading has received attention from the research community in recent years. The works in [29, 28] present game theoretic approaches in a P2P setting, while a greedy rule-based P2P mechanism to assign energy among prosumers is proposed in [3] that includes mid-market pricing. Similarly, the physical aspects of P2P energy trading, such as power loss minimization and voltage regulation, have been explored in [15, 17]. These works, however, largely overlook the user behavior in designing their solutions. As established in [16, 27, 25], accommodating the user behavioral modeling in P2P energy trading ensures sustained participation from prosumers while incentivizing their contribution. In fact, the papers [29, 28] consider prosumers to be actively involved and fully compliant with the system as rational decision-makers. First concern with this assumption is that the continuous online presence of participating prosumers with the system might not always be possible in real-world application. Secondly, research on user behavioral models and decision making [11, 2] have found users to have bounded rationality. Therefore, requiring constant active participation overwhelms the users and incentivizes non-rational decisions [7]. In the worst case, it might even result in users opting to terminate their participation altogether [2, 25]. In that light, the works in [25, 1] incorporates bounded rationality and user preferences into P2P energy trading. However, it requires continuous human participation and assumes a simplistic linear model for user perception. Conversely, the authors of [27] limit their focus on coalition formation in game theoretic setting and do not explicitly consider user behavioral modeling.

As a result, a prosumer-centric P2P energy trading model, that effectively incorporates the prosumers’ decision-making behavior and their perceived loss/gain value from trading, is still lacking in the existing literature. Such a trading modality is expected to require minimal active participation from users while also ensuring their sustained involvement through the adoption of user behavioral modeling. To this end, the framework of Prospect Theory (PT) [13] can be used to model the non-rational user behavior in the face of uncertain decision-making. It is often regarded as fairly accurate mathematical representation of human behavior [13, 9, 8].

Recently, there has been few efforts in integrating PT in energy related applications as well to capture the irrationality of users [20, 8, 30, 31]. In relation to P2P energy trading, the authors in [31] have proposed a PT-based distributed energy trading model to optimize trading decisions for prosumers in a competitive market. Although these papers model the user behavior in some ways, they require active participation from users and also assume that such behavior (e.g., the parameters of PT) is homogeneous for all the users. Social science studies, such as the one conducted in Italy [6] to investigate the social acceptance of nuclear energy using an online survey, show that users exhibit significant heterogeneity in their preferences for the sources of energy. Neuroscience studies have also stressed the heterogeneity of humans in reference to PT parameters [10]. Not capturing such heterogeneity provides little benefits in terms of user behavioral modeling.

I-B Paper Contributions

In this paper, we design a PT-based optimization framework for prosumer-centric P2P energy trading as shown in Fig. 1. The framework aims at matching energy production and consumption (step $1$ in Fig. 1) to maximize the perceived utility of individual buyers while taking into account the intrinsic heterogeneity of human perception. Given that the optimization problem is non-linear and non-convex, we further devise a Differential Evolution-based [23] metaheuristic algorithm called $D E b A T E$ to solve the problem (energy allocation, step $2$ ). In order to ensure minimal active participation of prosumers, we employ a Reinforcement Learning (RL) framework, called $P Q R$ , in tandem with $D E b A T E$ to automate the pricing mechanism for sellers (pricing mechanism, step $3$ ). In doing so, $P Q R$ learns the selling price for each sellers using a PT-based risk-sensitive Q-learning algorithm [21]. The output of the algorithms is then returned to the prosumers for executing the physical energy transactions (step $4$ ). Using real datasets for energy production and consumption, paired with recent survey data for PT perception modeling, results show that $D E b A T E$ performs $25 %$ higher in buyer’s perception and $7 %$ higher in seller’s reward compared to state-of-the-art approach.

The major contributions of the paper are the following:

We develop a PT-inspired optimization framework for P2P energy trading;
We design a metaheuristic algorithm $D E b A T E$ to solve the non-linear energy allocation problem;
We design dynamic pricing mechanism with $P Q R$ algorithm using risk-sensitive Q-learning approach;
Experiments using real data show the superiority of proposed approach compared to the state-of-the-art;

Fig. 1: P2P Energy Trading System Overview.

Ii System Model and Problem Formulation

We consider a P2P energy trading system as shown in Fig. 1. The system consists of prosumers that can exchange energy among each other through an existing distribution network. The grid serves as backup for prosumers to either buy or sell energy, if the local energy trading is insufficient or not possible. Let $P$ be the set of all prosumers participating in the P2P energy market. We refer to $B_{t} \subset P$ as the set of Buyers, i.e. the set of prosumers that have higher self-consumption than generation at a timeslot $t$ , and consumers without energy generation capabilities. Similarly, $S_{t} \subset P$ is the set of Sellers, i.e., prosumers that have excess generation at a timeslot $t$ . For simplicity of notation, we drop the subscript $t$ in the following.

We model the perceived loss and gain of prosumers using the prospect theory (PT) value function to capture user perception on gains and losses. Specifically, consider the excess energy generation of seller $i \in S$ be $r_{i}$ and demand of buyer $j \in B$ be $w_{j}$ . Then, let $x_{i j} \in [0, 1]$ represent the fraction of $w_{j}$ that a buyer $j$ is willing to buy from seller $i$ at $ρ_{i}$ price per $k W h$ amount of energy. There is an energy loss during the physical energy transfer through wires [32], which depends on the wire-length between $i$ and $j$ and directly proportional to the amount of energy exchanged. The loss is modeled as a fraction $l_{i j} \in [0, 1]$ of the energy exchanged. Assume $ρ_{g s}, ρ_{g b}$ be the energy selling and purchasing prices from the grid. We adopt a modified PT value function to model realistic user perception in an energy market [13]. The function quantifies perceived utility of humans towards gain and loss based on degree of deviation from a reference point. Particularly, in our problem, it captures the difference of total actual buying cost $y_{j}$ from the buyer’s desired total reference cost $ρ_{j} w_{j}$ where $ρ_{j}$ is the reference price of buyer $j$ for purchasing energy. This utility function is formulated as

v (y_{j}) = {\begin{matrix} k_{+, j} (ρ_{j} w_{j} - y_{j})^{ζ_{+, j}}, & y_{j} < ρ_{j} w_{j} - k_{-, j} (y_{j} - ρ_{j} w_{j})^{ζ_{-, j}}, & y_{j} \geq ρ_{j} w_{j} \end{matrix}

(1)

where $k_{+, .}, k_{- .}, ζ_{+, .}, ζ_{-, .}$ are the parameters that control the degree of loss-aversion and risk-sensitivity. These parameters are found to be highly heterogeneous and vary from person to person based on factors like gender and age group [4, 19]. $y_{j}$ is the total actual cost of buying energy for $j^{t h}$ buyer s.t.

y_{j} = \sum i \in S ρ_{i} x_{i j} w_{j} + ρ_{g s} (1 - \sum i x_{i j}) w_{j}

Note that, similar to the PT value function in [13], the utility function in Eq. (1) is concave in the gain domain (i.e. case $y_{j} < ρ_{j} w_{j}$ ) while convex in loss domain (i.e. case $y_{j} \geq ρ_{j} w_{j}$ ).

The problem of matching demand and production of heterogeneous prosumers is formalized as follows.


maximize	$f (y) : \sum j \in B v (y_{j})$		(2)
s.t.	$\sum j \in B (1 + l_{i j}) x_{i j} w_{j} \leq r_{i},$	$\forall i$	(2a)
	$\sum i \in S x_{i j} \leq 1,$	$\forall j$	(2b)
	$x_{i j} = 0, if l_{i j} \geq l_{m a x},$	$\forall i$	(2c)
	$ρ_{g b} \leq ρ_{i}, ρ_{j} \leq ρ_{g s},$	$\forall i$	(2d)
	$x_{i j} \in [0, 1],$	$\forall i, j$	(2e)

The problem maximizes the sum of perceived utility for buyers in Eq. (2). Constraint in Eq. (2a) prevents the problem from exceeding the amount of energy being sold by each sellers while incorporating the losses in electric lines. The constraint in Eq. (2b) ensures that the energy demand for each buyers is not exceeded, while constraint (2c) limits the loss between sellers and buyers to be within the loss threshold $l_{m a x}$ . Finally, the constraint (2d) limits upper and lower bound for energy price to the selling and buying price of the grid.

It is to be noted that the problem in Eq. (2) is non-linear, non-convex optimization problem. Hence, we propose a heuristic based on Differential Evolution Algorithm (DEA) [23] described in the following section. Additionally, in the above problem, the selling price is considered as a fixed amount for a trading period. However, the reference price $ρ_{j}$ of buyer $j$ is a personal value which may under- or over-estimate the competitiveness of market. In order to maximize the sellers’ perceived objectives through prospect theory, we resort to the risk-sensitive Q-learning algorithm [21].

Input : set of buyers

B

, sellers

S

, fitness function

f (.)

, max iterations

G_{m a x}

, population size

N P

, crossover probability

C R

, differential weight

F

Output : best identified feasible solution

x^{*}

1 Update set of buyers

B

and sellers

S

c o u n t = 0

;

2 Generate initial population

X = {x_{k} | k = 1, \dots, N P}

;

3 while $c o u n t < G_{m a x}$ do

4 for each $x_{k} \in X$ do

5 Choose

3

different vectors

{x_{a}, x_{b}, x_{c}} \in X

at random and

R \sim U (1, | S | \times | B |)

;

6 Create mutated solution

{¯ x}_{k} = x_{k}

;

/* Mutation and Crossover */

7 for each $i \in | S |$ , $j \in | B |$ do

8 Select

u \sim U (0, 1)

;

9 if $u < C R | | (i \times j) == R$ then

{¯ x}_{i j}^{(k)} = x_{i j}^{(a)} + F \times (x_{i j}^{(b)} - x_{i j}^{(c)})

;

{¯ x}_{i j}^{(k)} = min (1, max (0, {¯ x}_{i j}^{(k)}))

13 end for

/* Check Constraints */

\forall i, j

, if $l_{i j} \geq l_{m a x}$ then

{¯ x}_{i j} = 0

;

\forall i

, if $\sum_{j} (1 + l_{i j}) {¯ x}_{i j} w_{j} > r_{i}$ then

{¯ x}_{i j} = \frac{{¯ x}_{i j} r_{i}}{\sum_{^j} ¯ (1 + l_{i^j}) ¯ x_{i^j} w_{^j}}

;

\forall j

, if $\sum_{i} {¯ x}_{i j} > 1$ then

{¯ x}_{i j} = \frac{{¯ x}_{i j}}{\sum_{^i} {¯ x}_{^i j}}

;

/* Compare fitness */

17 if $f ({¯ x}_{k}) > f (x_{k})$ then

X = (X ∖ {x_{k}}) \cup {{¯ x}_{k}}

;

19 end for

20 count = count++;

22 end while

/* Find the best solution to execute trading */

23 Let

x^{*} = arg max x_{k} \in X f (x_{k})

;

24 Execute transactions for each prosumers to

x^{*}

;

Algorithm 1 DEbATE

Iii The DEbATE and PQR Heuristics

In this section, we describe the Differential Evolution-based Algorithm for Trading Energy (DEbATE) (Alg. 1), designed for the problem presented in Section II, and the Pricing mechanism with Q-learning and Risk-sensitivity (PQR), designed to dynamically adjust the sellers’ prices.

Iii-a DEbATE

$D E b A T E$ is executed at each trading period (e.g., 12 hours) to solve the non-linear optimization problem in Eq. (2). It uses differential evolution to determine an optimal amount of energy to be traded between prosumers that maximizes the perceived utility of buyers. DEbATE initially updates the list of buyers ( $B$ ) and sellers ( $S$ ) based on the expected production and consumption for current trading period. These can be predicted accurately with recent approaches [14, 5]. The differential evolution-based optimization begins on line $2$ where an initial population $X$ is generated with population size of $N P$ . An element $x_{k} \in X$ , with $k = 1, 2, \dots, N P$ is a candidate solution vector of variables $x_{i j}$ representing the amount of energy to be traded between each seller $i$ and buyer $j$ . These variables correspond to the decision variables of our optimization problem.

The $w h i l e -$ loop (line $3 - 19$ ) is the differential evolution loop that aims at finding solution to the non-linear optimization problem with Eq. (2) as the fitness function. The loop is executed for $G_{m a x}$ iterations. At each iteration, for each candidate solution $x_{k} \in X$ , the algorithm creates a mutated solution ${¯ x}_{k}$ . Initially, ${¯ x}_{k} = x_{k}$ . The mutated solution is subsequently updated through mutation and crossover with $3$ random candidates $x_{a}, x_{b}, x_{c} \in X$ (line $5$ ). A value $R \in [1, | S | \times | B |]$ is selected at random. $R$ will be used in the following $f o r -$ loop to ensure a minimum mutation. The for loop in line $7$ iterates over the components (dimensions in evolutionary terms) of ${¯ x}_{k}$ . During each iteration, a value $u \in [0, 1]$ is sampled at random as mutation probability (line $8$ ). Subsequently, a mutation occurs for the component $i j$ of ${¯ x}_{k}$ with crossover probability $C R$ (line $9$ ). The mutation occurs irrespective of the probability if $(i \times j) = R$ (to ensure at least one minimum mutation). A mutation is executed by combining the corresponding component of $x_{a}$ , $x_{b}$ , and $x_{c}$ with the differential weight parameter $F \in [0, 2]$ as in line $10$ . The mutated component ${¯ x}_{i j}^{(k)}$ is clipped to ensure that it falls within $[0, 1]$ as minimum and maximum threshold to satisfy constraint Eq. (2e) in line $11$ of the algorithm.

After the mutated solution is finalized, it is checked, and adjusted if needed, to meet the constraints in Eqs. (2a)-(2c) of the optimization problem. Specifically, line $13$ ensures that no exchange occurs (i.e., ${¯ x}_{i j}^{(k)} = 0$ ) between users having a loss higher than $l_{m a x}$ . Lines $14 - 15$ ensure that the production of a seller and the demand of each buyer are not exceeded, respectively. Finally, in line $16$ , the fitness function $f (.)$ of the mutated solution ${¯ x}_{k}$ is compared against the original candidate solution $x_{k}$ . If $f ({¯ x}_{k}) > f (x_{k})$ , then ${¯ x}_{k}$ replaces $x_{k}$ in the set of candidate solutions $X$ . At the end of the while loop, $D E b A T E$ selects the best solution $x^{*}$ in $X$ (line $20)$ and executes the transactions accordingly (line $21$ ). In the following theorem 1, we show that the $D E b A T E$ has polynomial complexity and hence, computationally efficient.

Theorem 1.

The complexity of the $D E b A T E$ algorithm is $O (G_{m a x} \times N P \times | S | | B |)$ .

Proof.

The complexity is dominated by the $w h i l e$ loop (lines $3 - 19$ ), which is executed $G_{m a x}$ times. Within this loop, the $f o r -$ loop (lines $4 - 17$ ) does $| X | = N P$ total iterations. In each iteration, the inner $f o r -$ loop (lines $7 - 12$ ) iterates over the sets $S$ and $B$ , and only contains constant operations. Similarly, checking the constraints (lines $13 - 15$ ) requires to iterate over the same sets. Finally, calculating the function $f (.)$ (line $16$ ) has cost $| B |$ . Overall, the complexity is $O (G_{m a x} \times N P \times (| S | | B | + 3 | S | | B | + | B |)) = O (G_{m a x} \times N P \times | S | | B |)$ ∎

/* Pricing with Risk-sensitive Q-learning */

1 Collect transaction information for each prosumers from

D E b A T E

(Alg. 1) for current timestep

t

;

2 for each $i \in S$ do

3 Select an action,

a \in {+ δ, - δ, 0}

based on exploration and exploitation ;

s = ρ_{i}; s_{n e w} = s + a; R_{i} = (ρ_{i} + a) \sum j \in B x_{i j}

;

5 Update

Q (s, a)

as in Eq. (3);

ρ_{i} = s_{n e w}

;

7 Send information on updated price

ρ_{i}

to seller

i

;

9 end for

Algorithm 2 PQR

Iii-B Pqr

After determining the solution to the energy allocation problem in $D E b A T E$ , the selling price for sellers is then updated through the $P Q R$ algorithm. In order to learn the optimal selling price dynamically over time, we model the sellers as independent learning agents. Note that, to preserve the privacy and avoid the conflict between prosumers, these agents do not have access to information about other sellers or buyers. The state space in the Q-learning formulation consists of the prices between the grid buying ( $ρ_{g b}$ ) and selling ( $ρ_{g s}$ ), discretized by a step size, $δ$ , i.e., $ρ_{i} \in {ρ_{g b}, ρ_{g b} + δ, ρ_{g b} + 2 δ, . . ., ρ_{g b} + (\frac{ρ_{g s} - ρ_{g b}}{δ} - 1) δ, ρ_{g s}} .$

The action space consists of a price increasing action, price decreasing action, and no change action, i.e. $a \in {+ δ, - δ, 0}$ , where $δ$ is the amount by which price is increased or decreased. Seller $i$ reward function is the total revenue generated at the current trading period i.e. $R_{i} = (ρ_{i} + a) \sum_{j \in B} x_{i j} w_{j}$ . For updating Q-values, we modify the approach proposed in [21] by considering the following Q-learning update rule that includes the PT-based perceived utility of sellers.

Q^{(n e w)} (s, a) = Q^{(o l d)} (s, a) + α v (y_{i})

(3)

v (y_{i}) = {\begin{matrix} k_{+, i} (y_{i})^{ζ_{+, i}}, & y_{i} > 0 - k_{-, i} (- y_{i})^{ζ_{-, i}}, & y_{i} \leq 0 \end{matrix}

(4)

where, $y_{i} = R_{i} + γ {max}_{a} Q (s_{n e w}, a) - Q (s, a)$ is the Temporal Difference (TD) error of $i^{t h}$ seller for current iteration, and $v (y_{i})$ is transformation of TD error to capture each seller’s personalized perceived utility on loss and gain. $α$ refers to the learning rate for updating Q-values in Eq. (3). The action is selected based on an $ϵ$ -greedy exploration-exploitation strategy [24]. Specifically, $ϵ$ refers to the probability of exploration and it is initially set to $1$ . It is then decreased over time using an $ϵ -$ decay value, as the system learns the optimal policy. Based on the selected action, the new selling price, reward, and Q-value are updated as per Eqs. (3) and (4). Updated selling price is then sent to the respective seller $i$ for next trading period.

The system runs both $D E b A T E$ and $P Q R$ sequentially at every trading period. Input of $D E b A T E$ is updated based on the prices computed by $P Q R$ . $P Q R$ then takes as input the reward from executing energy transactions by $D E b A T E$ .

Iv Experimental Results

Iv-a Experimental Setup

Fig. 2: Normalized objective value vs. number of iterations.

In this section, we evaluate the performance of DEbATE and PQR, hereafter jointly referred as $D E b A T E - P Q R$ , against a recent state-of-the-art approach referred to as Rule [3]. $R u l e$ allocates energy using a greedy heuristic that assigns cheapest sellers to buyers based on their registration order in the system, while final price of each transaction follows mid-market pricing, i.e., mid value of seller’s and buyer’s asking price. We consider a system with $40$ prosumers, split evenly as buyers and sellers. This is considered a representative number of prosumers in a microgrid or set of houses supplied by a single distribution transformer. We use a realistic dataset for buyers’ energy consumption obtained from [18]. Similarly, we consider sellers equipped with $4 k W$ rooftop solar located in Lexington, Kentucky, USA. The energy generated is estimated using NREL’s PVWatts Calculator [22] given the solar irradiance in Lexington and size of solar panels. Losses are assigned uniformly at random from set ${1 %, 2 %, 3 %, 4 %}$ and maximum loss threshold $L_{m a x} = 2.5 %$ .

We assume that prosumers complete a survey before joining the system to estimate their individual prospect theory parameters, similar to [19, 4, 10], and use realistic prospect theory parameters determined by them. Specifically, we sample the risk-averting parameter for gains $(ζ_{+}) \in [0.60, 0.88]$ , the risk-seeking parameter for losses $(ζ_{-}) \in [0.52, 1.0]$ , the loss-aversion parameters for gain and loss $(k_{+}), (k_{-}) \in [2.10, 2.61]$ for each individual prosumers. The grid energy buying price is set to $ρ_{g b} = $ 0.06$ and the selling price to $ρ_{g s} = $ 0.12$ . The reference price for each sellers is initially randomly sampled from range $[0.09, 0.12]$ . It is then updated using $P Q R$ at each iteration. The reference price for each buyer is selected in the range $[0.06, 0.10]$ and considered static for the duration of experiments, which is $365$ days. The parameters for $P Q R$ algorithm are set as follows: learning rate $α = 10^{- 4}$ , step size for discretizing state space $δ = $ 0.001$ , and $ϵ -$ decay $= 0.965$ .

Iv-B Results

We consider several experimental scenarios and performance metrics, as discussed in the following.

Experimental Scenario 1: We first run experiments to study the convergence of DEbATE. We considered different system size by scaling the number of sellers and buyers. Fig. 2 shows the normalized objective value as a function of the number of iterations using a population size $N P = 20$ . The plot averaged over 10 runs shows that $10, 000$ iterations are sufficient for the algorithm to converge in the considered settings. As a result, in the following scenarios we set $G_{m a x} = 10, 000$ and the population size $N P = 20$ .

Experimental Scenario 2: In the second experimental scenario we study the performance of the considered approaches over time. Two performance metrics are considered, namely the buyers’ objective value and the sellers’ cumulative reward. These are represented in Figs. 5 and 5, respectively, with a moving average of $10$ days. In this experiments we consider $15$ buyers and $15$ sellers. The benefits of $D E b A T E - P Q R$ over $R u l e$ are more prominent from April through October, when the energy demand and production is higher. The greedy nature of $R u l e$ penalizes the quality of the resulting matching, significantly reducing the buyers’ perceived value. Note that, the buyers’ objective values are negative because they are paying higher prices than their reference purchase price. Therefore, transactions are seen as loss from a prospect theory perspective. Nevertheless, our approach optimizes the energy assignment to maximize the buyers perceived value. Additionally, our approach is able to generate higher rewards than $R u l e$ by dynamically learning the prices for sellers through the $P Q R$ algorithm. The the sellers’ reward decreases after mid-september for both the approaches due to the reduced energy production during winter.

Fig. 6: Obj. values for buyer vs. network size.

We further study the performance over time by considering the evolution of average and individual sellers’ prices. We consider a smaller system of $5$ sellers and $5$ buyers for ease of representation of the results. Fig. 5 shows the individual prices. $D E b A T E - P Q R$ is able to learn and adjust the price over time to improve the buyers’ perceived value considering their competitiveness. The competitiveness is a function of a buyer’s reference price, their production, and their location in the system (e.g., loss w.r.t. sellers). As a result, our approach is able to improve the perception of both buyers and sellers while ensuring the competitiveness of the market.

Experimental Scenario 3: In this scenario we test the scalability with respect to the system size. Specifically, we increase the system proportionately from $5$ sellers and $5$ buyers to $20$ sellers and $20$ buyers. Figs. 7-7 show the buyers’ total perceived value and the sellers’ reward, respectively, over a year. By considering the loss-averse and risk-seeking PT-value functions, $D E b A T E - P Q R$ achieves an increasing advantage as the system size increases compared to $R u l e$ , for both sellers and buyers. As a numerical example, $D E b A T E - P Q R$ achieves as much as $26 %$ increase in buyers’ perceived value while ensuring $7 %$ profit improvement for sellers.

V Concluding Remarks

In this paper, we bring together the concept of perceived utility from behavioral economics and reinforcement learning into the P2P energy trading scene. Unlike existing literature, we propose an automated and dynamic P2P energy trading problem that maximizes the perceived value for buyers while simultaneously learning the optimal selling price. Given the non-linear and non-convex nature of the problem, we propose a novel differential evolution-based metaheuristic algorithm, called $D E b A T E$ . $D E b A T E$ is paired with a prospect theory enhanced Q-learning algorithm, called $P Q R$ , to adjust the selling price over time. Results show the advantages of the proposed approaches with respect to a state of the art solution using real energy consumption and production data.

Acknowledgment

This work is supported by the NSF grant EPCN-1936131 and NSF CAREER grant CPS-1943035.

References

[1] V. Agate, A. R. Khamesi, S. Silvestri, and S. Gaglio (2020) Enabling peer-to-peer user-preference-aware energy sharing through reinforcement learning. In IEEE ICC, Cited by: §I-A.
[2] D. E. Agosto (2002) Bounded rationality and satisficing in young people’s web-based decision making. Journal of the American society for Information Science and Technology 53 (1), pp. 16–27. Cited by: §I-A.
[3] M. I. Azim, S. Pourmousavi, W. Tushar, et al. (2019) Feasibility study of financial p2p energy trading in a grid-tied power network. In IEEE PESGM, Cited by: §I-A, §IV-A.
[4] V. Baláž, V. Bačová, E. Drobná, et al. (2013) Testing prospect theory parameters. Ekonomicky časopis 61. Cited by: §II, §IV-A.
[5] E. Casella, E. Sudduth, and S. Silvestri (2022) Dissecting the problem of individual home power consumption prediction using machine learning. In IEEE SMARTCOMP, Cited by: §III-A.
[6] D. Contu, E. Strazzera, and S. Mourato (2016) Modeling individual preferences for energy sources: the case of iv generation nuclear energy in italy. Ecological Economics 127, pp. 37–58. Cited by: §I-A.
[7] P. E. Earl (2016) Bounded rationality in the digital age. In Minds, Models and Milieux, pp. 253–271. Cited by: §I-A.
[8] G. El Rahi, S. R. Etesami, W. Saad, N. B. Mandayam, and H. V. Poor (2017) Managing price uncertainty in prosumer-centric energy trading: a prospect-theoretic stackelberg game approach. IEEE Transactions on Smart Grid 10 (1), pp. 702–713. Cited by: §I-A, §I-A.
[9] G. El Rahi, W. Saad, A. Glass, et al. (2016) Prospect theory for prosumer-centric energy trading in the smart grid. In IEEE PES ISGT, Cited by: §I-A.
[10] C. R. Fox and R. A. Poldrack (2009) Prospect theory and the brain. In Neuroeconomics, pp. 145–173. Cited by: §I-A, §IV-A.
[11] G. Gigerenzer and R. Selten (2002) Bounded rationality: the adaptive toolbox. MIT press. Cited by: §I-A.
[12] IEA org. Note: https://www.iea.org/reports/electricity-information-overview/ Cited by: §I.
[13] D. Kahneman and A. Tversky (2013) Prospect theory: an analysis of decision under risk. In Handbook of the fundamentals of financial decision making: Part I, pp. 99–127. Cited by: §I-A, §II, §II.
[14] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang (2017) Short-term residential load forecasting based on lstm recurrent neural network. IEEE Transactions on Smart Grid 10 (1), pp. 841–851. Cited by: §III-A.
[15] M. Nasimifar, V. Vahidinasab, and M. S. Ghazizadeh (2019) A peer-to-peer electricity marketplace for simultaneous congestion management and power loss reduction. In IEEE Smart Grid Conference, Cited by: §I-A.
[16] Y. Parag and B. Sovacool (2016-03) Electricity market design for the prosumer era. Nature Energy 1, pp. 16032. External Links: Document Cited by: §I-A, §I-A, §I.
[17] A. Paudel, L. Sampath, J. Yang, and H. B. Gooi (2020) Peer-to-peer energy trading in smart grid considering power losses and network fees. IEEE Transactions on Smart Grid 11 (6), pp. 4727–4737. Cited by: §I-A.
[18] (Website) External Links: Link Cited by: §IV-A.
[19] M. O. Rieger, M. Wang, and T. Hens (2017) Estimating cumulative prospect theory parameters from an international survey. Theory and Decision 82 (4), pp. 567–596. Cited by: §II, §IV-A.
[20] W. Saad, A. L. Glass, N. B. Mandayam, and H. V. Poor (2016) Toward a consumer-centric grid: a behavioral perspective. Proceedings of the IEEE 104 (4), pp. 865–882. Cited by: §I-A.
[21] Y. Shen, M. J. Tobia, T. Sommer, and K. Obermayer (2014) Risk-sensitive reinforcement learning. Neural computation 26 (7). Cited by: §I-B, §II, §III-B.
[22] (Website) External Links: Link Cited by: §IV-A.
[23] R. Storn and K. Price (1997) Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization 11 (4), pp. 341–359. Cited by: §I-B, §II.
[24] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §III-B.
[25] A. Timilsina, A. R. Khamesi, V. Agate, and S. Silvestri (2021) A reinforcement learning approach for user preference-aware energy sharing systems. IEEE Transactions on Green Communications and Networking 5 (3), pp. 1138–1153. Cited by: §I-A, §I.
[26] W. Tushar, T. K. Saha, C. Yuen, et al. (2020) Peer-to-peer trading in electricity networks: an overview. IEEE Transactions on Smart Grid. Cited by: §I-A, §I.
[27] W. Tushar, T. K. Saha, C. Yuen, P. Liddell, R. Bean, and H. V. Poor (2018) Peer-to-peer energy trading with sustainable user participation: a game theoretic approach. IEEE Access 6, pp. 62932–62943. Cited by: §I-A, §I-A, §I.
[28] W. Tushar, T. K. Saha, C. Yuen, T. Morstyn, H. V. Poor, R. Bean, et al. (2019) Grid influenced peer-to-peer energy trading. IEEE Transactions on Smart Grid 11 (2), pp. 1407–1418. Cited by: §I-A.
[29] W. Tushar, C. Yuen, H. Mohsenian-Rad, T. Saha, H. V. Poor, and K. L. Wood (2018) Transforming energy networks via peer-to-peer energy trading: the potential of game-theoretic approaches. IEEE Signal Processing Magazine 35 (4), pp. 90–111. Cited by: §I-A.
[30] Y. Wang, L. Zhang, Q. Ding, and K. Zhang (2020) Prospect theory-based optimal bidding model of a prosumer in the power market. IEEE Access 8, pp. 137063–137073. Cited by: §I-A.
[31] Y. Yao, C. Gao, T. Chen, et al. (2021) Distributed electric energy trading model and strategy analysis based on prospect theory. International Journal of Electrical Power & Energy Systems 131, pp. 106865. Cited by: §I-A.
[32] T. Zhu, Z. Huang, A. Sharma, J. Su, D. Irwin, A. Mishra, D. Menasche, and P. Shenoy (2013) Sharing renewable energy in smart microgrids. In ACM/IEEE ICCPS, Vol. . Cited by: §I-A, §II.