ACTOR-CRITIC REINFORCEMENT LEARNING FOR ENERGY OPTIMIZATION IN HYBRID PRODUCTION ENVIRONMENTS

This paper presents a centralized approach for energy optimization in large scale industrial production systems based on an actor-critic reinforcement learning (ACRL) framework. The objective of the on-line capable self-learning algorithm is the optimization of the energy consumption of a production process while meeting certain manufacturing constraints like a demanded throughput. Our centralized ACRL algorithm works with two artificial neural networks (ANN) for function approximation using Gaussian radial-basis functions (RBF), one for the critic and another for the actor, respectively. This kind of actorcritic design enables the handling of both, a discrete and continuous state and action space, which is essential for hybrid systems where discrete and continuous actuator behavior is combined. The ACRL algorithm is exemplary validated on a dynamic simulation model of a bulk good system for the task of supplying bulk good to a subsequent dosing section while consuming as low energy as possible. The simulation results clearly show the applicability and capability of our machine learning (ML) approach for energy optimization in hybrid production environments.


INTRODUCTION
Energy has become a very valuable and discussed property in recent years.The main reason for this trend is the rethinking from an environmental polluting energy production to a green energy supply with the focus on renewable energy sources to reduce sustainably the emissions of detrimental greenhouse gases.This reorganization in the energy sector entails risks and costs which partly have to be borne also by the consumer, e.g. the industrial production sector.Therefore an energy efficient handling and facility operation is demanded, not least to be able to stay competitive on the market.These circumstances affect especially large-scale industrial plants with an extremely high number of energy consumers (like pumps, valves, conveyors etc.) as it is generally the case in process industry and basic material industry [1].In this case, even minor energy savings can lead to decreasing emissions and costs [2,3].
Generally, RL is a goal-oriented learning technique which learns the optimal policy by (longterm) rewarded trial-and-error interactions with the environment, imitating the natural learning behavior of a child or an animal.Machine learning (ML) techniques like RL became particularly popular with the success in playing the game of Go [4,5] demonstrating super-human performance of the technical system.Considering practical real-life applications, the main benefits of RL methods are the on-line capability and additionally the capability to cope with uncertainties and changes in system dynamics [6], which makes the framework especially attractive for analytically hard describable (technical) problems.
In the literature actor-critic reinforcement learning (ACRL) methods that combine the strengths of actor-only and critic-only RL methods [7,8], i.e. merging policy-based with value-based methods, are nowadays more present than ever before.They focus various research directions and different domains like spoken dialogue systems (SDS), i. e. task-completion dialogue policy learning with an adversarial advantage actor critic (A2C) approach [9] or ACRL with experience replay (ACER) for dialogue systems with large action spaces [10], a decentralized collaborative MARL approach based on ACRL methods especially for continuous state and action spaces [11] and ACRL for optimal control of multiple-model discrete-time systems [12], to just name a few.Particularly, ACRL with suitably chosen Gaussian radial-basis function neural networks (RBFNNs) as function approximators is efficient and notably suitable for continuous domains or hybrid systems [6,[13][14][15] as it is the case in process industry or manufacturing.The strength to cope with large continuous actionspaces within a hybrid system environment, is one of the most significant benefits of an RBFNN based ACRL structure.
Other important ANN types addressing the artificial intelligence research are spiking neural networks (SNNs) [16][17][18] and recurrent neural networks (RNNs) [19].SNNs proceed by sequences of spikes and have their explicit strengths in applications that require very fast processing times of huge amounts of data [18], e.g. in the field of robotics.A large list of engineering applications for SNNs in combination with different learning scenarios, e.g.RL, can be found in [17].However, SNNs are still difficult to train because by their nature they are not back-propagation capable.In [19] an approximate dynamic programming (ADP) approach for the energy management of a microgrid based on deep RNN learning is proposed, which guarantees convergence while using linear models to approximate the value function.
RL in general and recently Deep-RL using deep neural networks (DNN), has already been applied to energy management systems with special emphasis on distributed smart grids and microgrids [20,21] and on electric vehicles [22,23] or energy optimization in electric water heaters [24].An example of using ANNs for the function approximation of a Q-function for RL in the context of energy optimization can be found in [25].Especially ACRL approaches are presented in, e.g.[26] for improving variable speed wind turbine controllers to changing wind conditions dealing with continuous valued state and action spaces or [27] where a transfer actor-critic learning framework for energy efficient radio access networks is proposed.In contrast, the adequate application of ACRL techniques to energy optimization in the manufacturing and process industry domain with its inherent challenges has still open research questions.The learning set-up with an appropriate pre-elected statespace and action-set, well suited timings like episode duration and hyperparameter tuning as well as incorporating process constraints are very crucial and the basis for a successful learning behavior.
In this paper we present a centralized approach for energy optimization based on the ideas of ACRL with RBFNNs function approximators focusing the challenges of the application to hybrid manufacturing systems.We give a detailed description of the learning set-up for the actor and the critic network used in our ACRL algorithm.Furthermore we develop a bulk good process model of our physical laboratory testbed for co-simulation purposes which serves as application example for our approach.In relation to our exemplary plant we define the MDP for the energy optimization problem that can be scaled easily to larger systems.The gained results show typical learning behavior and outperform the baseline model with regard to the energy consumption and the throughput rate.A preliminary version of this paper has been presented in [1].
This paper is organized as follows.In Section 2 we state the learning problem for energy management and optimization in manufacturing systems.Section 3 describes our ACRL-based approach using Gaussian RBFNNs for function approximation.In Section 5 we present an application example for energy optimization using ACRL with RBFNNs and discuss the results obtained from a simulation model of a laboratory bulk good system.Section 6 gives the conclusions and points toward further work.

PROBLEM STATEMENT
The considered general structure of the production environment is illustrated in the schematic of Fig. 1.As illustrated, we consider a distributed production process with a number of possibly different subsystems interacting with each other.The interaction is assumed to take place on the physical level by exchanging energy and material flows and on the cyber level by exchanging information and control signals.To this end each subsystem has its own control system with sensoring and monitoring devices to measure its production performance and a certain number of energy consumers like electrical drives, valves and compressors to actuate the subsystem.The considered energy consumers have either discrete behavior like DOL-motors or on-off valves, continuous behavior like VSD-drives or hybrid behavior like e.g.vacuum pumps.We consider different forms of energy consumption like electrical energy or instrument air.The energy consumption of the consumers is assumed to be continuously measured by suitable energy metering devices in the local control systems and then communicated to a centralized control system for further analysis.After describing the general system set-up, we will now state the problem to be solved: We consider a distributed system  with 1 i k   subsystems i  as illustrated in Fig. 1 for 3 k  .The system dynamics of the ith subsystem are given by ( ( ), ( ), ( ), ( )) where Then, the optimization problem is stated as follows.Given a predefined production episode 0, t T   , find the optimal energy consumption 0 min ( ) . .(1) (4) where s Y is the required performance over the considered production episode previously defined.Note, that the previously scheduled production performance can depend either on the performance i y of each module or on the performance of only a subset of modules.This relation is formally modeled by the function r .For instance, in certain processes only the last subsystems output is responsible for the overall performance while the other modules influence this output indirectly by suitable supply actions.Some remarks to the previously defined problem are in order.The performance outputs ( ) i y t can be arbitrarily defined based on the given process objectives and available process measurements.
Examples include product concentrations in chemical plants, mass flows in bulk good plants or processing times in manufacturing plants.The length of the considered production episode is closely related to these requirements as performance parameters might only be accessible after some processing time.Typical examples include batch operations.Hence, the episode should be at least as long as the processing times.The number of samples per episode should be determined such that the important dynamics of the energy consumption and the process parameters are represented.Particularly, the processing times and operation points of actuators strongly influence the energy consumption of the overall system and need to be carefully examined.
Note, that the problem description mainly focuses on discrete and hybrid processes where operations and system behavior are not solely continuous.Such hybrid systems containing discrete and continuous dynamics and actuation are quite common in the process and basic materials industries due to discontinuous and delaying components like buffers, reactors or conveyors as well as on-off actuators and actuators with discontinuous actions.We will introduce an example of such a process in Sec. 5.

ACTOR-CRITIC REINFORCEMENT LEARNING: AN INTRODUCTION
In general RL is a machine learning technique that is based upon the animal trial-and-error learning.The learner, also called agent, acts on its environment and learns from rewards gained from these interactions within a given time horizon called episode.Analytically, the environment can be formulated as a Markov decision process (MDP) with its states, possible actions to take and the resulting rewards as an evaluation of the chosen actions.Formally, the MDP is described by the tuple 0 ( , , , , ) P R p   with:  the set of states  ,  the set of actions  ,  the transition model The decision, which action to take next given the current state, depends on the agents policy : S A   .The policy can be chosen based on the agents past experiences or even randomly.The goal of the reinforcement learning problem is to find the optimal policy by interacting with the environment, i.e. the policy which results in the highest possible cumulated reward.Hence, we want to maximize the return with the discounted reward where 0 1    is the discount factor.
For our optimization problem stated in Sec. 2, we choose an actor-critic (AC) framework with artificial neural network (ANN) function approximators in order to emerge a self-learning system behavior.This approach allows us to avoid a theoretically complicated solution for analyzing the condition for optimality, by using a neural network approximation within our ACRL algorithm, learning the unknown system dynamics.ACRL methods combine notions of policy iteration (PIT) with adaptive function approximation [8].Compared to general Q-learning, the ACRL method has the advantages of reduced variance in function approximation, efficient computation in continuous domains and a high similarity to neural mechanisms in mammalian brains [28].In contrast to the Q-learning approach in [29], where the state and action space has to be discretized, this approach allows to cope with not only a hybrid state but also a hybrid action space where continuous and discrete behavior are merged.The fundamental idea of the ACRL method is the partition into a critic part for policy evaluation (PE) and an actor part for policy improvement (PI).In our algorithm we use two normalized RBFNNs, one as policy approximator within the actor and another one as state-action value function approximator within the critic.In this context the critic evaluates the actor's policy using the SARSA( ) method which updates the state-action value estimation and calculates a kind of temporal difference (TD) error between the state-action value at the next and the current state.Independent of the critic's PE, the actor updates the current followed policy according to its own assessment of the TD error with a second RBFNN.Fig. 2 gives a general overview of the ACRL algorithm structure.
ACRL is usually introduced with policy gradient methods augmented by a suitable evaluation of the policy.In policy gradient methods, a class of parameterized randomized policies is defined.Then the gradient of the average reward with respect to the policy parameters  is estimated from the observed states, actions, and rewards.The policy is finally improved by adjusting its parameters in the direction of the estimated gradient.The average reward is usually defined as Hence, the optimal parameters are obtained from argmax ( ) By means of the policy gradient theorem [30], the gradient can be calculated as where The update of the policy parameters is finally obtained by 1 ( ) with the learning rate  .Note, that the functions Q  and V  determine the expected reward to be gained when starting in state s and respectively, taking action a and then following policy  .
and the advantage function A  all allow to evaluate a certain policy and serve as critics during the policy learning.This ACRL variant is well known as advantage actorcritic (A2C) [31] and has even been extended to asynchronous advantage actor-critic (A3C) [32].However, all the above mentioned functions are not available during learning but have to be estimated.Different approaches are possible including Monte-Carlo (MC) methods or temporal difference (TD) learning.In this work, we use the well known where T  are the learning parameters and ( , ) s a  are in general, continuous differentiable nonlinear functions in the states and actions.

RADIAL-BASIS FUNCTION NEURAL NETWORKS FOR FUNCTION APPROXIMATION IN A2C
In order to represent hybrid systems, we use radialbasis function neural networks (RBFNN) for function approximation within the critic and the actor, which should be chosen in a certain interdependency [7].A simple RBFNN generally consists of an input layer, a hidden neural layer with RBFs and an output layer with linear neurons whose inputs are weighted.The advantage of RBFs for our application is the possibility to use a locally limited activation function (radial functions) like the Gaussian function with special approximation properties.The Gaussian function is defined as Hence, the weighted output function of the network is calculated to where [ ] , L is the number of basis functions, j c is the mean and j  the variance of the j-th basis function.
Hence, the normalized output, yielding accuracy improvement, can be written as , 1 ( , ) • ( ) ( ) For the sake of simplicity regarding a future PLC implementation, we reduce the learning task of the RBF network to a learning of the weights , Finally, the resulting ACRL algorithm executes as follows: 1. Initialize learning parameters  ,  , z to zero and choose first action 1 a .
2. Execute the system using the chosen action a and observe state t s and reward t r .
Note, that by using the ideas of natural actorcritic (NAC) algorithms [30], the update law of the policy parameters can be further simplified to In Step 3 of the above algorithm the actions have to be drawn from the policy distribution which has so far not been defined.As we will deal with hybrid action spaces, i.e. both discrete and continuous actions, the choice of the distribution has to be done differently for both classes of actions.To this end, we split the action set  into discrete      .Then, we draw the discrete and continuous actions independently from corresponding distributions.In the discrete case, we apply Gibb's sampling using the softmax function In the continuous case, we use the multivariate Gaussian distribution with positive definite matrix   , often chosen to

LABORATORY TESTBED
After introducing the general ACRL-approach, we will now focus on the application to a laboratory testbed as schematically illustrated in Fig. 3.As depicted, the testbed consists of four interacting modules forming a bulk good handling system.Modules 1 and 2 represent typical supply, buffer and transportation stations.Module 1 consists of a container and a continuously controlled belt conveyor from which the bulk good is carried to a mini hopper which is the interface to module 2. Module 2 consists of a vacuum pump, a buffer container and a vibration conveyor.The vacuum pump itself transports the material from module 1 into an internal container.The material is then released to the buffer container by a pneumatically actuated flap and then charged via the vibration conveyor into a mini hopper which is the interface to the dosing station module 3. It contains a further vacuum pump and a dosing unit composed of a buffer container with a weighing system and a rotary feeder.The dosed material is finally transported by a third vacuum pump to module 4 and then filled into transport boxes.Additionally, every module is equipped with its own PLC-based control system which communicate with each other via a suitable communication protocol.Each module has a set of sensors to monitor the modules state, particularly, each buffer is equipped with min/max level sensors and each mini hopper with overflow sensors.The electrical energy consumption is measured by energy metering modules.As the energy consumption of the vacuum pump and vibration conveyor is influenced by instrument air consumption, we take this ancillary into account.Note, that the testbed mimics to some extend typical large scale systems which are modularized in smaller subsystems with their own control systems and suitable communication interfaces.Besides, it is mentionable that such a system set-up is especially qualified for distributed control and decentralized optimization approaches.Furthermore, due to the system structure with different buffer containers as well as due to the inherent discontinuous behavior of the vacuum pumps, this process constitutes a typical hybrid system with a mixture of discrete and continuous behavior.For this reason a learning based energy management optimization is particularly beneficial allowing for enhancing the energy optimal operation strategy.In the following experiments we will concentrate mainly on modules 1 and 2 which are the supply units for the subsequent dosing station.In particular, the target is to supply the dosing unit with the required amount of material continuously processed by the dosing station while keeping the energy consumption of all the actuators within the supply stations as low as possible.Note, that there exists no pre-programmed sequence of actuator operations in the PLC when starting with the learning process.
However, to assure a safe operation of the process, some interlocks to avoid buffer overflows are implemented at the basic PLC level using the available sensor information described above [29].

BULK GOOD PROCESS MODELING
To allow for fast development times and reduce the effort to gain machine data, we additionally derive a simulation model.Hence, the ACRL can be analyzed using a co-simulation approach before testing at the real plant.To this end, we briefly state the basic system equations of the physical model based on mass-flow balance equations as well as the equations for the energy consumption used in the reward calculation.Note that the simulation model is set up as a modular model where subsystems can arbitrarily be plugged in and removed.
To define the mass-flow balances, we define a state equation for each storage element, i.e. buffer and hopper using the sum of differences between input mass flow ) More specific, for the first module we derive the massflow differential equations (MFDE) for the buffer (bf1) and hopper (hp1) respectively: ) where bc n denotes the speed of the belt conveyor, ) ( ), ( , ) ), , ( ( ) resulting in the overall energy consumption ( ) ( ) Note that all above listed functions and constants used in the simulation model rely on measurements and regression analysis based on real process data.
Additionally, it is worth mentioning that the vacuum pumps exhibit a specific behavior.After switching on, first an evacuation period occurs where conveying of product is not possible.Afterward the conveyed product follows a polynomial function until the buffer in the vacuum pump is full which results in a sudden drop of mass flow.The ACRL-algorithm should be able to cope with this specific system behavior.

ENERGY MANAGEMENT SET-UP FOR THE A2C APPROACH
After the introduction to ACRL in Sec.III and a detailed description of our application example, we will now formulate our ACRL-learning framework for energy management and optimization.As application example we built a bulk good process simulation model of our laboratory testbed which has a modularized system architecture like the presented system setup illustrated in Fig. 1.To this end, we need to define the MDP for the energy optimization problem by specifying the system states, the set of possible actions and the rewards to be gained.The definition of the state and action space can be seen in Table 1.As we are interested in energy optimization during an industrial process the state of the MDP need to mainly represent the energy flows in the system as described in Sec. 2. To this end, we assign a set of states , i j  to each sensor measurement of the ith subsystem.A typical examples of such a state set is , {full, empty} i j   for the discrete states of the buffer sensors.However, also continuously operating sensors with more than two states are located in the system, like the sensors in the mini hoppers.These continuous sensor states are covered by Gaussian basis functions in the hidden layer of the RBF network.Finally, the resulting set of states of subsystem i than yields The definition of the action space , i j  is done by the actuators behavior which also can be either discrete or continuous.Continuous actuator behavior is captured with Gaussian basis functions likewise.Furthermore the rewards for each state transition have to be defined.The appropriate definition of rewards is of major importance as the energy optimization problem has to fulfill different partly counteracting objectives [29].On the one hand, the energy consumption should be minimized, but on the other hand, the plant has to supply at least a certain product amount, which costs energy.From the energy point of view a standstill would be the optimal solution.Hence, the reward function should contain both, the energy consumption during the sample period , ( )

RESULTS
In this section we present the results of our ACRL approach applied to a bulk good process simulation model.
The results are gained with the following parameter settings: discount factor 0.9   , learning rate 0.1   , vanishing for high number of episodes, and trace decay rate 0.9   .The setting of the centers and variances necessary for determination of the Gaussian functions rely on measurements at the laboratory bulk good system.Note that empirical investigations revealed, that variations in the location and the width of the RBFs do not have a significant influence on the results.Furthermore we fix the number of the Gaussian functions to two for discrete behavior and to three for the continuous case.
The resulting learning curves in Fig. 4 and Fig. 5 show the energy consumption and volume output over the number of episodes, respectively, where one episode comprises 15s.As indicated, a typical learning behavior with a notable exploration period at the beginning can be observed, followed by more exploitation till around learning episode 250.From episode 250 to 450 no visible changes occur anymore which is an indicator for reaching the optimal operating sequence.In the end, obviously a cyclic process sequence is the optimal result gained from the RBF-A2C learning algorithm, which is quite plausible considering the type of actuators in the bulk good handling process, in particular the vacuum pumps.The vacuum pumps exhibit a specific operation with a short evacuation period at the beginning of an operation sequence where no material can be transported, followed by a suction period where the material is transported linearly with the suction time.Thus, the operation behavior of the vacuum pumps is periodic by nature and as they are the dominant actuators in the system, it can be assumed that they have the biggest influence on the process itself.Hence, the baseline model also shows this kind of system behavior.In comparison to the baseline, the energy consumption is considerably reduced (about 22%) while generating a slightly higher volume output, which is a noticeable improvement.An obvious reason for this could be a deceleration or increased shut-off times of the conveyors in contrast to a continuous operation in the baseline.In Fig. 6 the gained reward during the learning procedure is shown.Due to the reward definition, where a multi-objective balancing between three parameters is required (high throughput demand, low energy, no overflow) in combination with the periodic process behavior, it has not to converge necessarily to the highest end value but shows also periodic behavior.In relation to the graphs of the energy consumption and the volume output, until episode 200 the exploration is clearly visible.After episode 200 the reward signal becomes more and more consistent and finally ends up in a periodic graph.

CONCLUSIONS
We here introduced a novel approach for energy optimization in large scale industrial process plants.The approach is based on a formulation of the optimization problem in form of an advantage actorcritic reinforcement learning (A2C) with RBFNNs as function approximators that enables to account for hybrid system behavior.The approach is applied to a bulk good process simulation model with very promising results.Hence, the energy consumption of the production process is minimized compared to the baseline model by learning from subsequent operation sequences while maintaining the production quality and performance.In future research, the developed A2C approach can be implemented on an industrial PLC for validation on the real laboratory bulk good testbed.Moreover, as the current approach requires a centralized learning process, a distributed approach for the ACRLproblem, potentially involving game theoretical ideas, leading to a game-based coordination of the multi-agent system (MAS), could be a forwardlooking topic for further research activities.In this context, also the investigation of the deep deterministic policy gradient (DDPG) method [34] would be an interesting issue for a continuative research direction.

Figure 1 -
Figure 1 -General Production System Set-Up [29] ) is the advantage function, i.e. the difference between the state-action value function ( , ) Q s a  and the state value function ( )

Figure 2 -
Figure 2 -Actor-Critic Reinforcement Learning Schematic with RBF Neural Networks for Function Approximation

Figure 3 -
Figure 3 -Laboratory Testbed Schematic and Modeling Set-Up of the energy consumption, we have to deal with two different energy sources, namely electrical energy ei E and pneumatic energy pi E in terms of instrument air.The consumptions for Module 1-3 are calculated as follows 1 1 1