REINFORCEMENT LEARNING BASED ANTI-COLLISION ALGORITHM FOR RFID SYSTEMS

Efficient collision arbitration protocol facilitates fast tag identification in radio frequency identification (RFID) systems. EPCGlobal-Class1-Generation2 (EPC-C1G2) protocol is the current standard for collision arbitration in commercial RFID systems. However, the main drawback of this protocol is that it requires excessive message exchanges between tags and the reader for its operation. This wastes energy of the already resource-constrained RFID readers. Hence, in this work, reinforcement learning based anti-collision protocol (RL-DFSA) is proposed to address the energy efficient collision arbitration problem in the RFID system. The proposed algorithm continuously learns and adapts to the changes in the environment by devising an optimal policy. The proposed RL-DFSA was evaluated through extensive simulations and compared with the variants of EPC-C1G2 algorithms that are currently being used in the commercial readers. Based on the results, it is concluded that RL-DFSA performs equal or better than EPC-C1G2 protocol in delay, throughput and time system efficiency when simulated for sparse and dense environments while requiring one order of magnitude lesser control message exchanges between the reader and the tags.


INTRODUCTION
Radio frequency identification (RFID) technology had found widespread acceptance in security, logistics, retailing and inventory management [1].RFID system is the most efficient and reliable way to identify an entity and collect data [2].An RFID system consists of one or multiple readers with numerous tags that communicate using a shared communication channel.Among the three types of tags that are available in the marketpassive, active and semi-active -passive tags are the least complex and cheapest.It uses the backscatter electromagnetic energy from the reader's signal to communicate the ID information.The communication protocol for the RFID system has to be simple since the tags are computationally challenged.Thus, the reader assumes full responsibility for managing or reducing collisions in the network.There are three types of collisions in an RFID system, namely, tag-to-tag, reader-to-tag and reader-to-reader [3].The focus of this work is to propose a solution for reader-to-tag collisions using reinforcement learning technique.
Collision arbitration protocols for the RFID system can be divided into two categories.The existing protocols are either deterministic (treebased) or probabilistic (Aloha-based).The treebased protocols use binary tree search method where tags are continuously split into subsets until each set has only one tag.Tree-based anti-collision algorithms (ACA) are found to be efficient when the number of tags is small.However, long identification delays for a large number of tags and high protocol complexity are the drawbacks of these protocols [4].On the other hand, Aloha-based ACA uses time slots and random transmission strategy to reduce the collision probabilities.They are known for minimal complexity and ease of implementation in the context of RFID applications.In fact, EPC-Global Class 1-Generation 2 (EPC-C1G2) standard uses a variant of Aloha for its operation.However, the theoretical maximum throughput of slotted aloha ACA is only 37% [5].Also, these protocols cannot guarantee low identification delay in a dynamic environment such as warehouse [6].
In this paper, we present an efficient Aloha-based ACA which adapts its frame size dynamically using reinforcement learning mechanism.Framed slotted Aloha (FSA) is selected due to its simplicity and ability to handle a large number of tags or nodes when combined with capable algorithms [7].The performance of the proposed algorithm (RL-DFSA) was evaluated using Monte-Carlo simulations and was compared with algorithms that are currently being used in commercial settings.RL-DFSA reduces collisions and improves throughput and delay significantly as compared to algorithms that are currently being employed in the commercial RFID readers.Besides, it is energy efficient since the control message overhead is an order of magnitude lower than that of the best performing algorithm in the commercial readers.
The remainder of this paper is organized as follows.Section 2.0 discusses the current RFID standard and related works.In Section 3.0, the complete methodology of the proposed RFID anticollision protocol is presented in detail.Section 4.0 presents results and discussion of the proposed protocol in relation to selected protocols from the literature.Finally, the paper concludes with concluding remarks and future works in Section 5.0.

BACKGROUND INFORMATION
In FSA, a frame is divided into slots of the same length.At the beginning of each frame, interrogator or reader broadcasts the frame size to the tags.The tags then select a slot randomly and send the ID information to the reader in that slot.Due to this random slot selection policy, excessive collisions are bound to happen depending on the tag population if a non-optimal frame size is selected by the reader.The average throughput, U of FSA for N tag population and frame size L is, and the normalized throughput, U norm is given by, The normalized throughput is maximized when L = N.However, readers are not privy of the tag population and FSA has a fixed frame size.Due to these limitations, a variant of FSA called dynamic frame slotted Aloha (DFSA) which adapts frames dynamically based on the backlog tag estimation was proposed in the literature [8].Depending on the accuracy of the backlog tag estimation method, the number of collisions in the proceeding frames varies for the better or worse.Besides, the throughput of the DFSA also drops when there is a large number of tags to read.Therefore, a variation of DFSA, called Q-algorithm was proposed to be used as the standard protocol in current generation RFID systems.

EPCGLOBAL CLASS 1 GENERATION 2 STANDARD
EPC-Global Class 1 Generation 2 (EPC-C1G2) is the RFID air interface protocol which enables interoperation of RFID devices across the globe with the help of its standardization [9].It uses a variant of dynamic frame slotted Aloha (DFSA) known as Qalgorithm which operates per slot basis to arbitrate collisions and dynamically adapt the frame size.
Operation of Q-algorithm is shown in Fig. 1.Q algorithm operates using two parameters, namely, a floating-point parameter, and .The round value is used to set the frame size, and the is used to increase or decrease the value in the event of collision or empty slots, respectively.An interrogation process is initiated by the reader with the broadcast of a command which contains the frame size.Upon receiving this command, tags generate a random number in the range between 0 − 2 and set their counter equal to the generated value.Then, the reader interrogates each slot of the frame one by one using the Query_repeat command.For each Query_repeat command, tags decrease their counter by one.Tag with counter equals to zero transmits its ID information to the reader.However, if there are more than one tag with counter equals zero for current slot, a collision would be detected by the reader.Consequently, the is increased by some pre-determined value.In the case of empty slot, would be decreased by the same value.The round value would be updated continuously for each slot until a change is detected upon which the reader would exit the current frame and broadcasts a new frame size using the Query_adjust command.This process repeats until all tags are identified.The standard limits the round value in the range between 0 15 for delay concerns.Besides, the reader has the autonomy to decide whether to exit the current frame or continue interrogating it even when the round value had changed.One unique feature of the EPC-C1G2 algorithm is that it has different time durations for success, collision and empty slots as per the standard.Thus, the claim that the throughput of FSA maximized when = is not applicable even though EPC-C1G2 is a variant of FSA.This has been verified analytically by [10] and the optimal frame size, for EPC-C1G2 was calculated as, where, − 1 is the contending tag population.However, Q-algorithm has several drawbacks as follow.The initial selection of the Q value affects its performance significantly.The reader has no means to know the population of tags in the network a priori to set the Q value appropriately.Besides, Q adjustment strategy using produces excessive protocol overheads and also performs poorly in dense tag environment.

RELATED WORKS
In this section, we discuss some representative past studies on DFSA based anti-collision algorithms for RFID systems.The objective of the proposed algorithms can be either solving for optimal frame size or estimating the tag population.More often, the proposed algorithms try to achieve both these objectives as can be seen from the reviewed protocols in this section.We also explain some shortcomings of these algorithms.
Floerkemeier [11] and Bueno-Delgado et al. [12] proposed a solution for optimal frame size in the RFID system.Authors from both papers asserted that L = N is the optimal frame size.However, we know from [10] that optimal frame size for RFID system is not same as in the traditional networks due to the different slot durations for the success, empty and collision slots of the RFID networks.On the other hand, Zhen et al. [13] proposed that the optimal frame size should be set as 1.4 times the tag population based on their own experimentation.
Eom et al. [14] proposed an anti-collision protocol which updates the frame size using the estimated backlog tag population.An estimation algorithm is used to calculate the number of collided tags (γ) in each collision slot.The author reported that the proposed protocol exhibits improved tag estimation accuracy while reducing the total number of slots required for an interrogation round.However, the author failed to distinguish between the three types of slots (success, empty and collision) when evaluating the total number of slots.Therefore, it is safe to assume that the reported comparison with the rest of the protocol is not valid.
Chen [15] introduced an anti-collision protocol which dynamically adjusts the frame length by examining only one slot per frame.This reduces the total number of examinations needed for setting the optimal frame size.The protocol updates the frame length using the estimated number of tag population.The author evaluated the protocol through simulations and reported that the normalized throughput of the protocol is higher as compared to the EPC-C1G2 protocol.However, the comparison is not valid since the author assumed all three types of time slots to have the same duration.
Even though there are numerous anti-collision algorithms available in the literature, in this paper, we only compared our proposed algorithm with the EPC-C1G2 protocol and its variants due to reasons stated as follows.Since we already know the optimal frame size from the literature, we can create an upper bound for performance (Ideal algorithm) as we had explained in Section 4. Therefore, there is no need to compare the proposed anti-collision algorithm with any other algorithms from the literature except the EPC-C1G2 algorithm and its variants.Besides, we can compare the results reported in this paper with other protocols by getting the percentage of improvement from the EPC-C1G2 protocol.

REINFORCEMENT LEARNING BASED DYNAMIC FRAME SLOTTED ALOHA (RL-DFSA)
In this section, the proposed RL-DFSA anticollision algorithm is explained in detail.The primary motivation for pursuing RL based frame adaptation method is inspired by the work of Shaheen [16].In this work, the author had used Markov decision process (MDP) which is the framework for most reinforcement learning algorithms [17] to analyze the slotted Aloha protocol.However, the author dropped the idea of solving the MDP for a large number of tags due to the need for an enormous number of computations.In turn, a heuristic-based method was adopted in the work.Accordingly, in this work, we approached the problem from a different point-of-view.Rather than calculating the transition probabilities, we used the Q-learning algorithm which updates the Q-value for each state based on its interaction with the environment.As a result, the computational complexity drops with the convergence time as a tradeoff.We used the Q-learning algorithm since it is known to be one of the most effective and popular algorithms to find an optimal policy in the absence of transition probability and reward function [18].

INTRODUCTION TO Q-LEARNING
Q-learning is a model-free reinforcement learning algorithm which learns by interacting with the environment and receiving Q-value for the stateaction pair.The Q-value denotes the preference of taking an action over all other available actions when the system is at a certain state.Formally, for each state s t ∈S and action a t ∈A we define Q-value by, where α is the learning rate, γ is the discount factor and r t+1 is the delayed reward.The α ∈ [0;1] value controls how quickly learning occurs.Besides, γ ∈ [0;1] controls the willingness or deferment for delayed rewards.The objective of a reward function is to lead the learning agent towards the goal by properly rewarding or punishing the agent for the action taken at a certain state.A carefully defined reward function will lead the Q algorithm towards convergence in a relatively short amount of time depending on the application.Q-learning pseudo code for a single agent is presented in Algorithm 1.
Algorithm 1 Q-learning 1. Set t=0 and initialized Q-values Q(s t ,a t ) for all ∈ and ∈ .2. while t<max_iteration do 3.
Observe the current state .

4.
Select next action a t = arg max a'∈A Q(s t ,a').

5.
Apply a t , observe the next state s t+1 and reward r t ≜r(s t ,a t ).

end while
The goal of the learning agent is to map each state to an action that maximizes its expected discounted reward over the time.However, a policy which chooses only the known maximal action without occasional exploration may succumb to locally optimal solutions.Therefore, there are numerous exploitation-exploration strategies available in the literature to tackle this problem.As for this work, we selected the well-known epsilongreedy method [19] to balance between the exploitation and exploration.An agent following this learning strategy would occasionally choose actions which have lower Q-values with ε probability.

RL-DFSA
This subsection describes the methodology used to adapt FSA using Q-learning algorithm for the RFID systems.We are well aware of the computational restriction of the RFID tags and the complexity of the Q-learning algorithm.Thus, the proposed algorithm is created to run on the readers only.There are numerous high-end readers like GAORFID, RapidRadio etc. which have a powerful ARM processor and memory card supports [20], [21] that can run the proposed algorithm without any trouble.Besides, the algorithm also can be made to function in online or offline mode.In online mode, the reader would continuously update the Q-matrix until the end of the interrogation round.This mode also supports dynamic tag number population since the algorithm is actively learning.In the offline mode, the algorithm would be made to run on a reader for a certain tag population until convergence is achieved.After convergence, the Q-matrix can be transferred to low-end readers using memory cards so that they can function optimally for a certain tag population.This reduces the computational complexity since selecting an action with maximal Q-value from a matrix requires a smaller number of operations.The downside of the offline mode is that the low-end reader would produce errors when the tag population changes way beyond what it did during the training period in the high-end reader.We used offline mode for evaluating RL-DFSA due to the following reasons.
The proposed RL-DFSA has two phases, namely, learning (exploration + exploitation) and testing (exploitation only) due to the technical difficulty in running both the learning and testing, concurrently.For 1000 tags, RL-DFSA requires 20,000 iterations (~ 12 minutes) to converge to a near optimal policy as shown in Fig. 2. Learning an optimal policy is not possible due to the stochastic nature of our application.Fig. 3 shows the downward trend of cumulative reward as the exploration probability, ε is decayed over time settles around 20,000 iterations.
Therefore, learning and testing phases were conducted separately for time concerns.The slow convergence of the algorithm is due to the stochastic nature of our application and Q-learning itself is slow as rightly observed in [22].

BASIC SETUP
In this work, the Q-learning algorithm was integrated into FSA to solve the reader-to-tag collision problem in the RFID networks.In FSA, a frame is comprised of multiple time slots of the same length.During each timeslot, the reader would interrogate the tags to get their ID information.Only the tag which had selected current time slot for transmission would reply in that particular timeslot.However, if the frame size is much smaller than the tag population, severe collisions would happen at the reader's side and depletes its energy.To make the matter worse, readers are not privy of the tag population to set the frame size to be optimal.Therefore, there is a need to estimate the tag population and determine the optimal frame size for the RFID networks.In this regard, Q-learning can help the reader to adapt its frame size dynamically using the feedback it got from the network.Besides, it can also solve the tag estimation problem through experimenting with the various tag estimation methods by having them as its possible actions as explained in the rest of this subsection.
The problem of determining optimal frame size for RFID network was solved analytically by [10] and it was found that the frame size should be set 1.46 times the tag population.Therefore, in this work, we focused on creating a policy for the reader so that it can adjust its frame size by alternating between the various tag estimates.The tag estimates were calculated based on a rational intuition which was based on the fact that the number of collided tags in a timeslot can be equal to or greater than two only.Thus, we defined the action space of the Qlearning algorithm as follow, Action 1=1.46 ×2.0 ×number_of_collision. (5) Action 2=1.46 ×2.2 ×number_of_collision. (6) And so on until, Action 11=1.46 ×4.0 ×number_of_collision. (7) The number of actions was limited to eleven since increasing it further introduces additional time complexity which is exponential.As for the state of the learning agent, it was set to be equal to the number of collisions in the previous frame.A reward function was defined using reward shaping methodology to assist the learning agent to achieve its goal.A metric called collision ratio It is clear from the reward function that the goal of the learning agent is to reduce the number of collisions to receive higher rewards.

LEARNING AND TESTING PHASES
The number of actions space must be small so that the Q-learning can converge in a reasonable amount of time.Therefore, an initial study was performed to identify the dominant actions based on the cumulative sum of their Q values.Using this criterion, three actions (1, 2 and 4) were identified as dominant and a new simulation was done using the identified actions.Through this new simulation, an optimal policy and Q-matrix for a tag population of 1000 were obtained.The number of tags was limited at 1000 since increasing it further would increase the simulation time exponentially.Besides, the policy obtained using 1000 number of tags can be used for tag population up to 2500 based on our own experiments.Beyond that error is produced since the state space exceeds the index of the Q-matrix.The parameters of RL-DFSA algorithm for the initial study are presented in Table 1.The initial state of the agent can be any arbitrary value except one since state one is the goal state.The timing parameters given in Table 3.1 were used for all our simulations.The pseudocode of RL-DFSA is presented in Algorithm 2.
Broadcast frame size and get C, S, E. 8.
Get reward r t ≜r(s t ,a t ).10.
Update Q-value

end while
During the testing phase, the learned Q-matrix was used to select an optimal action in each state.Monte-Carlo simulations with 5000 iterations were done for a various number of tags and the results are presented in Section 4.

SIMULATION SETTINGS, RESULTS, AND DISCUSSION
In this section, simulation results and discussions for all five algorithms (EPC-Fixed, EPC-Q-Frame, EPC-Q-Slot, Ideal, and RL-DFSA) are presented.In EPC-Fixed, the fixed frame size of 16 (for sparse) and 128 (for dense) were used to simulate commercial readers with similar characteristics such as Symbol, ThingMagic Mercury 4, Samsys and Intermec [12].Fixed frame size commercial readers are available in two variants which are the noncustomizable and user customizable readers.The Q value for non-customizable tag readers is fixed at 4 while for the user-customizable tag readers, the user can select Q value from a range of 1 to 7 at the start of the interrogation round [12].Therefore, in this simulation, the frame size of 16 and 128 were selected for simulating fixed frame size commercial readers in the sparse and dense environment, respectively.In the case of EPC-Q-Frame, initial frame size was set to be 16 as per the EPC-Gen2 standard requirement.However, there are no clear rules available in the EPC-Gen2 standard for fixing the cq value.Nevertheless, cq value of 0.3 was selected since it is found to perform most stable for sparse and dense networks [24].EPC-Gen2 also allows the reader to decide whether to continue interrogating the current frame or abandon it when the round (Q fp ) value varies due to collision and empty slots.In this regard, EPC-Q-Frame (Algorithm 3) simulates the situation where reader decides to continue interrogating current frame even though the round (Q fp ) value had changed midframe.New Q value is only broadcasted at the beginning of next frame.As for the EPC-Q-Slot, reader abandons the current frame as soon as it detects a variation in round (Q fp ) value.In Ideal case, initial frame size was set 16 and it is assumed that the reader knows exactly the number of remaining tags in the system after the expiration of the first frame.Subsequent frame sizes were set to 1.46 ×remaining_tags which is the optimal frame size as explained in Section 2. The Ideal case is treated as the upper bound of performance that can be achieved by an optimal algorithm.Finally, in RL-DFSA, initial frame size was set to 16 and the subsequent frames were adjusted dynamically based on the optimal action (3 actions available) selected by the reader at each state.Simulations were performed for a single reader with a various number of tags (10 -1000) using Matlab 2017 software.Also, our simulation used the 394-kbps tag-to-reader link rate which obeys the regulation set by EPCGlobal [9].The timing details presented in Table 3.1 were obtained from [23] since the same frequency as in the present work was used.Besides, the simulation scenario was divided into two -sparse (10 -100 tags) and dense (100 -1000 tags) environments -for an easier interpretation of the results.In order to get more reliable and accurate results, Monte-Carlo simulations with 5000 iterations were conducted for each algorithm and the following five performance metrics were recorded.

TIME SYSTEM EFFICIENCY (TSE) [10]
This metric gives the percentage of time successfully spend in identifying tags.It is calculated as follow: where, Success, Collision, and Empty denote the number of successful, collided and empty slots in the frame, respectively.Ts, Te, and Tc are the duration of successful, empty and collision slots, respectively.

THROUGHPUT (tag per second)
This metric gives average tags per second that can be identified using the given algorithm.It is calculated as follow: where, denotes duration of the query command issued by the reader.

AVERAGE FRAME PER ROUND
This metric gives us an average number of frames issued by the reader for each interrogation round.

AVERAGE SLOTS PER ROUND
This metrics shows an average number of slots required for each interrogation round.

AVERAGE DELAY PER ROUND
This metric gives the average time taken by the reader to finish each interrogation round.

RESULTS AND DISCUSSION
The performance of RL-DFSA in terms of TSE was evaluated by comparing it with the other four algorithms for a various number of tags as shown in Fig. 4. As expected, EPC-Fixed performed the worst since the frame size was fixed for both the sparse and dense environments.As the number of tags increases, TSE drops abruptly due to the increase in collisions.One persistent trend in TSE and throughput results pertaining to EPC-Q-Frame is its performance deteriorate from 100 to 400 tags then increases gradually.This is because that Q-algorithm is slow to adapt to the rapid changes in the tag number population.Such behavior of Q-algorithm had also been reported by other researchers [25].In contrast, EPC-Q-Slot performed far better since it abandons current frame as soon as it detects variation in the Q value.The performance of RL-DFSA and EPC-Q-Slot are almost identical to the Ideal case in sparse tag environment.However, unlike EPC-Q-Slot, RL-DFSA adapts to the changes in the frame with an order of magnitude fewer message exchanges as presented in Fig. 4.2.Its superior performance is due to the efficient learning method using feedback received in the form of reward/cost.In dense tag environment, there is a small gap in TSE for Ideal case and EPC-Q-Slot and RL-DFSA algorithms which denote there is still some room for improvements.Overall, RL-DFSA is 6.3% -250 %, 0.4% -18.6% and 0.4% -5.7% better at TSE for sparse tag environment as compared to EPC-Fixed, EPC-Q-Frame and EPC-Q-Slot algorithms, respectively.Also, for dense tag environment, RL-DFSA performs 5.3% -707.4% and 17% -578.8%better as compared to EPC-Fixed and EPC-Q-Frame algorithms, respectively.The performance increment or decrement is insignificant (less than 1%) as compared to EPC-Q-Slot algorithm.The conventional normalized throughput of the algorithms fails to give an accurate picture on how it may translate to the real-life applications.Hence, for RFID systems, the throughput of an algorithm is given as the number of tags that can be identified per second as shown in Fig. 4.1.A similar trend as in the TSE can be observed here.RL-DFSA and EPC-Q-Slot perform almost identical on both sparse and dense environments except at a very low number of tags where RL-DFSA performed better.However, the performance of EPC-Fixed drops rapidly for the dense environment as the fixed 128 frame size is insufficient to accommodate large tag numbers.EPC-Q-Frame performs better than EPC-Fixed when the number of tags is small.In dense tag environment, its performance is unstable for the similar reasons mentioned during the discussion of TSE.Overall, RL-DFSA performs far better than EPC-Fixed and EPC-Q-Frame algorithms in both the sparse and dense tag environments with significant performance gap when the number of tags is large as can be seen in Fig  Energy efficiency is critical in RFID systems as the readers are battery operated [26].Therefore, an efficient ACA should be able to reduce the collisions while guaranteeing fast tag identification time.In addition, the number of frames required per interrogation round also need to be kept at a minimum for energy and delay concerns.Fig. 4.2 shows an average number of frames per interrogation round issued by the reader for all five algorithms.EPC-Fixed and EPC-Q-Slot require an order of magnitude higher number of frames as compared to the other three algorithms.In the case of EPC-Fixed, the number of frames required increases with the number of tags due to lack of dynamic frame size adaptivity in the algorithm.It performs much better at dense tag environment since the frame size is 128 as compared to 16 in sparse tag environment.As for the EPC-Q-Slot, the decision to abandon a frame as soon as there is a difference in the Q value leads to excessive frame adjustment queries which is much more pronounced in dense tag environment.Even though EPC-Q-Slot performs similarly to RL-DFSA in TSE and throughput, this excessive overhead makes it ill-suited for RFID application since it is not energy efficient.Finally, RL-DFSA performed the best due to its efficient learning capability even in a stochastic environment.
In contrast to the number of frames, number of slots in each frame should be larger than the contending tags population so the tags can find a unique transmission slot.However, the number of slots cannot be increased indefinitely as a large number of empty slots would increase the system delay.The performance of EPC-Fixed and EPC-Q-Frame follows the same trend as in the earlier metrics with the performance of EPC-Q-Frame is better in the sparse environment.Since EPC-Q-Slot is utilizing a slot-by-slot frame updating mechanism, its performance should be the upper bound for this metric.In both sparse and dense tag environments, RL-DFSA performs almost identical to EPC-Q-Slot which shows its superior performance even though it utilizes a frame-by-frame updating mechanism.This is mainly because the actions of the Q-learning algorithm are optimal as they had been carefully selected from the initial study.Consequently, the performance of RL-DFSA is equal to EPC-Q-Slot algorithm as shown in Fig.     shows the delay performance of all five algorithms.The list of the algorithms ranked from best to worst for the delay performance is as follow: Ideal, RL-DFSA, EPC-Q-Slot, EPC-Q-Frame, and EPC-Fixed.Due to its static frame size, the EPC-Fixed algorithm took so much longer time as compared to the other four algorithms to finish an interrogation round.The difference is much more pronounced in the dense tag environment.As for the EPC-Q-Frame it performed much better since it has dynamic frame adaptation mechanism.RL-DFSA algorithm performed much better than EPC-Q-Slot algorithm despite using a frame-by-frame adaptation mechanism.In fact, RL-DFSA is 7.9% -9.3% and 4.9% -5.5% better in delay performance as compared to EPC-Q-Slot in sparse and dense tag environments, respectively.This shows the superiority of the employed learning algorithm which was able to learn an optimal policy even in a dynamic environment.

CONCLUSION
Energy efficiency is crucial in the internet of things application and RFID systems are no exception.Hence, a great amount of care must be taken when designing the algorithm so it has low overheads while being efficient in doing the intended task.In this work, we proposed an ACA which utilizes the Q-learning algorithm for selecting optimal frame size based on the number of collisions detected in the previous frame.The proposed RL-DFSA was trained with 1000 tags during the learning period and the resultant Q-matrix was used for evaluating the performance of the algorithm for varying number of tags from 10 to 1000.Its performance was compared with four algorithms, namely, EPC-Fixed, EPC-Q-Frame, EPC-Q-Slot, and Ideal.Through extensive simulations, it is concluded that RL-DFSA performs equal to or better than commercial algorithms at various performance metrics.Specifically, the number of frames required by RL-DFSA is an order of magnitude lower than the best performing algorithm that is currently being utilized by the commercial readers.Hence, RL-DFSA is proven to be an efficient anti-collision algorithm which is also energy efficient.The energy efficiency claim is valid since computational cost is more than 70 times cheaper as compared to the communication cost depending on the processor architecture.However, RL-DFSA still has some room for improvements as follow.The algorithm has slow convergence speed and its computational cost is relatively higher than the commercial algorithms.Thus, the application of another Q-learning derivative such as Speedy-Q can be investigated to speed up the learning process.Finally, RL-DFSA should be implemented on a software-defined radio platform and evaluated in real life applications.

Figure 2 -
Figure 2 -Q-learning performance for 1000 tags e +Collision×T c +T query ,

Figure 4 -
Figure 4 -TSE of the algorithms for various number of tags . 4.1 (b).

Figure 4 . 2 -
Figure 4.2 -Average number of frames required per interrogation round for various algorithms

Figure 4 . 3 -
Figure 4.3 -Average number of slots required per interrogation round for various algorithm

Fig. 4 . 4
Fig.4.4 shows the delay performance of all five algorithms.The list of the algorithms ranked from best to worst for the delay performance is as follow: Ideal, RL-DFSA, EPC-Q-Slot, EPC-Q-Frame, and EPC-Fixed.Due to its static frame size, the EPC-Fixed algorithm took so much longer time as compared to the other four algorithms to finish an interrogation round.The difference is much more pronounced in the dense tag environment.As for the EPC-Q-Frame it performed much better since it has (a) Sparse tag environment (10 -100 tags) (a) Dense tag environment (100 -1000 tags)

Figure 4 . 4 -
Figure 4.4 -Average time required per interrogation round for various algorithms