Sample strategy based on TD-error for offline reinforcement learning

ZHANG Longfei; FENG Yanghe; LIANG Xingxing; LIU Shixuan; Cheng Guangquan; Huang Jincai

doi:10.13374/j.issn2095-9389.2022.10.22.001

Article Contents

Article Navigation > Chinese Journal of Engineering > 2023 > Uncorrected proof

ZHANG Longfei, FENG Yanghe, LIANG Xingxing, LIU Shixuan, Cheng Guangquan, Huang Jincai. Sample strategy based on TD-error for offline reinforcement learning[J]. Chinese Journal of Engineering. doi: 10.13374/j.issn2095-9389.2022.10.22.001

Citation:

ZHANG Longfei, FENG Yanghe, LIANG Xingxing, LIU Shixuan, Cheng Guangquan, Huang Jincai. Sample strategy based on TD-error for offline reinforcement learning[J]. Chinese Journal of Engineering. doi: 10.13374/j.issn2095-9389.2022.10.22.001

Citation:

PDF( 1837 KB)

Sample strategy based on TD-error for offline reinforcement learning

doi: 10.13374/j.issn2095-9389.2022.10.22.001

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

More Information

Corresponding author: E-mail: fengyanghe@nudt.edu.cn
Received Date: 2022-10-22
Available Online: 2023-03-28

Abstract

Abstract

Offline reinforcement learning uses pre-collected expert data or other empirical data to learn action strategies offline without interacting with the environment. Offline reinforcement learning is preferable to online reinforcement learning because it has lower interaction costs and trial-and-error risks. However, offline reinforcement learning often faces the issues of severe extrapolation errors and low sample utilization because the Q-value estimation errors cannot be corrected in time by interacting with the environment. To this end, this paper suggests an effective sampling strategy for offline reinforcement learning based on TD-error, using TD-error as the priority measure for priority sampling, and enhancing the sampling efficacy of offline reinforcement learning and addressing the issue of out-of-distribution error by using a combination of priority sampling and uniform sampling. Meanwhile, based on the use of the dual Q-value estimation network, this paper examines the performance of the algorithms corresponding to their time-difference error measures when determining the target network using three approaches, including the minimum, the maximum, and the convex combined of dual Q-value network, according to the various calculation techniques of the target network. Furthermore, to eliminate the training bias arising from preference sampling using priority sampling, this paper uses a significant sampling mechanism. By comparing with existing offline reinforcement learning research results combining sampling strategies on the D4RL baseline, the algorithm proposed shows better performance in terms of the final performance, data efficiency, and training stability. To confirm the contribution of each research point in the algorithm, two experiments were performed in the ablation experiment section of this study. Experiment 1 shows that the algorithm using the sampling method with a combination of uniform sampling and priority sampling outperforms the algorithm using uniform sampling alone and the algorithm using priority sampling alone in terms of sample utilization and strategy stability, while experiment 2 compares the effect on the performance of the algorithm based on the double Q-value estimation network produced by the double network of a maximum, minimum, and maximum-minimum convex combination of values based on the dual Q-value estimation network with a total of three different time-difference calculation methods on the performance of the algorithm. Experimental evidence shows that the algorithm in the research that uses the least amount of dual networks performs better overall and in terms of data utilization than the other two algorithms, but its strategy variance is higher. The approach described in this paper can be used in conjunction with any offline reinforcement learning method based on Q-value estimation. This approach has the advantages of stable performance, straightforward implementation, and high scalability, and it supports the use of reinforcement learning techniques in real-world settings.
- offline,
- reinforcement learning,
- sample strategy,
- experience replay buffer,
- TD-error

FullText(HTML)

References(27)

References

[1]	Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019, 575(7782): 350 doi: 10.1038/s41586-019-1724-z
[2]	Kiran B R, Sobh I, Talpaert V, et al. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans Intell Transp Syst, 2022, 23(6): 4909 doi: 10.1109/TITS.2021.3054625
[3]	Degrave J, Felici F, Buchli J. et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022, 602(7897): 414
[4]	Fawzi A, Balog M, Huang A, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 2022, 610(7930): 47 doi: 10.1038/s41586-022-05172-4
[5]	梁星星, 馮旸赫, 黃金才, 等. 基于自回歸預測模型的深度注意力強化學習方法. 軟件學報, 2020, 31(4):948 Liang X X, Feng Y H, Huang J C, et al. Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model. J Softw, 2020, 31(4): 948
[6]	Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning//International Conference on Machine Learning. New York, 2016: 1928
[7]	Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor//International Conference on Machine Learning. Stockholm, 2018: 1861
[8]	Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods // International Conference on Machine Learning. Stockholm, 2018: 1587
[9]	Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels // International Conference on Machine Learning. California, 2019: 2555
[10]	Hafner D, Lillicrap T, Ba J, et al. Dream to control: Learning behaviors by latent imagination[J/OL]. arXiv preprint (2020-05-17) [2022-10-22].https://arxiv.org/abs/1912.01603
[11]	Hafner D, Lillicrap T, Norouzi M, et al. Mastering atari with discrete world models[J/OL]. arXiv preprint (2022-02-12) [2022-10-22].https://arxiv.org/abs/2010.02193
[12]	Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration // International Conference on Machine Learning. California, 2019: 2052
[13]	Zhang L F, Zhang Y L, Liu S X, et al. ORAD: A new framework of offline Reinforcement Learning with Q-value regularization. Evol Intel, 2022: 1
[14]	Mao Y H, Wang C, Wang B, et al. MOORe: Model-based offline-to-online reinforcement learning[J/OL]. arXiv preprint (2022-01-25) [2022-10-22]. https://arvix.org/abs/2201.10070
[15]	Fujimoto S, Gu S S. A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst, 2021, 34: 20132
[16]	Kumar A, Zhou A, Tucker G, et al. Conservative Q-learning for offline reinforcement learning // Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, 2020: 1179
[17]	Fu J, Kumar A, Nachum O, et al. D4rl: Datasets for deep data-driven reinforcement learning[J/OL]. arXiv preprint (2021-02-06) [2022-10-22]. https://arxiv.org/abs/2004.07219
[18]	Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J/OL]. arXiv preprint (2016-02-25) [2022-10-22]. https://arxiv.org/abs/1511.05952
[19]	Liu H, Trott A, Socher R, et al. Competitive experience replay[J/OL]. arXiv preprint (2019-02-17) [2022-10-22]. https://arxiv.org/abs/1902.00528
[20]	Fu Y W, Wu D, Boulet B. Benchmarking sample selection strategies for batch reinforcement learning[J/OL]. OpenReview. net (2022-01-29) [2022-10-22]. https://openreview.net/forum?id=WxBFVNbDUT6
[21]	Lee S, Seo Y, Lee K, et al. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble // Conference on Robot Learning. London, 2022: 1702
[22]	Bellman R. A Markovian decision process. J Math Mech, 1957: 679
[23]	Hessel M, Modayil J, Van Hasselt H, et al. Rainbow: Combining improvements in deep reinforcement learning// The Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 3215
[24]	Schulman J, Levine S, Abbeel P, et al. Trust region policy optimization // International Conference on Machine Learning. Lille, 2015: 1889
[25]	Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J/OL]. arXiv preprint (2017-08-28) [2022-10-22]. https://arxiv.org/abs/1707.06347
[26]	Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT press. 2018
[27]	Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529 doi: 10.1038/nature14236