Research progress of deep reinforcement learning applied to text generation

XU Cong; LI Qing; ZHANG De-zheng; CHEN Peng; CUI Jia-rui

doi:10.13374/j.issn2095-9389.2019.06.16.030

Volume 42 Issue 4

Apr. 2020

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Engineering > 2020 > 42(4): 399-411

XU Cong, LI Qing, ZHANG De-zheng, CHEN Peng, CUI Jia-rui. Research progress of deep reinforcement learning applied to text generation[J]. Chinese Journal of Engineering, 2020, 42(4): 399-411. doi: 10.13374/j.issn2095-9389.2019.06.16.030

Citation:

XU Cong, LI Qing, ZHANG De-zheng, CHEN Peng, CUI Jia-rui. Research progress of deep reinforcement learning applied to text generation[J]. Chinese Journal of Engineering, 2020, 42(4): 399-411. doi: 10.13374/j.issn2095-9389.2019.06.16.030

Citation:

PDF( 859 KB)

Research progress of deep reinforcement learning applied to text generation

doi: 10.13374/j.issn2095-9389.2019.06.16.030

XU Cong^{1, 2},
LI Qing^{1
,
,},
ZHANG De-zheng^{2, 3},
CHEN Peng¹,
CUI Jia-rui¹

1.
School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China
2.
Beijing Key Laboratory of Knowledge Engineering for Materials Science, Beijing 100083, China
3.
School of Computer & Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China

More Information

Corresponding author: E-mail: liqing@ies.ustb.edu.cn
Received Date: 2019-06-16
Publish Date: 2020-04-01

Abstract

Abstract

With the recent exciting achievements of Google’s artificial intelligence system in the game of Go, deep reinforcement learning (DRL) has witnessed considerable development. DRL combines the abilities of sensing and making decisions provided by deep learning and reinforcement learning. Natural language processing (NLP) involves a large number of vocabularies or statements that have to be represented, and its subtasks, such as the dialogue system and machine translation, involve many decision problems that are difficult to model. Because of the aforementioned reasons, DRL can be appropriately applied to various NLP tasks such as named entity recognition, relation extraction, dialogue system, image caption, and machine translation. Further, DRL is helpful in improving the framework or the training pipeline of the aforementioned tasks, and notable achievements have been obtained. DRL is not an algorithm or a method but a paradigm. Many researchers fit plenty of NLP tasks in this paradigm and achieve better performance. Specifically, in text generation based on the reinforcement learning paradigm, the learning process that is used to produce a predicted sequence from the given source sequence can be considered to be the Markov decision process (MDP). In MDP, an agent interacts with the environment by receiving a sequence of observations and scaled rewards and subsequently produces the next action or word. This causes the text generation model to achieve decision-making ability, which can result in future success. Thus, the text generation task integrated with reinforcement learning is an attractive and promising research field. This study presented a comprehensive introduction and a systemic overview. First, we presented the basic methods in DRL and its variations. Then, we showed the main applications of DRL during the text generation task, trace the development of DRL, and summarized the merits and demerits associated with these applications. The final section enumerated some future research directions of DRL combined with NLP.
- deep reinforcement learning,
- natural language processing,
- text generation,
- dialogue system,
- machine translation,
- image caption

FullText(HTML)

References(82)

References

[1]	Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd Ed. Massachusetts: MIT Press, 2018
[2]	Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529 doi: 10.1038/nature14236
[3]	Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484 doi: 10.1038/nature16961
[4]	LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436 doi: 10.1038/nature14539
[5]	Littman M L. Reinforcement learning improves behaviour from evaluative feedback. Nature, 2015, 521(7553): 445 doi: 10.1038/nature14540
[6]	Li Y X. Deep reinforcement learning: an overview[J/OL]. arXiv Preprint (2017-09-15) [2019-06-16]. https://arxiv.org/abs/1701.07274
[7]	Baroni M, Zamparelli R. Nouns are vectors, adjectives are matrices: representing adjective-noun constructions in semantic space // Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Cambridge, 2010: 1183
[8]	Lapata M, Mitchell J. Vector-based models of semantic composition // Proceedings of the Meeting of the Association for Computational Linguistics. Columbus, 2008: 236
[9]	Su P H, Ga?i? M, Mrk?i? N, et al. On-line active reward learning for policy optimisation in spoken dialogue systems // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, 2016: 2431
[10]	Vinyals O, Le Q. A neural conversational model[J/OL]. arXiv Preprint (2015-07-22) [2019-06-16]. https://arxiv.org/abs/1506.05869
[11]	Wen T H, Vandyke D, Mrksic N, et al. A network-based end-to-end trainable task-oriented dialogue system[J/OL]. arXiv Preprint (2017-04-24) [2019-06-16]. https://arxiv.org/abs/1604.04562
[12]	Wen T H, Ga?ic M, Kim D, et al. Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking // Proceedings of 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Prague, 2015: 275
[13]	Henderson M, Thomson B, Williams J. The second dialog state tracking challenge // Proceedings of 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Philadelphia, 2014: 263
[14]	Eric M, Manning C D. Key-value retrieval networks for task-oriented dialogue[J/OL]. arXiv Preprint (2017-07-14) [2019-06-16]. https://arxiv.org/abs/1705.05414
[15]	Lowe R, Pow N, Serban I V, et al. The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems // Proceedings of 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Prague, 2015: 285
[16]	Brown P F, Pietra V J D, Pietra S A D, et al. The mathematics of statistical machine translation: Parameter estimation. Comput Linguist, 1993, 19(2): 263
[17]	Koehn P, Och F J, Marcu D. Statistical phrase-based translation // Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Edmonton, 2003: 48
[18]	Zhang J J, Zong C Q. Deep neural networks in machine translation: an overview. IEEE Intell Sys, 2015, 30(5): 16 doi: 10.1109/MIS.2015.69
[19]	Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks // Proceedings of Advances in Neural Information Processing Systems. Montréal, 2014: 3104
[20]	Cho K, Merri?nboer van B, Bahdanau D, et al. On the properties of neural machine translation: encoder–decoder approaches. Comput Sci, 2014: 103
[21]	Luong M T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation // Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lisbon, 2015: 1412
[22]	Wu Y H, Schuster M, Chen Z F, et al. Google’s neural machine translation system: bridging the gap between human and machine translation[J/OL]. arXiv Preprint (2016-10-08) [2019-06-16]. https://arxiv.org/abs/1609.08144
[23]	He Z J. Baidu translate: research and products // Proceedings of the ACL 2015 Fourth Workshop on Hybrid Approaches to Translation (HyTra). Beijing, 2015: 61
[24]	Cho K, Merrienboer van B, Gulcehre C, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation // Proceedings of the Conference on Empirical Methods in Natural Language Processing. Doha, 2014: 1724
[25]	Xu K, Ba J L, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention // Proceedings of 32nd International Conference on Machine Learning. Lille, 2015: 2048
[26]	Das A, Kottur S, Gupta K, et al. Visual dialog[J/OL]. arXiv Preprint (2017-08-01) [2019-06-16]. https://arxiv.org/abs/1611.08669
[27]	Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res, 2013, 47: 853 doi: 10.1613/jair.3994
[28]	Young P, Lai A, Hodosh M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist, 2014, 2: 67
[29]	Lin T Y, Maire M, Belongie S, et al. Microsoft coco: common objects in context // Proceedings of European Conference on Computer Vision. Zurich, 2014: 740
[30]	Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-Learning // AAAI Conference on Artificial Intelligence. Phoenix, 2016: 2094
[31]	Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J/OL]. arXiv Preprint (2016-02-25) [2019-06-16]. https://arxiv.org/abs/1511.05952
[32]	Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning // Proceedings of 33rd International Conference on Machine Learning. New York, 2016: 1995
[33]	Schulman J, Levine S, Mortiz P, et al. Trust region policy optimization // Proceedings of 31st International Conference on Machine Learning. Lille, 2015: 1889
[34]	Kandasamy K, Bachrach Y, Tomioka R, et al. Batch policy gradient methods for improving neural conversation models[J/OL]. arXiv preprint (2017-02-10) [2019-06-16]. https://arxiv.org/abs/1702.03334
[35]	Bhatnagar S, Sutton R S, Ghavamzadeh M, et al. Natural actor-critic algorithms. Automatica, 2009, 45(11): 2471 doi: 10.1016/j.automatica.2009.07.008
[36]	Grondman I, Busoniu L, Lopes G A D, et al. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern Part C Appl Rev, 2012, 42(6): 1291 doi: 10.1109/TSMCC.2012.2218595
[37]	Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning // Proceedings of 33rd International Conference on Machine Learning. New York, 2016: 1928
[38]	Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[J/OL]. arXiv Preprint (2016-02-29) [2019-06-16]. https://arxiv.org/abs/1509.02971
[39]	Kulkarni T D, Saeedi A, Gautam S, et al. Deep successor reinforcement learning[J/OL]. arXiv Preprint (2016-06-08) [2019-06-16]. https://arxiv.org/abs/1606.02396
[40]	Xu C, Li Q, Zhang D, et al. Deep successor feature learning for text generation[J/OL]. Neurocomputing, (2019-04-25) [2019-06-16]. https://doi.org/10.1016/j.neucom.2018.11.116
[41]	Zhang J W, Springenberg J T, Boedecker J, et al. Deep reinforcement learning with successor features for navigation across similar environments[J/OL]. arXiv Preprint (2017-07-23) [2019-06-16]. https://arxiv.org/abs/1612.05533
[42]	Bowling M, Burch N, Johanson M, et al. Heads-up limit hold’em poker is solved. Science, 2015, 347(6218): 145 doi: 10.1126/science.1259433
[43]	Liu X, Xia T, Wang J, et al. Fully convolutional attention localization networks for fine-grained recognition[J/OL]. arXiv Preprint (2017-03-21) [2019-06-16]. https://arxiv.org/abs/1603.06765
[44]	Zoph B, Le Q V. Neural architecture search with reinforcement learning[J/OL]. arXiv Preprint (2017-02-15) [2019-06-16]. https://arxiv.org/abs/1611.01578
[45]	Theocharous G, Thomas P S, Ghavamzadeh M. Personalized ad recommendation systems for life-time value optimization with guarantees // International Joint Conferences on Artificial Intelligence. Buenos Aires, 2015: 1806
[46]	Cuayáhuitl H. Simple D S: A simple deep reinforcement learning dialogue system // Dialogues with Social Robots. Springer, Singapore, 2017: 109
[47]	He D, Xia Y C, Qin T, et al. Dual learning for machine translation // Advances in Neural Information Processing Systems. Barcelona, 2016: 820
[48]	Zhang X X, Lapata M. Sentence simplification with deep reinforcement learning // Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark, 2017: 584
[49]	Narasimhan K, Kulkarni T D, Barzilay R. Language understanding for text-based games using deep reinforcement learning // Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lisbon, 2015: 1001
[50]	Williams R J, Zipser D. A learning algorithm for continually running fully recurrent neural networks. Neural Comput, 1989, 1(2): 270 doi: 10.1162/neco.1989.1.2.270
[51]	Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput, 1997, 9(8): 1735 doi: 10.1162/neco.1997.9.8.1735
[52]	He J, Chen J, He X, et al. Deep reinforcement learning with a natural language action space // Proceedings of 54th Annual Meeting of the Association for Computational Linguistics. Berlin, 2016: 1621
[53]	Guo H. Generating text with deep reinforcement learning[J/OL]. arXiv Preprint (2015-10-30) [2019-06-16]. https://arxiv.org/abs/1510.09202
[54]	Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation // Proceedings of 40th Annual Meeting of Association for Computational Linguistics. Philadelphia, 2002: 311
[55]	Sutton R S, McAllester D A, Singh S P, et al. Policy gradient methods for reinforcement learning with function approximation // Advances in Neural Information Processing Systems. Denver, 2000: 1057
[56]	Ranzato M A, Chopra S, Auli M, et al. Sequence level training with recurrent neural networks[J/OL]. arXiv Preprint (2016-05-06) [2019-06-16]. https://arxiv.org/abs/1511.06732
[57]	Li J W, Monroe W, Shi T L, et al. Adversarial learning for neural dialogue generation[J/OL]. arXiv Preprint (2017-09-24) [2019-06-16]. https://arxiv.org/abs/1701.06547
[58]	Lin C Y. Rouge: A package for automatic evaluation of summaries // Proceedings of Workshop on Text Summarization Branches Out, Post Conference Workshop of ACL 2004. Barcelona, 2004: 8
[59]	Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning[J/OL]. arXiv Preprint (2017-11-16) [2019-06-16]. https://arxiv.org/abs/1612.00563
[60]	Vedantam R, Lawrence Z C, Parikh D. CIDEr: Consensus-based image description evaluation // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, 2015: 4566
[61]	Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments // Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, 2005: 65
[62]	Wang L, Yao J L, Tao Y Z, et al. A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization // Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, 2018: 4453
[63]	Wu Y X, Hu B T. Learning to extract coherent summary via deep reinforcement learning // Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 5602
[64]	Li J W, Monroe W, Ritter A, et al. Deep reinforcement learning for dialogue generation[J/OL]. arXiv Preprint (2016-09-29) [2019-06-16]. https://arxiv.org/abs/1606.01541
[65]	Takanobu R, Huang M, Zhao Z Z, et al. A weakly supervised method for topic segmentation and labeling in goal-oriented dialogues via reinforcement learning // Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. Stockholm, 2018: 4403
[66]	Bahdanau D, Brakel P, Xu K, et al. An actor-critic algorithm for sequence prediction[J/OL]. arXiv Preprint (2017-03-03) [2019-06-16]. https://arxiv.org/abs/1607.07086
[67]	Su P H, Budzianowski P, Ultes S, et al. Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management[J/OL]. arXiv Preprint (2017-07-05) [2019-06-16]. https://arxiv.org/abs/1707.00130
[68]	Wang Z Y, Bapst V, Heess N, et al. Sample efficient actor-critic with experience replay[J/OL]. arXiv Preprint (2017-07-10) [2019-06-16]. https://arxiv.org/abs/1611.01224
[69]	Peters J, Schaal S. Natural actor-critic. Neurocomputing, 2008, 71(7-9): 1180 doi: 10.1016/j.neucom.2007.11.026
[70]	Chen L, Su P H, Gasic M. Hyper-parameter optimisation of gaussian process reinforcement learning for statistical dialogue management // Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Prague, 2015: 407
[71]	Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets // Advances in Neural Information Processing Systems. Montréal, 2014: 1
[72]	Yu L T, Zhang W N, Wang J, et al. SeqGAN: Sequence generative adversarial nets with policy gradient // Proceedings of Thirty-First AAAI Conference on Artificial Intelligence. Palo Alto, 2017: 2852
[73]	Pfau D, Vinyals O. Connecting generative adversarial networks and actor-critic methods[J/OL]. arXiv Preprint (2017-01-18) [2019-06-16]. https://arxiv.org/abs/1610.01945
[74]	Serban I V, Sankar C, Germain M, et al. A deep reinforcement learning chatbot[J/OL]. arXiv Preprint (2017-11-05) [2019-06-16]. https://arxiv.org/abs/1709.02349
[75]	He D, Lu H Q, Xia Y C, et al. Decoding with value networks for neural machine translation //Advances in Neural Information Processing Systems. Long Beach, 2017: 177
[76]	Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning // Proceedings of 33rd International Conference on Machine Learning. New York, 2016: 1928
[77]	Casanueva I, Budzianowski P, Su P H, et al. Feudal reinforcement learning for dialogue management in large domains // Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, Louisiana, 2018: 714
[78]	Dayan P, Hinton G E. Feudal reinforcement learning // Advances in Neural Information Processing Systems. Denver, 1993: 271
[79]	Xiong W, Hoang T, Wang W Y. DeepPath: a reinforcement learning method for knowledge graph reasoning // Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, 2017: 564
[80]	Buck C, Bulian J, Ciaramita M, et al. Ask the right questions: active question reformulation with reinforcement learning. arXiv Preprint (2018-03-02) [2019-06-16]. https://arxiv.org/abs/1705.07830
[81]	Feng J, Huang M L, Zhao L, et al. Reinforcement learning for relation classification from noisy data // Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 5779
[82]	Zhang T Y, Huang M L, Zhao L. Learning structured representation for text classification via reinforcement learning // Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 6053