-
摘要: 谷歌的人工智能系統(AlphaGo)在圍棋領域取得了一系列成功,使得深度強化學習得到越來越多的關注。深度強化學習融合了深度學習對復雜環境的感知能力和強化學習對復雜情景的決策能力。而自然語言處理過程中有著數量巨大的詞匯或者語句需要表征,并且在對話系統、機器翻譯和圖像描述等文本生成任務中存在大量難以建模的決策問題。這使得深度強化學習在自然語言處理的文本生成任務中能夠發揮重要的作用,幫助改進現有的模型結構或者訓練機制,并且已經取得了很多顯著的成果。為此,本文系統闡述深度強化學習應用在不同的文本生成任務中的一些主要方法,梳理其發展的軌跡,分析算法特點。最后,展望深度強化學習與自然語言處理任務融合的前景和挑戰。Abstract: With the recent exciting achievements of Google’s artificial intelligence system in the game of Go, deep reinforcement learning (DRL) has witnessed considerable development. DRL combines the abilities of sensing and making decisions provided by deep learning and reinforcement learning. Natural language processing (NLP) involves a large number of vocabularies or statements that have to be represented, and its subtasks, such as the dialogue system and machine translation, involve many decision problems that are difficult to model. Because of the aforementioned reasons, DRL can be appropriately applied to various NLP tasks such as named entity recognition, relation extraction, dialogue system, image caption, and machine translation. Further, DRL is helpful in improving the framework or the training pipeline of the aforementioned tasks, and notable achievements have been obtained. DRL is not an algorithm or a method but a paradigm. Many researchers fit plenty of NLP tasks in this paradigm and achieve better performance. Specifically, in text generation based on the reinforcement learning paradigm, the learning process that is used to produce a predicted sequence from the given source sequence can be considered to be the Markov decision process (MDP). In MDP, an agent interacts with the environment by receiving a sequence of observations and scaled rewards and subsequently produces the next action or word. This causes the text generation model to achieve decision-making ability, which can result in future success. Thus, the text generation task integrated with reinforcement learning is an attractive and promising research field. This study presented a comprehensive introduction and a systemic overview. First, we presented the basic methods in DRL and its variations. Then, we showed the main applications of DRL during the text generation task, trace the development of DRL, and summarized the merits and demerits associated with these applications. The final section enumerated some future research directions of DRL combined with NLP.
-
表 1 對話數據集內容概覽
Table 1. Summary of dialogue datasets
Dataset Numbers of dialogue Numbers of slots Scene Multi-turn Cambridge restaurants database 720 6 1 Yes San Francisco restaurants database 3577 12 1 Yes Dialog system technology challenge 2 3000 8 1 Yes Dialog system technology challenge 3 2265 9 1 Yes Stanford multi-turn multi-domain task-oriented dialogue dataset 3031 79,65,140 3 Yes The Twitter dialogue corpus 1300000 — — Yes The Ubuntu dialogue corpus 932429 — — No Opensubtitle corpus 70000000 — — No -
參考文獻
[1] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd Ed. Massachusetts: MIT Press, 2018 [2] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529 doi: 10.1038/nature14236 [3] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484 doi: 10.1038/nature16961 [4] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436 doi: 10.1038/nature14539 [5] Littman M L. Reinforcement learning improves behaviour from evaluative feedback. Nature, 2015, 521(7553): 445 doi: 10.1038/nature14540 [6] Li Y X. Deep reinforcement learning: an overview[J/OL]. arXiv Preprint (2017-09-15) [2019-06-16]. https://arxiv.org/abs/1701.07274 [7] Baroni M, Zamparelli R. Nouns are vectors, adjectives are matrices: representing adjective-noun constructions in semantic space // Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Cambridge, 2010: 1183 [8] Lapata M, Mitchell J. Vector-based models of semantic composition // Proceedings of the Meeting of the Association for Computational Linguistics. Columbus, 2008: 236 [9] Su P H, Ga?i? M, Mrk?i? N, et al. On-line active reward learning for policy optimisation in spoken dialogue systems // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, 2016: 2431 [10] Vinyals O, Le Q. A neural conversational model[J/OL]. arXiv Preprint (2015-07-22) [2019-06-16]. https://arxiv.org/abs/1506.05869 [11] Wen T H, Vandyke D, Mrksic N, et al. A network-based end-to-end trainable task-oriented dialogue system[J/OL]. arXiv Preprint (2017-04-24) [2019-06-16]. https://arxiv.org/abs/1604.04562 [12] Wen T H, Ga?ic M, Kim D, et al. Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking // Proceedings of 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Prague, 2015: 275 [13] Henderson M, Thomson B, Williams J. The second dialog state tracking challenge // Proceedings of 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Philadelphia, 2014: 263 [14] Eric M, Manning C D. Key-value retrieval networks for task-oriented dialogue[J/OL]. arXiv Preprint (2017-07-14) [2019-06-16]. https://arxiv.org/abs/1705.05414 [15] Lowe R, Pow N, Serban I V, et al. The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems // Proceedings of 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Prague, 2015: 285 [16] Brown P F, Pietra V J D, Pietra S A D, et al. The mathematics of statistical machine translation: Parameter estimation. Comput Linguist, 1993, 19(2): 263 [17] Koehn P, Och F J, Marcu D. Statistical phrase-based translation // Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Edmonton, 2003: 48 [18] Zhang J J, Zong C Q. Deep neural networks in machine translation: an overview. IEEE Intell Sys, 2015, 30(5): 16 doi: 10.1109/MIS.2015.69 [19] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks // Proceedings of Advances in Neural Information Processing Systems. Montréal, 2014: 3104 [20] Cho K, Merri?nboer van B, Bahdanau D, et al. On the properties of neural machine translation: encoder–decoder approaches. Comput Sci, 2014: 103 [21] Luong M T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation // Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lisbon, 2015: 1412 [22] Wu Y H, Schuster M, Chen Z F, et al. Google’s neural machine translation system: bridging the gap between human and machine translation[J/OL]. arXiv Preprint (2016-10-08) [2019-06-16]. https://arxiv.org/abs/1609.08144 [23] He Z J. Baidu translate: research and products // Proceedings of the ACL 2015 Fourth Workshop on Hybrid Approaches to Translation (HyTra). Beijing, 2015: 61 [24] Cho K, Merrienboer van B, Gulcehre C, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation // Proceedings of the Conference on Empirical Methods in Natural Language Processing. Doha, 2014: 1724 [25] Xu K, Ba J L, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention // Proceedings of 32nd International Conference on Machine Learning. Lille, 2015: 2048 [26] Das A, Kottur S, Gupta K, et al. Visual dialog[J/OL]. arXiv Preprint (2017-08-01) [2019-06-16]. https://arxiv.org/abs/1611.08669 [27] Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res, 2013, 47: 853 doi: 10.1613/jair.3994 [28] Young P, Lai A, Hodosh M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist, 2014, 2: 67 [29] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: common objects in context // Proceedings of European Conference on Computer Vision. Zurich, 2014: 740 [30] Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-Learning // AAAI Conference on Artificial Intelligence. Phoenix, 2016: 2094 [31] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J/OL]. arXiv Preprint (2016-02-25) [2019-06-16]. https://arxiv.org/abs/1511.05952 [32] Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning // Proceedings of 33rd International Conference on Machine Learning. New York, 2016: 1995 [33] Schulman J, Levine S, Mortiz P, et al. Trust region policy optimization // Proceedings of 31st International Conference on Machine Learning. Lille, 2015: 1889 [34] Kandasamy K, Bachrach Y, Tomioka R, et al. Batch policy gradient methods for improving neural conversation models[J/OL]. arXiv preprint (2017-02-10) [2019-06-16]. https://arxiv.org/abs/1702.03334 [35] Bhatnagar S, Sutton R S, Ghavamzadeh M, et al. Natural actor-critic algorithms. Automatica, 2009, 45(11): 2471 doi: 10.1016/j.automatica.2009.07.008 [36] Grondman I, Busoniu L, Lopes G A D, et al. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern Part C Appl Rev, 2012, 42(6): 1291 doi: 10.1109/TSMCC.2012.2218595 [37] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning // Proceedings of 33rd International Conference on Machine Learning. New York, 2016: 1928 [38] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[J/OL]. arXiv Preprint (2016-02-29) [2019-06-16]. https://arxiv.org/abs/1509.02971 [39] Kulkarni T D, Saeedi A, Gautam S, et al. Deep successor reinforcement learning[J/OL]. arXiv Preprint (2016-06-08) [2019-06-16]. https://arxiv.org/abs/1606.02396 [40] Xu C, Li Q, Zhang D, et al. Deep successor feature learning for text generation[J/OL]. Neurocomputing, (2019-04-25) [2019-06-16]. https://doi.org/10.1016/j.neucom.2018.11.116 [41] Zhang J W, Springenberg J T, Boedecker J, et al. Deep reinforcement learning with successor features for navigation across similar environments[J/OL]. arXiv Preprint (2017-07-23) [2019-06-16]. https://arxiv.org/abs/1612.05533 [42] Bowling M, Burch N, Johanson M, et al. Heads-up limit hold’em poker is solved. Science, 2015, 347(6218): 145 doi: 10.1126/science.1259433 [43] Liu X, Xia T, Wang J, et al. Fully convolutional attention localization networks for fine-grained recognition[J/OL]. arXiv Preprint (2017-03-21) [2019-06-16]. https://arxiv.org/abs/1603.06765 [44] Zoph B, Le Q V. Neural architecture search with reinforcement learning[J/OL]. arXiv Preprint (2017-02-15) [2019-06-16]. https://arxiv.org/abs/1611.01578 [45] Theocharous G, Thomas P S, Ghavamzadeh M. Personalized ad recommendation systems for life-time value optimization with guarantees // International Joint Conferences on Artificial Intelligence. Buenos Aires, 2015: 1806 [46] Cuayáhuitl H. Simple D S: A simple deep reinforcement learning dialogue system // Dialogues with Social Robots. Springer, Singapore, 2017: 109 [47] He D, Xia Y C, Qin T, et al. Dual learning for machine translation // Advances in Neural Information Processing Systems. Barcelona, 2016: 820 [48] Zhang X X, Lapata M. Sentence simplification with deep reinforcement learning // Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark, 2017: 584 [49] Narasimhan K, Kulkarni T D, Barzilay R. Language understanding for text-based games using deep reinforcement learning // Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lisbon, 2015: 1001 [50] Williams R J, Zipser D. A learning algorithm for continually running fully recurrent neural networks. Neural Comput, 1989, 1(2): 270 doi: 10.1162/neco.1989.1.2.270 [51] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput, 1997, 9(8): 1735 doi: 10.1162/neco.1997.9.8.1735 [52] He J, Chen J, He X, et al. Deep reinforcement learning with a natural language action space // Proceedings of 54th Annual Meeting of the Association for Computational Linguistics. Berlin, 2016: 1621 [53] Guo H. Generating text with deep reinforcement learning[J/OL]. arXiv Preprint (2015-10-30) [2019-06-16]. https://arxiv.org/abs/1510.09202 [54] Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation // Proceedings of 40th Annual Meeting of Association for Computational Linguistics. Philadelphia, 2002: 311 [55] Sutton R S, McAllester D A, Singh S P, et al. Policy gradient methods for reinforcement learning with function approximation // Advances in Neural Information Processing Systems. Denver, 2000: 1057 [56] Ranzato M A, Chopra S, Auli M, et al. Sequence level training with recurrent neural networks[J/OL]. arXiv Preprint (2016-05-06) [2019-06-16]. https://arxiv.org/abs/1511.06732 [57] Li J W, Monroe W, Shi T L, et al. Adversarial learning for neural dialogue generation[J/OL]. arXiv Preprint (2017-09-24) [2019-06-16]. https://arxiv.org/abs/1701.06547 [58] Lin C Y. Rouge: A package for automatic evaluation of summaries // Proceedings of Workshop on Text Summarization Branches Out, Post Conference Workshop of ACL 2004. Barcelona, 2004: 8 [59] Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning[J/OL]. arXiv Preprint (2017-11-16) [2019-06-16]. https://arxiv.org/abs/1612.00563 [60] Vedantam R, Lawrence Z C, Parikh D. CIDEr: Consensus-based image description evaluation // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, 2015: 4566 [61] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments // Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, 2005: 65 [62] Wang L, Yao J L, Tao Y Z, et al. A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization // Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, 2018: 4453 [63] Wu Y X, Hu B T. Learning to extract coherent summary via deep reinforcement learning // Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 5602 [64] Li J W, Monroe W, Ritter A, et al. Deep reinforcement learning for dialogue generation[J/OL]. arXiv Preprint (2016-09-29) [2019-06-16]. https://arxiv.org/abs/1606.01541 [65] Takanobu R, Huang M, Zhao Z Z, et al. A weakly supervised method for topic segmentation and labeling in goal-oriented dialogues via reinforcement learning // Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. Stockholm, 2018: 4403 [66] Bahdanau D, Brakel P, Xu K, et al. An actor-critic algorithm for sequence prediction[J/OL]. arXiv Preprint (2017-03-03) [2019-06-16]. https://arxiv.org/abs/1607.07086 [67] Su P H, Budzianowski P, Ultes S, et al. Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management[J/OL]. arXiv Preprint (2017-07-05) [2019-06-16]. https://arxiv.org/abs/1707.00130 [68] Wang Z Y, Bapst V, Heess N, et al. Sample efficient actor-critic with experience replay[J/OL]. arXiv Preprint (2017-07-10) [2019-06-16]. https://arxiv.org/abs/1611.01224 [69] Peters J, Schaal S. Natural actor-critic. Neurocomputing, 2008, 71(7-9): 1180 doi: 10.1016/j.neucom.2007.11.026 [70] Chen L, Su P H, Gasic M. Hyper-parameter optimisation of gaussian process reinforcement learning for statistical dialogue management // Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Prague, 2015: 407 [71] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets // Advances in Neural Information Processing Systems. Montréal, 2014: 1 [72] Yu L T, Zhang W N, Wang J, et al. SeqGAN: Sequence generative adversarial nets with policy gradient // Proceedings of Thirty-First AAAI Conference on Artificial Intelligence. Palo Alto, 2017: 2852 [73] Pfau D, Vinyals O. Connecting generative adversarial networks and actor-critic methods[J/OL]. arXiv Preprint (2017-01-18) [2019-06-16]. https://arxiv.org/abs/1610.01945 [74] Serban I V, Sankar C, Germain M, et al. A deep reinforcement learning chatbot[J/OL]. arXiv Preprint (2017-11-05) [2019-06-16]. https://arxiv.org/abs/1709.02349 [75] He D, Lu H Q, Xia Y C, et al. Decoding with value networks for neural machine translation //Advances in Neural Information Processing Systems. Long Beach, 2017: 177 [76] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning // Proceedings of 33rd International Conference on Machine Learning. New York, 2016: 1928 [77] Casanueva I, Budzianowski P, Su P H, et al. Feudal reinforcement learning for dialogue management in large domains // Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, Louisiana, 2018: 714 [78] Dayan P, Hinton G E. Feudal reinforcement learning // Advances in Neural Information Processing Systems. Denver, 1993: 271 [79] Xiong W, Hoang T, Wang W Y. DeepPath: a reinforcement learning method for knowledge graph reasoning // Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, 2017: 564 [80] Buck C, Bulian J, Ciaramita M, et al. Ask the right questions: active question reformulation with reinforcement learning. arXiv Preprint (2018-03-02) [2019-06-16]. https://arxiv.org/abs/1705.07830 [81] Feng J, Huang M L, Zhao L, et al. Reinforcement learning for relation classification from noisy data // Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 5779 [82] Zhang T Y, Huang M L, Zhao L. Learning structured representation for text classification via reinforcement learning // Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 6053 -