A survey of multimodal machine learning

CHEN Peng; LI Qing; ZHANG De-zheng; YANG Yu-hang; CAI Zheng; LU Zi-yi

doi:10.13374/j.issn2095-9389.2019.03.21.003

Volume 42 Issue 5

May 2020

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Engineering > 2020 > 42(5): 557-569

CHEN Peng, LI Qing, ZHANG De-zheng, YANG Yu-hang, CAI Zheng, LU Zi-yi. A survey of multimodal machine learning[J]. Chinese Journal of Engineering, 2020, 42(5): 557-569. doi: 10.13374/j.issn2095-9389.2019.03.21.003

Citation:

CHEN Peng, LI Qing, ZHANG De-zheng, YANG Yu-hang, CAI Zheng, LU Zi-yi. A survey of multimodal machine learning[J]. Chinese Journal of Engineering, 2020, 42(5): 557-569. doi: 10.13374/j.issn2095-9389.2019.03.21.003

Citation:

PDF( 1339 KB)

A survey of multimodal machine learning

doi: 10.13374/j.issn2095-9389.2019.03.21.003

CHEN Peng^{1, 2},
LI Qing^{1, 2
,
,},
ZHANG De-zheng^{3, 4},
YANG Yu-hang¹,
CAI Zheng¹,
LU Zi-yi¹

1.
School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China
2.
Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing 100083, China
3.
School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
4.
Beijing Key Laboratory of Knowledge Engineering for Materials Science, Beijing 100083, China

More Information

Corresponding author: E-mail: liqing@ies.ustb.edu.cn
Received Date: 2019-03-21
Publish Date: 2020-05-01

Abstract

Abstract

“Big data” is always collected from different resources that have different data structures. With the rapid development of information technologies, current precious data resources are characteristic of multimodes. As a result, based on classical machine learning strategies, multi-modal learning has become a valuable research topic, enabling computers to process and understand “big data”. The cognitive processes of humans involve perception through different sense organs. Signals from eyes, ears, the nose, and hands (tactile sense) constitute a person’s understanding of a special scene or the world as a whole. It reasonable to believe that multi-modal methods involving a higher ability to process complex heterogeneous data can further promote the progress of information technologies. The concepts of multimodality stemmed from psychology and pedagogy from hundreds of years ago and have been popular in computer science during the past decade. In contrast to the concept of “media”, a “mode” is a more fine-grained concept that is associated with a typical data source or data form. The effective utilization of multi-modal data can aid a computer understand a specific environment in a more holistic way. In this context, we first introduced the definition and main tasks of multi-modal learning. Based on this information, the mechanism and origin of multi-modal machine learning were then briefly introduced. Subsequently, statistical learning methods and deep learning methods for multi-modal tasks were comprehensively summarized. We also introduced the main styles of data fusion in multi-modal perception tasks, including feature representation, shared mapping, and co-training. Additionally, novel adversarial learning strategies for cross-modal matching or generation were reviewed. The main methods for multi-modal learning were outlined in this paper with a focus on future research issues in this field.
- multi-modal learning,
- statistical learning,
- deep learning,
- adversarial learning,
- feature representation

FullText(HTML)

References(111)

References

[1]	Rhianna K. Pedwell J A. Hardy S L, et al. Effective visual design and communication practices for research posters: Exemplars based on the theory and practice of multimedia learning and rhetoric. Biochem Mol Biol Educ, 2017, 45(3): 249 doi: 10.1002/bmb.21034
[2]	Welch K E. Electric Rhetoric: Classical Rhetoric, Oralism, and A New Literacy. Cambridge: MIT Press, 1999
[3]	Berlin James A. Contemporary composition: the major pedagogical theories. College English, 1982, 44(8): 765 doi: 10.2307/377329
[4]	O'Halloran K L. Interdependence, interaction and metaphor in multi-semiotic texts. Social Semiotics, 1999, 9(3): 317 doi: 10.1080/10350339909360442
[5]	O'Halloran K L. Classroom discourse in mathematics: a multi-semiotic analysis. Linguistics Educ, 1998, 10(3): 359 doi: 10.1016/S0898-5898(99)00013-3
[6]	Morency L P, Baltrusaitis T. Tutorial on multimodal machine learning [R/OL]. Language Technologies Institute (2016-6-26) [2019-03-05]. https://www.cs.cmu.edu/~morency/MMML-Tutorial-ACL2017.pdf
[7]	Plummer B A, Wang L W, Cervantes C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models // Proceedings of IEEE International Conference on Computer Vision (ICCV 2015). Santiago, 2015: 2641
[8]	von Glasersfeld E, Pisani P P. The multistore parser for hierarchical syntactic structures. Commun ACM, 1970, 13(2): 74 doi: 10.1145/362007.362026
[9]	Jackson P. Introduction to Expert Systems. 3rd Ed. Boston: Addison Wesley, 1998
[10]	Cortes C, Vapnik V. Support-vector networks. Machine Learning, 1995, 20(3): 273
[11]	Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco: Morgan Kaufmann Publishers, 1988
[12]	Jelinek F. Statistical Methods for Speech Recognition. Cambridge: MIT Press, 1997
[13]	McGurk H, MacDonald J. Hearing lips and seeing voices. Nature, 1976, 264(5588): 746 doi: 10.1038/264746a0
[14]	Petajan E D. Automatic Lipreading to Enhance Speech Recognition (Speech Reading) [Dissertation]. University of Illinois at Urbana-Champaign, 1984
[15]	Fels S S, Hinton G E. Glove-Talk: a neural network interface between a data-glove and a speech synthesizer. IEEE Trans Neural Networks, 1993, 4(1): 2 doi: 10.1109/72.182690
[16]	Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting. J Machine Learning Res, 2014, 15: 1929
[17]	Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks // Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale, 2011: 315
[18]	He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition // Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). Las Vegas, 2016: 770
[19]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks // Advances in Neural Information Processing Systems. Lake Tahoe, 2012: 1097
[20]	Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-ResNet and the impact of residual connections on learning // Proceedings of Thirty-First AAAI Conference on Artificial Intelligence. San Franciso, 2017: 4278
[21]	Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical image database // Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009). Miami, 2009: 248
[22]	Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition // Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, 2016: 260
[23]	Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning // Proceedings of the 28th International Conference on Machine Learning. Bellevue, 2011: 689
[24]	Baltrusaitis T, Ahuja C, Morency L P. Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Machine Intelligence, 2019, 41(2): 423 doi: 10.1109/TPAMI.2018.2798607
[25]	Zhang L, Zhao Y, Zhu Z F, et al. Multi-view missing data completion. IEEE Trans Knowledge Data Eng, 2018, 30(7): 1296 doi: 10.1109/TKDE.2018.2791607
[26]	Wang L Q, Sun W C, Zhao Z C, et al. Modeling intra- and inter-pair correlation via heterogeneous high-order preserving for cross-modal retrieval. Signal Process, 2017, 131: 249 doi: 10.1016/j.sigpro.2016.08.012
[27]	Liu H P, Li F X, Xu X Y, et al. Multi-modal local receptive field extreme learning machine for object recognition. Neurocomputing, 2018, 277: 4 doi: 10.1016/j.neucom.2017.04.077
[28]	Fu K, Jin J Q, Cui R P, et al. Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Machine Intelligence, 2017, 39(12): 2321 doi: 10.1109/TPAMI.2016.2642953
[29]	Breiman L. Random forests. Machine Learning, 2001, 45(1): 5 doi: 10.1023/A:1010933404324
[30]	Breiman L, Friedman J H, Olshen R A, et al. Classification and Regression Trees. Florida: Chapman and Hall/CRC, 1998
[31]	Breiman L. Statistical modeling: the two cultures. Statist Sci, 2001, 16(3): 199
[32]	Vapnik V N, Cervonenkis, A. J. Empirical Inference. Berlin: Springer, 2013
[33]	Sch?lkopf B, Smola A J, Bach F. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge: MIT Press, 2002
[34]	Mercer J. Functions of positive and negative type, and their connection with the theory of integral equations. Philos Trans R Soc London Ser A, 1909, 209(441-458): 415 doi: 10.1098/rsta.1909.0016
[35]	Aronszajn N. Theory of reproducing kernels. Trans Am Math Soc, 1950, 68(3): 337 doi: 10.1090/S0002-9947-1950-0051437-7
[36]	Steinwart I, Hush D, Scovel C. An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. IEEE Trans Inf Theory, 2006, 52(10): 4635 doi: 10.1109/TIT.2006.881713
[37]	Lodhi H, Saunders C, Shawe-Taylor J, et al. Text classification using string kernels. J Machine Learning Res, 2002, 2(3): 419
[38]	Wu J X, Rehg J M. Beyond the euclidean distance: Creating effective visual codebooks using the histogram intersection kernel // 2009 IEEE 12th International Conference on Computer Vision. Kyoto, 2009: 630
[39]	Lanckriet G R, Deng M, Cristianini N, et al. Kernel-based data fusion and its application to protein function prediction in yeast. // Proceedings of Pacific Symposium on Biocomputing. Hawaii, 2004: 300
[40]	Lee W J, Verzakov S, Duin R P W. Kernel combination versus classifier combination // Proceedings of International Workshop on Multiple Classifier Systems, MCS 2007. Prague, 2007: 22
[41]	G?nen M, Alpaydin E. Localized multiple kernel learning // Proceedings of the 25th International Conference on Machine learning. Helsinki, 2008: 352
[42]	Jiang T J, Wang S Z, Wei R X. Support vector machine with composite kernels for time series prediction // Proceedings of International Symposium on Neural Networks. Nanjing, 2007: 350
[43]	Hotelling H. Relations between 2 sets of variants. Biometrika, 1935, 28(3-4): 312
[44]	Cooley W W, Lohnes P R. Multivariate Procedures for the Behavioral Sciences. New York: John Wiley & Sons, 1962
[45]	Akaho S. A kernel method for canonical correlation analysis // Proceedings of the International Meeting of the Psychometric Society (IMPS2001). Osaka, 2001: 1
[46]	Wang S, Lu J F, Gu X J, et al. Unsupervised discriminant canonical correlation analysis based on spectral clustering. Neurocomputing, 2016, 171: 425 doi: 10.1016/j.neucom.2015.06.043
[47]	Hu H F. Multiview gait recognition based on patch distribution features and uncorrelated multilinear sparse local discriminant canonical correlation analysis. IEEE Trans Circuits Syst Video Technol, 2014, 24(4): 617 doi: 10.1109/TCSVT.2013.2280098
[48]	Farquhar J D R, Hardoon D R, Meng H, et al. Two view learning: SVM-2K, theory and practice // Proceedings of the 18th International Conference on Neural Information Processing. Vancouver, 2005: 355.
[49]	Ozerov A, Fevotte C. Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans Audio Speech Language Process, 2010, 18(3): 550 doi: 10.1109/TASL.2009.2031510
[50]	Zhang J, Huan J. Inductive multi-task learning with multiple view data // Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Beijing, 2012: 543
[51]	Kong X N, Ng M K, Zhou Z H. Transductive multilabel learning via label set propagation. IEEE Trans Knowledge Data Eng, 2013, 25(3): 704 doi: 10.1109/TKDE.2011.141
[52]	Blum A, Mitchell T. Combining labeled and unlabeled data with co-training // Proceedings of the Eleventh Annual Conference on Computational Learning Theory. Madison, 1998: 92
[53]	Collins M. Unsupervised models for named entity classification. // Proceedings the 1999 of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. College Park, 1999: 100
[54]	Brefeld U, Scheffer T. Co-EM support vector learning // Proceedings of the Twenty-first International Conference on Machine Learning. Banff, 2004: 16
[55]	Muslea I, Minton S, Knoblock C A. Active + semi-supervised learning = robust multi-view learning // Proceedings of the 19th International Conference on Machine Learning. Sydney, 2002: 435
[56]	Lécun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. Proc IEEE, 1998, 86(11): 2278 doi: 10.1109/5.726791
[57]	Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model // Eleventh Annual Conference of the International Speech Communication Association. Makuhari, 2010: 1045
[58]	Hinton G E. Deep belief networks[J/OL]. Scholarpedia (2009-04-11) [2019-03-05]. http://www.scholarpedia.org/article/Deep_belief_networks
[59]	Simonyan K, Zisserman A. Very Deep convolutional networks for large-scale image recognition. // Proceedings of International Conference on Learning Representations 2015. San Diego 2015: 1
[60]	Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas 2016: 779
[61]	Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Machine Intelligence, 2017, 39(4): 640 doi: 10.1109/TPAMI.2016.2572683
[62]	Kim Y. Convolutional neural networks for sentence classification // Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, 2014: 1746
[63]	Shen Y L, He X D, Gao J F, et al. A latent semantic model with convolutional-pooling structure for information retrieval // Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. Shanghai, 2014: 101
[64]	Hu B T, Lu Z D, Li H, et al. Convolutional neural network architectures for matching natural language sentences // Advances in Neural Information Processing Systems. Montreal, 2014: 2042
[65]	Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. J Machine Learning Res, 2003, 3(6): 1137
[66]	Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning // Proceedings of the 25th International Conference on Machine Learning. Helsinki, 2008: 160
[67]	Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J/OL]. arXiv (2013-09-07) [2019-03-05]. https://arxiv.org/pdf/1301.3781.pdf
[68]	Graves A, Schmidhuber J. Frame-wise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 2005, 18(5-6): 602 doi: 10.1016/j.neunet.2005.06.042
[69]	Liu P F, Qiu X P, Huang X J. Recurrent neural network for text classification with multi-task learning // Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. New York, 2016: 2873
[70]	Sundermeyer M, Alkhouli T, Wuebker J, et al. Translation modeling with bidirectional recurrent neural networks // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, 2014: 14
[71]	Cho K, van Merrienboer B, Bahdanau D, et al. On the properties of neural machine translation: encoder-decoder approaches. // Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, 2014: 103
[72]	Wollmer M, Eyben F, Graves A, et al. Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cognitive Comput, 2010, 2(3): 180 doi: 10.1007/s12559-010-9041-8
[73]	Zhou P, Shi W, Tian J, et al. Attention-based bidirectional long short-term memory networks for relation classification // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, 2016: 207
[74]	Zhang J, Man K F. Time series prediction using RNN in multi-dimension embedding phase space // SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics. San Diego, 1998: 1868
[75]	Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks // 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, 2013: 6645
[76]	Karpathy A, Joulin A, Fei-Fei L F. Deep fragment embeddings for bidirectional image sentence mapping // Advances in Neural Information Processing Systems. Montreal, 2014: 1889
[77]	Donahue J, Hendricks L A, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Machine Intelligence, 2014, 39(4): 677
[78]	Kiros R, Salakhutdinov R, Zemel R. Unifying visual-semantic embeddings with multimodal neural language models. // Deep Learning and Representation Learning Workshop: NIPS 2014. Montreal, 2014: 1
[79]	Mitchell M, Han X F, Dodge J, et al. Midge: Generating image descriptions from computer vision detections // Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon, 2012: 747
[80]	Ma L, Lu Z D, Li H. Learning to answer questions from image using convolutional neural network // Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Phoenix, 2016: 3567
[81]	Wan J, Wang D Y, Hoi S C H, et al. Deep learning for content-based image retrieval: a comprehensive study // Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, 2014: 157
[82]	W?llmer M, Metallinou A, Eyben F, et al. Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling // Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH2010). Makuhari, 2010: 2362
[83]	Su Y H, Fan K, Bach N, et al. Unsupervised multi-modal neural machine translation // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle, 2019: 10482
[84]	Wang X, Huang Q Y, Celikyilmaz A, et al. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle, 2019: 6629
[85]	Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets // Advances in Neural Information Processing Systems. Monteral, 2014: 2672
[86]	Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks[J/OL]. arXiv (2016-01-07) [2019-03-05]. https://arxiv.org/pdf/1511.06434.pdf
[87]	Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks // Proceedings of the 34th International Conference on Machine Learning. Sydney, 2017: 214
[88]	Mirza M, Osindero S. Conditional generative adversarial nets[J/OL]. arXiv (2014-11-06) [2019-03-05]. https://arxiv.org/pdf/1411.1784.pdf
[89]	Tzeng E, Hoffman J, Saenko K, et al. Adversarial discriminative domain adaptation // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, 2017: 7167
[90]	Liu M Y, Tuzel O. Coupled generative adversarial networks // Advances in Neural Information Processing Systems. Barcelona, 2016: 469
[91]	Pei Z Y, Cao Z J, Long M S, et al. Multi-adversarial domain adaptation // Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 3934
[92]	Cao Z J, Long M S, Wang J M, et al. Partial transfer learning with selective adversarial networks // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt lake city, 2018: 2724
[93]	Xie S A, Zheng Z B, Chen L, et al. Learning semantic representations for unsupervised domain adaptation // Proceedings of the 35th International Conference on Machine Learning. Long Beach, 2018: 5423
[94]	Denton E L, Chintala S, Szlam A, et al. Deep generative image models using a laplacian pyramid of adversarial networks // Advances in Neural Information Processing Systems. Montreal, 2015: 1486
[95]	Zhang H, Goodfellow I, Metaxas D, et al. Self-attention generative adversarial networks [J/OL]. arXiv (2018-05-21)[2019-03-05]. https://arxiv.org/pdf/1805.08318.pdf
[96]	Rush A M, Chopra S, Weston J. A neural attention model for abstractive sentence summarization // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, 2015: 379
[97]	Miyato T, Kataoka T, Koyama M, et al. Spectral normalization for generative adversarial networks[J/OL]. arXiv (2018-02-16) [2019-03-05]. https://arxiv.org/pdf/1802.05957.pdf
[98]	Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis[J/OL]. arXiv (2019-02-25) [2019-03-05]. https://arxiv.org/pdf/1809.11096.pdf
[99]	Isola P, Zhu J Y, Zhou T H, et al. Image-to-image translation with conditional adversarial networks // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017: 1125
[100]	Zhu J Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks // Proceedings of the IEEE International Conference on Computer Vision. Venice, 2017: 2223
[101]	Choi Y, Choi M, Kim M, et al. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 8789
[102]	Huang X, Liu M Y, Belongie S, et al. Multimodal unsupervised image-to-image translation // Proceedings of the European Conference on Computer Vision (ECCV). Munich, 2018: 172
[103]	Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation // Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, 2015: 234
[104]	Anderson P, He X D, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 6077
[105]	Chen X P, Ma L, Jiang W H, et al. Regularizing RNNs for caption generation by reconstructing the past with the present // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 7995
[106]	Chen F H, Ji R R, Sun X S, et al. Groupcap: Group-based image captioning with structured relevance and diversity constraints // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 1345
[107]	Reed S, Akata Z, Yan X, et al. Generative adversarial text to image synthesis. // Proceedings of The 33rd International Conference on Machine Learning. New York, 2016: 1060
[108]	Zhang H, Xu T, Li H S, et al. Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks // Proceedings of the IEEE International Conference on Computer Vision. Venice, 2017: 5907
[109]	Zhang H, Xu T, Li H S, et al. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Machine Intelligence, 2019, 41(8): 1947 doi: 10.1109/TPAMI.2018.2856256
[110]	Hong S, Yang D D, Choi J, et al. Inferring semantic layout for hierarchical text-to-image synthesis // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 7986
[111]	Xu T, Zhang P C, Huang Q Y, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 1316