In this paper, we propose a framework for learning through drawing. Our goal is to learn the correspondence between spoken words and abstract visual attributes, from a dataset of spoken descriptions of images. We use the learned representations of GANs and manipulate them to edit semantic concepts in the generated outputs, and use such GAN-generated images to train a model using a triplet loss.
Acknowledgements
Funding for this research was partially supported by Toyota Research Institute.