Synthetic Data for Deep Learning

Although the dust cover promises "a collection of relevant data on twenty-four psychiatric conditions, a discussion of what a physician sees when such a condition is present, and an examination of how some courts have seen the same problem," nothing of the sort lies within. Instead, we are presented with the author's rambling and often tangential thoughts about a whole host of psychiatric (and some not psychiatric) conditions. The sections are brief, averaging six pages, and each begins with a short disconnected thought by the author. Following that is an excerpted case which, presumably, the author has chosen to illustrate a point he wishes to make. Most sections close with a short note on the topic. Unfortunately, this book was outdated before it was ever published. The cases cited are no longer landmark cases, often having been overturned more than a decade ago. The author's choice of DSM-I [1] is perplexing, as DSM-III-R [2] is currently being used to diagnose and classify mental disorders. Therefore, many of the psychiatric diagnoses described have been renamed, reclassified, or dropped from the psychiatric nomenclature. Similarly, the references are too short and, again, outdated. Most chapters cite books from the 1920s through 1960s as authorities. Frequently, first editions are cited when second and even third editions exist. The author took too big a bite with too little research. The publisher lacked a psychiatric consultant, who could have alerted him to the problems. The outcome is not worth reading.


When Deep
Learning was just about entering the domain of Mathematical Finance, it was already evident for many that data would soon assume a more significant role in financial modelling, be it to make predictions, analyse the market or to train models. What was much more difficult to anticipateat least to the majority of users-was the emerging role of synthetic data in this context. Given the terabytes of data footprints produced on a daily basis in an array of domains, we were (quite unsurprisingly) speaking about big data, rather than synthetic data for a long time. It was only as deep learning applications became more prevalent, that use cases (and thus the need) for high quality synthetic data became important.
Though interest for this area has recently grown through many channels, for many of the people following research in Quantitative Finance it was the work Deep Hedging (Buehler et al. 2019) that kickstarted interest in synthetic data. Deep Hedging was one of the applications that brought the advantages of synthetic data (or as recently coined by the domain 'Market Generators') to light. The interest in market generation has opened up new avenues for financial modelling and has led to a surge in research activity in the domain with a number of recent contributions (for example Bonnier et al. 2019, Wiese et al. 2020, Acciaio et al. 2021, Buehler et al. 2021 and doubtless more to come. It is thus no surprise that several of the 2022 Risk Awards went to Hans Bühler and the JP Morgan team. In more traditional Machine Learning applications (think robotics or autonomous driving) the case for synthetic (training) data was made far earlier than in finance. Nevertheless, even in the more classical domains, only fairly recent developments have focussed attention (or an entire book) to the study of synthetic data for deep learning: In fact, the author joyfully exclaims in the preface that he 'managed to release one of the first books specifically devoted to the subject of synthetic data'. And while Nikolenko goes on to convince the reader why this subject deserves our attention in general, we devote this book review to discussing why this subject deserves the attention of quants and quantitative finance researchers, specifically.
Synthetic Data for Deep Learning by Sergey I. Nikolenko appeared in 2019 in the Springer series Springer Optimization and Its Applications. The book is slightly outside of the range of our regular themes and it is in fact strictly speaking not a book for Quantitative Finance: To put it differently, it is not a book that was written with the aim of addressing questions arising in finance. It is a book on deep learning in its more traditional sense and the applications are more aligned with typically known use cases of deep learning: from computer vision problems to optical flow estimation and navigation, none of which are related to finance in any obvious form. So why then was it selected to be in the spotlight in this journal, especially if several of the topics discussed in the book are not (yet) on the standard agenda of its community? It is a distinctive flavour of quantitative finance that one canand perhaps indeed should-get inspiration by doing some window-shopping in other disciplines and immersing oneself with new methods on display there. Such excursions can (and in the past quite frequently did) add to the quantitative modeller's tool-kit. It may well be that the book will provide some inspiration for useful new tools in this spirit. With this in mind, we consider Synthetic Data for Deep Learning a great read for those who are currently interested or active in developing synthetic data solutions in quantitative finance, and in particular to those who do not mind being left hungry for answers to questions that arise along the way.
The author, Sergey I. Nikolenko is beyond doubt very active in the area with a previously published monograph, which shares its title 'Deep Learning' with the famous 2016 reference work (Goodfellow et al. 2016) by Goodfellow, Bengio, and Courville. Nikolenko is Head of Lab at the Steklov Institute of Mathematics at St. Petersburg, and has been commercially active for a number of companies, amongst them large international businesses as well as smaller providers of Machine Learning solutions.
Praise and criticism: This book has some potential to inspire new tools for finance since it showcases several applications and an array of challenges where synthetic data is used for deep learning. It also displays some of the typical pitfalls in those applications. However if the book does inspire new tools, they are quite certainly not ready-made for quantitative finance and will require some adaptation by the prepared reader. It should also be pointed out, as the does author himself, that this is not an introductory textbook. Although it contains several introductory chapters on deep neural networks and corresponding optimisation problems and some on deep generative models as well as neural architectures for computer vision, these introductory chapters are better seen as a reminder rather than a thorough introduction. The book is targeted to 'a somewhat prepared reader' who will only use the introductory chapters as reference material. It is disappointing to this reviewer that the author, who is clearly practically oriented and experienced, has not shared more practical examples and in particular, code. It would have been immensely useful to have snippets of code available alongside the chapters, or pointers to repositories which would have helped the interested reader to actively engage with the concepts.
Structure and contents: The book itself is structured into ten chapters in addition to an introductory and a concluding chapter. The introductory chapter picks up what the author promises to do: It explains 'the data problem' to convince readers that synthetic data deserves their attention in the first place. The concluding chapter (chapter 12) points to possible directions for future work and to domain adaptation. The contents cover three main directions for the use of synthetic data in machine learning and provide a high-level walkthrough of typical challenges and solutions (albeit from a perspective unrelated to finance): (1) Using synthetically generated datasets to train machine learning models directly.
(2) Using synthetic data to augment existing real datasets so that the resulting hybrid datasets are better suited for training the models. (3) Using synthetic data to resolve privacy issues that make the use of real data difficult.
The discussions around synthetic data highlight a question relevant for finance from a regulatory perspective i.e. whether realism in synthetic data research is always as necessary as we tend to think, in order to efficiently train models. The book revisits this question several times with different viewpoints in different applications (though also here answers in the financial context are yet to be delivered). It is somewhat inconvenient for the curious reader that these themes are mainly presented from a computer vision angle. Although the book touches on financial aspects (Chapter 11) in a rudimentary fashion in the context of privacy issues, point (3) above. In this chapter, there are a number of blanks left to the finance-inclined reader's imagination. However, the reality is that topics around synthetic data and questions of data privacy in finance are currently picking up and gaining momentum among quants and quantitative finance researchers and there is a lot to say in this area: We can expect to see many more finance-related research in this area in the future.
What the reader can expect to gain from this book, will strongly depend on their previous knowledge. The author elegantly acknowledges this by distinguishing between 'unprepared' and 'prepared' readers. As pointed out previously, it is not ideal as an introductory textbook: Often the fuzzy details and tedious discussions are circumvented for the sake of brevity and a neat presentation. On the plus side, readers may find themselves getting good high-level insights whilst skimming through chapters. However, the same readers (either prepared or unprepared) may find it somewhat difficult at times to fill in the details or to reproduce the numerical results independently. Nevertheless, all readers of this book will gain a broader perspective on generative modelling and its historical evolution. Perhaps even more importantly, this book will give its readers a good starting point to communicate with and build bridges across disciplines through an increased awareness of the terminologies used in other areas of applications and the typical challenges and pitfalls that arise. We are convinced that such awareness will bear numerous benefits in an ever more interdisciplinary research landscape.