An Introduction with Applications in Data Science

This is a textbook in probability in high dimensions with a view toward applications in data sciences. It is intended for doctoral and advanced masters students and beginning researchers in mathematics, statistics, electrical engineering, computer science, computational biology and related areas, who are looking to expand their knowledge of theoretical methods used in modern research in data sciences.

Data sciences are moving fast, and probabilistic methods often provide a foundation and inspiration for such advances. A typical graduate probability course is no longer sufficient to acquire the level of mathematical sophistication that is expected from a beginning researcher in data sciences today. The proposed book intends to partially cover this gap. It presents some of the key probabilistic methods and results that should form an essential toolbox for a mathematical data scientist. This book can be used as a textbook for a basic second course in probability with a view toward data science applications. It is also suitable for self-study.

The essential prerequisites for reading this book are a rigorous course in probability theory (on Masters or Ph.D. level), an excellent command of undergraduate linear algebra, and general familiarity with basic notions about Hilbert and normed spaces and linear operators. Knowledge of measure theory is not essential but would be helpful.

Once this book is completed, it is going to be published by Cambridge University Press. If you want to be notified once the book is available, please send me an e-mail.

I am still writing this book. The current draft of the textbook is available:

(Warning: large file, please be patient with download.)

As of now, the technical material is almost complete. References and a lot of "chat" will be added; please also ignore the typos in the current version.

This draft is updated periodically. Use it at your own risk, and only for your personal and classroom needs. Please do not distribute the copy.

Got ideas how this textbook can be improved? Want to suggest useful topics or exercises? Please let me know, I will be happy to hear what you think.

Here are a few useful sources, which cover some of the material that is going to be included in the textbook. Some of them require more advanced background than this textbook does.

- R. Vershynin, Four lectures on probabilistic methods for data science. 2016 PCMI Summer School, AMS, to appear.
- R. Vershynin, Introduction to the non-asymptotic analysis of random matrices. Compressed sensing, 210--268, Cambridge Univ. Press, Cambridge, 2012.
- P. Rigollet, High-dimensional statistics. Lecture notes, Massachusetts Institute of Technology, 2015.
- A. Bandeira, Ten lectures and forty-two open problems in the mathematics of data science, Lecture notes, 2016.
- S. Boucheron, G. Lugosi and P. Massart, Concentration inequalities, Oxford University Press, 2013.
- T. Tao, Topics in random matrix theory, AMS, 2012.
- M. Ledoux, Concentration of measure phenomenon, AMS, 2001.
- Y. Plan, Probability in high dimensions, graduate course at UBC.
- R. Vershynin, High-dimensional probability, graduate course at UM.
- R. van Handel, Probability in high dimension, ORF 570 Lecture notes, Princeton University, 2014.
- D. Chafai, O. Guedon, G. Lecue, A. Pajor, Interactions between compressed sensing, random matrices and high-dimensional geometry, preprint.
- R. Vershynin, Lectures in geometric functional analysis, unpublished, 2009.

**May 23, 2017.**An "Appetizer" added to the front of the book. It presents the so-called Maurey's empirical method, which is an elegant and elementary application of probability to bound covering numbers of sets. Chapter 7 is now polished.**April 27, 2017.**Chapter 6 is now polished.**April 20, 2017.**Chapter 5 is now polished. I cleaned up the guarantees of covariance estimation both in this chapter and those appeared earlier in Chapter 4.**February 23, 2017.**Chapter 4 is now polished. I added an application to error correction codes in Section 4.3 and rewrote the application for covariance estimation in Section 4.7.**February 9, 2017.**Chapter 3 is now polished. I added a section (3.7) on kernel methods and Krivine's proof of Grothendieck's inequality, which gives (almost) the best known bound on the constant.**January 20, 2017.**Chapter 2 is now polished.**January 4, 2017.**Chapter 1 has been polished. The difficulty of exercises will be indicated by the number of coffee cups one may need to solve them.**December 21, 2016.**Numerous typos and inaccuracies fixed throughout the book. It was then converted into the publisher's style, which miraculously reduced the number of pages by 50!**December 20, 2016.**A short version of this book, condensed into just four lectures, can be found here.**November 15, 2016.**Two big sections are added in Chapter 8: VC dimension and applications in statistical learning theory.**October 24, 2016.**A few applications are added to Chapter 3: Grothendieck's inequality, semidefinite programming, and maximum cut for graphs.