- David Blei, Princeton (2012 Jamon Lecture Awardee)
- Felix Herrmann, Univ of British Columbia
- Jonathan Goldstein, Microsoft
Big Data and Bayes: Stochastic Variational Inference and Scalable Topic Models
Harveys Emerald Bay A
Probabilistic topic modeling provides a suite of tools for analyzing large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. We can use topic models to explore the thematic structure of a corpus and to solve a variety of prediction problems about documents.
Most topic models are based on hierarchical mixed-membership models, where each document expresses a set of components (called topics) with individual per-document proportions. The computational problem is to condition on a collection of observed documents and estimate the posterior distribution of the topics and per-document proportions. In modern data sets, this amounts to posterior inference with billions of latent variables.
How can we cope with such data? In this talk, I will describe stochastic variational inference, an algorithm for computing with topic models that can handle very large document collections (and even endless streams of documents). I will demonstrate our algorithm with models fitted to millions of articles. I will show how stochastic variational inference can be generalized to many kinds of hierarchical models, including models of images and social networks, and Bayesian nonparametric models. I will highlight several open questions and outstanding issues.
David Blei is an associate professor of Computer Science at Princeton University. He received his PhD in 2004 at U.C. Berkeley and was a postdoctoral fellow at Carnegie Mellon University. His research focuses on probabilistic topic models, Bayesian nonparametric methods, and approximate posterior inference. He works on a variety of applications, including text, images, music, social networks, and scientific data.
Randomized sampling in exploration seismology
Harveys Emerald Bay A
Current-day wave-equation based exploration seismology increasingly relies on faithful samplings and simulations of seismic wavefields. This reliance on full sampling and high-fidelity wavefield simulations strains our acquisition and processing systems and overcoming this impediment is becoming one of the main challenges faced by our industry. By using randomized dimensionality-reduction techniques, we propose a new strategy where acquisition and computational costs are no longer dictated by the sampling grid but by transform-domain compressibility. To arrive at these results, we use recent insights from stochastic optimization, statistical physics, and compressive sensing to reduce processing costs by minimizing the number of required passes through the data by carrying out the inversions on random subsets of data. By incorporating these ideas in our problem formulations, we are able to control the sky-rocketing acquisition– and computation–related costs.
Felix J. Herrmann received his Ph.D. degree in Engineering Physics from the Delft University of Technology (the Netherlands) in 1997. Felix was a visiting scholar at Stanford's Mathematics Department in 1998, a post-doctoral fellow at MIT's Earth Resources Laboratory from 1999 to 2002, and a senior Fellow at the UCLA's Institute for Pure and Applies Mathematics in 2004. Felix is currently an associate professor at the Department of Earth & Ocean Sciences of the University of British Columbia. Felix is director of the UBC-Seismic Laboratory for Imaging and Modeling (SLIM), which he founded in 2003. His research interests include theoretical and applied aspects of exploration seismology, compressive sensing, and large-scale optimization. Felix is the principal investigator of the industry- and NSERC-supported research programs SINBAD and DNOISE. Felix serves on the editorial boards of the Journal of Applied Geophysics and Geophysical Prospecting. Felix also serves on the advisory boards of the UBC-Pacific institute for the Mathematical Sciences, the UBC-Institute for Applied Mathematics, and on the Academic Advisory Committee of the Harbin Institute of Technology (China). He is a member of the Institute for Computing, Information, and Cognitive Science (ICICS); the European Association of Geoscientists & Engineers (EAGE); the Canadian Society of Exploration Geophysicists (CSEG); the Society of Exploration Geophysicists (SEG); the Society of Industrial and Applied Mathematics (SIAM), and the American Geophysical Union (AGU).
Temporal Analytics on Big Data
Harveys Emerald Bay A
Many "Big Data" machine learning problems are often fundamentally temporal in nature, as are many analytics tasks over such data. For instance, display advertising uses Behavioral Targeting (BT) to select ads for users based on prior searches, page views, etc. Prior work on BT has focused on techniques that scale well for offline data using M-R. However, this approach has limitations for BT-style applications that deal with temporal data: (1) many queries are temporal and not easily expressible in M-R, and moreover, the set-oriented nature of M-R front-ends such as SCOPE is not suitable for temporal processing; (2) as commercial systems mature, they may need to also directly analyze and react to real-time data feeds since a high turnaround time can result in missed opportunities, but it is difficult for current solutions to naturally also operate over real-time streams.
This talk therefore advocates combining time-oriented data processing systems with a M-R framework for solving big machine learning problems. In particular, this talk will focus on an example of such a framework, called TiMR (pronounced timer). Users write and submit analysis algorithms as temporal queries - these queries are succinct, scale-out-agnostic, and easy to write. They scale well on large-scale offline data using TiMR, and can work unmodified over real-time streams. TiMR also incorporates new cost-based query fragmentation and temporal partitioning schemes for improving efficiency. This talk also presents the feasibility of this approach for BT, with new temporal algorithms that exploit new targeting opportunities. Experiments using real data from a commercial ad platform show that TiMR is very efficient and incurs orders-of-magnitude lower development effort. The presented BT solution is easy and succinct, and performs up to several times better than current schemes in terms of memory, learning time, and click-through-rate/coverage.
Jonathan Goldstein is currently the architect for Microsoft StreamInsight, a product based on the CEDR research project at Microsoft Research, which he led for the 4 years prior to joining the product team. Microsoft StreamInsight is a streaming product based on the CEDR algebra and CEDR query processing algorithms. Prior to working on streaming, Jonathan worked on query optimization, audio fingerprinting, similarity search, and database compression. His work on similarity searching has been recognized in the SIGMOD influential papers anthology, and he was recently awarded the SIGMOD Test of Time award for his prior query optimization work. Jonathan received his B.S. from SUNY Stony Brook in 1993, and his Ph.D. from the University of Wisconsin in 1999.