- Christopher Ré, Stanford University
- Raghu Ramakrishnan, Microsoft Research
- Derek Murray, Microsoft Research
- Michael I. Jordan, UC Berkley
- Joseph M. Hellerstein, UC Berkeley
- John Lanford, Microsoft Research
The Thorn in the Side of Big Data: Too Few Artists
Harvey's Emerald Bay B
A new generation of data processing systems, including web search, Google's Knowledge Graph, IBM's Watson, and several different recommendation systems, combine rich databases with software driven by machine learning. The spectacular successes of these trained systems have been among the most notable in all of computing and have generated excitement in health care, finance, energy, and general business. But building them can be challenging even for computer scientists with PhD-level training. If these systems are to have a truly broad impact, building them must become easier. This talk describes our recent thoughts on one crucial pain point in the construction of trained systems feature engineering. For street-art lovers, this talk will argue that current systems require artists, like Banksy, but we need them to be usable by the Mr. Brainwashes of the world. As an example, this talk will also describe some recent work on building trained systems to support research in paleobiology and feature selection for enterprise analytics.
Christopher (Chris) Re is an assistant professor in the Department of Computer Science at Stanford University. The goal of his work is to enable users and developers to build applications that more deeply understand and exploit data. Chris received his PhD from the University of Washington in Seattle under the supervision of Dan Suciu. For his PhD work in probabilistic data management, Chris received the SIGMOD 2010 Jim Gray Dissertation Award. Chris's papers have received four best-paper or best-of-conference citations, including best paper in PODS 2012, best-of-conference in PODS 2010 twice, and one best-of-conference in ICDE 2009). Chris received an NSF CAREER Award in 2011 and an Alfred P. Sloan fellowship in 2013.
Scale-out Beyond Map-Reduce
Harvey's Emerald Bay B
The amount of data being collected is growing at a staggering pace. The default is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated by instrumenting key customer and systems touchpoints. Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous. These differences in data scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of data to be stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools for graph analytics and machine learning. These new systems use scale-out architectures for both data storage and computation. Hadoop has become a key building block in the new generation of scale-out systems. Early versions of analytic tools over Hadoop, such as Hive and Pig for SQL-like queries, were implemented by translation into Map-Reduce computations. This approach has inherent limitations, and the emergence of resource managers such as YARN and Mesos has opened the door for newer analytic tools to bypass the Map-Reduce layer. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit. In this talk, I will examine this architectural trend, and argue that resource managers are a first step in re-factoring the early implementations of Map-Reduce, and that more work is needed if we wish to support a variety of analytic tools on a common scale-out computational fabric. I will then present REEF, which runs on top of resource managers like YARN and provides support for task monitoring and restart, data movement and communications, and distributed state management. Finally, I will illustrate the value of using REEF to implement iterative machine learning algorithms. This is joint work with the CISL team at Microsoft.
Raghu Ramakrishnan heads the Cloud and Information Services Lab (CISL) in the Data Platforms Group at Microsoft. From 1987 to 2006, he was a professor at University of Wisconsin-Madison, where he wrote the widely-used text “Database Management Systems” and led a wide range of research projects in database systems (e.g., the CORAL deductive database, the DEVise data visualization tool, SQL extensions to handle sequence data) and data mining (scalable clustering, mining over data streams). In 1999, he founded QUIQ, a company that introduced a cloud-based question-answering service. He joined Yahoo! in 2006 as a Yahoo! Fellow, and over the next six years served as Chief Scientist for the Audience (portal), Cloud and Search divisions, driving content recommendation algorithms (CORE), cloud data stores (PNUTS), and semantic search (“Web of Things”). Ramakrishnan has received several awards, including the ACM SIGKDD Innovations Award, the SIGMOD 10-year Test-of-Time Award, the IIT Madras Distinguished Alumnus Award, and the Packard Fellowship in Science and Engineering. He is a Fellow of the ACM and IEEE.
Timely dataflow in Naiad
Harvey's Emerald Bay B
Scaling out machine learning algorithms is more challenging than it needs to be. If you are lucky, your problem fits in the domain of an existing specialized system like GraphLab. But if it doesn’t, you face the choice between sacrificing performance using a general-purpose framework like Hadoop, or wasting time writing a new system from scratch.
In this talk, I will explain how the Naiad project aims to make life easier for distributed machine learning practitioners. Naiad is an open-source distributed computing framework that is optimized for both high-throughput and low-latency parallel computation. Naiad programs are written in It is based on the “timely dataflow” abstraction, which supports both asynchronous and fine-grained synchronous execution in the same program, and allows programmers to write efficient applications with predictable semantics. I will also present some of the high level DSLs that we have built on top of Naiad, which further simplify application development, and show that applications written on Naiad can achieve the performance of specialized systems.
Derek Murray is a researcher at the Microsoft Research Silicon Valley lab. His principal interests are in large-scale distributed and parallel computing. To that end, he is currently working on the Naiad project, in which the team is building a new distributed system that unifies iteration and incremental computation. Previously, he worked on CIEL (a distributed execution engine that was the subject of my PhD dissertation at the University of Cambridge Computer Lab) and Steno (which optimizes the code generated for declarative queries in DryadLINQ).
On the Computational and Statistical Interface and Big Data
Harvey's Emerald Bay B
The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend the statistical and computational sciences. That classical perspectives from these fields are not adequate to address emerging problems in "Big Data" is apparent from their sharply divergent nature at an elementary level---in computer science, the growth of the number of data points is a source of "complexity" that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. We wish to blend these perspectives. I present two research vignettes that pursue such a blend, the first involving the deployment of resampling methods such as the bootstrap on parallel and distributed computing platforms, the second introducing a methodology of "algorithmic weakening," whereby hierarchies of convex relaxations are used to control statistical risk as data accrue. [Joint work with Venkat Chandrasekaran, Ariel Kleiner, Purna Sarkar, and Ameet Talwalkar.]
Michael I. Jordan is the Pehong Chen Distinguished Professor in the Department of Electrical Engineering and Computer Science and the Department of Statistics at the University of California, Berkeley. He received his Masters in Mathematics from Arizona State University, and earned his PhD in Cognitive Science in 1985 from the University of California, San Diego. He was a professor at MIT from 1988 to 1998. His research interests bridge the computational, statistical, cognitive and biological sciences, and have focused in recent years on Bayesian nonparametric analysis, probabilistic graphical models, spectral methods, kernel machines and applications to problems in distributed computing systems, natural language processing, signal processing and statistical genetics. Prof. Jordan is a member of the National Academy of Sciences, a member of the National Academy of Engineering and a member of the American Academy of Arts and Sciences. He is a Fellow of the American Association for the Advancement of Science. He has been named a Neyman Lecturer and a Medallion Lecturer by the Institute of Mathematical Statistics, and has received the ACM/AAAI Allen Newell Award. He is a Fellow of the AAAI, ACM, ASA, CSS, IMS, IEEE and SIAM.
Big Eympathy: Growing Up
Harvey's Emerald Bay B
In recent years, Machine Learning research has expanded from its initial focus on algorithms and statistics to embrace challenges in data processing and distributed systems. This movement is increasing the intellectual and practical scope of the field in positive ways. But in this talk I will agitate for yet more breadth. A mature computing research field eventually embraces the problems of its users and developers as part of its research agenda, and it is arguably time for Machine Learning to step up to these challenges. First, more attention should be paid to the day-to-day work done by analysts working with the community's methods and models. Addressing their human-computer interaction challenges is important for fostering broader adoption, and raises interesting research challenges. Second, more attention should be paid to the software engineering challenge of deploying these algorithms—particularly the challenges of simultaneously providing generality, high performance, and maintainability. Again, the practical problems here drive new research questions. To illustrate my point, I'll refer back to some of the work done in my own field of Data Management, which has a long tradition of including these issues in its research agenda.
Joseph M. Hellerstein is a Chancellor's Professor of Computer Science at the University of California, Berkeley, whose work focuses on data-centric systems and the way they drive computing. He is an ACM Fellow, an Alfred P. Sloan Research Fellow and the recipient of three ACM-SIGMOD "Test of Time" awards for his research. In 2010, Fortune Magazine included him in their list of 50 smartest people in technology , and MIT's Technology Review magazine included his Bloom language for cloud computing on their TR10 list of the 10 technologies "most likely to change our world".
Hellerstein is the co-founder and CEO of Trifacta. He serves on the technical advisory boards of a number of computing and Internet companies including EMC, SurveyMonkey, Platfora, Captricity, and Graphlab, and previously served as the Director of Intel Research, Berkeley.
Tutorial on Vowpal Wabbit
This is a tutorial on Vowpal Wabbit. We'll start with some basics and pointers to previously presented things, then dive into detail on new systems including: (a) The online bootstrap. This is an efficient way to get a sense of the variance in your predictions. (b) Imperative Searn. This is a much more friendly approach to Searn's structured prediction which is extremely efficient and effective in comparison to many other structured prediction algorithms like CRF++, SVMstruct, etc... (c) The internal reduction interface. The internal interface for learning reduction design has been developing for sometime, so that it's now about ready for general use.
John Langford studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor's degree in 1997, and received his Ph.D. from Carnegie Mellon University in 2002. Since then, he has worked at Yahoo!, Toyota Technological Institute, and IBM's Watson Research Center. He is also the primary author of the popular Machine Learning weblog, hunch.net and the principle developer of Vowpal Wabbit. Previous research projects include Isomap, Captcha, Learning Reductions, Cover Trees, and Contextual Bandit learning. For more information visit http://hunch.net/~jl.