Keynote Talks
Day 1
- Morning Session
- Nicolas Pinto, Harvard
- Yann LeCun and Clement Ferabet, NYU
- Afternoon Session
- Nuria Oliver, Telefonica Research
- Miguel Araujo and Charles Parker, BigML
- Daniel Whiteson, Univ of California, Irvine
Day 1
- Morning Session
- Chris Re, Univ of Wiconsin
- Alex Smola, Yahoo!
- Afternoon Session
- Matei Zaharia, Univ of Berkeley
- Jeff Hammerbacher, Cloudera
- Carlos Guestrin, CMU

Nicolas Pinto
The Rowland Institute at Harvard
GPU Metaprogramming: A Case Study in Large-Scale Convolutional Neural Networks
Hardware Accelerated Learning
Montebajo: Theater
December 16, 7:40AM
Large-scale parallelism is a common feature of many neuro-inspired algorithms. In this short paper, we present a practical tutorial on ways that metaprogramming techniques – dynamically generating specialized code at runtime and compiling it just-in-time – can be used to greatly accelerate a large data-parallel algorithm. We use filter-bank convolution, a key component of many neural networks for vision, as a case study to illustrate these tech- niques. We present an overview of several key themes in template metaprogramming, and culminate in a full example of GPU auto-tuning in which an instrumented GPU kernel template is built and the space of all possible instantiations of this kernel is automatically grid- searched to find the best implementation on various hardware/software platforms. We show that this method can, in concert with traditional hand-tuning techniques, achieve significant speed-ups, particularly when a kernel will be run on a variety of hardware platforms.
Slides - here
Bio
Nicolas Pinto is a Research Scientist working with David Cox at Harvard University’s Rowland Institute and James DiCarlo at MIT’s McGovern Institute for Brain Research. He is also a lecturer in Computer Science teaching massively parallel programming at Harvard’s School of Engineering and Applied Sciences and Harvard’s Division of Continuing Education. His research interests lie at the intersection of brain and computer sciences, with the goal of dramatically accelerate the development of computational theories of how the visual cortex works. Nicolas received two M.S. from France and a PhD from MIT.


Yann LeCun and Clement Farabet
NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision
Hardware Accelerated Learning
Montebajo: Theater
December 16, 9:45AM
We present a scalable hardware architecture to implement general-purpose systems based on convolutional networks. We will first review some of the latest advances in convolutional networks, their applications and the theory behind them, then present our dataflow processor, a highly-optimized architecture for large vector transforms, which represent 99% of the computations in convolutional networks. It was designed with the goal of providing a high-throughput engine for highly-redundant operations, while consuming little power and remaining completely runtime reprogrammable. We present performance comparisons between software versions of our system executing on CPU and GPU machines, and show that our FPGA implementation can outperform these standard computing platforms.
Bio
Yann LeCun is Silver Professor of Computer Science and Neural Science at the Courant Institute of Mathematical Sciences and the Center for Neural Science of New York University. He received the Electrical Engineer Diploma from Ecole Supérieure d'Ingénieurs en Electrotechnique et Electronique (ESIEE), Paris in 1983, and a PhD in Computer Science from Université Pierre et Marie Curie (Paris) in 1987. After a postdoc with Geoffrey Hinton at the University of Toronto, he joined AT&T Bell Laboratories in Holmdel, NJ, in 1988, and became head of the Image Processing Research Department at AT&T Labs-Research in 1996. He joined NYU as a professor in 2003, after a brief period as Fellow at the NEC Research Institute in Princeton. His current interests include machine learning, computer perception and vision, mobile robotics, and computational neuroscience. He has published over 140 technical papers and book chapters on these topics as well as on neural networks, handwriting recognition, image processing and compression, and VLSI design. His handwriting recognition technology is used by several banks around the world to read checks. His image compression technology, called DjVu, is used by hundreds of web sites and publishers and millions of users to access scanned documents on the Web, and his image recognition methods are used in deployed systems by companies such as Google, Microsoft, NEC, France Telecom and several startup companies for document recognition, human-computer interaction, image indexing, and video analytics. He has been on the editorial board of IJCV, IEEE PAMI, IEEE Trans on Neural Networks, was program chair of CVPR'06, and is chair of the annual Learning Workshop. He is on the science advisory board of Institute for Pure and Applied Mathematics, and is the co-founder of MuseAmi, a music technology company.
Clement Farabet received a Master's Degree in Electrical Engineering with honors from Institut National des Sciences Appliquées (INSA) de Lyon, France in 2008. His Master's thesis work was developed at the Courant Institute of Mathematical Sciences of New York University with Professor Yann LeCun. He then joined Professor Yann LeCun's laboratory in 2008, as a research scientist. In 2009, he started collaborating with Yale University's e-Lab, led by Professor Eugenio Culurciello. In 2010, he started the PhD program at Universite Paris-Est, with Professors Michel Couprie and Laurent Najman, in parallel with his research work at Yale and NYU. His research interests include intelligent hardware, embedded super-computers, computer vision, machine learning, embedded robotics, and more broadly artificial intelligence. His current work aims at developing a massively-parallel yet low-power processor for general-purpose vision. Algorithmically, most of this work is based on Prof Yann LeCun's Convolutional Networks, while the hardware has its roots in dataflow computers and architectures as they first appeared in the 1960s.

Nuria Oliver
Telefonica Research, Barcelona
Towards Human Behavior Understanding from Pervasive Data: Opportunities and Challenges Ahead
Applications
Montebajo: Theater
December 16, 4:00PM
We live in an increasingly digitized world where our -- physical and digital -- interactions leave digital footprints. It is through the analysis of these digital footprints that we can learn and model some of the many facets that characterize people, including their tastes, personalities, social network interactions, and mobility and communication patterns. In my talk, I will present a summary of our research efforts on transforming these massive amounts of user behavioral data into meaningful insights, where machine learning and data mining techniques play a central role. The projects that I will describe cover a broad set of areas, including smart cities and urban computing, psychographics, socioeconomic status prediction and disease propagation. For each of the projects, I will highlight the main results and point at technical challenges still to be solved from a data analysis perspective.
Slides - here
Bio
http://www.nuriaoliver.com/bio.htm


Miguel Araujo and Charles Parker
Big Machine Learning Made Easy
Tools and Software
Montebajo: Theater
December 16, 5:25PM
While machine learning has made its way into certain industrial applications, there are many important real-world domains, especially domains with large-scale data, that remain unexplored. There are a number of reasons for this, and they occur at all places in the technology stack.
One concern is ease-of-use, so that practitioners with access to big data who are not necessarily machine learning experts are able to create models. Another is transparency. Users are more likely to want models they can easily visualize and understand. A flexible API layer is required so users can integrate models into their business process with a minimum of hassle. Finally, a robust back-end is required to parallelize machine learning algorithms and scale up or down as needed..
In this talk, we discuss our attempt at building a system that satisfies all of these requirements. We will briefly demonstrate the functionality of the system and discuss major architectural concerns and future work.
Slides - here
Bio
Miguel Araujo holds a B.S and M.S in computer science from Universidad Antonio de Nebrija and San Diego State University. He is a Machine Learning addict and an active open source hacker that enjoys coding in Python. Miguel is a contributor in open source projects like: django-rules, django-crispy-forms, and requests-oauth. He is the Head of Web Development at BigML where he is tasked with the efforts to make Machine Learning for Big Data easily accessible and understandable to non-machine- learning experts. He has been a speaker at conferences such as DjangoCon.us 2011.
Charles Parker received his Ph.D. in Computer Science in 2007 under Professor Prasad Tadepalli at Oregon State University in the area of statistical machine learning. From 2007 to 2011 he worked at the Eastman Kodak Company, focused on machine learning theory and applications in computer vision, audio recognition, and text processing. He currently works at BigML, Inc., helping to develop a scalable cloud infrastructure for machine learning predictive analytics. His work has appeared in The AAAI conference on Artificial Intelligence, the International Conference on Machine Learning, the International Conference on Data Mining, and other notable conferences.

Daniel Whiteson
Dept of Physics and Astronomy, UCI
Machine Learning's Role in the Search for Fundamental Particles
Applications
Montebajo: Theater
December 16, 6:45PM
High-energy physicists try to decompose matter into its most fundamental pieces by colliding particles at extreme energies. But to extract clues about the structure of matter from these collisions is not a trivial task, due to the incomplete data we can gather regarding the collisions, the subtlety of the signals we seek and the large rate and dimensionality of the data. These challenges are not unique to high energy physics, and there is the potential for great progress in collaboration between high energy physicists and machine learning experts. I will describe the nature of the physics problem, the challenges we face in analyzing the data, the previous successes and failures of some ML techniques, and the open challenges.
Slides - here
Bio
Daniel Whiteson is an Associate Professor in the Department of Physics & Astronomy at UC Irvine. His research area is experimental particle physics, using data from the world's most powerful colliders to answer questions about the fundamental nature of matter and interactions at the smallest scales. He has a long-standing interest in machine learning and has collaborated with machine learning researchers to apply new ideas to the problems of particle physics. He did his PhD at UC Berkeley and went to college at Rice University.

Chris Re
Hazy: Making Data-driven Statistical Applications Easier to build and Maintain
Models and Algorithms
Montebajo: Theater
December 17, 7:30AM
The main question driving my group’s research is: how does one deploy statistical data-analysis tools to enhance data driven systems? Our goal is to find abstractions that one needs to deploy and maintain such systems. In this talk, I describe my group’s attack on this question by building a diverse set of statistical-based data-driven applications: a system whose goal is to read the Web and answer complex questions, a muon detector in collaboration with a neutrino telescope called IceCube, and a social-science applications involving rich content (OCR and speech data). Even in this diverse set, my group has found common abstractions that we are exploiting to build and to maintain systems. Of particular relevance to this workshop is that I have heard of applications in each of these domains referred to as “big data.” Nevertheless, in our experience in each of these tasks, after appropriate preprocessing, the relevant data can be stored in a few terabytes -- small enough to fit entirely in RAM or on a handful of disks. As a result, it is unclear to me that scale is the most pressing concern for academics. I argue that dealing with data at TB scale is still challenging, useful, and fun, and I will describe some of our work in this direction. This is joint work with Benjamin Recht, Stephen J. Wright, and the Hazy Team
Slides - here
Bio
Christopher (Chris) Ré is currently an assistant professor in the department of Computer Sciences at the University of Wisconsin-Madison. The goal of his work is to enable users and developers to build applications that more deeply understand data. In many applications, machines can only understand the meaning of data statistically, e.g., user-generated text or data from sensors. To attack this challenge, Chris's recent work is to build a system, Hazy, that integrates a handful of statistical operators with a standard relational database management system. To support this work, Chris received the NSF CAREER Award in 2011.
Chris received his PhD from the University of Washington, Seattle under the supervision of Dan Suciu. For his PhD work in the area of probabilistic data management, Chris received the SIGMOD 2010 Jim Gray Dissertation Award. His PhD work produced two systems: Mystiq, a system to manage relational probabilistic data, and Lahar, a streaming probabilistic database. The contributions of these systems are techniques to efficiently evaluate queries on probabilistic data, such as multisimulation and extensional plans for aggregates, and to efficiently represent probabilistic data using materialized views and approximate lineage.
Chris's papers have received three best-of-conference citations (two in PODS 2010 and one in ICDE 2009). Chris was recently granted his first patent.
http://pages.cs.wisc.edu/~chrisre/bio.txt

Alex Smola
Real Time Data Sketches
Models and Algorithms
Montebajo: Theater
December 17, 9:25AM
I will describe a set of algorithms for extending streaming and sketching algorithms to real time analytics. These algorithm captures frequency information for streams of arbitrary sequences of symbols. The algorithm uses the Count-Min sketch as its basis and exploits the fact that the sketching operation is linear. It provides real time statistics of arbitrary events, e.g.\ streams of queries as a function of time. In particular, we use a factorizing approximation to provide point estimates at arbitrary (time, item) combinations. The service runs in real time, it scales perfectly in terms of throughput and accuracy, using distributed hashing. The latter also provides performance guarantees in the case of machine failure. Queries can be answered in constant time regardless of the amount of data to be processed. The same distribution techniques can also be used for heavy hitter detection in a distributed scalable fashion.
Slides - here
Bio
Alex is a Principal Researcher at Yahoo. He was previously at the Australian National University and NICTA, from 1999-2008, where he most recently led the machine learning program (comprising 8 academics and 8 PhD students). He received his PhD at the Technische Universitaet Berlin (summa cum laude) in 1998, and worked from 1995-1996 at AT\&T Bell Laboratories in Holmdel (USA) with Vladimir Vapnik, on ``Regression with SV Machines''. Alex's current research focus is on nonparametric methods for estimation, in particular kernel methods and exponential families. This includes support vector Machines, gaussian processes, and conditional random fields. Kernels are very useful also for the representation of distributions, that is two-sample tests, independence tests and many applications to unsupervised learning. In practice this work requires similarity measures on discrete objects (graphs, strings, automata), large scale optimization and numerical analysis (interior point methods, matrix factorization), and learning theory (uniform convergence bounds, design of priors, etc.). Alex is presently working on problems in bioinformatics, pattern recognition, document analysis, computer vision and optimization for parallel processing.

Matei Zaharia
AMP Lab, University of California, Berkeley
Spark: In-Memory Cluster Computing for Iterative and Interactive Applications
Tools and Software
Montebajo: Theater
December 17, 4:20PM
MapReduce and its variants have been highly successful in supporting large-scale data-intensive cluster applications. However, these systems are inefficient for applications that share data among multiple computation stages, including many machine learning algorithms, because they are based on an acyclic data flow model. We present Spark, a new cluster computing framework that extends the data flow model with a set of in-memory storage abstractions to efficiently support these applications. Spark outperforms Hadoop by up to 30x in iterative machine learning algorithms while retaining MapReduce's scalability and fault tolerance. In addition, Spark makes programming jobs easy by integrating into the Scala programming language. Finally, Spark's ability to load a dataset into memory and query it repeatedly makes it especially suitable for interactive analysis of big data. We have modified the Scala interpreter to make it possible to use Spark interactively as a highly responsive data analytics tool.
At Berkeley, we have used Spark to implement several large-scale machine learning applications, including a Twitter spam classifier and a real-time automobile traffic estimation system based on expectation maximization. We will present lessons learned from these applications and optimizations we added to Spark as a result.
Spark is open source and can be downloaded at http://www.spark-project.org.
Slides - here
Bio
Matei Zaharia is a fifth year graduate student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in cloud computing, operating systems and networking. He is also a committer on Apache Hadoop. He is funded by a Google PhD fellowship. Before joining Berkeley, Matei got his undergraduate degree at the University of Waterloo in Canada.

Jeff Hammerbacher
Machine Learning and Apache Hadoop
Tools and Software
Montebajo: Theater
December 17, 5:30PM
Talk was given by Josh Wills
We'll review common use cases for machine learning and advanced analytics found in our customer base at Cloudera and ways in which Apache Hadoop supports these use cases. We'll then discuss upcoming developments for Apache Hadoop that will enable new classes of applications to be supported by the system.
Slides - here
Bio
Jeff Hammerbacher is a founder and the Chief Scientist of Cloudera. Jeff was an Entrepreneur in Residence at Accel Partners immediately prior to founding Cloudera. Before Accel, he conceived, built, and led the Data team at Facebook. Before joining Facebook, Jeff was a quantitative analyst on Wall Street. Jeff serves as a Director of Sage Bionetworks and as a Mentor for Rock Health. He teaches "Introduction to Data Science" at the University of California, Berkeley and served as a Contributing Editor for O'Reilly's "Beautiful Data". Jeff earned his Bachelor's Degree in Mathematics from Harvard University.

Carlos Guestrin
GraphLab 2: The Challenges of Large Scale Computation on Natural Graphs
Tools and Systems
Montebajo: Theater
December 17, 7:00PM
Two years ago we introduced GraphLab to address the critical need for a high-level abstraction for large-scale graph structured computation in machine learning. Since then, we have implemented the abstraction on multicore and cloud systems, evaluated its performance on a wide range of applications, developed new ML algorithms, and fostered a growing community of users. Along the way, we have identified new challenges to the abstraction, our implementation, and the important task of fostering a community around a research project. However, one of the most interesting and important challenges we have encountered is large-scale distributed computation on natural power law graphs. To address the unique challenges posed by natural graphs, we introduce GraphLab 2, a fundamental redesign of the GraphLab abstraction which provides a much richer computational framework. In this talk, we will describe the GraphLab 2 abstraction in the context of recent progress in graph computation frameworks (e.g., Pregel/Giraph). We will review some of the special challenges associated with distributed computation on large natural graphs and demonstrate how GraphLab 2 addresses these challenges. Finally, we will conclude with some preliminary results from GraphLab 2 as well as a live demo. This talk represents joint work with Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Alex Smola, and Joseph Hellerstein.
Slides - here
Bio
Carlos Guestrin is an Assistant Professor at Carnegie Mellon's Computer Science and Machine Learning Departments. Previously, he was a senior researcher at the Intel Research Lab in Berkeley. Carlos received his MSc and PhD in Computer Science from Stanford University in 2000 and 2003, respectively, and a Mechatronics Engineer degree from the Polytechnic School of the University of Sao Paulo, Brazil, in 1998. Carlos has conducted research in (1) learning and control in large-scale structured environments; (2) distributed multiagent coordination; (3) robust, efficient and resource-aware algorithms for sensor networks; (4) sensor placement and tasking; (5) query specific probabilistic modeling. For this work he has received best paper awards at the prestigious NIPS 2003, VLDB 2004, IPSN 2005, IPSN 2006 and KDD 2007 conferences, the IJCAI-JAIR Best Paper Prize 2007, and runner-up best paper awards at the UAI 2005, ICML 2005 and NIPS 2007 conferences. Carlos is also a recipient of the Alfred P. Sloan Fellowship, the IBM Faculty Fellowship, the NSF Career Award, the ONR Young Investigator Award, the Siebel Scholarship and the Stanford Centennial Teaching Assistant Award. He is also a member of the DARPA Information Science and Technology (ISAT) Study Group.