## Papers

The accepted papers are organized as follows

### Contributed Talks

**Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, David Andersen and Alexander Smola**.*Parameter Server for Distributed Machine Learning*

We propose a parameter server framework to solve distributed machine learning problems. Both data and workload are distributed into client nodes, while server nodes maintain globally shared parameters, which are represented as sparse vectors and matrices. The framework manages asynchronous data communications between clients and servers. Flexible consistency models, elastic scalability and fault tolerance are supported by this framework. We present algorithms and theoretical analysis for challenging nonconvex and nonsmooth problems. To demonstrate the scalability of the proposed framework, we show experimental results on real data with billions of parameters.

PDF

**Yarin Gal and Zoubin Ghahramani**.*Pitfalls in the use of Parallel Inference for the Dirichlet Process*

Recent work done by Lovell, Adams, and Mansingka [2012] and Williamson, Dubey, and Xing [2013] has suggested an alternative parametrisation for the Dirichlet process in order to derive non-approximate parallel MCMC inference for it. This approach to parallelisation has been picked-up and implemented in several different fields [Chahuneau et al., 2013, Pan et al., 2013]. In this paper we show that the approach suggested is impractical due to an extremely unbalanced distribution of the data. We characterise the requirements of efficient parallel inference for the Dirichlet process and show that the proposed inference fails most of these conditions (while approximate approaches often satisfy most of them). We present both theoretical and experimental evidence of this, analysing the load balance for the inference showing that it is independent of the size of the dataset and the number of nodes available in the parallel implementation, and end with preliminary suggestions of alternative paths of research for efficient non-approximate parallel inference for the Dirichlet process.

PDF

**Yingyu Liang, Maria-Florina Balcan and Vandana Kanchanapally**.*Distributed PCA and k-Means Clustering*

This paper proposes a distributed PCA algorithm, with the theoretical guarantee that any good approximation solution on the projected data for k-means clustering is also a good approximation on the original data, while the projected dimension required is independent of the original dimension. When combined with the distributed coreset-based clustering approach in [3], this leads to an algorithm in which the number of vectors communicated is independent of the size and the dimension of the original data. Our experiment results demonstrate the effectiveness of the algorithm.

PDF

### Posters

**Julien-Charles Lévesque, Christian Gagné and Robert Sabourin**.*Ensembles of Budgeted Kernel Support Vector Machines for Parallel Large Scale Learning*

In this work, we propose to combine multiple budgeted kernel support vector machines (SVMs) trained with stochastic gradient descent (SGD) in order to exploit large databases and parallel computing resources. The variance induced by budget restrictions of the kernel SVMs is reduced through the averaging of predictions, resulting in greater generalization performance. The variance of the trainings results in a diversity of predictions, which can help explain the better performance. Finally, the proposed method is intrinsically parallel, which means that parallel computing resources can be exploited in a straightforward manner.

PDF

**Zhen Qin, Vaclav Petricek, Nikos Karampatziakis, Lihong Li and John Langford**.*Efficient Online Bootstrapping for Large Scale Learning*

Bootstrapping is a useful technique for estimating the uncertainty of a predictor, for example, confidence intervals for prediction. It is typically used on small to moderate sized datasets, due to its high computation cost. This work describes a highly scalable online bootstrapping strategy, implemented inside Vowpal Wabbit, that is several times faster than traditional strategies. Our experiments indicate that, in addition to providing a black box-like method for estimating uncertainty, our implementation of online bootstrapping may also help to train models with better prediction performance due to model averaging.

PDF

**Arun Kumar, Nikos Karampatziakis, Paul Mineiro, Markus Weimer and Vijay Narayanan**.*Distributed and Scalable PCA in the Cloud*

Principal Component Analysis (CA) is a popular technique with many applications. Recent randomized PCA algorithms scale to large datasets but face a bottleneck when the number of features is also large. We propose to mitigate this issue using a composition of structured and unstructured randomness within a randomized PCA algorithm. Initial experiments using a large graph dataset from Twitter show promising results. We demonstrate the scalability of our algorithm by implementing it both on Hadoop, and a more flexible platform named REEF.

PDF

**Nedim Lipka**.*Towards Distributed Reinforcement Learning for Digital Marketing with Spark*

A variety of problems in digital marketing can be modeled as Markov decision processes and solved by dynamic programming with the goal of calculating the policy that maximizes the expected discounted reward. Algorithms, such as policy iteration, require a state transition and a reward model, which can be estimated based on a given data set. In this paper, we compare the execution times for estimating the transition function in a map-reduce fashion if the data set becomes large in terms of the number of records and features. Therefore, we create different-sized Spark and Hadoop clusters in the Amazon cloud computing environment. The in-memory clustering system Spark is outperforming Hadoop and runs up to 71% faster. Furthermore, we study the execution times of policy iteration running on Spark clusters and show the execution time reduction gained by increasing the number of instances in the cluster.

PDF

**Tuukka Ruotsalo, Jaakko Peltonen, Manuel J. A. Eugster, Dorota Glowacka, Giulio Jacucci, Aki Reijonen and Samuel Kaski**.*Lost in Publications? How to Find Your Way in 50 Million Scientific Documents*

Researchers must navigate big data. Current scientific knowledge includes 50 million published articles. How can a system help a researcher find relevant documents in her field? We introduce IntentRadar, an interactive search user interface and search engine that anticipates userâ™s search intents by estimating them form userâ™s interaction with the interface. The estimated intents are visualized on a radial layout that organizes potential intents as directions in the information space. The intent radar assists users to direct their search by allowing feedback to be targeted on keywords that represent the potential intents. Users can provide feedback by manipulating the position of the keywords on the radar. The system then learns and visualizes improved estimates and corresponding documents. IntentRadar has been shown to significantly improve usersâ™ task performance and the quality of retrieved information without compromising task execution time.

PDF

**Michael Kane and Bryan Lewis**.*cnidaria: A Generative Communication Approach to Scalable, Distributed Learning*

This paper presents a scalable, software framework that facilitates large-scale learning and numerical computing. Unlike existing MapReduce frameworks our design is not limited to embarrassingly parallel computing challenges. The framework sits on top of existing storage infrastructures and results of a computation may left out on the cluster (a reduce step is not required). Unlike existing distributed numerical frameworks the proposed framework is elastic and works with both dense and sparse data representations. This generality is achieved through a generative communication scheme whose expressions are either consumed by the distributed computing environment or used to move data, in a peer-to-peer (P2P) fashion, between nodes in a cluster/cloud. This approach integrates advances in the both cloud computing and the distributed numerical computing community and can be applied to a general class of learning challenges.

PDF

**Anshumali Shrivastava and Ping Li**.*Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search*

We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity. We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the Google Sets application as well as in an application for improving retrieval quality.

PDF

**Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho and Eric Xing**.*Petuum: A System for Iterative-Convergent Distributed ML*

A major bottleneck to applying advanced ML programs at industrial scales is the migration of an academic implementation, often specialized for a small, wellcontrolled computer platform such as desktop PCs and small lab-clusters, to a big, less predicable platform such as a corporate cluster or the cloud. This poses enormous challenges: how does one train huge models with billions of parameters on massive data, especially when substantial expertise is required to handle many low-level systems issues? We propose a new architecture of systems components that systematically addresses these challenges, thus providing a generalpurpose distributed platform for Big Machine Learning. Our architecture specifically exploits the fact that many ML programs are fundamentally loss function minimization problems, and that their iterative-convergent nature presents many unique opportunities to minimize loss, such as via dynamic variable scheduling and error-bounded consistency models for synchronization. Thus, we treat data, parameter and variable blocks as computing units to be dynamically scheduled and updated in an error-bounded manner, with the goal of minimizing the loss function as quickly as possible.

PDF

**Haiqin Yang, Junjie Hu, Michael Lyu and Irwin King**.*Online Imbalanced Learning with Kernels*

Imbalanced learning, or learning from imbalanced data, is a challenging problem in both academy and industry. Nowadays, the streaming imbalanced data become popular and trigger the volume, velocity, and variety issues of learning from these data. To tackle these issues, online learning algorithms are proposed to learn a linear classifier via maximizing the AUC score. However, the developed linear classifiers ignore the learning power of kernels. In this paper, we therefore propose online imbalanced learning with kernels (OILK) to exploit the non-linearity and heterogeneity embedded in the imbalanced data. Different from previously proposed work, we optimize the AUC score to learn a non-linear representation via the kernel trick. To relieve the computational and storing cost, we also investigate different buffer update policies, including first-in-first-out (FIFO) and reservoir sampling (RS), to maintain a fixed budgeted buffer on the number of support vectors. We demonstrate the properties of our proposed OILK through detailed experiments.

PDF

**Alex Beutel, Abhimanu Kumar, Evangelos Papalexakis, Partha Pratim Talukdar, Christos Faloutsos and Eric Xing**.*FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop*

Given multiple data sets of relational data that share a number of dimensions, how can we efficiently decompose our data into the latent factors? Factorization of a single matrix or tensor has attracted much attention, as, e.g., in the Netflix challenge, with users rating movies. However, we often have additional, side, information, like, e.g., demographic data about the users, in the Netflix example above. Incorporating the additional information leads to the coupled factorization problem. So far, it has been solved for relatively small datasets. We provide a distributed, scalable method for decomposing matrices, tensors, and coupled data sets through stochastic gradient descent on a variety of objective functions. We offer the following contributions: (1) Versatility: Our algorithm can perform matrix, tensor, and coupled factorization, with flexible objective functions including the Frobenius norm, Frobenius norm with an l1 induced sparsity, and non-negative factorization. (2) Scalability: FLEXIFACT scales to unprecedented sizes in both the data and model, with up to billions of parameters. FLEXIFACT runs on standard Hadoop. (3) Convergence proofs showing that FLEXIFACT converges on the variety of objective functions, even with projections.

PDF

**Faraz Makari Manshadi and Rainer Gemulla**.*A Distributed Approximation Algorithm for Mixed Packing-Covering Linear Programs*

Mixed packing-covering linear programs capture a simple but expressive subclass of linear programs. They commonly arise as linear programming relaxations of a number important combinatorial problems, including various network design and generalized matching problems. In this paper, we propose an efficient distributed approximation algorithm for solving mixed packing-covering problems which requires a poly-logarithmic number of passes over the input. Our algorithm is well-suited for parallel processing on GPUs, in shared-memory architectures, or on small clusters of commodity nodes. We report results of a case study for generalized bipartite matching problems.

PDF

**Artem Sokolov and Stefan Riezler**.*Task-driven Greedy Learning of Feature Hashing Functions*

Randomly hashing multiple features into one aggregated feature is routinely used in largescale machine learning tasks to both increase speed and decrease memory requirements, with little or no sacrifice in performance. In this paper we investigate whether using a learned (instead of a random) hashing function improves performance. We show experimentally that with increasing difference between the dimensionalities of the input space and the hashed space, learning hashes is increasingly useful compared to random hashing.

PDF

**Ahmed Elgohary, Ahmed Farahat, Mohamed Kamel and Fakhri Karray**.*Approximate Nearest Centroid Embedding for Kernel $k$-Means*

This paper proposes an efficient embedding method for scaling kernel k-means on cloud infrastructures. The embedding method allows for approximating the computation of the nearest centroid to each data instance and, accordingly, it eliminates the quadratic space and time complexities of the cluster assignment step in the kernel k-means algorithm. We show that the proposed embedding method is effective under memory and computing power constraints, and that it achieves better clustering performance compared to other approximations of the kernel kmeans algorithm.

PDF

**Yisheng Liao, Alex Rubinsteyn, Russell Power and Jinyang Li**.*Learning Random Forests on the GPU*

Random Forests are a popular and powerful machine learning technique, with several fast multi-core CPU implementations. Since many other machine learning methods have seen impressive speedups from GPU implementations, applying GPU acceleration to random forests seems like a natural fit. Previous attempts to use GPUs have relied on coarse-grained task parallelism and have yielded inconclusive or unsatisfying results. We introduce CudaTree, a GPU Random Forest implementation which adaptively switches between data and task parallelism. We show that, for larger datasets, this algorithm is faster than highly tuned multi-core CPU implementations.

PDF

**Shravan Narayanamurthy, Markus Weimer, Dhruv Mahajan, Tyson Condie, Sundararajan Sellamanickam and S. Sathiya Keerthi**.*Towards Resource-Elastic Machine Learning*

PDF

**Ignacio Arnaldo, Kalyan Veeramachaneni and Una-May O'Reilly**.*Building Multiclass Nonlinear Classifiers with GPUs*

The adoption of multiclass classification strategies that train independent binary classifiers becomes challenging when the goal is to retrieve nonlinear models from large datasets and the process requires several passes through the data. In such scenario, the combined use of a search and score algorithm and GPUs allows to obtain binary classifiers in a reduced time. We demonstrate our approach by training a ten class classifier over more than 400K exemplars following the exhaustive Error Correcting Output Code strategy that decomposes into 511 binary problems.

PDF

**John Canny and Huasha Zhao**.*BIDMach: Large-scale Learning with Zero Memory Allocation*

This paper describes recent work on the BIDMach toolkit for large-scale machine learning. BIDMach has demonstrated single-node performance that exceeds that of published cluster systems for many common machine-learning task. BIDMach makes full use of both CPU and GPU acceleration (through a sister library BIDMat), and requires only modest hardware (commodity GPUs). One of the challenges of reaching this level of performance is the allocation barrier. While it is simple and expedient to allocate and recycle matrix (or graph) objects in expressions, this approach is too slow to match the arithmetic throughput possible on either GPUs or CPUs. In this paper we describe a caching approach that allows code with complex matrix (graph) expressions to run at massive scale, i.e. multi-terabyte data, with zero memory allocation after initial start-up. We present a number of new benchmarks that leverage this approach.

PDF

**Shohei Hido, Satoshi Oda and Seiya Tokui**.*Jubatus: An Open Source Platform for Distributed Online Machine Learning*

Distributed computing is essential for handling very large datasets. Online learning is also promising for learning from rapid data streams. However, it is still an unresolved problem how to combine them for scalable learning and prediction on big data streams. We propose a general computational framework called loose model sharing for online and distributed machine learning. The key is to share only models rather than data between distributed servers. We also introduce Jubatus, an open source software platform based on the framework. Finally, we describe the details of implementing classifier and nearest neighbor algorithms, and discuss our experimental evaluations.

PDF