`sparsecomputation`

is a python package to construct a sparse matrix of pairwise similarities (similarity matrix). Sparse similarity matrices generated with sparse computation provide a substantial speedup without loss in accuracy for supervised machine learning algorithms that rely on pairwise similarity matrices / kernel matrices, such as k-nearest neighbors, kernel-based support vector machines, and supervised normalized cut.

Similarity matrices have `O(n^2)`

entries for a dataset with `n`

objects (observations). Computing all pairwise similarities is computationally intractable for large datasets. Sparse computation overcomes this computational burden by identifying all pairs of objects that are close to each other in space. These pairs are identified by projecting the objects onto a low-dimensional space and using closeness in the low-dimensional space as a proxy for closeness in the original space. Sparse computation uses grids to determine efficiently close objects in the low-dimensional space.

Sparse computation takes as input a two-dimensional numpy array of size `n x d`

. Each row represents one object with `d`

features. Spase computation returns a list of 2-tuples, where each tuple are the indices of the objects between which a pairwise similarity should be computed.

The latest version of this software is available on GitHub. The source code is also available in ZIP format here.

You can use `pip`

to install the package:

To find the relevant pairwise similarities with Sparse Computation, use the following code:

# load sample dataset import sklearn.datasets data, _ = sklearn.datasets.load_iris(return_X_y=True) # load Sparse Computation from sparsecomputation import ApproximatePCA, SparseComputation # Project data to a low-dimensional space of dimension 3 # with a fast, scalable version of PCA. apca = ApproximatePCA(dimLow=3) # Select pair if objects are closer than ~distance~ (1 / resolution). # Controls sparsity (small distance - highly sparse matrix) sc = SparseComputation(apca, distance=0.05) # Input: n x d matrix - data # Output: list of relevant pairs (undirected) pairs = sc.select_pairs(data) # Out: [(101, 142), (1, 9), (1, 34), (1, 37), ...]

The default implementation for sparse computation is based on block shifting. You can select an alternative implementation by setting the `method`

parameter of the `SparseComputation`

object.

For more details on the techniques read the following papers. Please cite these works if you use this implementation in academic work:

- Dorit S. Hochbaum, Philipp Baumann (2016). Sparse computation for large-scale data mining.
*IEEE Transactions on Big Data*, 2(2), 151-174. - Philipp Baumann, Dorit S. Hochbaum and Quico Spaen (2017). High-Performance Geometric Algorithms for Sparse Computation in Big Data Analytics.
*2017 IEEE International Conference on Big Data*, Boston MA. - Philipp Baumann, Dorit S. Hochbaum, and Quico Spaen (2016). Sparse-Reduced Computation.
*Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods*.

Original Matlab implementation by Philipp Baumann. Original Python implementation by Titouan Jehl and Quico Spaen. Updated implementation with the new block shifting algorithm by Quico Spaen and Philipp Baumann.

The source code is subject to the following academic license. Note that this is not an open source license.

Copyright 2017. The Regents of the University of California (Regents). All Rights Reserved.

Permission to use, copy, modify, and distribute this software and its documentation for educational, research, and not-for-profit purposes, without fee and without a signed licensing agreement, is hereby granted, provided that the above copyright notice, this paragraph and the following two paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 510, Berkeley, CA 94720-1620, (510) 643-7201, for commercial licensing opportunities. Created by Quico Spaen, Titouan Jehl, Philipp Baumann, and Dorit S. Hochbaum, Department of Industrial Engineering and Operations Research, University of California, Berkeley.

IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED “AS IS”. REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.