Np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))īuilding off of Vaali's solution: def sparse_cosine_similarity(sparse_matrix): Np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m)) # Assert that all results are consistent with the first model ("truth") Just a version of method 4 that takes in sparse arraysĬosine = np.array(similarity.multiply(inv_mag)) Return 1 - squareform(pdist(A, metric='cosine'))Īnorm = A / np.linalg.norm(A, axis=-1) # Define a function to calculate the cosine similarities a few different ways see which is the fastest.įrom import squareform, pdistįrom import linear_kernelįrom sklearn.preprocessing import normalizeįrom import cosine_similarityĪ = np.random.randint(0, 2, (10000, 100)).astype(float).TĪsp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))
validate each of the results (see assertion below) and 2. I took all these answers and wrote a script to 1. Those should be pretty straightforward replacements of basic numpy operations with their scipy.sparse equivalents. If your problem is atypical you'll need more modifications. Then replace the first line as indicated. If this is the case, list your 'items' in rows and create A using scipy.sparse. Let's call this dimension the 'item' dimension. Also, the short dimension is the one whose entries you want to calculate similarities between. If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. # cosine similarity (elementwise multiply by inverse magnitudes) # if it doesn't occur, set it's inverse magnitude to zero (instead of inf) # squared magnitude of preference vectors (number of occurrences) # replace this with A.dot(A.T).toarray() for sparse representation # base similarity matrix (all dot products)
See below for a discussion of how to optimize for sparsity. It works pretty quickly on large matrices (assuming you have enough RAM) The following method is about 30 times faster than. If you want column-wise cosine similarities simply transpose your input matrix beforehand: A_anspose() Print('pairwise dense output:\n \n'.format(similarities_sparse))
Similarities = cosine_similarity(A_sparse) As of version 0.17 it also supports sparse output: from import cosine_similarityĪ = np.array(, ,]) You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn.