(Incomplete) Text retrieval on Spark

四月 16, 2017

Background:
Clustering Documents to find Similar documents likely on same topic

• Document D represented as a word vector
– (w1, w2,..., wk), where wi = 1 if it appears in D
(or term frequency or term frequency *inverse document frequency)
• Measure similarity of vectors: document D1 and D2 – Cosine(D1, D2)

Preparation:
A) Data
1. Data set:
the Bag-of-words dataset at UCI
https://archive.ics.uci.edu/ml/datasets/Bag+of+Words

2. Data description:
It contain 5 collections of documents, each of them consists of 2 files:
1) vocabulary file “vocab.enron.txt” = a list of words, e.g., “aaa”, “aaas”, etc.
2) document-word file “docword.enron.txt.gz” , its format looks like:

The 1st line is # of documents in the collection (39861).
The 2nd line is # of words in the vocabulary (28102) (these words appear in 10+ documents. )
The 3rd line (3710420) is # of words that appear in 1+ documents.
From the 4th line, the content is <document id> <word id> <tf>.

B) Spark Framework Since # of words in the vocabulary may be so large, SparseVector / csc_matrix is preferable:

https://spark.apache.org/docs/latest/mllib-data-types.html to store unit vectors
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseVector
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix

the TF-IDF method provided by Spark
https://spark.apache.org/docs/latest/ml-features.html#tf-idf
To install scipy on EC2, execute “pip install --user numpy scipy”.

SparseVector(len(array), idx of no-0s, no-0 values)
SparseVector的形式，是一个三元tuple：第一个值datatype为int表示Vector的长度，第二个值datatype为list-是Vector中的非0元素的index，第三个值datatype依旧为list-Vector中非0元素的value。
e.g.

(1936,[42,78,112,236,359,382,651,709,712,844,1031,1156,1158,1160,1245,1392,1402,1440,1548,1627,1683,1783],[0.179729117132,0.179583376927,0.179664410961,0.179632017557,0.17955632811,0.179729117132,0.179612568587,0.179612568587,0.35914728303,0.179599597214,0.179632017557,0.179590329305,0.179664410961,0.179729117132,0.359180658611,0.179583376927,0.179590329305,0.179612568587,0.179564652822,0.35914728303,0.179567149889,0.179560657182])

>>> a = SparseVector(4, [1, 3], [3.0, 4.0])
>>> a.squared_distance(a)
0.0
>>> a.squared_distance(array([1., 2., 3., 4.]))
11.0
>>> b = SparseVector(4, [2, 4], [1.0, 2.0])
>>> a.squared_distance(b)
30.0
>>> b.squared_distance(a)
30.0

Find !

Stepping

(Incomplete) Text retrieval on Spark

评论

发表评论

此博客中的热门博文

8 Link Analysis

How to do addition for sparse vectors

Sorting Algorithms Summary