(Incomplete) Text retrieval on Spark
Background:
Clustering Documents to find Similar documents likely on same topic
• Document D represented as a word vector
– (w1, w2,..., wk), where wi = 1 if it appears in D
(or term frequency or term frequency *inverse document frequency)
• Measure similarity of vectors: document D1 and D2 – Cosine(D1, D2)
– (w1, w2,..., wk), where wi = 1 if it appears in D
(or term frequency or term frequency *inverse document frequency)
• Measure similarity of vectors: document D1 and D2 – Cosine(D1, D2)
Preparation:
A) Data
1. Data set:
the Bag-of-words dataset at UCI
https://archive.ics.uci.edu/ml/datasets/Bag+of+Words
2. Data description:
It contain 5 collections of documents, each of them consists of 2 files:
1) vocabulary file “vocab.enron.txt” = a list of words, e.g., “aaa”, “aaas”, etc.
2) document-word file “docword.enron.txt.gz” , its format looks like:
39861 28102 3710420 1 118 1 1 285 1 1 1229 1 1 1688 1 1 2068 1 ...The 1st line is # of documents in the collection (39861).
The 2nd line is # of words in the vocabulary (28102) (these words appear in 10+ documents. )
The 3rd line (3710420) is # of words that appear in 1+ documents.
From the 4th line, the content is <document id> <word id> <tf>.
B) Spark Framework Since # of words in the vocabulary may be so large, SparseVector / csc_matrix is preferable:
https://spark.apache.org/docs/latest/mllib-data-types.html to store unit vectors
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseVector
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix
the TF-IDF method provided by Spark
https://spark.apache.org/docs/latest/ml-features.html#tf-idf
To install scipy on EC2, execute “pip install --user numpy scipy”.
SparseVector(len(array), idx of no-0s, no-0 values)
SparseVector的形式,是一个三元tuple:第一个值datatype为int表示Vector的长度,第二个值datatype为list-是Vector中的非0元素的index,第三个值datatype依旧为list-Vector中非0元素的value。
e.g.
(1936,[42,78,112,236,359,382,651,709,712,844,1031,1156,1158,1160,1245,1392,1402,1440,1548,1627,1683,1783],[0.179729117132,0.179583376927,0.179664410961,0.179632017557,0.17955632811,0.179729117132,0.179612568587,0.179612568587,0.35914728303,0.179599597214,0.179632017557,0.179590329305,0.179664410961,0.179729117132,0.359180658611,0.179583376927,0.179590329305,0.179612568587,0.179564652822,0.35914728303,0.179567149889,0.179560657182])
>>> a = SparseVector(4, [1, 3], [3.0, 4.0]) >>> a.squared_distance(a) 0.0 >>> a.squared_distance(array([1., 2., 3., 4.])) 11.0 >>> b = SparseVector(4, [2, 4], [1.0, 2.0]) >>> a.squared_distance(b) 30.0 >>> b.squared_distance(a) 30.0
评论
发表评论