This looks like a job for MapReduce…but I just can't figure it out

Question!

I've been battling with this for about 2 days now, and any help would be tremendously appreciated. I currently have a very large MongoDB collection(over 100M documents) in the following format:

[_id]
[date]
[score]
[meta1]
[text1]
[text2]
[text3]
[text4]
[meta2]

This isn't the exact data in there, I've obfuscated it a little for the purpose of this post, but the schema is identical, and no the format of that data cannot be changed, that's just the way it is.

There are a TON of duplicate entries in there, a job is running once a day day adding millions of entries to the database that may have the same data in the text fields but different values for the score, meta1, and meta2 fields. So I need to eliminate the duplicates and shoehorn everything into one collection without duplicate texts:

First, I'm going to concatenate the text fields and hash the result, so I have no duplicates containing the same text fields(this part is easy and already works).

Here's where I'm struggling: The resulting collection will have an array of each unique meta1, which will in turn be an array containing the dates and scores matching it.

So if I have the following three documents in my collection now:

[_id] => random mongoid
[date] => 12092010 
[score] => 3
[meta1] => somemetadatahere
[text1] => foo
[text2] => bar
[text3] => foo2
[text4] => bar2
[meta2] => uniquemeta2data

[_id] => random mongoid
[date] => 12092010
[score] => 5
[meta1] => othermetadata
[text1] => foo
[text2] => bar
[text3] => foo2
[text4] => bar2
[meta2] => uniquemeta2data1

[_id] => random mongoid
[date] => 12102010
[score] => 7
[meta1] => somemetadatahere  (same meta1 as the first document)
[text1] => foo
[text2] => bar
[text3] => foo2
[text4] => bar2
[meta2] => uniquemeta2data

They should be reduced to this collection(indents are nested documents/arrays). The keys in the datas array come from the values of the meta1 field in the original collection:

[_id]=> (md5 hash of all the text fields)
[text1] => foo
[text2] => bar
[text3] => foo2
[text4] => bar2    
[datas]
    [somemetadatahere]
        [meta2] => uniquemeta2data
        [scores]
            [12092010]=>3
            [12102010]=>7
    [othermetadata]
        [meta2] => uniquemeta2data1   
        [scores]
            [12092010]=>3     

This seems like a perfect use case for a MapReduce job, but I'm having trouble wrapping my head around exactly how to do this.

Is anyone up for the challenge of helping me figure this out?



Answers

I think the MapReduce problem seems straight forward, which means I probably misunderstand your problem. Here is how I see it anyway.

Divide up the original collection based on the text hash. Have each section focus on combining the resulting subset.

Here's some code from http://www.dashdashverbose.com/2009/01/mapreduce-with-javascript.html

I will try to edit this to fit your question.

function myMapper(key, value) {
 var ret = [];
 var words = normalizeText(value).split(' ');
 for (var i=0; i


Basically, this is the same problem as the well-known word frequency problem in mapreduce, but instead of using words, you use hashes (and a reference to the original entry):

  • Map: Take the hash of each entry and map it onto the couple (hash, 1). (To retrieve the original entry: create an object and use the original entry as a property).
  • Reduce: All hash entries will be collected into the same bucket, count the values for each couple (hash, 1).
  • Output the hashes, the original entry (stored in the object), and the count

Analogy: the cat sat on the mat

Map:

  • the - (hash(the), 1)
  • cat -> (hash(cat), 1)
  • sat -> (hash(sat), 1)
  • on -> (hash(on), 1)
  • the - (hash(the), 1)
  • mat -> (hash(mat), 1)

Intermediate:

  • the - (hash(the), 1)
  • cat -> (hash(cat), 1)
  • sat -> (hash(sat), 1)
  • on -> (hash(on), 1)
  • the - (hash(the), 1)
  • mat -> (hash(mat), 1)

Reduce:

  • (hash(the), 2)
  • (hash(cat), 1)
  • (hash(sat), 1)
  • (hash(on), 1)
  • (hash(mat), 1)
By : jvdbogae


This video can help you solving your question :)
By: admin