I've been battling with this for about 2 days now, and any help would be tremendously appreciated. I currently have a very large MongoDB collection(over 100M documents) in the following format:
[_id] [date] [score] [meta1] [text1] [text2] [text3] [text4] [meta2]
This isn't the exact data in there, I've obfuscated it a little for the purpose of this post, but the schema is identical, and no the format of that data cannot be changed, that's just the way it is.
There are a TON of duplicate entries in there, a job is running once a day day adding millions of entries to the database that may have the same data in the text fields but different values for the score, meta1, and meta2 fields. So I need to eliminate the duplicates and shoehorn everything into one collection without duplicate texts:
First, I'm going to concatenate the text fields and hash the result, so I have no duplicates containing the same text fields(this part is easy and already works).
Here's where I'm struggling: The resulting collection will have an array of each unique meta1, which will in turn be an array containing the dates and scores matching it.
So if I have the following three documents in my collection now:
[_id] => random mongoid [date] => 12092010 [score] => 3 [meta1] => somemetadatahere [text1] => foo [text2] => bar [text3] => foo2 [text4] => bar2 [meta2] => uniquemeta2data [_id] => random mongoid [date] => 12092010 [score] => 5 [meta1] => othermetadata [text1] => foo [text2] => bar [text3] => foo2 [text4] => bar2 [meta2] => uniquemeta2data1 [_id] => random mongoid [date] => 12102010 [score] => 7 [meta1] => somemetadatahere (same meta1 as the first document) [text1] => foo [text2] => bar [text3] => foo2 [text4] => bar2 [meta2] => uniquemeta2data
They should be reduced to this collection(indents are nested documents/arrays). The keys in the datas array come from the values of the meta1 field in the original collection:
[_id]=> (md5 hash of all the text fields) [text1] => foo [text2] => bar [text3] => foo2 [text4] => bar2 [datas] [somemetadatahere] [meta2] => uniquemeta2data [scores] =>3 =>7 [othermetadata] [meta2] => uniquemeta2data1 [scores] =>3
This seems like a perfect use case for a MapReduce job, but I'm having trouble wrapping my head around exactly how to do this.
Is anyone up for the challenge of helping me figure this out?