CSL862: Assignment 3 on Map-Reduce
Assignment
- Generate at least 50K random sentences of max length 140 characters from a
set of 20-30 words
- Challenge version: download at least 50K tweets using TwitterÂ’s APIs
- Find all sets of sentences that are 90% similar to each other, i.e. 90% of the words match
- Formulate using MapReduce and implement in parallel
- Challenge version: use Google Scholar to find an efficient algorithm for the above (it exists)
- Challenge ++: implement the above in parallel using MR
(Use Hadoop on AWS)
Note:
- To be done in groups of two or three.
- The last date of submission is Nov 7