How to Build an Inverted Index With MapReduce
MapReduce is a parallel programming model developed in Google for large data sets. It processes data in chunks rather than in sequential order. In doing so, it relies on a map of paired input functions (keys) and values that it then puts through the reduce function -- thus, its name -- to make the data easier to understand. Instead of providing the map function with a key and value, an inverted index pairs words and documents to search text. You can use inverted indexes in MapReduce to create an index for a keyword search, for example.
- Difficulty:
- Moderate
Instructions
-
-
1
Type the following code for the map function:
public static class InvertedIndexerMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text>
{
private final static Text word = new Text () ;
private final static Text location = new Text () ;public void map(LongWritable key, Text val,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException
{
FileSplit fileSplit = (FileSplit) reporter.getInputSplit() ;
String fileName = fileSplit.getPath() .getName() ;
location.set(fileName) ;String line - val.toString() ;
StringTokenizer itr = new StringTokenizer(line.toLowerCase()) ;
while (itr.hasMoreTokens()) {
word.set(itr.nextToken()) ;
output.collect(word, location) ;
}
}
} -
2
Type the following code for the reduce function:
public static class InvertedIndexerReducer extends MapReduceBase
implements Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException
{
boolean first = true;
StringBuilder toReturn = new StringBuilder() ;
while (values.hasNext()) {
if (!first)
toReturn.append(", ") ;
first = false;
toReturn.append(values.next().toString()) ;
}
output.collect(key, new Text(toReturn.toString())) ;
}
} -
3
Type the following code to complete the inverted index:
public static void main(String[] args) throws IOException
{
if (args.length < 2) {
System.out
println("Usage: InvertedIndex <input path> <output path>") ;
system.exit(1) ;
}
JobConf conf = new JobConf(InvertedIndex.class) ;
conf.setJobName("InvertedIndex") ;conf.setOutputKeyClass(Text.class) ;
conf.setOutputValueClass(Text.class) ;conf.setMapperClass(InvertedIndexerMapper.class) ;
conf.setReducerClass(InvertedIndexerReducer.class) ;FileInputFormat.setInputPaths(conf, new Path(args[0])) ;
FileOutputFormat.setOutputPath(conf, new Path(args[1])) ;
try {
JobClient.runJob(conf) ;
} catch (Exception e) {
e.pringStackTrace() ;
}
}
-
1
Related Searches
References
Resources
- Photo Credit Comstock/Comstock/Getty Images