SkillAgentSearch skills...

FileSetInputFormat

A Hadoop input format for sending lists of files as keys to a mapper. Set the list of files, and an input split will be created per file. Each map task gets only one input key: the filename for its split.

Install / Use

/learn @kevinweil/FileSetInputFormat
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

An input format for processing individual Path objects one at a time. Useful for distributing large jobs that are inherently single-machine across the entire cluster. Even if they still run on one machine, this way they get given to an under-utilized node at runtime, and there is built-in resilience to task failure.

This input format takes a set of paths and produces a separate input split for each one. If you need to, for example, unzip a collection of five files in HDFS, that unzipping has to happen on a single machine per file. But the set of five files can at least all be unzipped on different machines. Using this input format lets you process each file as you wish in your mapper.

  1. In your main/run method of your Hadoop job driver class, add
<pre><code> Job job = new Job(new Configuration()); ... job.setInputFormatClass(FileSetInputFormat.class); FileSetInputFormat.addPath("/some/path/to/a/file"); FileSetInputFormat.addPath("/some/other/path/to/a/file"); // Also see FileSetInputFormat.addAllPaths(Collection<Path> paths); </code></pre>
  1. Then, make your mapper take a Path as the key and a NullWritable as the value:
<pre><code> public static class MyMapper extends Mapper&lt;Path, NullWritable, ..., ...&gt; { protected void map(Path key, NullWritable value, Context context) throws IOException, InterruptedException { // Do something with the path, e.g. open it and unzip it to somewhere. ... } } </code></pre>

The Path keys are the same paths passed in to FileSetInputFormat.addPath and FileSetInputFormat.addAllPaths. Duplicates are stripped, and one InputSplit is generated per unique Path. That's it!

View on GitHub
GitHub Stars16
CategoryDevelopment
Updated7y ago
Forks5

Languages

Java

Security Score

60/100

Audited on Sep 4, 2018

No findings