8.2. MapReduce

This tab covers MapReduce settings. Here you can set properties for the JobTracker and TaskTrackers, as well as some general and advanced properties. Click the name of the group to expand and collapse the display

 

Table 3.6. MapReduce Settings: JobTracker

Name Notes

JobTracker host

This value is prepopulated based on your choices on previous screens. The host that has been assigned to run JobTracker.

JobTracker new generation size

Default size of Java new generation size for JobTracker (Java option -XX:NewSize)

 JobTracker maximum new generation size

Maximum size of Java new generation for Job­Tracker (Java option -XX:MaxNewSize)

 JobTracker maximum Java heap size

Maximum Java heap size for JobTracker in MB (Java option -Xmx)


 

Table 3.7. MapReduce Settings: TaskTracker

Name Notes
TaskTracker hosts This value is prepopulated based on your choices on previous screens. The hosts that have been assigned to run TaskTrackers.
MapReduce local directories Directories for MapReduce to store intermediate data files
Number of Map slots per node Number of slots that Map tasks that run simultaneously can occupy on a TaskTracker
Number of Reduce slots per node Number of slots that Reduce tasks that run simultaneously can occupy on a TaskTracker.
Java options for MapReduce tasks Java options for the TaskTracker child processes

 

Table 3.8. MapReduce Settings: General

Name Notes
MapReduce Capacity Scheduler The scheduler to use for scheduling MapReduce jobs
Cluster's Map slot size (virtual memory) The virtual memory size of a single Map slot in the MapReduce framework. Use -1 for no limit

Cluster's Reduce slot size (virtual memory)

The virtual memory size of a single Reduce slot in the MapReduce framework. Use -1 for no limit

Upper limit on virtual memory for single Map task

Upper limit on virtual memory for single Map task. Use -1 for no limit.

Upper limit on virtual memory for single Reduce task

Upper limit on virtual memory for single Reduce task. Use -1 for no limit.

Default virtual memory for a job’s map-task

Virtual memory for single Map task. Use -1 for no limit.
Default virtual memory for a job's reduce-task Virtual memory for single Reduce task. Use -1 for no limit.

Map-side sort buffer memory

The total amount of Map-side buffer memory to use while sorting files (Expert-only configuration)

Limit on buffer

Percentage of sort buffer used for record collection (Expert-only configuration)

Job log retention (hours)

The maximum time, in hours, for which the user-logs are to be retained after the job completion.

Maximum number tasks for a Job

Maximum number of tasks for a single Job. Use -1 for no limit.

LZO compression

Check to enable LZO compression in addition to Snappy

Snappy compression Check to enable Snappy compression
Enable Job Diagnostics Check to enable tools for tracing the path and troubleshooting the performance of MapReduce jobs

 

Table 3.9. MapReduce Settings: Advanced

Name Notes
MapReduce system directories MapReduce system directories
io.sort.record.percent  
io.sort.factor  
mapred.tasktracker.tasks.sleeptime-before-sigkill Normally this is the amount of time before killing processes, and the recommended default is 5.000 seconds, a value of 5000 here. In this case it is used solely to blast tasks before killing them, and killing them very quickly (.25 second) to guarantee that we do not leave VMs around for later jobs
mapred.job.tracker.handler.count The number of server threads for the JobTracker. Roughly 4% of the number of TaskTracker nodes.
mapreduce.cluster.administrators  
mapred.reduce.parallel.copies  
tasktracker.http.threads  
mapred.map.tasks.speculative.execution If true, then multiple instances of some map tasks may be executed in parallel
mapred.reduce.tasks.speculative.execution If true, then multiple instances of some reduce tasks may be executed in parallel
mapred.reduce.slowstart.completed.maps  
mapred.inmem.merge.threshold The threshold, in terms of the number of files, for triggering the in-memory merge process. When the threshold is hit, we initiate the merge and spill to disk. A value of less than or equal to 0 means no threshold is set and ramfs's memory consumption triggers the merge.
mapred.job.shuffle.merge.percent The threshold, expressed as a percentage of the total memory allocated to storing in-memory map outputs (defined in mapred.job.shuffle.input.buffer.percent), for triggering the in-memory merge process.
mapred.job.shuffle.input.buffer.percent The percentage of memory to be allocated from the maximum heap size for storing map outputs during the shuffle.
mapred.output.compression.type If the job outputs are to be compressed as SequenceFiles, how should they be compressed? Acceptable values are: NONE, RECORD, or BLOCK.
mapred.jobtracker.completeuserjobs.maximum  
mapred.jobtracker.restart.recover A value of true enables job recovery on restart; false starts afresh
mapred.job.reduce.input.buffer.percent The percentage of memory relative to the maximum heap size. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin.
mapreduce.reduce.input.limit The limit on the input size of the reduce. If the estimated input size of the reduce is greater than this value, job is failed. A value of -1 means that no limit is set.
mapred.task.timeout The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, or updates its status string.
jetty.connector  
mapred.child.root.logger  
mapred.max.tracker.blacklists If a node is reported blacklisted by this number of successful jobs within the timeout window, it will be graylisted.
mapred.healthChecker.interval  
mapred.healthChecker.script.timeout  
mapred.job.tracker.persist.jobstatus.active Indicates if persistency of job status is active or not
mapred.job.tracker.persist.jobstatus.hours The number of hours job status information is persisted in DFS. Job status information is available after it drops off the memory queue and between JobTracker restarts. A value of zero means that job status information is not persisted at all.
mapred.jobtracker.retirejob.check  
mapred.jobtracker.retirejob.interval  
mapred.job.tracker.history.completed.location  
mapreduce.fileoutputcommitter.marksuccessfuljobs  
mapred.job.reuse.jvm.num.tasks The number of tasks to run per JVM. A value if -1 indicates no limit.
hadoop.job.history.user.location  
mapreduce.jobtracker.staging.root.dir The path prefix for the staging directories. The next level is always the user's name. It is a path in the default file system.
mapreduce.tasktracker.group The group that the TaskTracker controller uses for accessing the controller. The mapred user must be a member and users should not be members.
mapreduce.jobtracker.split.metainfo.maxsize If the size of the split metainfo file is larger than this value, the JobTacker will fail the job during initialization.
mapred.jobtracker.blacklist.fault-timeout-window Sliding window in minutes
mapred.jobtracker.blacklist.fault-bucket-width 15 minute bucket size, in minutes
mapred.queue.names Comma separated list of queues configured for this jobtracker
Custom MapReduce Configs Use this text box to enter values for mapred-site.xml properties not exposed by the UI. Enter in "key=value" format, with a newline as a delimiter between pairs.


loading table of contents...