Share knowledge and thoughts: Hadoop questions and answers

Q. What is the default block size in HDFS.
A. 64 mb

Q. What is the difference between block and split.
A. Block is unit of continuous memory space in HDFS where as split is logical set process by one map task, generally block and split size are same and could differ in cases where a record fall between two splits in that case the records is considered as part of the first split.

Q. How can you make sure that you don't split a file.
A. You need to write your own InputFormat class and override isSplittable and return always 'false'.

Q.When a reducer method of a reducer get called.
A. Not until all mapper finished processing their input.

Q. What if output of your mapper does not match with reducer input.
A. Job will fail with ClassCastException at run time.

Q. Are keys and values in sorted order when passed to reducer?
A. Keys are sorted but values are not.

Q.Where intermediate data emitted from mapper get written?
A. Local file system of the node where mapper is running.

Q.What's the default replication factor.
A. 3

Q. How Hadoop decide where/how to store replicated data?
A. Data block 'd' stored in 3 different nodes n1, n2, n3 (assuming replication factor 3), under two different racks r1, r2.

Q. Can you configure the number of mappers for your input file?
A. You can configure how many mapper will run in parallel under a node but you can't configure total number of mappers as its decided by number of splits (and ultimately by block size), so by changing the block size when can control the number of mappers.

Share knowledge and thoughts

Wednesday, January 1, 2014

Hadoop questions and answers

No comments:

Blog Archive

About Me