World's most popular travel blog for travel bloggers.

[Solved]: Naive Bayes MapReduce

, , No Comments
Problem Detail: 

I have the following question to answer in a MapReduce assignment sheet

The Naive Bayes classifier is a widely-used tool for analyzing data. Consider a data set that has $n$ data items, each of which has the same $m+1$ attributes. Each data item is stored as a tuple $$(i,(X_0,i, X_1,i, . . . X_m,i))$$

where $i$ is the key (index) of the item, and the associated value is a tuple with the $m + 1$ attributes of the item. For simplicity, assume that all the attributes are binary, i.e. each $Xj,i$ is either $0$ or $1$.

To build the Naive Bayes classifier, we need to compute the set of conditional probabilities $$Pr[X_j = x|X0]$$

These probabilities can be computed from counts.

$$Pr[X_j = x|X_0 = b] = \frac{C[X_j = x \text{ AND } X_0 = b]}{C[X_0 = b]}$$

where $C[\text{cond}]$ counts the number of tuples from the input that satisfy the condition $\text{cond}$.

Describe how the probabilities needed for a Naive Bayes classifier could be computed using the MapReduce model. Assume that $n$ is large (say, $n = 10^9$), while $m$ is small(say, $m = 10$).

How many rounds does your solution require?

My approach -

I assume that $X_j$ here means the sum of $X_j$ attribute column fields across all rows. So using the MapReduce model, we can take subsets of tuples and split them over different machines for the Map stage. For each tuple, if $X_0$ is $1$ then we can emit the key-value pair $(0,1)$, if $X_j$ is $1$ we can emit $\mathrm{KVP} (j, 1)$ and if either $X_0$ or $X_j$ are $1$ then we can emit $(0j, 1)$. As $m$ is small this production of key-value pairs for each tuple should take constant time.

After this once similar keys are shuffled together, in the Reduce stage we can simply count up the totals for each key and use the following formula -

$\mathsf{COUNT}(A \text{ AND } B) = \mathsf{COUNT}(A) + \mathsf{COUNT}(B) - \mathsf{COUNT}(A \text{ OR } B)$

and then get the corresponding probabilities. This solution thus requires one round of MapReduce.

Does this seem like a reasonable solution or am I interpreting the question wrong?

EDIT - More specifically, I have the following doubts

  1. In the context of the Naive Bayes classifier, is my assumption that $X_j$ is the sum of the values of entries in the $X_j$ column across all rows correct?

  2. In the Map stage is it possible to generate multiple key-value pairs of different types like I did in my answer above? And would it still be counted as one 'round' of MapReduce?

Asked By : Noble Six Taniguchi

Answered By : TenaliRaman

Your approach is fine. The idea essentially is the same that gets used in Word Count example in Hadoop [1].

[1] http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/38119

 Ask a Question

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback