<?xml version="1.0"?>
	
<items>
	
		
	<item>
		<title>
			Bayes Classifiers 101
		</title>
	        <apropos id="186" author="timm" dob="1189900513" />
       		<link>http://menzies.us/cs591o/?lecture=186</link>
       		<category>lectures</category>
       		<category>bayes</category>
		<description>
			<![CDATA[


<p>A Bayes classifier is a simple statistical-based learning scheme.</p>

<p>Advantages:</p>

<ul>
<li>Tiny memory footprint</li>
<li>Fast training, fast learning</li>
<li>Simplicity</li>
<li>Often works surprisingly well</li>
</ul>

<p>Assumptions</p>

<ul>
<li>Learning is done best via statistical modeling</li>
<li>Attributes are
<ul>
<li>equally important</li>
<li>statistically independent (given the class value)</li>
<li>This means that knowledge about the value of a             particular attribute doesn't tell us anything about the               value of another attribute (if the class is known)</li>
</ul></li>
<li>Although based on assumptions that are almost  never correct, this scheme works well in practice 
<a href="http://menzies.us/cs591o/?doc=187">[Domingos97]</a>
:</li>
</ul>

<p><center>
<a href="http://menzies.us/cs591o/img/bayesVsOthers.png"><img width=400 src="http://menzies.us/cs591o/img/bayesVsOthers.png"></a>
</center></p>

<h2>Example</h2>

<p>weather.symbolic.arff</p>

<pre><code>outlook  temperature  humidity   windy   play
-------  -----------  --------   -----   ----
rainy    cool        normal    TRUE    no
rainy    mild        high      TRUE    no
sunny    hot         high      FALSE   no
sunny    hot         high      TRUE    no
sunny    mild        high      FALSE   no
overcast cool        normal    TRUE    yes
overcast hot         high      FALSE   yes
overcast hot         normal    FALSE   yes
overcast mild        high      TRUE    yes
rainy    cool        normal    FALSE   yes
rainy    mild        high      FALSE   yes
rainy    mild        normal    FALSE   yes
sunny    cool        normal    FALSE   yes
sunny    mild        normal    TRUE    yes%%
</code></pre>

<p>This data can be summerized as follows:</p>

<pre><code>          
           Outlook            Temperature           Humidity   
====================   =================   =================  
          Yes    No            Yes   No            Yes    No 
Sunny       2     3     Hot     2     2    High      3     4
Overcast    4     0     Mild    4     2    Normal    6     1 
Rainy       3     2     Cool    3     1
          -----------         ---------            ----------
Sunny     2/9   3/5     Hot   2/9   2/5    High    3/9   4/5 
Overcast  4/9   0/5     Mild  4/9   2/5    Normal  6/9   1/5
Rainy     3/9   2/5     Cool  3/9   1/5

            Windy        Play
=================    ========
      Yes     No     Yes   No
False 6      2       9     5
True  3      3
      ----------   ----------
False  6/9    2/5   9/14  5/14
True   3/9    3/5
</code></pre>

<p>So, what happens on a new day:</p>

<pre><code>Outlook       Temp.         Humidity    Windy         Play
Sunny         Cool          High        True          ?%%
</code></pre>

<p>First find the likelihood of the two classes</p>

<ul>
<li>For "yes" = 2/9 * 3/9 * 3/9 * 3/9 * 9/14 = 0.0053</li>
<li>For "no" = 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206</li>

<li>Conversion into a probability by normalization:
<ul>
<li>P("yes") = 0.0053 / (0.0053 + 0.0206) = 0.205</li>
<li>P("no") = 0.0206 / (0.0053 + 0.0206) = 0.795</li>
</ul></li>
</ul>

<p>So, we aren't playing golf today.</p>

<h2>Bayes' rule</h2>

<p>More generally, the above is just an application of Bayes' Theorem.</p>

<ul>
<li>Probability of event H given evidence E: <pre>
             Pr[E | H ] * Pr[H]
Pr[H | E] =  -------------------
                  Pr[E] 
</pre></li>
<li>A priori probability of H= Pr[H]
<ul>
<li>Probability of event before evidence has been seen</li>
</ul></li>
<li>A posteriori probability of H= Pr[H|E]
<ul>
<li>Probability of event after evidence has been seen</li>
</ul></li>
<li>Classification learning: what's the probability of the  class given an instance?
<ul>
<li>Evidence E = instance</li>
<li>Event H = class value for instance</li>
</ul></li>
<li>Naive Bayes assumption: evidence can be split  into independent parts (i.e. attributes of instance!</li>

<pre>
            Pr[E1 | H ]* Pr[E2 | H ] * ....  *Pr[En | H ]Pr[H ]
Pr[H | E] = ---------------------------------------------------
                               Pr[E]
</pre>
</ul>

<ul>

<li><p>We used this above. Here's our evidence:</p>
<pre>
Outlook       Temp.         Humidity    Windy         Play
Sunny         Cool          High        True          ?</pre>
</li>
<li><p>Here's the probability for "yes":</p></li>

<pre>
Pr[ yes | E] = Pr[Outlook     = Sunny | yes] *
               Pr[Temperature = Cool  | yes] *
               Pr[Humidity     = High  | yes] * Pr[ yes]
               Pr[Windy       = True  | yes] * Pr[yes] / Pr[E]
             = (2/9 * 3/9 * 3/9 * 3/9)       * 9/14)   / Pr[E]
</pre>
</ul>
<p>Return the classification with highest probability
<ul>
<li>Probability of the evidence Pr[E]
<ul>
<li> Constant across all possible classifications;
<li> So, when comparing N classifications, it cancels out
</ul>
</ul>

<h2>Numerical errors:</h2>

<p>From multiplication of lots of small numbers</p>

<ul>
<li>Use the standard fix:  don't multiply the numbers, add the logs</li>
</ul>

<h2>Missing values</h2>

<p>Missing values are a problem for any learner. NaiveBayes' treatment of missing values is particularly elegant.</p>

<ul>
<li>During training: instance is not included in frequency count for attribute value-class combination</li>

<li>During classification: attribute will be omitted from  calculation</li>
</ul>

<pre>
Example: Outlook    Temp.    Humidity    Windy    Play
         ?          Cool     High        True     ?%%
</pre>

<ul>
<li>Likelihood of "yes" = 3/9 * 3/9 * 3/9 * 9/14 = 0.0238</li>
<li>Likelihood of "no" = 1/5 * 4/5 * 3/5 * 5/14 = 0.0343</li>
<li>P("yes") = 0.0238 / (0.0238 + 0.0343) = 41%</li>
<li>P("no") = 0.0343 / (0.0238 + 0.0343) = 59%</li>

</ul>

<h2>The "low-frequencies problem"</h2>

<p>What if an attribute value doesn't occur with every class value (e.g. "Humidity = high" for class "yes")?</p>

<ul>
<li>Probability will be zero!</li>
<li>Pr[Humidity = High | yes] = 0</li>
<li>A posteriori probability will also be zero! Pr[ yes | E] = 0      (No matter how likely the other values are!)</li>
</ul>

<p>So use an estimators for low frequency attribute ranges</p>

<ul>
<li>Add a little <em>"m"</em> to the count for every attribute value-class combination
<ul>
<li>The Laplace estimator</li>
<li>Result: probabilities will never be zero!</li>
</ul></li>
</ul>

<p>And use an stimator for low frequency classes</p>

<ul>
<li>Add a little <em>"k"</em> to class counts
<ul>
<li>The M-estimate</li>
</ul></li>
</ul>

<p>Magic numbers: m=2, k=1</p>

<h2>Psuedo-code</h2>

<p>Here's the pseudo code of the
the NaiveBayes classifier preferred by 
<a href="http://menzies.us/cs591o/?doc=188">[Yang03]</a> 
(p4).
<p>For full details see
<a href="http://unbox.org/wisp/branches/tims-our/minerc.lib/nbd.awk">this code.</a>
</p>

<pre><code>function train(   i) {
   Instances++
   if (++N[$Klass]==1) Klasses++  
   for(i=1;i<=Attr;i++) 
     if (i != Klass)
      if ($i !~ /\?/)  
         symbol(i,$i,$Klass) 
}
function symbol(col,value,klass) {
   Count[klass,col,value]++;
}
</code></pre>

<p>When testing, find the likelihood of each hypothetical 
class and return the one that is most likely.</p>

<p>Simple version</p>
<pre><code>function likelihood(l,         klass,i,inc,temp,prior,what,like) {  
   like = -10000000000;    # smaller than any log
   for(klass in N) {  
      prior=N[klass] / Instances; 
      temp= prior
      for(i=1;i<=Attr;i++) {  
         if (i != Klass)
            if ( $i !~ /\?/ ) 
                temp *= Count[klass,i,$i] / N[klass]
      }
      l[klass]= temp
      if ( temp >= like ) {like = temp; what=klass}
   }
   return what
}
</code></pre>

<p>More realistic version (handles certain low-frequency cases). </p>
<pre><code>function likelihood(l,         klass,i,inc,temp,prior,what,like) {  
   like = -10000000000;    # smaller than any log
   for(klass in N) {  
      prior=(N[klass]+K)/(Instances + (K*Klasses)); 
      temp= log(prior)
      for(i=1;i<=Attr;i++) {  
         if (i != Klass)
            if ( $i !~ /\?/ ) 
                temp += log((Count[klass,i,$i]+M*prior)/(N[klass]+M))
      }
      l[klass]= temp
      if ( temp >= like ) {like = temp; what=klass}
   }
   return what
}
</code></pre>

<h2>Handling Numerics</h2>

<p>The above code assumes that the attributes are discrete.
The usual approximation is to assume a "gaussian" (i.e. a "normal" or "bell-shaped" curve)
for the numerics.</p>

<p>The probability density function for the normal  distribution is defined by the mean and standardDev (standard deviation)</p>

<p>Given:</p>

<ul>
<li>n: the number of values;</li>
<li>sum: the sum of the values; i.e. sum = sum + value;</li>
<li>sumSq: the sum of the square of the values; i.e. sumSq = sumSq + value*value</li>
</ul>

<p>Then:</p>

<pre><code>    function mean(sum,n)  {
        return sum/n
    }
    function standardDeviation(sumSq,sum,n)  {
        return sqrt((sumSq-((sum*sum)/n))/(n-1))
    }
    function gaussianPdf(mean,standardDev,x) {
       pi= 1068966896 / 340262731; #: good to 17 decimal places
       return 1/(standardDev*sqrt(2*pi)) ^
                    (-1*(x-mean)^2/(2*standardDev*standardDev))
    }
</code></pre>

<p>For example: </p>

<pre><code>outlook  temperature humidity windy play
-------  ----------- -------- ----- ---
sunny    85      85       FALSE no
sunny    80      90       TRUE  no
overcast 83      86       FALSE yes
rainy    70      96       FALSE yes
rainy    68      80       FALSE yes
rainy    65      70       TRUE  no
overcast 64      65       TRUE  yes
sunny    72      95       FALSE no
sunny    69      70       FALSE yes
rainy    75      80       FALSE yes
sunny    75      70       TRUE  yes
overcast 72      90       TRUE  yes
overcast 81      75       FALSE yes
rainy    71      91       TRUE  no%%
</code></pre>

<p>This generates the following statistics:</p>

<pre><code> 
             Outlook           Temperature               Humidity
=====================    =================      =================
           Yes    No             Yes    No            Yes      No
Sunny       2      3             83     85             86      85
Overcast    4      0             70     80             96      90
Rainy       3      2             68     65             80      70
          -----------            ----------            ----------
Sunny     2/9    3/5    mean     73     74.6  mean     79.1   86.2
Overcast  4/9    0/5    std dev   6.2    7.9  std dev  10.2    9.7
Rainy     3/9    2/5

              Windy            Play
===================     ===========
           Yes   No     Yes     No
False       6     2      9       5
True        3     3
            -------     ----------
False     6/9   2/5     9/14  5/14
True      3/9   3/5
</code></pre>

<p>Example density value:</p>

<ul>
<li>f(temperature=66|yes)= gaussianPdf(73,6/2,66) =0.0340</li>
<li>Classifying a new day:</li>
</ul>

<pre>
Outlook    Temp.    Humidity    Windy    Play
Sunny      66       90          true     ?%%
</pre>

<ul>
<li>Likelihood of "yes" = 2/9 * 0.0340 * 0.0221 * 3/9 * 9/14 = 0.000036</li>

<li>Likelihood of "no" = 3/5 * 0.0291 * 0.0380 * 3/5 * 5/14 = 0.000136
<ul>
<li>P("yes") = 0.000036 / (0.000036 + 0. 000136) = 20.9%</li>
<li>P("no") = 0. 000136 / (0.000036 + 0. 000136) = 79.1%</li>
</ul></li>
</ul>

<p>Note: missing values during training: not included in  calculation of mean and standard deviation</p>

<p><p>BTW, an alternative to the above is apply some  discretization policy 
to the data; e.g. <a href="Refs.html#Yang03">[Yang03]</a>.
Such discretization is good practice since
it can dramatically improve the performance of a NaiveBayes
classifier (see <a href="http://menzies.us/cs591o/?doc=135">[Dougherty95]</a></p>.</p>

<h2>Not so "Naive" Bayes </h2>

<p><p>Why does Naive Bayes work so well? 
<a href="http://menzies.us/cs591o/?doc=187">[Domingos97]</a>
offer one analysis:</p>

<ul>
<li>They offer one example with three attributes
where the performance
where a "Naive" and a "optimal" Bayes peform nearly
the same.
<li>
They generalized that to conclude that
"Naive" bayes  is only really Naive in a vanishingly
small number of  cases.
</ul>

<p>There three attribute example is given below. For
the generalized example, see

<a href="http://menzies.us/cs591o/?doc=187">their paper</a>.
</p>

<p><p>Consider a Boolean concept, described by three attributes A, B and C . </p>

<p><p>Assume that the 
two classes, denoted by + and  - are equiprobable </p>

<pre> (P(+) = P(-) = 1/2 ).  </pre>
<p>Let A and C be independent, and 
let A = B (i.e., A and B are completely dependent). Therefore B should be ignored, 
and the optimal classification procedure for a test instance is to assign it to 
(i) class + if 
<pre> P(A|+) * P(C|+) -  P(A|-) * P(C|-) > 0, </pre>   

<p>and (ii) to class (if the inequality has the opposite 
sign), and (iii) to an arbitrary class if the two sides are equal.


<p>Note that the Bayesian 
classifier will take B into account as if it was independent from A, and this will be
equivalent 
to counting A twice. 
Thus, the Bayesian classifier will assign the instance to class + if 
<pre> P(A|+)^2 *  P(C|+) -  P(A|-)^2  * P(C|-) > 0, </pre>
<p>and to - otherwise. 

<p>Applying Bayes' theorem, P(A|+) can be reexpressed as 
<pre> P(A) * P(+|A)/P(+) </pre>  
<p>and  similarly for the other probabilities. 
<p>Since P(+) = P(-), after canceling like terms this 
leads to the equivalent expressions 
<pre> P(+|A) * P(+|C ) - P(-|A) * P(-|C ) > 0 </pre> 
<p>for the optimal decision, and 
<pre> P(+|A)^2 *  P(+|C ) - P(-|A)^2 * P(-|C) > 0  </pre>

<p>for the Bayesian classifier. Let 
<pre>
P(+|A) = p 
P(+|C) = q. 
</pre> 

<p><p>Then class + should be selected when </p>

<pre> pq - (1 - p)*(1 - q) > 0    </pre>  

<p><p>which is equivalent to </p>

<pre> q > 1 - p   [Optimal Bayes] </pre>

<p><p>With the Bayesian classifier, it will be 
selected when </p>

<pre> p^2 * q  - (1 - p)^2 *  (1 - q) > 0  </pre> 

<p><p>which is equivalent to </p>

<pre>q >  (1 - p)^2 *  p^2 +(1 - p)^2    [Simple  Bayes]</pre>

<p><p>The two 
curves are shown in following figure. The remarkable fact is that, even though the independence 
assumption is decisively violated because B = A, the Bayesian classifier disagrees with 
the optimal procedure only in the two narrow regions that are above one of the curves 
and below the other; everywhere else it performs the correct classification. </p>

<p><p>Thus, for all 
problems where (p, q) does not fall in those two small regions, the Bayesian classifier is 
effectively optimal. </p>

<p><center>
<a href="http://menzies.us/cs591o/img/optimalBayes.png"><img width=400 src="http://menzies.us/cs591o/img/optimalBayes.png"></a>
</center></p>

			]]>
		</description>
	</item>
	
</items>
	
	
