<?xml version="1.0"?>
	
<items>
	
		
	<item>
		<title>
	Iterative Dichotomization
		</title>
	        <apropos id="184" author="timm" dob="1189469142" />
       		<link>http://menzies.us/cs591o/?lecture=184</link>
       		<category>lecture</category>
       		<category>trees</category>
		<description>
			<![CDATA[
<h3>How to generate a tree</h3>
<p>
<ul>
    <li> Given a bag of mixed-up stuff.
          <ul>
			<li> Need a measure of "mixed-up"
		  </ul>
    <li> Split: Find something that divides up the bag in two new sub-bags
          <ul>
				<li> And each sub-bag is less mixed-up;
          		<li> Each split is the root of a sub-tree.
		  </ul>
    <li> Recurse: repeat for each sub-bag
          <ul>
			<li> i.e. on just the data that falls into each part of the split
                <ul>
					<li> Need a Stop rule
                	<li> Condense the instances that fall into each sub-bag
				</ul>
		  </ul>
    <li> Prune back the generated tree.
</ul></p>
<p>
Different tree learners result from different selections of
<measure,split,condense,stop,prune>
	<ul>
    <li> CART: (regression trees)
          <ul>
				<li> measure: standard deviation
						<ul><li>Three "normal" curves with 
							<a href="http://upload.wikimedia.org/wikipedia/commons/1/1b/Normal_distribution_pdf.png">different standard deviations</a>
							<li>Expected values 
								<a href="http://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Standard_deviation_diagram.svg/350px-Standard_deviation_diagram.svg.png">under the normal curve</a>
						</ul>
          		<li> condense: report the average of the instances in each bag.
			</ul>
    <li> M5Prime: (model trees)
          <ul>
				<li> measure: standard deviation
				<li>
           condense: generate a linear model of the form a+b * x1 +c * x2 + d * x3 +...
		</ul>
	<li>
     J48: (decision trees)
          <ul>
			<li> measure: <a href="http://upload.wikimedia.org/math/a/e/f/aef122e9c7f64d071b2acb4d17e88000.png">"entropy"</a>
          	<li> condense: report majority class
		 </ul>
	</ul>
<h3>Example: entrophy and decsion trees</h3>
<img align=center width=500
 src="http://www.csee.wvu.edu/~timm/cs591o/old/images/splits.jpg">
<p>
Q: which attribute is the best to split on?</p>
<p>
A: the one which will result in the smallest tree:
</p><p> Heuristic: choose the attribute that produces the "purest" nodes
(purity = not-mixed-up)
</p>
<p>e.g. Outlook= sunny</p>

<ul>
<li>info([2,3])= entropy(2/5,3/5) = -2/5 * log(2/5) - 3/5 * log(3/5) = 0.971 bits</li>
</ul>

<p>Outlook = overcast</p>

<ul>
<li>info([4,0]) = entropy(1,0) = -1 * log(1) - 0 * log(0) = 0 bits</li>
</ul>

<p>Outlook = rainy </p>

<ul>
<li>info([3,2]) = entropy(3/5, 2/5) = -3/5 * log(3/5) - 2/5 * log(2/5) = 0.971 bits</li>
</ul>

<p>Expected info for Outlook =  Weighted sum of the above</p>

<ul>
<li>info([3,2],[4,0],[3,2]) =  5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971 = 0.693</li>
</ul>

<p>Computing the information gain</p>

<ul>
<li>e.g. information before splitting minus information after splitting</li>
<li>e.g. gain for attributes from weather data:</li>
<li>gain("Outlook") = info([9,5]) - info([2,3],[4,0],[3,2]) = 0.940 - 0.963 = 0.247 bits</li>
<li>gain("Temperature") =  0.247 bits</li>

<li>gain("Humidity") = 0.152 bits</li>
<li>gain("Windy") = 0.048 bits</li>
</ul>

<h3>Problem: Highly-branching attributes </h3>

<p>Problematic: </p>

<ul>
<li>attributes with a large number of values 
(extreme case: ID code)</li>
<li>Subsets are more likely to be pure if there 
is a large number of values</li>
<li>Information gain is biased towards choosing     
attributes with a large number of values</li>

<li><p>This may result in over fitting (selection of an     attribute that is non-optimal for prediction); e.g. </p>

<pre><code>    ID code    Outlook     Temp.    Humidity    Windy    Play
      A          Sunny       Hot      High        False    No
      B          Sunny       Hot      High        True     No
      C          Overcast  Hot        High        False    Yes
      D          Rainy       Mild     High        False    Yes
      E          Rainy       Cool     Normal      False    Yes
      F          Rainy       Cool     Normal      True     No
      G          Overcast    Cool     Normal      True     Yes
      H          Sunny       Mild     High        False    No
      I          Sunny       Cool     Normal      False    Yes
      J          Rainy       Mild     Normal      False    Yes
      K          Sunny       Mild     Normal      True     Yes
      L          Overcast    Mild     High        True     Yes
      M          Overcast    Hot      Normal      False    Yes
      N          Rainy       Mild     High        True     No%%
</code></pre></li>
<li><p>If we split on <em>ID</em> we get <em>N</em> sub-trees with one class in each;</p></li>
<li>info("ID code")= info([0,1]) + info([0,1]) + ... + info([0,1]) = 0 bits</li>
<li>So the info gain is 0.940 bits</li>

</ul>

<p>The gain ratio</p>

<ul>
<li>Gain ratio: a modification of the information gain that reduces its bias</li>
<li>Gain ratio takes number and size of branches into account when choosing an attribute</li>
<li>It corrects the information gain by taking the     intrinsic information of a split into account</li>
<li><em>Intrinsic informations</em>: entropy of distribution of instances into branches (i.e. how much info do we  need to tell which branch an instance belongs to)</li>
</ul>

<h3>Other tree learning</h3>

<h4>Regression trees</h4>

<ul>
<li>Differences to decision trees:</li>
<li>Splitting criterion: minimizing intra-subset variation</li>
<li>Pruning criterion: based on numeric error measure</li>
<li>Leaf node predicts average class values of training instances reaching that node</li>

<li>Can approximate piecewise constant functions</li>
<li><p>Easy to interpret:</p>

<pre><code>    curb-weight &lt;= 2660 : 
    |   curb-weight &lt;= 2290 : 
    |   |   curb-weight &lt;= 2090 : 
    |   |   |   length &lt;= 161 : price=6220
    |   |   |   length &gt;  161 : price=7150
    |   |   curb-weight &gt;  2090 : price=8010
    |   curb-weight &gt;  2290 : 
    |   |   length &lt;= 176 : price=9680
    |   |   length &gt;  176 : 
    |   |   |   normalized-losses &lt;= 157 : price=10200
    |   |   |   normalized-losses &gt;  157 : price=15800
    curb-weight &gt;  2660 : 
    |   width &lt;= 68.9 : price=16100
    |   width &gt;  68.9 : price=25500

</code></pre></li>
<li><p>More sophisticated version: model trees</p></li>
</ul>

<h4>Model trees</h4>

<ul>
<li><p>Regression trees with linear regression functions at each node</p>

<pre><code>    curb-weight &lt;= 2660 : 
    |   curb-weight &lt;= 2290 : LM1 
    |   curb-weight &gt;  2290 : 
    |   |   length &lt;= 176 : LM2 
    |   |   length &gt;  176 : LM3 
    curb-weight &gt;  2660 : 
    |   width &lt;= 68.9 : LM4
    |   width &gt;  68.9 : LM5
    .
    LM1:  price = -5280 + 6.68 * normalized-losses 
                        + 4.44 * curb-weight
                        + 22.1 * horsepower - 85.8 * city-mpg 
                        + 98.6 * highway-mpg
    LM2:  price = 9680
    LM3:  price = -1100 + 91 * normalized-losses
    LM4:  price = 9940 + 47.5 * horsepower
    LM5:  price = -19000 + 13.2 * curb-weight

</code></pre></li>
<li><p>Linear regression applied to instances that reach a node 
after full regression tree has been built</p></li>
<li>Only a subset of the attributes is used for LR</li>
<li>Attributes occurring in subtree (+maybe attributes occurring in path to the root)</li>
<li>Fast: overhead for LR not large because usually only a small subset of attributes is used in tree</li>
</ul>

<p>Building the tree</p>

<ul>
<li>Splitting criterion: standard deviation reduction into <em>i</em> bins</li>

<li>SDR = sd(T) - sum( ( |Ti| / |T| * sd(Ti) ) )
<ul>
<li>where (|T| = number of instances in that tree).</li>
</ul></li>
<li>Termination criteria (important when building trees for numeric prediction):</li>
<li>Standard deviation becomes smaller than certain fraction of sd for full training set (e.g. 5%)</li>
<li>Too few instances remain (e.g. less than four)</li>
</ul>

<p>Smoothing (Model Trees)</p>

<ul>

<li>Naive method for prediction outputs value of LR for corresponding leaf node</li>
<li>Performance can be improved by smoothing predictions using internal LR models</li>
<li>Predicted value is weighted average of LR models along path from root to leaf</li>
<li>Smoothing formula: p' = (np+kq)/(n+k)</li>
<li><em>p'</em> is what gets passed up the tree</li>
<li><em>p</em> is what got passed from down the tree</li>
<li><em>q</em> is the value predicted by the linear models at this node</li>

<li><em>n</em> is the number of examples that fall down to here</li>
<li><em>k</em> magic smoothing constant; default=2 </li>
</ul>

			]]>
		</description>
	</item>
	
</items>
	
	
