<?xml version="1.0"?>

<items>

	<item>
		<title>
			Welcome to Data Mining
		</title>
	        <apropos id="137" author="timm" dob="1187221456" />
       		<link>http://menzies.us/cs591o/?lecture=137</link>
       		<category>lecture</category>
       		<category>one</category>
		<description>
			<![CDATA[
			<h2> It's all about the data, baby, yeah!</h2>
<a href="http://menzies.us/cs591o/img/dianem16.jpg">
<img 
						class="rthumb250" src="http://menzies.us/cs591o/img/dianem16250.jpg"></a>
						<p>Other subjects teach algorithms. Elegant beautiful algorithms
						with provable properties.</p>
						<p>Here, we're about algorithms working on data. And real-world data
						has many, many quirks. What seems like
						good ideas, in theory, may be irrlevant in practice. </p>
						<p>So the real lessons of data mining are 
						<em>not</em> about the algorithms
						(though the algorithms are exciting to
						study).
						Rather, what data mining really reveals
						is the strange wonderful state of
						the world around us.
						<br clear=all></p>
		
			<h2>Too much "doing", not enough "learning"
					</h2>
<a href="http://menzies.us/cs591o/img/catignore.jpg"><img 
						class="rthumb250" src="http://menzies.us/cs591o/img/catignore.jpg"></a>
				<p>Sadly, all too often, organizations collect data and never analyze it. 
					<ul><li>All that data, lying buried and ignored
							<li>E.g. in five case studies with NASA data I've found clear predictors
							for software quality (effort and defects).<li> In four of those five case,
								that data source is no longer active.
								<li>This is incredible- that data was <em>very</em> expensive to
									collect yet no one ever looked at it seriously.
									<li>Ideally, we should assess corporate data warehouses not by 
										the amount of data that goes <em>in</em> but by the insightful
										conclusions
										that come <em>out</em>.
<br clear=all></li>
							</ul></p>

	<h2> Take more risks, sooner</h2>
<img 
						class="rthumb250" src="http://menzies.us/cs591o/img/risk250.jpg">
						<p>
In this subject, you need to know your algorithms.
But, more importantly, you also need to know
how those algorithms work, in practice, on real-world data</p>
<p>
You should spend <em> as little time as possible</em>
on pretty interfaces and "power tools" that, supposedly,
will make you more productive when you (eventually) process
real data.</p>
<p>Instead, you should do whatever it takes to get the code 
munching on the data. And be prepared for surprises- surprises
that fundamentally change the task and how you tackle it.</p>
<p>So I encourage you to take risks, as early as possible. 
Get the data into the code. See what comes back.
Show that you can show the advantage
<em>and</em> disadvantages of some methods.</p>
<p>Even spectaculars failures are fine- provided that you "fail" the right way
(e.g. show that standard theory would predict a success, that failure
was found even after much diligent effort on your part, that you applied
good evaluation criteria, etc). 
<br clear=all></p>
			<h2>
					Much of data "mining" is data "pre-processing" </h2>
<a href="http://menzies.us/cs591o/img/effort.gif"><img 
						class="rthumb250" src="http://menzies.us/cs591o/img/effort.gif"></a>
				<p>				<ul>
<li>In order to get you to the coal face, faster, so you can take
more risks, earlier, this subject will teach you lots of
scripting tricks.
Much of data mining is really data pre-processing. 
The methods and algorithms are important
(e.g. the <a
href="http://www.cs.queensu.ca/home/mcconell/WekaPart2.html">WEKA</a>),
but students also need learn the scripting skills required for the pre- and post-processing.
<br clear=all></li>
</ul></p>

			<h2>
Bias makes us blind, bias lets us see</h2>
<a href="http://menzies.us/cs591o/img/monk_no_evil.jpg"><img 
						class="rthumb250" src="http://menzies.us/cs591o/img/monk_no_evil.jpg"></a>
<p>
<ul>
<li>The output of a data miner is always biased by the data selected
for the learning, the learning method applied, etc etc. </li><li> They <em>must</em>
be biased since, otherwise, there would be no way to decide what
bits are  most important and which bits can be ignored. </li>
<li>Paradoxically,
bias  blinds us to some things while letting us see (predict) the
future.  </li>
<li>So all theories are biased (but only some  admit it).  But
we should always be aware of the domain-specific nature of the
conclusions drawn from a learner.
<br clear=all></li>
</ul>
</p>

			<h2>

					No idea is absolutely "right", but many more are useless.</h2>
<a href="http://menzies.us/cs591o/img/truefasle.jpg"><img 
						class="rthumb250" src="http://menzies.us/cs591o/img/truefasle.jpg"></a>
				<p>Different learners use different biases to learnt their theories.
					<p>So the <em>same</em> data can generate <em>different conclusions</em>, depending on the bias of the
						learner.

					<ul><li>So are we just making stuff up?
<li>Knowledge relativism? All ideas are valid? </li>
<li>No!</li>
<li>Sure, one data set supports many theories.
<ul>
<li>But there are many many more theories that are unsupported.</li>
</ul></li>

<li>While no idea is  <em>right</em> ...
<ul>
<li>... some things are <em>useful</em> (perform well on test data) ...</li>
<li>... and many many many more ideas are <em>useless</em> 
<ul><li><a href="http://en.wikipedia.org/wiki/Not_even_wrong">"This idea isn't even wrong"</a> -- Wolfgang Pauli </ul> </li>
</ul>
											<li>Sherlock Holmes was  nearly right: <ul><li>"Eliminate all other factors, and the one which 
												remains must be the truth."</ul>
												<li>Just need to add the plural:
													<ul>
													<li>"Eliminate all other factors, and the one<u>S</u> which 
														remains must be the truth<u>S</u>."
														<br clear=all></li> </ul></ul></p>


			<h2>

					Dumb apes get by.</em></h2>
<a href="http://menzies.us/cs591o/img/Neandertal_misconception.gif"><img 
						class="rthumb250" src="http://menzies.us/cs591o/img/Neandertal_misconception.gif"></a>
				<p>
<ul>
<li>Here's a puzzle. 
People aren't real bright (just look at how badly they <a href="http://groups.google.com/group/comp.risks">write software</a>).
Yet, somehow,
people have built the most amazing things, like
the international
domestic airline network and the Internet. How? </li>
<li>Maybe the real world
is not as complex as our egos imagine. And seemingly naive
probes tell us  most of what can be found using supposedly
more sophisticated methods.
<br clear=all></li> </ul></p>
			
<h2>
You are responsible</h2>
<a href="http://menzies.us/cs591o/img/responsble.jpg"><img 
						class="rthumb250" src="http://menzies.us/cs591o/img/responsble.jpg"></a>
<p>
<ul>
<li>Very successful data miners can be surprisingly simple. This begs the question
"why aren't they used more often so we can control the world around us, better?".</li>
<li>The answer is that, sometimes the world is very very complicated and no single simple solution
will suffice. But often, the world is 
a surprisingly simple place (otherwise, dumb apes would not get by)
which means, in turn, that we <em>should</em> be able to predict and control and select
the future that we want.  </li>
<li>So the curse of data mining is that once you learn how to do it, you become responsible for the future of the human race.
Are your ready for that?
<br clear=all></li> </ul></p>

			]]>
		</description>
	</item>

	<item>
		<title>
Bias and Ethics

		</title>
	        <apropos id="150" author="timm" dob="1187446229" />
       		<link>http://menzies.us/cs591o/?main=150</link>
       		<category>lecture</category>
       		<category>two</category>
		<description>
			<![CDATA[
			
			<h2>Introduction</h2>
<p>Here are some problems:</p>
<p>
<ul>
	<li>What eggs to select for IVF? 
		<a href="http://menzies.us/cs591o/?doc=?122">(p2)</a>   </li>
<li>What will software cost to develop?</li>
<li>What diseases does a patient have
		<a href="http://menzies.us/cs591o/?doc=122">p25</a>   </li>
	<li>Which loan applications to fund <a href="http://menzies.us/cs591o/?doc=122">(p22)</a>?</li>
<li>What houses will have the best resale value?</li>
<li>Which parts of the program need more inspection?</li>
<li>What products are best to sell to what markets? 
		<a href="http://menzies.us/cs591o/?doc=122">(p26)</a>   </li>
	<li>What cows to keep and which to send to the abattoir ? 
		<a href="http://menzies.us/cs591o/?doc=122">(p2)</a>   </li>
	<li>How to teach a satellite to distinguish between cloud shadows and oil spills 
		<a href="http://menzies.us/cs591o/?doc=122">(p23)</a>?   </li>
	<li>How much electricity will b needed in two hours (i.e. what cola-powered generators to fire up)? 
		<a href="http://menzies.us/cs591o/?doc=122">(p24)</a>?   </li>
</ul>
</p>
<p>In the modern era, these problems have data mining solutions.</p>

<ul>
<li>Lots of data: world's databases doubling in size every 20 months;
<ul>
<li>Internet, Radio Frequency Identification (RFID) tracking, on-line shopping (patterns of sales tracked at Amazon)</li>
</ul></li>
</ul>

<p><img class=rthumb src="http://menzies.us/cs591o/img/spiderman.jpg">
But, like Spider Man says, with great power comes great responsibility; e.g.</p>
<p>
<ul>
<li>Can consumers audit how their loans applications are analyzed?</li>
<li>Is it right that our learners tell us what embryos are <em>not</em> implanted ?</li>
<li>How can a Hindu protest at letting vital ethical decisions (like deciding what cows are <br />
<a href="http://www.beliefnet.com/story/82/story_8229_1.html">Aghanya--that  which must not be slaughtered</a>)
being made by a machine?</li>
</ul>
</p>
<p>But before we   can discuss the ethical implications of data mining, we must first understand both  the power and limitations of the technology. </p>

<p>So we'll get to the ethics after discussing the technology.
<br clear=all></p>

<h2>The Technology</h2>

<h3>Data (Arff format)</h3>

<pre>
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
...
</pre>

<h3>Summarized by J48 (decision tree learner)</h3>
	<ul>
		<li> Find the attribute value that most divides up the data
			<li>Split the data on that value
				<li>Recurse on each subset
					<li>Each split is a <em>node</em> in a <em>decision tree</em>.</ul>

<pre>
outlook = sunny
|   humidity = high: no 
|   humidity = normal: yes 
outlook = overcast: yes 
outlook = rainy
|   windy = TRUE: no 
|   windy = FALSE: yes 
</pre>

<p>How good is this theory?
	<ul><li>10 times, <ul><li>divide data into 90% train, 10% test, <li>learn on train, <li>apply on test.</ul><li>Report average
							results across the 10-way</ul></p>

<pre>
 a b   <-- classified as
 5 4 | a = yes
 3 2 | b = no
 </pre>

 <p>To get some summary statistics out of this:
	 <ul><li><pre>
   a   b   <-- classified as
 A=5 C=4 | a = yes
 B=3 D=2 | b = no
 </pre>
 <li>Accuracy = (A+D) / (A+B+C+D)
	 <li>Recall= prob(detection) = pd = D / (B+D)
		 <li>Precision = prec = D/(D+C)
			 <li>prob(false Alarm) = pf = C / (A+C)
				 <li>F = harmonic mean of prec,pd = 2*pd*prec/(pd+prec)
				 </ul>
				 <pre>
 TP Rate   FP Rate   Precision   Recall  F-Measure   Class
   0.556     0.6        0.625     0.556     0.588    yes
     0.4     0.444      0.333     0.4       0.364    no
</pre>
</p>
<h3>Summarized by Naive Baues</h3>
<p>A Naive Bayes classifier collects internal statistics, does not generate an explicit explanation:</p>

<pre>
Class yes: Prior probability = 0.63
outlook:  Discrete Estimator. Counts =  3 5 4  (Total = 12)
temperature:  Discrete Estimator. Counts =  3 5 4  (Total = 12)
humidity:  Discrete Estimator. Counts =  4 7  (Total = 11)
windy:  Discrete Estimator. Counts =  4 7  (Total = 11)
</pre><pre>
Class no: Prior probability = 0.38
outlook:  Discrete Estimator. Counts =  4 1 3  (Total = 8)
temperature:  Discrete Estimator. Counts =  3 3 2  (Total = 8)
humidity:  Discrete Estimator. Counts =  5 2  (Total = 7)
windy:  Discrete Estimator. Counts =  4 3  (Total = 7)
</pre>

<p>Results of the Naive Bayes classifier:</p>
	<ul><li>10 times, <ul><li>divide data into 90% train, 10% test, <li>learn on train, <li>apply on test.</ul><li>Report average
							results across the 10-way</ul>

<pre>
  a b   <-- classified as
  7 2 | a = yes
  4 1 | b = no

 TP Rate   FP Rate   Precision   Recall  F-Measure   Class
   0.778     0.8        0.636     0.778     0.7      yes
     0.2     0.222      0.333     0.2       0.25     no
</pre>

<h2>Definitions</h2>

<p>In general, these are hard questions:

<ul>
<li>What does learning mean? </li>
<li>What does intelligence mean?</li>
<li>Does a slipper <em>learn</em> the shape of your foot?</li>
</ul>
</p>

<p>But a simpler kind of "learning" is much easier to understand
<pre>
                 PERFORMANCE systems
				 (e.g. Naive Bayes)

            .------------->-----------.
            |                         |
            |                         v
TRAINING    |                     PREDICTIONS
data -->  LEARNING             (on test data)
            |                         ^
            |                         |
            .----> GENERALIZATION -->-.

               EXPLANATION systems
			      (e.g. J48)
</pre>
(This is the kind of learning explored in this class.)</p>
<p>Explanation systems build some human-readable intermediary, 
 then use that intermediary to make predictions. </p>

<p>Performance  systems  don't bother with explaining themselves, they just make predictions.</p>

<p>In the above, decision trees are an explanation system and Naive Bayes was a performance system.</p>

<h2>The Over Fitting problem</h2>

<p>Fixation on irrelevant details &rArr; over fitting. </p>
<p>Cause: <em>noise</em>; i.e. spurious signals not connected
	to the output class</p>
<p>Symptoms: 

<ul>
<li>Overly complex theory</li>
<li>Poorer performance on future examples</li>
</ul>
</p>
<p>Solutions:

<ul>
<li>Testing: assess a learned theory via its performance on data not seen during training (as done above).</li>
<li>Pruning: <em>after</em> a theory is built, try throwing bits away; e.g. prune a decision tree back from the leaves, see how that changes performance.
	<pre>
confidence limit for pruning = 0.1 (very selective)

c0.1 

Horsepower <= 82: _0 (9.0)
Horsepower > 82
|   Horsepower <= 190: _20 (70.0/4.0)
|   Horsepower > 190: _40 (14.0/5.0)

  a  b  c  d   <-- classified as
  9  1  0  0 |  a = _0
  0 65  5  0 |  b = _20
  0  7  5  0 |  c = _40
  0  0  1  0 |  d = _60

TP Rate   FP Rate   Precision   Recall  F-Measure   Class
  0.9       0          1         0.9       0.947    _0
  0.929     0.348      0.89      0.929     0.909    _20
  0.417     0.074      0.455     0.417     0.435    _40
  0         0          0         0         0        _60
</pre><pre>
confidence limit for pruning = 0.25 (default, less selective)

c0.25 

Horsepower <= 82: _0 (9.0)
Horsepower > 82
|   Horsepower <= 190: _20 (70.0/4.0)
|   Horsepower > 190
|   |   Drive_train_type = 1
|   |   |   Highway_MPG <= 26: _40 (4.0/1.0)
|   |   |   Highway_MPG > 26: _20 (2.0)
|   |   Drive_train_type = 0: _40 (7.0/1.0)
|   |   Drive_train_type = 2: _20 (1.0)

  a  b  c  d   <-- classified as
  9  1  0  0 |  a = _0
  0 65  5  0 |  b = _20
  0  7  5  0 |  c = _40
  0  0  1  0 |  d = _60

TP Rate   FP Rate   Precision   Recall  F-Measure   Class
  0.9       0          1         0.9       0.947    _0
  0.929     0.348      0.89      0.929     0.909    _20
  0.417     0.074      0.455     0.417     0.435    _40
  0         0          0         0         0        _60
</pre>
</li>
</ul>
</p>

<h2>Generalization as Search</h2>

<p>Give a set of possible concept descriptors</p>

<ul>
<li>e.g. age  > 6; wealthy
...and a set of combination methods</li>
<li>e.g. and, or, not, if-then-else, it-unless, etc</li>
</ul>

<p>How to search the space of possible descriptors*combinations?</p>

<p>Problem:  the space of descriptors*combinations is usually impossibly large for a complete search</p>

<p>Solution: heuristic search (cut corners) </p>

<ul>
<li>e.g. 1R: only ever learn decision trees of depth one 
<ul>
<li>i.e. always try combinations on a single attribute</li>
<li>i.e. never try <em>combinations</em> of attributes</li>
</ul></li>
<li>BTW, works ok, but usually beaten by other methods</li>
</ul>

<h2>Different kinds of bias</h2>

<h3>Search bias</h3>

<p>When growing a theory, what concepts*descriptors do you look at <em>first</em>?</p>

<ul>
<li>Depending on your <em>goals</em>  your search will find <em>different</em> conclusions.
<ul>
<li>e.g. are you seeking the theory with <em>highest</em> performance on new test data, or the 
<em>smallest</em> theory, or a combination of both, see (minimum description length [p179][wt]).</li>
<li>e.g. decision trees: what goes into the root (thereby effecting all sub-trees)?</li>
<li>e.g. greedy search (only consider the next best thing) vs &epsilon;-greedy search (consider anything within &epsilon; of the current best thing).</li>
</ul></li>
</ul>

<h3>Over fitting Avoidance Bias</h3>

<p>When shrinking a theory, how do you control pruning?</p>

<ul>
<li>What trade-offs do you allow for size vs performance?</li>
<li>What do you try pruning first, then second, then...
<ul>
<li>e.g. if throwing away a decision tree sub-branch reduces performance by 5% but decision tree size by 50%, is that a <em>goods</em> prune?</li>
</ul></li>
</ul>

<h3>Sample Bias</h3>

<p>Learn from available examples, not the space of all possible examples.</p>

<ul>
<li>e.g. this data only comes from the east coast and they do things differently over there.</li>
</ul>

<h3>Language Bias</h3>

<p>What is space of legal combinations and legal descriptors?</p>

<ul>
<li>e.g. many learners can't understand numbers
<ul>
<li>then it can never learn, day > Thursday and must report a clumsier theory, day=Friday  OR day=Saturday OR day=Sunday</li>
</ul></li>
<li>e.g. Classification learners find connections from many independent attributes to
a single class Dependent Attribute
<ul>
<li>While association rule learners find connections between sets of attributes.</li>
<li>So an classification learners can't predict for <em>sets</em> of attributes.</li>
</ul></li>
</ul>

<h3>Evaluation bias</h3>
<p>In the above examples, we assessed our theories by mean performance
	and size of the learned theory and the explain-ability of the learned 
	theory.</p>
<p>Different measures yielded different conclusions about the <em>best</em>
	learner.</p>
<p>Even widely-used measures are surprisingly problematic.
</p>
<p>
<ul>
	<li>e.g. 10 hobos with no jobs enter Bill Gate's office.
		<li>Q: What is their <em>mean</em> income?. A: ((10*0)+<a href="http://evan.snew.com/ecgi/gates.cgi?180825988598757340145090831950708#Worth">$24 billion</a>)/11<br>
				=2.2 billion.
				<li>Note that this number is uninformative regarding the social status of  both the hobos and Bill Gates
			</ul></p>
			<p>Many ways to evaluate the performance 
				of a learner, but we'll need some <a href="http://menzies.us/cs591o/?doc=152">theory</a> first.</p>
					<p>In the meantime:
						<ul><li>Well keep using means (simplest); 
								<li>But we'll know we can better
									<lI>And we'll be interested in how <a href="http://menzies.us/cs591o/?doc=151">changing the 
											evaluation criteria</a> changes our opinion about the  the learner.
								</ul></p>
<h3>Many, many biases</h3>

<p>From <a href="http://menzies.us/cs591o/?doc=153">"Separate-and-conquer rule learning"</a></p>

<p><center> 
		<a href="http://menzies.us/cs591o/img/bias.png"><img width=500 src="http://menzies.us/cs591o/img/bias.png"></a>
</center></p>

<h2>Ethics</h2>

<p>Issue: different learners use different heuristics</p>

<ul>
<li>Consequently, they learn different theories.</li>
</ul>

<p>So the <em>same</em> data can generate <em>different theories</em>, depending on the search bias (the heuristic search method) used to explore the space of descriptors*combinations</p>

<ul>
<li>i.e. there is no one <em>best</em> theory.</li>
</ul>

<p>More generally, if you using induction on historical data to predict the future, then:</p>

<ul>
<li>There is no such thing as an "unbiased opinion". </li>
<li>All theories are bias (but only some  admit it).</li>
<li>They <em>must</em> be biased else there is no way to decide what bits are  most important and which bits can be ignored.</li>
<li>So bias  blinds us and, paradoxically, lets us see (predict) the future.</li>
</ul>

<p>Anyway, you have an ethical dilemma</p>

<ul>
<li>You can generate different theories: which do you report?</li>
<li>Who gets hurt or helped by the different theories?</li>
</ul>

<p>Do the <em>users</em> of your theories fully appreciate the bias and limitations of the learning methods used to generate this theory></p>

<ul>
<li>Can your users audit the bias?</li>
<li>Are you use an explanation system where the learned theory can be browsed or just a performance system that only knows how to <em>make</em> conclusions and not how to <em>explain</em> them? </li>
</ul>

<p>Also, <em>should</em> the users audit the bias? </p>

<ul>
<li>Should you be able to access and audit the theory that assigns your credit rating? </li>
<li>Should we tell spammers what kinds of emails will get rejected? </li>
<li>Should we tell bombers how we will screen airline passengers for explosives?</li>
</ul>

<h2>How to  handle bias</h2>

<p>Data miners: you have responsibilities to your users;</p>

<ul>
	<li>Explore the space of possible theories: any common conclusions?</li>
<li>Ensure reproducibility, documented the biases;</li>
<li>And always remember : above all, do no harm.</li>
</ul>

<p>Users of theories from data miners</p>

<ul>
<li>DO NOT be an passive consumer of other people's conclusions;
<ul>
<li>ALWAYS be an active reviewer of ideas.</li>
</ul></li>
<li>Require:
<ul>
<li>Access to the data used for training;</li>
<li>Access to the software used for learning;</li>
</ul></li>
</ul>

<p>And if our users demand the <em>right</em> theory, we can't give it them (the Bias problem).</p>

<ul>
<li>Knowledge relativism? All ideas are valid? </li>
<li>No!</li>
<li>Sure, one data set supports many theories.
<ul>
<li>But there are many many more theories that are unsupported.</li>
</ul></li>
</ul>

<p>So while no idea is  <em>right</em> ...</p>

<ul>
<li>... some things are <em>useful</em> (perform well on test data) ...</li>
<li>... and many many many more ideas are <em>useless</em> 
<ul><li><a href="http://en.wikipedia.org/wiki/Not_even_wrong">"This idea isn't even wrong"</a> -- Wolfgang Pauli</ul>  </li>
</ul>

			]]>
		</description>
	</item>



</items>
