Welcome to Data Mining

Welcome to Data Mining http://menzies.us/cs591o/?lecture=137 lecture one It's all about the data, baby, yeah!

Other subjects teach algorithms. Elegant beautiful algorithms with provable properties.

Here, we're about algorithms working on data. And real-world data has many, many quirks. What seems like good ideas, in theory, may be irrlevant in practice.

So the real lessons of data mining are not about the algorithms (though the algorithms are exciting to study). Rather, what data mining really reveals is the strange wonderful state of the world around us.

Too much "doing", not enough "learning"

Sadly, all too often, organizations collect data and never analyze it.

All that data, lying buried and ignored
E.g. in five case studies with NASA data I've found clear predictors for software quality (effort and defects).
In four of those five case, that data source is no longer active.
This is incredible- that data was very expensive to collect yet no one ever looked at it seriously.
Ideally, we should assess corporate data warehouses not by the amount of data that goes in but by the insightful conclusions that come out.

Take more risks, sooner

In this subject, you need to know your algorithms. But, more importantly, you also need to know how those algorithms work, in practice, on real-world data

You should spend as little time as possible on pretty interfaces and "power tools" that, supposedly, will make you more productive when you (eventually) process real data.

Instead, you should do whatever it takes to get the code munching on the data. And be prepared for surprises- surprises that fundamentally change the task and how you tackle it.

So I encourage you to take risks, as early as possible. Get the data into the code. See what comes back. Show that you can show the advantage and disadvantages of some methods.

Even spectaculars failures are fine- provided that you "fail" the right way (e.g. show that standard theory would predict a success, that failure was found even after much diligent effort on your part, that you applied good evaluation criteria, etc).

Much of data "mining" is data "pre-processing"

In order to get you to the coal face, faster, so you can take more risks, earlier, this subject will teach you lots of scripting tricks. Much of data mining is really data pre-processing. The methods and algorithms are important (e.g. the WEKA), but students also need learn the scripting skills required for the pre- and post-processing.

Bias makes us blind, bias lets us see

The output of a data miner is always biased by the data selected for the learning, the learning method applied, etc etc.
They must be biased since, otherwise, there would be no way to decide what bits are most important and which bits can be ignored.
Paradoxically, bias blinds us to some things while letting us see (predict) the future.
So all theories are biased (but only some admit it). But we should always be aware of the domain-specific nature of the conclusions drawn from a learner.

No idea is absolutely "right", but many more are useless.

Different learners use different biases to learnt their theories.

So the same data can generate different conclusions, depending on the bias of the learner.

So are we just making stuff up?
Knowledge relativism? All ideas are valid?
No!
Sure, one data set supports many theories.
- But there are many many more theories that are unsupported.
While no idea is right ...
- ... some things are useful (perform well on test data) ...
- ... and many many many more ideas are useless
  - "This idea isn't even wrong" -- Wolfgang Pauli
Sherlock Holmes was nearly right:
- "Eliminate all other factors, and the one which remains must be the truth."
Just need to add the plural:
- "Eliminate all other factors, and the oneS which remains must be the truthS."

Dumb apes get by.

Here's a puzzle. People aren't real bright (just look at how badly they write software). Yet, somehow, people have built the most amazing things, like the international domestic airline network and the Internet. How?
Maybe the real world is not as complex as our egos imagine. And seemingly naive probes tell us most of what can be found using supposedly more sophisticated methods.

You are responsible

Very successful data miners can be surprisingly simple. This begs the question "why aren't they used more often so we can control the world around us, better?".
The answer is that, sometimes the world is very very complicated and no single simple solution will suffice. But often, the world is a surprisingly simple place (otherwise, dumb apes would not get by) which means, in turn, that we should be able to predict and control and select the future that we want.
So the curse of data mining is that once you learn how to do it, you become responsible for the future of the human race. Are your ready for that?

]]> Bias and Ethics http://menzies.us/cs591o/?main=150 lecture two Introduction

Here are some problems:

What eggs to select for IVF? (p2)
What will software cost to develop?
What diseases does a patient have p25
Which loan applications to fund (p22)?
What houses will have the best resale value?
Which parts of the program need more inspection?
What products are best to sell to what markets? (p26)
What cows to keep and which to send to the abattoir ? (p2)
How to teach a satellite to distinguish between cloud shadows and oil spills (p23)?
How much electricity will b needed in two hours (i.e. what cola-powered generators to fire up)? (p24)?

In the modern era, these problems have data mining solutions.

Lots of data: world's databases doubling in size every 20 months;
- Internet, Radio Frequency Identification (RFID) tracking, on-line shopping (patterns of sales tracked at Amazon)

But, like Spider Man says, with great power comes great responsibility; e.g.

Can consumers audit how their loans applications are analyzed?
Is it right that our learners tell us what embryos are not implanted ?
How can a Hindu protest at letting vital ethical decisions (like deciding what cows are
Aghanya--that which must not be slaughtered) being made by a machine?

But before we can discuss the ethical implications of data mining, we must first understand both the power and limitations of the technology.

So we'll get to the ethics after discussing the technology.

The Technology

Data (Arff format)

@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
...

Summarized by J48 (decision tree learner)

Find the attribute value that most divides up the data
Split the data on that value
Recurse on each subset
Each split is a node in a decision tree.

outlook = sunny
|   humidity = high: no 
|   humidity = normal: yes 
outlook = overcast: yes 
outlook = rainy
|   windy = TRUE: no 
|   windy = FALSE: yes

How good is this theory?

10 times,
- divide data into 90% train, 10% test,
- learn on train,
- apply on test.
Report average results across the 10-way

 a b   <-- classified as
 5 4 | a = yes
 3 2 | b = no

To get some summary statistics out of this:

   a   b   <-- classified as
 A=5 C=4 | a = yes
 B=3 D=2 | b = no

Accuracy = (A+D) / (A+B+C+D)
Recall= prob(detection) = pd = D / (B+D)
Precision = prec = D/(D+C)
prob(false Alarm) = pf = C / (A+C)
F = harmonic mean of prec,pd = 2*pd*prec/(pd+prec)

 TP Rate   FP Rate   Precision   Recall  F-Measure   Class
   0.556     0.6        0.625     0.556     0.588    yes
     0.4     0.444      0.333     0.4       0.364    no

Summarized by Naive Baues

A Naive Bayes classifier collects internal statistics, does not generate an explicit explanation:

Class yes: Prior probability = 0.63
outlook:  Discrete Estimator. Counts =  3 5 4  (Total = 12)
temperature:  Discrete Estimator. Counts =  3 5 4  (Total = 12)
humidity:  Discrete Estimator. Counts =  4 7  (Total = 11)
windy:  Discrete Estimator. Counts =  4 7  (Total = 11)

Class no: Prior probability = 0.38
outlook:  Discrete Estimator. Counts =  4 1 3  (Total = 8)
temperature:  Discrete Estimator. Counts =  3 3 2  (Total = 8)
humidity:  Discrete Estimator. Counts =  5 2  (Total = 7)
windy:  Discrete Estimator. Counts =  4 3  (Total = 7)

Results of the Naive Bayes classifier:

10 times,
- divide data into 90% train, 10% test,
- learn on train,
- apply on test.
Report average results across the 10-way

  a b   <-- classified as
  7 2 | a = yes
  4 1 | b = no

 TP Rate   FP Rate   Precision   Recall  F-Measure   Class
   0.778     0.8        0.636     0.778     0.7      yes
     0.2     0.222      0.333     0.2       0.25     no

Definitions

In general, these are hard questions:

What does learning mean?
What does intelligence mean?
Does a slipper learn the shape of your foot?

But a simpler kind of "learning" is much easier to understand

                 PERFORMANCE systems
				 (e.g. Naive Bayes)

            .------------->-----------.
            |                         |
            |                         v
TRAINING    |                     PREDICTIONS
data -->  LEARNING             (on test data)
            |                         ^
            |                         |
            .----> GENERALIZATION -->-.

               EXPLANATION systems
			      (e.g. J48)

(This is the kind of learning explored in this class.)

Explanation systems build some human-readable intermediary, then use that intermediary to make predictions.

Performance systems don't bother with explaining themselves, they just make predictions.

In the above, decision trees are an explanation system and Naive Bayes was a performance system.

The Over Fitting problem

Fixation on irrelevant details ⇒ over fitting.

Cause: noise; i.e. spurious signals not connected to the output class

Symptoms:

Overly complex theory
Poorer performance on future examples

Solutions:

Testing: assess a learned theory via its performance on data not seen during training (as done above).

Pruning: after a theory is built, try throwing bits away; e.g. prune a decision tree back from the leaves, see how that changes performance.

confidence limit for pruning = 0.1 (very selective)

c0.1 

Horsepower <= 82: _0 (9.0)
Horsepower > 82
|   Horsepower <= 190: _20 (70.0/4.0)
|   Horsepower > 190: _40 (14.0/5.0)

  a  b  c  d   <-- classified as
  9  1  0  0 |  a = _0
  0 65  5  0 |  b = _20
  0  7  5  0 |  c = _40
  0  0  1  0 |  d = _60

TP Rate   FP Rate   Precision   Recall  F-Measure   Class
  0.9       0          1         0.9       0.947    _0
  0.929     0.348      0.89      0.929     0.909    _20
  0.417     0.074      0.455     0.417     0.435    _40
  0         0          0         0         0        _60

confidence limit for pruning = 0.25 (default, less selective)

c0.25 

Horsepower <= 82: _0 (9.0)
Horsepower > 82
|   Horsepower <= 190: _20 (70.0/4.0)
|   Horsepower > 190
|   |   Drive_train_type = 1
|   |   |   Highway_MPG <= 26: _40 (4.0/1.0)
|   |   |   Highway_MPG > 26: _20 (2.0)
|   |   Drive_train_type = 0: _40 (7.0/1.0)
|   |   Drive_train_type = 2: _20 (1.0)

  a  b  c  d   <-- classified as
  9  1  0  0 |  a = _0
  0 65  5  0 |  b = _20
  0  7  5  0 |  c = _40
  0  0  1  0 |  d = _60

TP Rate   FP Rate   Precision   Recall  F-Measure   Class
  0.9       0          1         0.9       0.947    _0
  0.929     0.348      0.89      0.929     0.909    _20
  0.417     0.074      0.455     0.417     0.435    _40
  0         0          0         0         0        _60

Generalization as Search

Give a set of possible concept descriptors

e.g. age > 6; wealthy ...and a set of combination methods
e.g. and, or, not, if-then-else, it-unless, etc

How to search the space of possible descriptors*combinations?

Problem: the space of descriptors*combinations is usually impossibly large for a complete search

Solution: heuristic search (cut corners)

e.g. 1R: only ever learn decision trees of depth one
- i.e. always try combinations on a single attribute
- i.e. never try combinations of attributes
BTW, works ok, but usually beaten by other methods

Different kinds of bias

Search bias

When growing a theory, what concepts*descriptors do you look at first?

Depending on your goals your search will find different conclusions.
- e.g. are you seeking the theory with highest performance on new test data, or the smallest theory, or a combination of both, see (minimum description length [p179][wt]).
- e.g. decision trees: what goes into the root (thereby effecting all sub-trees)?
- e.g. greedy search (only consider the next best thing) vs ε-greedy search (consider anything within ε of the current best thing).

Over fitting Avoidance Bias

When shrinking a theory, how do you control pruning?

What trade-offs do you allow for size vs performance?
What do you try pruning first, then second, then...
- e.g. if throwing away a decision tree sub-branch reduces performance by 5% but decision tree size by 50%, is that a goods prune?

Sample Bias

Learn from available examples, not the space of all possible examples.

e.g. this data only comes from the east coast and they do things differently over there.

Language Bias

What is space of legal combinations and legal descriptors?

e.g. many learners can't understand numbers
- then it can never learn, day > Thursday and must report a clumsier theory, day=Friday OR day=Saturday OR day=Sunday
e.g. Classification learners find connections from many independent attributes to a single class Dependent Attribute
- While association rule learners find connections between sets of attributes.
- So an classification learners can't predict for sets of attributes.

Evaluation bias

In the above examples, we assessed our theories by mean performance and size of the learned theory and the explain-ability of the learned theory.

Different measures yielded different conclusions about the best learner.

Even widely-used measures are surprisingly problematic.

e.g. 10 hobos with no jobs enter Bill Gate's office.
Q: What is their mean income?. A: ((10*0)+$24 billion)/11
=2.2 billion.
Note that this number is uninformative regarding the social status of both the hobos and Bill Gates

Many ways to evaluate the performance of a learner, but we'll need some theory first.

In the meantime:

Well keep using means (simplest);
But we'll know we can better
And we'll be interested in how changing the evaluation criteria changes our opinion about the the learner.

Many, many biases

From "Separate-and-conquer rule learning"

Ethics

Issue: different learners use different heuristics

Consequently, they learn different theories.

So the same data can generate different theories, depending on the search bias (the heuristic search method) used to explore the space of descriptors*combinations

i.e. there is no one best theory.

More generally, if you using induction on historical data to predict the future, then:

There is no such thing as an "unbiased opinion".
All theories are bias (but only some admit it).
They must be biased else there is no way to decide what bits are most important and which bits can be ignored.
So bias blinds us and, paradoxically, lets us see (predict) the future.

Anyway, you have an ethical dilemma

You can generate different theories: which do you report?
Who gets hurt or helped by the different theories?

Do the users of your theories fully appreciate the bias and limitations of the learning methods used to generate this theory>

Can your users audit the bias?
Are you use an explanation system where the learned theory can be browsed or just a performance system that only knows how to make conclusions and not how to explain them?

Also, should the users audit the bias?

Should you be able to access and audit the theory that assigns your credit rating?
Should we tell spammers what kinds of emails will get rejected?
Should we tell bombers how we will screen airline passengers for explosives?

How to handle bias

Data miners: you have responsibilities to your users;

Explore the space of possible theories: any common conclusions?
Ensure reproducibility, documented the biases;
And always remember : above all, do no harm.

Users of theories from data miners

DO NOT be an passive consumer of other people's conclusions;
- ALWAYS be an active reviewer of ideas.
Require:
- Access to the data used for training;
- Access to the software used for learning;

And if our users demand the right theory, we can't give it them (the Bias problem).

Knowledge relativism? All ideas are valid?
No!
Sure, one data set supports many theories.
- But there are many many more theories that are unsupported.

So while no idea is right ...

... some things are useful (perform well on test data) ...
... and many many many more ideas are useless
- "This idea isn't even wrong" -- Wolfgang Pauli

]]>