All Projects

All Projects http://menzies.us/cs591o/?main=131 projects Students are to work in groups of four.

For each project, each project is to hand in one set of answers fixed together in some binder.

The submission must start with a front page listing names, emails, and student IDs of all your group members.
Mark one group member as "secretary" (all emails to the group will be sent to the secretary)

In addition, projects 3 and 4 will include a verbal presentation.

Slides from that presentation, in are also to be submitted.
Note that the slides a quick 20 minute briefing about the longer report (so the report will expand, by over 100%, the material in the slides).

]]> Project 1: priming the pump http://menzies.us/cs591o/?main=132 projects Ready...

Install some code using the instructions in http://unbox.org/wisp/trunk/our/INSTALL
To check if the install works, see the example in http://unbox.org/wisp/trunk/our/INSTALL.log.

...Steady...

Read the file $HOME/opt/oursh/shrc and answer the questions Q1..Q20. Hand in those answers, plus the outputs from demo1 .. demo19
Read the file $HOME/opt/ourgawk/gawrkrc and answer the questions Q1..Q18. Hand in those answers, plus the outputs from demo1 .. demo18

Go!

Read the file $HOME/opt/ourmine/minerc and hand in the outputs to demo3..demo19.
Carefully document demo3.. demo19. What does each do? What does each show us?

Bonus marks: Lets find our how little data we need to learn an effective theory. Modify "someArff" such that (a) it splits a data file of N instances into B bins of 50 instances each, (b) the train file comprise the first I bins and a test file containing just the I+1 bin. Then generate win/loss/tie tables comparing the performance of learning from 50,100,150,200,... instances.

]]> Project 2: standing tall http://menzies.us/cs591o/?main=133 projects For several classic data mining papers:

Succinctly summarize the paper
1. Show that you understand the point of the paper
2. Describe each of the data mining technologies discussed in the paper
3. Describe the conclusions reached by the paper
4. Write down finding(s) that would refute the paper.
Reproduce/ refute/ improve the experiments of the paper
1. Assess the results using winLossTie,
2. As far as possible, use the same data sets described in the paper
3. Also, try to use data sets other than those described in the paper
4. Comment on whether or not, after running your own experiment, you want to modify the conclusions of the paper

Hand in one bound report with the following four (or five) sections.

For full marks...

Process these papers:

Holte's ONER sv J48 experiments : see oner & j48 in minerc (hint, stop at page 6);
Dougherty et.al's nbins vs binLogging vs FayyadIrani experiment (note: you'll need to script up nbins. binLogging. For FayyadIrani see discretizeViaFayyadIrani).
Cohen's J48 vs RIPPER experiments (hint see j48 & jrip in minerc);
Menzies' et.al. comparison of PRISM vs NB vs other methods. Hint: you will need
weka.classifiers.rules.Prism -p 0 -t $1 -T $2

For bonus marks...

Process this paper:

Hall & Holmes study on feature subset selection
Warning- this a LOT of work.
To make this study practical, please restrict your FSS to infogain, relief, cfs, wrapper.
And rather that step through all attributes, try 100%, 50%, 25%, 12.5% etc (once the attributes are sorted by the FSS).

]]> Project 3: how are you going? http://menzies.us/cs591o/?main=138 projects Projects three is a progress report on project four.

To get 15 marks, students need to show that they have reached at least level 3 of the Data Maturity Model.

]]> Project 4: strutting your stuff http://menzies.us/cs591o/?main=140 projects This project demonstrate the students' understanding of data mining methods.

This project will be marked via an end-of-semester and a written report due one week after the presentation.

For this project, students can pick one of two goals:

(For the students with A-grade average): do something AMAZING! with data mining. Even spectacular failures are allowed here just as long as you fail in the right way (i.e. important questions are explored or recognized). IMPORTANT: students performing this project must demonstrate good progress (in their project 3 submission) or they will be switched to the other project. You AMAZING topic could be:
- Something wonderful of your own idea;
- Update TAR3 to choice rules that maximize not just lift, but lift*support.
- Repeat the STATLOG experiments (a big shoot-em-up between N learners) http://mlearn.ics.uci.edu/databases/statlog/
  http://www.amsta.leeds.ac.uk/~charles/statlog/.
- Machine learning and visualization: try mapping N dimensions into 3, then visualize the results. Compare two methods : PCA or fastmap. Fastmap is meant to be more faster/scalable than PCA. Can you test that?
- Port TAR3 to Java and get it working in WEKA.
- Stack nearest neighbor with some learner. I.e. to classify a test instance, find its nearest neighbors in the training set, and build a theory just on those nearest neighbors
- Build a lift-based rule covering algorithm.
  - Discrete all the data.
  - Score the classes by some utility function F.
  - For remaining training data do:
    - Score each instance by its class score.
    - Sort the T training instances by F, then have two classes: the BEST 20% and the REST 80%.
    - Create two frequency tables for BEST and REST.
    - Sort each attribute range by a₂/(a+b) where a=BEST.a/T and b=BEST.b/T.
    - Rule = Rule + attributeRangeFirstInTheSort
    - Reject all training data that contradicts the rule.
  - Apply Rule to the test set.
OR conduct a "good" analysis of one or more of the data sets found at
- The data sets in PROMISE repository of software engineering data sets;
- The UCI machine learning repository;
- The Knowledge Discovery in Databases Archive(caution: some of these data sets are too large to load into WEKA)
- The ML net resources page (caution: contains some dead links)
- The Times Series Library (caution: many of these time series are single attribute with no target class).
Here "good" means at least DMM level 4 (and to get bonus marks, students must achieve a DMM Level 5).

]]>