<?xml version="1.0"?>

<items>

	
	<item>
		<title>
			All Projects
		</title>
	        <apropos id="131" author="timm" dob="1187220141" />
       		<link>http://menzies.us/cs591o/?main=131</link>
       		<category>projects</category>
		<description>
			<![CDATA[
			<p>Students are to work in groups of four.</p>
			<p>
				<p>For each project, each project is to
				hand in one set of answers fixed together in some binder.<ul><li> The submission must start
				with a front page listing  names, 
				emails, and  student IDs of all your group members.
				<li>Mark one group member as "secretary" (all emails to the group will be sent to the secretary)</ul>
			</p>
			<p>In addition, projects 3 and 4 will include a verbal presentation. <ul><li>Slides from that presentation,
				in are also to be submitted.
				<li>Note that the slides a quick 20 minute briefing about the longer report (so the report
				will expand, by over 100%, the material in the slides).</ul></p>
			]]>
		</description>
	</item>

	<item>
		<title>
			Project 1: priming the pump
		</title>
	        <apropos id="132" author="timm" dob="1187220365" />
       		<link>http://menzies.us/cs591o/?main=132</link>
       		<category>projects</category>
		<description>
			<![CDATA[
				<p>
<em>Ready...  </em><ul>
<li>Install some code using the instructions in 
<a href="http://unbox.org/wisp/trunk/our/INSTALL">http://unbox.org/wisp/trunk/our/INSTALL</a></P>
<li>To check if the install works, see the example in
<a href="http://unbox.org/wisp/trunk/our/INSTALL.log">http://unbox.org/wisp/trunk/our/INSTALL.log</a>.
</ul></p>
<p><em>...Steady... </em><ul>
<li>Read the file $HOME/opt/oursh/shrc and answer the questions Q1..Q20.
Hand in those answers, plus the outputs from demo1 .. demo19</li>
<li>Read the file $HOME/opt/ourgawk/gawrkrc and answer the questions Q1..Q18.
Hand in those answers, plus the outputs from demo1 .. demo18</li>
</li>
</ul>
</p>
<p><em>Go! </em><ul>
<li>Read the file $HOME/opt/ourmine/minerc and hand in the outputs to
demo3..demo19.</li>
<li>Carefully document demo3.. demo19. What does each
do? What does each show us?
</li>
</ul>
</p>
<p><em>Bonus marks:</em>
Lets find our how little data we need to learn an effective theory.
Modify "someArff" such that (a) it splits a data file of N instances
into B bins of 50 instances  each, (b) the train file
comprise the first I bins and a test file containing
just the I+1 bin. Then generate win/loss/tie tables 
comparing the performance of learning from 50,100,150,200,... instances.
</p>
			]]>
		</description>
	</item>

	<item>
		<title>
				Project 2: standing tall
		</title>
	        <apropos id="133" author="timm" dob="1187220501" />
       		<link>http://menzies.us/cs591o/?main=133</link>
       		<category>projects</category>
		<description>
			<![CDATA[
			<p>
			For several classic data mining papers:
			<ol>
				<li>Succinctly summarize the paper
					<ol type="a">
						<li>Show that you understand the point of the paper
						<li>Describe each of the  data mining  technologies discussed in the paper
						<li>Describe the conclusions reached by the paper
						<li>Write down finding(s) that would refute the paper.
					</ol>
				<li>Reproduce/ refute/ improve the experiments of the paper
					<ol type="a">
						<li>Assess the results using <code>winLossTie</code>,
						<li>As far as possible, use the same data sets described in the paper
						<li>Also, try to use data sets <me>other than</em> those described in the paper
						<li> Comment on whether or not, after running your own experiment,
						    you want to modify the conclusions of the paper 					</ol>
			</ol>
			</p>
			<p>Hand in <em>one</em> bound report with the following four (or five) sections.</p>
			<h2>For full marks... </h2>
			<p>Process these papers:
				<ul>
					<li>Holte's <a href="http://menzies.us/cs591o/?doc=136">ONER sv J48 experiments </a>:
							see <tt>oner</tt>	 &amp; 	 <tt>j48</tt> in <tt>minerc</tt> (hint, stop at 
							 page 6);		
							<li>Dougherty et.al's <a href="http://menzies.us/cs591o/?doc=135">nbins vs binLogging vs FayyadIrani</a> experiment (note: you'll need to script up nbins. binLogging. 
							For FayyadIrani see <code>discretizeViaFayyadIrani</code>).
<li>Cohen's <a href="http://menzies.us/cs591o/?doc=189">J48 vs RIPPER experiments</a>
					      (hint 
							see <tt>j48</tt>	 &amp; 	 <tt>jrip</tt> in <tt>minerc</tt>);		
												<li>Menzies' et.al. comparison of 
							  <a href="http://menzies.us/cs591o/?doc=191">PRISM vs NB vs other methods</a>.
							  Hint: you will need<br>
							  <em>  weka.classifiers.rules.Prism -p 0 -t $1 -T $2</em>
				</ul>
			</p>
			<h2>For bonus marks...</h2>
			<p>Process this paper:
				<ul>
					<li><a href="http://menzies.us/cs591o/?doc=192">Hall &amp; Holmes</a> study on feature subset selection
			<li>Warning- this a  LOT of work. 
			<li>To make this study practical, please restrict your FSS
			to infogain, relief, cfs, wrapper. 
			<li>And rather that step through all attributes, try 100%,
			50%, 25%, 12.5% etc (once the attributes are sorted by the FSS).

				</ul>
			</p>
		]]>
		</description>
	</item>


	<item>
		<title>
			Project 3: how are you going?
		</title>
	        <apropos id="138" author="timm" dob="1187227931" />
       		<link>http://menzies.us/cs591o/?main=138</link>
       		<category>projects</category>
		<description>
			<![CDATA[
			<p>
				Projects three is a progress report on <a href="http://menzies.us/cs591o?main=140">project four</a>.
				</p>
<p>
				To get 15 marks, students need to show that they have reached at least level 3 of the 
				<a href="http://menzies.us/cs591o/?lecture=139">Data Maturity Model</a>.
				</p>
			]]>
		</description>
	</item>


	<item>
		<title>
			Project 4: strutting your stuff
		</title>
	        <apropos id="140" author="timm" dob="1187229321" />
       		<link>http://menzies.us/cs591o/?main=140</link>
       		<category>projects</category>
		<description>
			<![CDATA[
			<p>
This project demonstrate the students' understanding of data mining methods.
</p><p>
This project will be marked via an end-of-semester and a written report due one week after the presentation.
</p><p>
For this project, students can pick one of two goals:
<ol>
	<li>
(For the students with A-grade average): do something AMAZING! with data mining. 
Even spectacular failures are allowed here just as long as you fail in the right way (i.e. important questions are explored or recognized).
IMPORTANT: students performing this project must demonstrate good progress (in their project 3 submission) 
or they will be switched to the other project.
You AMAZING topic could be:
<ul>
<li>Something wonderful of your own idea;
<li>Update TAR3 to choice rules that maximize not just lift, but lift*support.
<li>Repeat the STATLOG experiments (a big shoot-em-up between N learners)
<a href="http://mlearn.ics.uci.edu/databases/statlog/">http://mlearn.ics.uci.edu/databases/statlog/</a>
<br>
<a href="http://www.amsta.leeds.ac.uk/~charles/statlog/">http://www.amsta.leeds.ac.uk/~charles/statlog/</a>.
<li>Machine learning and visualization: try mapping N dimensions
into 3, then visualize the results. Compare two methods : PCA or 
<a href="http://citeseer.ist.psu.edu/faloutsos95fastmap.html">fastmap</a>.
Fastmap is meant to be more  faster/scalable 
than PCA. Can you test that?
<li>
Port TAR3 to Java and get it working in WEKA.
<li>Stack nearest neighbor with some learner. I.e. to classify
a test instance, find its nearest neighbors in the training set,
and build a theory just on those nearest neighbors
<li>
Build a lift-based rule covering algorithm.
<ul>
<li>
Discrete all the data.
<li>
Score the classes by some utility function F. 
<li>
For remaining training data do:
<ul>
<li>
Score each  instance by its class score.
<li>
Sort the T training instances by F,
then
have two classes: the BEST 20% and the REST 80%.
<li>
Create two  frequency tables
for BEST and REST. 
<li>
Sort each attribute range by <em>a<sub>2</sub>/(a+b)</em> 
where <em>a=BEST.a/T</em>
and <em>b=BEST.b/T</em>. 
<li>
Rule = Rule + attributeRangeFirstInTheSort 
<li>
Reject all training data that contradicts the rule.
</ul>
<li>
Apply Rule to the test set.
</ul>
</ul>
<li>
	OR
conduct a "good" analysis of one or more of the data sets found at
<ul>
	<li>
The data sets in <a href="http://promisedata.org/repository">PROMISE repository of 
	software engineering data sets</a>;
<li>
The <a hef="http://mlearn.ics.uci.edu/MLRepository.html">UCI machine learning repository</a>;
<li>
	The <a href="http://kdd.ics.uci.edu/">Knowledge Discovery in Databases Archive</a>(caution: some of these data sets are too large to load into WEKA)
	<li>
		The <a href="http://www.mlnet.org/resources/datasets-index.html">ML net</a>
		resources page (caution: contains some dead links)
		<li>
			The <a href="http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/">Times Series Library</a> (caution: many of these time series are single attribute with no target class).
		</ul>
		Here "good" means at least <a href="http://menzies.us/cs591o/?lectures=139">DMM level 4</a> (and to get bonus marks,
			students must achieve a <a href="http://menzies.us/cs591o/?lectures=139">DMM Level 5</a>).
			</p>
	</ol></p>
			]]>
		</description>
	</item>




</items>


