<?xml version="1.0"?>

<items>

	
	<item>
		<title>
			DMM: The data maturity model
		</title>
	        <apropos id="139" author="timm" dob="1187227986" />
       		<link>http://menzies.us/cs591o/?lecture=139</link>
       		<category>lectures</category>
		<description>
			<![CDATA[
			<H3>Why the DMM?</h3>
			<P><ul><li>
						For the 5 levels of the DMM, see <a href="#levels">below</a>.
				<li>For notes on the motivation, background, and status of the DMM, read below.
		</ul></p>
			<h4>Motivation</h4>

<p>The promise of the DMM is the creation of <strong>good</strong> data sets that software engineering researchers and managers can use to explore software engineering issues.</p>

<p>Our premise is that good data is both used (in the past) and usable (in the future). Hence, to define <strong>good</strong> data, this <em>Data Maturity Model</em> (DMM) must also define:</p>

<ul>
<li>how the data <em>might</em> be used;</li>
<li>how the data <em>has</em> been used;</li>
<li>what problems <em>exist</em> (past or present) with this data;</li>
<li>how were past problems with this data <em>solved</em>?</li>
<li><em>who</em> has used the data;</li>
<li>what was <em>learned</em> from the data;</li>
<li>etc.</li>
</ul>

<h4>Goals</h4>

<p>The DMM is intended to be an  <em>operational</em> definition of <strong>good</strong>; i.e. </p>

<ul>
<li><em>No voluminous documentation</em>: it should be clear and succinct;</li>
<li><em>Anyone  can join</em>: even newly created data sets have a place on the DMM;</li>
<li><em>No standards without support tools</em>: every portion should be supported by tools;</li>
<li><em>No illusionary standards</em>: numerous public domain examples should exist for every portion of it;</li>
<li><a name=HolesAreAllowed></a><em>Holes are allowed</em>: while the DMM list numerous requirements, some can be missed and the data set can still be matured.</li>
<li><em>No vacuous statements</em>: data mining analysts should be able to use it to determine what is the next best thing they could do with their data;</li>
<li><em>Everyone can play</em>: the analysis required to mature a data set does not require excessive resources (e.g years of training, expensive tools, very long processing times)</li>
</ul>

<h3>The 5 levels of the DMM</h3>

<p><img class=thumb align=right width=170 src="http://menzies.us/cs591o/img/modellevels.jpg"></p>

<ul>
<li>The lower the levels, the less effort in creating and the data; i.e. 1 is lazier than 5.</li>
<li>The higher the levels, the more the data has been used and is useful; i.e. 5 is better than 1.</li>
<li>Each level has steps.</li>
<li>Reaching Level 1 means achieving all its steps.</li>
<li>To reach the higher levels <em>I#62;1</em>:
<ul>
<li>the lower level <em>I-1</em> must be reached </li>
<li>but only  <em>ENOUGH%</em> the steps for this level must be achieved.</li>
</ul></li>
</ul>

<h3>How much is enough?</h3>

<p>For a standard to be practical, it can't be too dogmatic. </p>

<p>Hence,  <em>ENOUGH=66%</em> (remember: <a href="DMM.html#HolesAreAllowed">holes are allowed</a>).</p>

<h3>Levels</h3>

<h4>Level 1- Initial</h4>

<ol>
<li>Data is in some defined data format (csv, xml, arff, ...).</li>
<li>Data has been run through any automatic learner.</li>
<li>The learned theory has been automatically applied to some data to return some conclusion without human intervention. Note that manual browsing of some on-screen visualization does <em>not</em> constitute automatic application.</li>
</ol>

<h4>Level 2- Repeatable</h4>

<ol>
<li>A theory learned from some data D1 has been  run on some other data D2   and D1 ""&ne;"" D2 ; e.g. via a N-way  cross-validation study.                                                             </li>
<li>Data is in the public domain; e.g. on a web site with free registration or, better yet, no registration.</li>
<li>Data has been run through learners that are public domain.       </li>
<li>Someone else has processed this data rather than the original users.</li>
</ol>

<h4>Level 3- Defined  </h4>

<ol>
<li>A goal for the learning is recorded; e.g. a business situation has been specified in which solutions of type X are useful but solutions of type "Y" are not.</li>
<li>The meaning of most attributes are defined; e.g. comments explaining as much as is known about how those values were collected, what they mean, etc.</li>
<li>The meaning of each instance is defined; e.g. how is one instance different to another? how were each instances collected? to what extent do we trust the data collection process?</li>
<li>Statistics are available on the distribution of each attribute. Statistics include information on how many missing values exist (and some explanation is offered for the missing values).</li>
<li>Attribute subsets are identified that have differing effects on the goals; e.g. if the goal is cheap defect detection, then the attributes could be grouped into the cost of their data collection.</li>
<li>Instance subsets are identified which domain knowledge observes tells us is very different to the other instances; e.g. "instances 100 to 211 come from the east coast division and they do things very different over there".</li>
</ol>

<h4>Level 4- Managed    </h4>

<ol>
<li>Simple attribute distributions studied have been performed: e.g. outliers determined by a manual browsing graphs of the distributions of individual attributes</li>
<li>Data is run though multiple pre-processors; e.g. RemoveOutliers,  BinLogging ,  NBins, LogTransforms, etc.</li>
<li>Data is run though multiple learners.</li>
<li>Data with different attribute/instance subsets have been run through different learners after different pre-processing.     </li>
<li>Prior results with this data set are identified;                                                            </li>
<li>Results compared to prior results; e.g. using some widely used measure like pred(25) discussing similarities,differences, and advances over previous work. </li>
<li>The results from learning from different attributes/instances/pre-processing/learners has been compared in some way (e.g. via t-tests or delta diagrams). </li>
<li>Some trade-off study has been performed; e.g. roc curves where the learning goals are used to comment where in the roc curves this learner should fall.</li>
<li>Some straw man study has been performed; e.g.  data compared to much simpler learners </li>
<li>Some reduction studies  has been performed;  e.g.IncrementalCrossValidation or  Featuresubsetselection.</li>
</ol>

<h4>Level 5- Optimized </h4>

<ol>
<li>issues with the current high-water mark with this learner are identified.</li>
<li>any differences in the learner performance has been analyzed and explained; e.g. via studies on synthetic data sets and/or lesion studies such as where does the current learner stop working when the variance in the continuous variables is increased.</li>
<li>The limits of the current approach have been stated.</li>
<li>A future direction for processing this data is defined.</li>
<li>Going beyond the list of problems, a tentative solution has been proposed.</li>
</ol>

			]]>
		</description>
	</item>

</items>


