DMM: The data maturity model

DMM: The data maturity model http://menzies.us/cs591o/?lecture=139 lectures Why the DMM?

For the 5 levels of the DMM, see below.
For notes on the motivation, background, and status of the DMM, read below.

Motivation

The promise of the DMM is the creation of good data sets that software engineering researchers and managers can use to explore software engineering issues.

Our premise is that good data is both used (in the past) and usable (in the future). Hence, to define good data, this Data Maturity Model (DMM) must also define:

how the data might be used;
how the data has been used;
what problems exist (past or present) with this data;
how were past problems with this data solved?
who has used the data;
what was learned from the data;
etc.

Goals

The DMM is intended to be an operational definition of good; i.e.

No voluminous documentation: it should be clear and succinct;
Anyone can join: even newly created data sets have a place on the DMM;
No standards without support tools: every portion should be supported by tools;
No illusionary standards: numerous public domain examples should exist for every portion of it;
Holes are allowed: while the DMM list numerous requirements, some can be missed and the data set can still be matured.
No vacuous statements: data mining analysts should be able to use it to determine what is the next best thing they could do with their data;
Everyone can play: the analysis required to mature a data set does not require excessive resources (e.g years of training, expensive tools, very long processing times)

The 5 levels of the DMM

The lower the levels, the less effort in creating and the data; i.e. 1 is lazier than 5.
The higher the levels, the more the data has been used and is useful; i.e. 5 is better than 1.
Each level has steps.
Reaching Level 1 means achieving all its steps.
To reach the higher levels I#62;1:
- the lower level I-1 must be reached
- but only ENOUGH% the steps for this level must be achieved.

How much is enough?

For a standard to be practical, it can't be too dogmatic.

Hence, ENOUGH=66% (remember: holes are allowed).

Levels

Level 1- Initial

Data is in some defined data format (csv, xml, arff, ...).
Data has been run through any automatic learner.
The learned theory has been automatically applied to some data to return some conclusion without human intervention. Note that manual browsing of some on-screen visualization does not constitute automatic application.

Level 2- Repeatable

A theory learned from some data D1 has been run on some other data D2 and D1 ""≠"" D2 ; e.g. via a N-way cross-validation study.
Data is in the public domain; e.g. on a web site with free registration or, better yet, no registration.
Data has been run through learners that are public domain.
Someone else has processed this data rather than the original users.

Level 3- Defined

A goal for the learning is recorded; e.g. a business situation has been specified in which solutions of type X are useful but solutions of type "Y" are not.
The meaning of most attributes are defined; e.g. comments explaining as much as is known about how those values were collected, what they mean, etc.
The meaning of each instance is defined; e.g. how is one instance different to another? how were each instances collected? to what extent do we trust the data collection process?
Statistics are available on the distribution of each attribute. Statistics include information on how many missing values exist (and some explanation is offered for the missing values).
Attribute subsets are identified that have differing effects on the goals; e.g. if the goal is cheap defect detection, then the attributes could be grouped into the cost of their data collection.
Instance subsets are identified which domain knowledge observes tells us is very different to the other instances; e.g. "instances 100 to 211 come from the east coast division and they do things very different over there".

Level 4- Managed

Simple attribute distributions studied have been performed: e.g. outliers determined by a manual browsing graphs of the distributions of individual attributes
Data is run though multiple pre-processors; e.g. RemoveOutliers, BinLogging , NBins, LogTransforms, etc.
Data is run though multiple learners.
Data with different attribute/instance subsets have been run through different learners after different pre-processing.
Prior results with this data set are identified;
Results compared to prior results; e.g. using some widely used measure like pred(25) discussing similarities,differences, and advances over previous work.
The results from learning from different attributes/instances/pre-processing/learners has been compared in some way (e.g. via t-tests or delta diagrams).
Some trade-off study has been performed; e.g. roc curves where the learning goals are used to comment where in the roc curves this learner should fall.
Some straw man study has been performed; e.g. data compared to much simpler learners
Some reduction studies has been performed; e.g.IncrementalCrossValidation or Featuresubsetselection.

Level 5- Optimized

issues with the current high-water mark with this learner are identified.
any differences in the learner performance has been analyzed and explained; e.g. via studies on synthetic data sets and/or lesion studies such as where does the current learner stop working when the variance in the continuous variables is increased.
The limits of the current approach have been stated.
A future direction for processing this data is defined.
Going beyond the list of problems, a tentative solution has been proposed.

]]>