Sections
You are here: Home » Development » Documentation » Components » M5P

M5P

Contact: Stefan Kramer

Categories: Prediction

Exposed methods:

predict
Input: Instances, feature vectors, real-numbered target values
Output: Tree of regression models
Input format: Weka's ARFF format
Output format: Plain text, model binary
User-specified parameters: The user can choose whether instead of a model tree a regression tree is built, the minimum number of instances to allow at a leaf node, whether the tree should be pruned and whether to use unsmoothed predictions.
Reporting information: Performance measures (Correlation coefficient, mean absolute error, root mean squared error, relative absolute error, root relative squared error)

Description:

M5P [WAN97] is a reconstruction of Quinlan's M5 algorithm [QUI92] for inducing trees of regression models.
M5P combines a conventional decision tree with the possibility of linear regression functions at the nodes.
First, a decision-tree induction algorithm is used to build a tree, but instead of maximizing the information
gain at each inner node, a splitting criterion is used that minimizes the intra-subset variation in the class
values down each branch. The splitting procedure in M5P stops if the class values of all instances that reach a
node vary very slightly, or only a few instances remain.
Second, the tree is pruned back from each leaf. When pruning an inner node is turned into a leaf with a
regression plane.
Third, to avoid sharp discontinuities between the subtrees a smoothing procedure is applied that combines
the leaf model prediction with each node along the path back to the root, smoothing it at each of these nodes
by combining it with the value predicted by the linear model for that node.
Techniques devised by Breiman et al. [BRE84] for their CART system are adapted in order to deal with
enumerated attributes and missing values. All enumerated attributes are turned into binary variables so that all
splits in M5P are binary. As to missing values, M5P uses a technique called “surrogate splitting” that finds
another attribute to split on in place of the original one and uses it instead. During training, M5P uses as
surrogate attribute the class value in the belief that this is the attribute most likely to be correlated with the
one used for splitting. When the splitting procedure ends all missing values are replaced by the average values
of the corresponding attributes of the training examples reaching the leaves. During testing an unknown
attribute value is replaced by the average value of that attribute for all training instances that reach the node,
with the effect of choosing always the most populous subnode.
M5P generates models that are compact and relatively comprehensible.
For further information, we refer to the original publications [WAN97], [QUI92], [BRE84].

Background (publication date, popularity/level of familiarity, rationale of approach, further comments)
Published in 2007. Uses features from the well-known CART system and
reimplements Quinlan‟s well-known M5 algorithm with modifications and seems to
outperform it. M5P can deal effectively with enumerated attributes and missing values.
Smoothing substantially increases prediction accuracy.

Bias (instance-selection bias, feature-selection bias, combined instance-selection/feature-selection bias, independence assumptions?, ...)
Feature-selection bias

Lazy learning/eager learning
Eager learning

Interpretability of models (black box model?, ...)
Good (produced is a model tree)

Type of Descriptor:

Interfaces:

Priority: Low

Development status:

Homepage:

Dependencies:
External components: WEKA


Technical details

Data: No

Software: Yes

Programming language(s): Java

Operating system(s): Linux, Win, Mac OS

Input format: Weka's ARFF format

Output format: Plain text, model binary

License: GPL


References

References:
[BRE84] Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont CA. (1984)
[QUI92] Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.
[WAN97] Wang ,Y., Witten, I. H.: Induction of model trees for predicting continuous classes. In : Poster papers of the 9th European Conference on Machine Learning, 1997.

Document Actions