Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

verbose.1 @ 62

Last change on this file since 62 was 26, checked in by (none), 15 years ago
blabla
File size: 6.5 KB

Rev	Line
[26]	1	.TH C4.5 1
	2	.SH NAME
	3	A guide to the verbose output of the C4.5 decision tree generator
	4
	5	.SH DESCRIPTION
	6	This document explains the output of the program
	7	.I C4.5
	8	when it is run with the verbosity level (option
	9	.BR v )
	10	set to values from 1 to 3.
	11
	12	.SH TREE BUILDING
	13
	14	.B Verbosity level 1
	15
	16	To build a decision tree from a set of data items each of which belongs
	17	to one of a set of classes,
	18	.I C4.5
	19	proceeds as follows:
	20	.IP " 1." 7
	21	If all items belong to the same class, the decision
	22	tree is a leaf which is labelled with this class.
	23	.IP " 2."
	24	Otherwise,
	25	.I C4.5
	26	attempts to find the best attribute
	27	to test in order to divide the data items into
	28	subsets, and then builds a subtree from each subset
	29	by recursively invoking this procedure for each one.
	30	.HP 0
	31	The best attribute to branch on at each stage is selected by
	32	determining the information gain of a split on each of the attributes.
	33	If the selection criterion being used is GAIN (option
	34	.BR g ),
	35	the best
	36	attribute is that which divides the data items with the highest gain
	37	in information, whereas if the GAINRATIO criterion (the default) is
	38	being used (and the gain is at least the average gain across all
	39	attributes), the best attribute is that with the highest ratio of
	40	information gain to potential information.
	41
	42	For discrete-valued attributes, a branch corresponding to each value of
	43	the attribute is formed, whereas for continuous-valued attributes, a
	44	threshold is found, thus forming two branches.
	45	If subset tests are being used (option
	46	.BR s ),
	47	branches may be formed
	48	corresponding to a subset of values of a discrete attribute being tested.
	49
	50	The verbose output shows the number of items from which a tree is being
	51	constructed, as well as the total weight of these items. The weight
	52	of an item is the probability that the item would reach this point in the
	53	tree and will be less than 1.0 for items with an unknown value
	54	of some previously-tested attribute.
	55
	56	Shown for the best attribute is:
	57
	58	cut - threshold (continuous attributes only)
	59	inf - the potential information of a split
	60	gain - the gain in information of a split
	61	val - the gain or the gain/inf (depending on the
	62	selection criterion)
	63
	64	Also shown is the proportion of items at this point in the tree
	65	with an unknown value for that attribute. Items with an unknown value
	66	for the attribute being tested are distributed across all values
	67	in proportion to the relative frequency of these values in the
	68	set of items being tested.
	69
	70	If no split gives a gain in information, the set of items is made
	71	into a leaf labelled with the most frequent class of items reaching
	72	this point in the tree, and the message:
	73
	74	no sensible splits
	75	.IR r1 / r2
	76
	77	is given, where
	78	.I r1
	79	is the total weight of items reaching this point in the tree, and
	80	.I r2
	81	is the weight of these which don't belong to the class of this leaf.
	82
	83	If a subtree is found to misclassify
	84	at least as many items as does replacing the subtree with a leaf, then
	85	the subtree is replaced and the following message given:
	86
	87	Collapse tree for
	88	.I n
	89	items to leaf
	90	.I c
	91
	92	where
	93	.I c
	94	is the class assigned to the leaf.
	95
	96
	97	.B Verbosity level 2
	98
	99	When determining the best attribute to test,
	100	also shown are the threshold (continuous attributes only),
	101	information gain and potential information for a split on
	102	each of the attributes.
	103	If a test on a continuous attribute has no gain or there are
	104	insufficient cases
	105	with known values of the attribute on which
	106	to base a test, appropriate messages are given.
	107	(Sufficient here means at least twice MINOBJS, an integer
	108	which defaults to 2 but can be set with option
	109	.BR m.)
	110	The average gain across all attributes is also shown.
	111
	112	If subset tests on discrete attributes are being used,
	113	for each attribute being examined, the combinations of
	114	attribute values that are made (i.e. at each stage, the
	115	combination with highest gain or gain ratio) and the
	116	potential info, gain and gain or gain ratio are shown.
	117
	118
	119	.B Verbosity level 3
	120
	121	When determining the best attribute to test,
	122	also shown is the frequency distribution table showing
	123	the total weight of items of each class with:
	124
	125	- each value of the attribute (discrete attributes), or
	126	- values below and above the threshold (contin atts), or
	127	- values in each subset formed so far (subset tests).
	128
	129
	130
	131	.SH TREE PRUNING
	132
	133	.B Verbosity level 1
	134
	135	After the entire decision tree has been constructed,
	136	.I C4.5
	137	recursively
	138	examines each subtree to determine whether replacing it with
	139	a leaf or a branch would be beneficial.
	140	(Note: the numbers treated below as counts of items actually
	141	refer to the total weight of the items mentioned.)
	142
	143	Each leaf is shown as:
	144
	145	.IR c ( r1 : r2 /
	146	.IR r3 )
	147
	148	with:
	149	\fIc\fR - the most frequent class at the leaf
	150	\fIr1\fR - the number of items at the leaf
	151	\fIr2\fR - misclassifications at the leaf
	152	\fIr3\fR - \fIr2\fR adjusted for additional errors
	153
	154	Each test is shown as:
	155
	156	.IR att :[ n1 "% N=" r4 tree=
	157	.IR r5 leaf= r6 +
	158	.IR r7 br[ n2 ]= r8 ]
	159
	160	with:
	161	\fIn1\fR - percentage of egs at this subtree that are misclassified
	162	\fIr4\fR - the number of items in the subtree
	163	\fIr5\fR - misclassifications of this subtree
	164	\fIr6\fR - misclassifications if this was a leaf
	165	\fIr7\fR - adjustment to \fIr6\fR for additional errors
	166	\fIn2\fR - number of the largest branch
	167	\fIr8\fR - total misclassifications if subtree is replaced by largest branch
	168
	169	If replacing the subtree with a leaf or the largest branch
	170	reduces the number of errors, then the subtree is replaced
	171	by whichever of these results in the least number of errors.
	172
	173
	174	.SH THRESHOLD SOFTENING
	175
	176	.B Verbosity level 1
	177
	178	In softening the thresholds of tests on continuous attributes
	179	(option
	180	.BR p ),
	181	upper and lower bounds for each test are calculated.
	182	For each such test, the following are shown:
	183	.IP " *" 4
	184	Base errors - the number of items misclassified when the threshold has
	185	its original value
	186	.IP " *"
	187	Items - the number of items tested (with a known value for this
	188	attribute)
	189	.IP " *"
	190	se - the standard deviation of the number of errors
	191	.HP 0
	192	For each of the different attribute values, shown are:
	193	.IP " *" 4
	194	Val <= - the attribute value
	195	.IP " *"
	196	Errors - the errors with this value as threshold
	197	.IP " *"
	198	+Errors - Errors - Base errors
	199	.IP " *"
	200	+Items - the number of items between this value and the original
	201	threshold
	202	.IP " *"
	203	Ratio - Ratio of +Errors to +Items
	204	.HP 0
	205	The lower and upper bounds are then calculated so that the
	206	number of errors with each as threshold would be one standard
	207	deviation above the base errors.
	208
	209
	210	.SH SEE ALSO
	211
	212	c4.5(1)

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format