source: proiecte/Parallel-DT/R8/Doc/verbose.1 @ 24

Last change on this file since 24 was 24, checked in by (none), 14 years ago

blabla

File size: 6.5 KB
Line 
1.TH C4.5 1
2.SH NAME
3A guide to the verbose output of the C4.5 decision tree generator
4
5.SH DESCRIPTION
6This document explains the output of the program
7.I C4.5
8when it is run with the verbosity level (option
9.BR v )
10set to values from 1 to 3.
11
12.SH TREE BUILDING
13
14.B Verbosity level 1
15
16To build a decision tree from a set of data items each of which belongs
17to one of a set of classes,
18.I C4.5
19proceeds as follows:
20.IP "    1." 7
21If all items belong to the same class, the decision
22tree is a leaf which is labelled with this class.
23.IP "    2."
24Otherwise,
25.I C4.5
26attempts to find the best attribute
27to test in order to divide the data items into
28subsets, and then builds a subtree from each subset
29by recursively invoking this procedure for each one.
30.HP 0
31The best attribute to branch on at each stage is selected by
32determining the information gain of a split on each of the attributes.
33If the selection criterion being used is GAIN (option
34.BR g ),
35the best
36attribute is that which divides the data items with the highest gain
37in information, whereas if the GAINRATIO criterion (the default) is
38being used (and the gain is at least the average gain across all
39attributes), the best attribute is that with the highest ratio of
40information gain to potential information.
41
42For discrete-valued attributes, a branch corresponding to each value of
43the attribute is formed, whereas for continuous-valued attributes, a
44threshold is found, thus forming two branches.
45If subset tests are being used (option
46.BR s ),
47branches may be formed
48corresponding to a subset of values of a discrete attribute being tested.
49
50The verbose output shows the number of items from which a tree is being
51constructed, as well as the total weight of these items.  The weight
52of an item is the probability that the item would reach this point in the
53tree and will be less than 1.0 for items with an unknown value
54of some previously-tested attribute.
55
56Shown for the best attribute is:
57
58    cut  -  threshold (continuous attributes only)
59    inf  -  the potential information of a split
60    gain -  the gain in information of a split
61    val  -  the gain or the gain/inf (depending on the
62selection criterion)
63
64Also shown is the proportion of items at this point in the tree
65with an unknown value for that attribute.  Items with an unknown value
66for the attribute being tested are distributed across all values
67in proportion to the relative frequency of these values in the
68set of items being tested.
69
70If no split gives a gain in information, the set of items is made
71into a leaf labelled with the most frequent class of items reaching
72this point in the tree, and the message:
73
74        no sensible splits 
75.IR r1 / r2
76
77is given, where
78.I r1
79is the total weight of items reaching this point in the tree, and
80.I r2
81is the weight of these which don't belong to the class of this leaf.
82
83If a subtree is found to misclassify
84at least as many items as does replacing the subtree with a leaf, then
85the subtree is replaced and the following message given:
86
87        Collapse tree for
88.I n
89items to leaf
90.I c
91
92where
93.I c
94is the class assigned to the leaf.
95
96
97.B Verbosity level 2
98
99When determining the best attribute to test,
100also shown are the threshold (continuous attributes only),
101information gain and potential information for a split on
102each of the attributes.
103If a test on a continuous attribute has no gain or there are
104insufficient cases
105with known values of the attribute on which
106to base a test, appropriate messages are given.
107(Sufficient here means at least twice MINOBJS, an integer
108which defaults to 2 but can be set with option
109.BR m.)
110The average gain across all attributes is also shown.
111
112If subset tests on discrete attributes are being used,
113for each attribute being examined, the combinations of
114attribute values that are made (i.e. at each stage, the
115combination with highest gain or gain ratio) and the
116potential info, gain and gain or gain ratio are shown.
117
118
119.B Verbosity level 3
120
121When determining the best attribute to test,
122also shown is the frequency distribution table showing
123the total weight of items of each class with:
124
125    - each value of the attribute (discrete attributes), or
126    - values below and above the threshold (contin atts), or
127    - values in each subset formed so far (subset tests).
128
129
130
131.SH TREE PRUNING
132
133.B Verbosity level 1
134
135After the entire decision tree has been constructed,
136.I C4.5
137recursively
138examines each subtree to determine whether replacing it with
139a leaf or a branch would be beneficial.
140(Note: the numbers treated below as counts of items actually
141refer to the total weight of the items mentioned.)
142
143Each leaf is shown as:
144
145.IR        c ( r1 : r2 /
146.IR r3 )
147
148  with:
149        \fIc\fR   -  the most frequent class at the leaf
150        \fIr1\fR  -  the number of items at the leaf
151        \fIr2\fR  -  misclassifications at the leaf
152        \fIr3\fR  -  \fIr2\fR adjusted for additional errors
153
154Each test is shown as:
155
156.IR        att :[ n1 "%  N=" r4 tree=
157.IR r5  leaf= r6 +
158.IR r7  br[ n2 ]= r8 ]
159
160  with:
161        \fIn1\fR  -  percentage of egs at this subtree that are misclassified
162        \fIr4\fR  -  the number of items in the subtree
163        \fIr5\fR  -  misclassifications of this subtree
164        \fIr6\fR  -  misclassifications if this was a leaf
165        \fIr7\fR  -  adjustment to \fIr6\fR for additional errors
166        \fIn2\fR  -  number of the largest branch
167        \fIr8\fR  -  total misclassifications if subtree is replaced by largest branch
168
169If replacing the subtree with a leaf or the largest branch
170reduces the number of errors, then the subtree is replaced
171by whichever of these results in the least number of errors.
172
173
174.SH THRESHOLD SOFTENING
175
176.B Verbosity level 1
177
178In softening the thresholds of tests on continuous attributes
179(option
180.BR p ),
181upper and lower bounds for each test are calculated.
182For each such test, the following are shown:
183.IP "  *" 4
184Base errors - the number of items misclassified when the threshold has
185its original value
186.IP "  *"
187Items - the number of items tested (with a known value for this
188attribute)
189.IP "  *"
190se - the standard deviation of the number of errors
191.HP 0
192For each of the different attribute values, shown are:
193.IP "  *" 4
194Val <=   - the attribute value
195.IP "  *"
196Errors   - the errors with this value as threshold
197.IP "  *"
198+Errors  - Errors - Base errors
199.IP "  *"
200+Items   - the number of items between this value and the original
201threshold
202.IP "  *"
203Ratio    - Ratio of +Errors to +Items
204.HP 0
205The lower and upper bounds are then calculated so that the
206number of errors with each as threshold would be one standard
207deviation above the base errors.
208
209
210.SH SEE ALSO
211
212c4.5(1)
Note: See TracBrowser for help on using the repository browser.