1 | .TH C4.5 1 |
---|
2 | .SH NAME |
---|
3 | A guide to the verbose output of the C4.5 decision tree generator |
---|
4 | |
---|
5 | .SH DESCRIPTION |
---|
6 | This document explains the output of the program |
---|
7 | .I C4.5 |
---|
8 | when it is run with the verbosity level (option |
---|
9 | .BR v ) |
---|
10 | set to values from 1 to 3. |
---|
11 | |
---|
12 | .SH TREE BUILDING |
---|
13 | |
---|
14 | .B Verbosity level 1 |
---|
15 | |
---|
16 | To build a decision tree from a set of data items each of which belongs |
---|
17 | to one of a set of classes, |
---|
18 | .I C4.5 |
---|
19 | proceeds as follows: |
---|
20 | .IP " 1." 7 |
---|
21 | If all items belong to the same class, the decision |
---|
22 | tree is a leaf which is labelled with this class. |
---|
23 | .IP " 2." |
---|
24 | Otherwise, |
---|
25 | .I C4.5 |
---|
26 | attempts to find the best attribute |
---|
27 | to test in order to divide the data items into |
---|
28 | subsets, and then builds a subtree from each subset |
---|
29 | by recursively invoking this procedure for each one. |
---|
30 | .HP 0 |
---|
31 | The best attribute to branch on at each stage is selected by |
---|
32 | determining the information gain of a split on each of the attributes. |
---|
33 | If the selection criterion being used is GAIN (option |
---|
34 | .BR g ), |
---|
35 | the best |
---|
36 | attribute is that which divides the data items with the highest gain |
---|
37 | in information, whereas if the GAINRATIO criterion (the default) is |
---|
38 | being used (and the gain is at least the average gain across all |
---|
39 | attributes), the best attribute is that with the highest ratio of |
---|
40 | information gain to potential information. |
---|
41 | |
---|
42 | For discrete-valued attributes, a branch corresponding to each value of |
---|
43 | the attribute is formed, whereas for continuous-valued attributes, a |
---|
44 | threshold is found, thus forming two branches. |
---|
45 | If subset tests are being used (option |
---|
46 | .BR s ), |
---|
47 | branches may be formed |
---|
48 | corresponding to a subset of values of a discrete attribute being tested. |
---|
49 | |
---|
50 | The verbose output shows the number of items from which a tree is being |
---|
51 | constructed, as well as the total weight of these items. The weight |
---|
52 | of an item is the probability that the item would reach this point in the |
---|
53 | tree and will be less than 1.0 for items with an unknown value |
---|
54 | of some previously-tested attribute. |
---|
55 | |
---|
56 | Shown for the best attribute is: |
---|
57 | |
---|
58 | cut - threshold (continuous attributes only) |
---|
59 | inf - the potential information of a split |
---|
60 | gain - the gain in information of a split |
---|
61 | val - the gain or the gain/inf (depending on the |
---|
62 | selection criterion) |
---|
63 | |
---|
64 | Also shown is the proportion of items at this point in the tree |
---|
65 | with an unknown value for that attribute. Items with an unknown value |
---|
66 | for the attribute being tested are distributed across all values |
---|
67 | in proportion to the relative frequency of these values in the |
---|
68 | set of items being tested. |
---|
69 | |
---|
70 | If no split gives a gain in information, the set of items is made |
---|
71 | into a leaf labelled with the most frequent class of items reaching |
---|
72 | this point in the tree, and the message: |
---|
73 | |
---|
74 | no sensible splits |
---|
75 | .IR r1 / r2 |
---|
76 | |
---|
77 | is given, where |
---|
78 | .I r1 |
---|
79 | is the total weight of items reaching this point in the tree, and |
---|
80 | .I r2 |
---|
81 | is the weight of these which don't belong to the class of this leaf. |
---|
82 | |
---|
83 | If a subtree is found to misclassify |
---|
84 | at least as many items as does replacing the subtree with a leaf, then |
---|
85 | the subtree is replaced and the following message given: |
---|
86 | |
---|
87 | Collapse tree for |
---|
88 | .I n |
---|
89 | items to leaf |
---|
90 | .I c |
---|
91 | |
---|
92 | where |
---|
93 | .I c |
---|
94 | is the class assigned to the leaf. |
---|
95 | |
---|
96 | |
---|
97 | .B Verbosity level 2 |
---|
98 | |
---|
99 | When determining the best attribute to test, |
---|
100 | also shown are the threshold (continuous attributes only), |
---|
101 | information gain and potential information for a split on |
---|
102 | each of the attributes. |
---|
103 | If a test on a continuous attribute has no gain or there are |
---|
104 | insufficient cases |
---|
105 | with known values of the attribute on which |
---|
106 | to base a test, appropriate messages are given. |
---|
107 | (Sufficient here means at least twice MINOBJS, an integer |
---|
108 | which defaults to 2 but can be set with option |
---|
109 | .BR m.) |
---|
110 | The average gain across all attributes is also shown. |
---|
111 | |
---|
112 | If subset tests on discrete attributes are being used, |
---|
113 | for each attribute being examined, the combinations of |
---|
114 | attribute values that are made (i.e. at each stage, the |
---|
115 | combination with highest gain or gain ratio) and the |
---|
116 | potential info, gain and gain or gain ratio are shown. |
---|
117 | |
---|
118 | |
---|
119 | .B Verbosity level 3 |
---|
120 | |
---|
121 | When determining the best attribute to test, |
---|
122 | also shown is the frequency distribution table showing |
---|
123 | the total weight of items of each class with: |
---|
124 | |
---|
125 | - each value of the attribute (discrete attributes), or |
---|
126 | - values below and above the threshold (contin atts), or |
---|
127 | - values in each subset formed so far (subset tests). |
---|
128 | |
---|
129 | |
---|
130 | |
---|
131 | .SH TREE PRUNING |
---|
132 | |
---|
133 | .B Verbosity level 1 |
---|
134 | |
---|
135 | After the entire decision tree has been constructed, |
---|
136 | .I C4.5 |
---|
137 | recursively |
---|
138 | examines each subtree to determine whether replacing it with |
---|
139 | a leaf or a branch would be beneficial. |
---|
140 | (Note: the numbers treated below as counts of items actually |
---|
141 | refer to the total weight of the items mentioned.) |
---|
142 | |
---|
143 | Each leaf is shown as: |
---|
144 | |
---|
145 | .IR c ( r1 : r2 / |
---|
146 | .IR r3 ) |
---|
147 | |
---|
148 | with: |
---|
149 | \fIc\fR - the most frequent class at the leaf |
---|
150 | \fIr1\fR - the number of items at the leaf |
---|
151 | \fIr2\fR - misclassifications at the leaf |
---|
152 | \fIr3\fR - \fIr2\fR adjusted for additional errors |
---|
153 | |
---|
154 | Each test is shown as: |
---|
155 | |
---|
156 | .IR att :[ n1 "% N=" r4 tree= |
---|
157 | .IR r5 leaf= r6 + |
---|
158 | .IR r7 br[ n2 ]= r8 ] |
---|
159 | |
---|
160 | with: |
---|
161 | \fIn1\fR - percentage of egs at this subtree that are misclassified |
---|
162 | \fIr4\fR - the number of items in the subtree |
---|
163 | \fIr5\fR - misclassifications of this subtree |
---|
164 | \fIr6\fR - misclassifications if this was a leaf |
---|
165 | \fIr7\fR - adjustment to \fIr6\fR for additional errors |
---|
166 | \fIn2\fR - number of the largest branch |
---|
167 | \fIr8\fR - total misclassifications if subtree is replaced by largest branch |
---|
168 | |
---|
169 | If replacing the subtree with a leaf or the largest branch |
---|
170 | reduces the number of errors, then the subtree is replaced |
---|
171 | by whichever of these results in the least number of errors. |
---|
172 | |
---|
173 | |
---|
174 | .SH THRESHOLD SOFTENING |
---|
175 | |
---|
176 | .B Verbosity level 1 |
---|
177 | |
---|
178 | In softening the thresholds of tests on continuous attributes |
---|
179 | (option |
---|
180 | .BR p ), |
---|
181 | upper and lower bounds for each test are calculated. |
---|
182 | For each such test, the following are shown: |
---|
183 | .IP " *" 4 |
---|
184 | Base errors - the number of items misclassified when the threshold has |
---|
185 | its original value |
---|
186 | .IP " *" |
---|
187 | Items - the number of items tested (with a known value for this |
---|
188 | attribute) |
---|
189 | .IP " *" |
---|
190 | se - the standard deviation of the number of errors |
---|
191 | .HP 0 |
---|
192 | For each of the different attribute values, shown are: |
---|
193 | .IP " *" 4 |
---|
194 | Val <= - the attribute value |
---|
195 | .IP " *" |
---|
196 | Errors - the errors with this value as threshold |
---|
197 | .IP " *" |
---|
198 | +Errors - Errors - Base errors |
---|
199 | .IP " *" |
---|
200 | +Items - the number of items between this value and the original |
---|
201 | threshold |
---|
202 | .IP " *" |
---|
203 | Ratio - Ratio of +Errors to +Items |
---|
204 | .HP 0 |
---|
205 | The lower and upper bounds are then calculated so that the |
---|
206 | number of errors with each as threshold would be one standard |
---|
207 | deviation above the base errors. |
---|
208 | |
---|
209 | |
---|
210 | .SH SEE ALSO |
---|
211 | |
---|
212 | c4.5(1) |
---|