[26] | 1 | .TH C4.5 1 |
---|
| 2 | .SH NAME |
---|
| 3 | A guide to the verbose output of the C4.5 decision tree generator |
---|
| 4 | |
---|
| 5 | .SH DESCRIPTION |
---|
| 6 | This document explains the output of the program |
---|
| 7 | .I C4.5 |
---|
| 8 | when it is run with the verbosity level (option |
---|
| 9 | .BR v ) |
---|
| 10 | set to values from 1 to 3. |
---|
| 11 | |
---|
| 12 | .SH TREE BUILDING |
---|
| 13 | |
---|
| 14 | .B Verbosity level 1 |
---|
| 15 | |
---|
| 16 | To build a decision tree from a set of data items each of which belongs |
---|
| 17 | to one of a set of classes, |
---|
| 18 | .I C4.5 |
---|
| 19 | proceeds as follows: |
---|
| 20 | .IP " 1." 7 |
---|
| 21 | If all items belong to the same class, the decision |
---|
| 22 | tree is a leaf which is labelled with this class. |
---|
| 23 | .IP " 2." |
---|
| 24 | Otherwise, |
---|
| 25 | .I C4.5 |
---|
| 26 | attempts to find the best attribute |
---|
| 27 | to test in order to divide the data items into |
---|
| 28 | subsets, and then builds a subtree from each subset |
---|
| 29 | by recursively invoking this procedure for each one. |
---|
| 30 | .HP 0 |
---|
| 31 | The best attribute to branch on at each stage is selected by |
---|
| 32 | determining the information gain of a split on each of the attributes. |
---|
| 33 | If the selection criterion being used is GAIN (option |
---|
| 34 | .BR g ), |
---|
| 35 | the best |
---|
| 36 | attribute is that which divides the data items with the highest gain |
---|
| 37 | in information, whereas if the GAINRATIO criterion (the default) is |
---|
| 38 | being used (and the gain is at least the average gain across all |
---|
| 39 | attributes), the best attribute is that with the highest ratio of |
---|
| 40 | information gain to potential information. |
---|
| 41 | |
---|
| 42 | For discrete-valued attributes, a branch corresponding to each value of |
---|
| 43 | the attribute is formed, whereas for continuous-valued attributes, a |
---|
| 44 | threshold is found, thus forming two branches. |
---|
| 45 | If subset tests are being used (option |
---|
| 46 | .BR s ), |
---|
| 47 | branches may be formed |
---|
| 48 | corresponding to a subset of values of a discrete attribute being tested. |
---|
| 49 | |
---|
| 50 | The verbose output shows the number of items from which a tree is being |
---|
| 51 | constructed, as well as the total weight of these items. The weight |
---|
| 52 | of an item is the probability that the item would reach this point in the |
---|
| 53 | tree and will be less than 1.0 for items with an unknown value |
---|
| 54 | of some previously-tested attribute. |
---|
| 55 | |
---|
| 56 | Shown for the best attribute is: |
---|
| 57 | |
---|
| 58 | cut - threshold (continuous attributes only) |
---|
| 59 | inf - the potential information of a split |
---|
| 60 | gain - the gain in information of a split |
---|
| 61 | val - the gain or the gain/inf (depending on the |
---|
| 62 | selection criterion) |
---|
| 63 | |
---|
| 64 | Also shown is the proportion of items at this point in the tree |
---|
| 65 | with an unknown value for that attribute. Items with an unknown value |
---|
| 66 | for the attribute being tested are distributed across all values |
---|
| 67 | in proportion to the relative frequency of these values in the |
---|
| 68 | set of items being tested. |
---|
| 69 | |
---|
| 70 | If no split gives a gain in information, the set of items is made |
---|
| 71 | into a leaf labelled with the most frequent class of items reaching |
---|
| 72 | this point in the tree, and the message: |
---|
| 73 | |
---|
| 74 | no sensible splits |
---|
| 75 | .IR r1 / r2 |
---|
| 76 | |
---|
| 77 | is given, where |
---|
| 78 | .I r1 |
---|
| 79 | is the total weight of items reaching this point in the tree, and |
---|
| 80 | .I r2 |
---|
| 81 | is the weight of these which don't belong to the class of this leaf. |
---|
| 82 | |
---|
| 83 | If a subtree is found to misclassify |
---|
| 84 | at least as many items as does replacing the subtree with a leaf, then |
---|
| 85 | the subtree is replaced and the following message given: |
---|
| 86 | |
---|
| 87 | Collapse tree for |
---|
| 88 | .I n |
---|
| 89 | items to leaf |
---|
| 90 | .I c |
---|
| 91 | |
---|
| 92 | where |
---|
| 93 | .I c |
---|
| 94 | is the class assigned to the leaf. |
---|
| 95 | |
---|
| 96 | |
---|
| 97 | .B Verbosity level 2 |
---|
| 98 | |
---|
| 99 | When determining the best attribute to test, |
---|
| 100 | also shown are the threshold (continuous attributes only), |
---|
| 101 | information gain and potential information for a split on |
---|
| 102 | each of the attributes. |
---|
| 103 | If a test on a continuous attribute has no gain or there are |
---|
| 104 | insufficient cases |
---|
| 105 | with known values of the attribute on which |
---|
| 106 | to base a test, appropriate messages are given. |
---|
| 107 | (Sufficient here means at least twice MINOBJS, an integer |
---|
| 108 | which defaults to 2 but can be set with option |
---|
| 109 | .BR m.) |
---|
| 110 | The average gain across all attributes is also shown. |
---|
| 111 | |
---|
| 112 | If subset tests on discrete attributes are being used, |
---|
| 113 | for each attribute being examined, the combinations of |
---|
| 114 | attribute values that are made (i.e. at each stage, the |
---|
| 115 | combination with highest gain or gain ratio) and the |
---|
| 116 | potential info, gain and gain or gain ratio are shown. |
---|
| 117 | |
---|
| 118 | |
---|
| 119 | .B Verbosity level 3 |
---|
| 120 | |
---|
| 121 | When determining the best attribute to test, |
---|
| 122 | also shown is the frequency distribution table showing |
---|
| 123 | the total weight of items of each class with: |
---|
| 124 | |
---|
| 125 | - each value of the attribute (discrete attributes), or |
---|
| 126 | - values below and above the threshold (contin atts), or |
---|
| 127 | - values in each subset formed so far (subset tests). |
---|
| 128 | |
---|
| 129 | |
---|
| 130 | |
---|
| 131 | .SH TREE PRUNING |
---|
| 132 | |
---|
| 133 | .B Verbosity level 1 |
---|
| 134 | |
---|
| 135 | After the entire decision tree has been constructed, |
---|
| 136 | .I C4.5 |
---|
| 137 | recursively |
---|
| 138 | examines each subtree to determine whether replacing it with |
---|
| 139 | a leaf or a branch would be beneficial. |
---|
| 140 | (Note: the numbers treated below as counts of items actually |
---|
| 141 | refer to the total weight of the items mentioned.) |
---|
| 142 | |
---|
| 143 | Each leaf is shown as: |
---|
| 144 | |
---|
| 145 | .IR c ( r1 : r2 / |
---|
| 146 | .IR r3 ) |
---|
| 147 | |
---|
| 148 | with: |
---|
| 149 | \fIc\fR - the most frequent class at the leaf |
---|
| 150 | \fIr1\fR - the number of items at the leaf |
---|
| 151 | \fIr2\fR - misclassifications at the leaf |
---|
| 152 | \fIr3\fR - \fIr2\fR adjusted for additional errors |
---|
| 153 | |
---|
| 154 | Each test is shown as: |
---|
| 155 | |
---|
| 156 | .IR att :[ n1 "% N=" r4 tree= |
---|
| 157 | .IR r5 leaf= r6 + |
---|
| 158 | .IR r7 br[ n2 ]= r8 ] |
---|
| 159 | |
---|
| 160 | with: |
---|
| 161 | \fIn1\fR - percentage of egs at this subtree that are misclassified |
---|
| 162 | \fIr4\fR - the number of items in the subtree |
---|
| 163 | \fIr5\fR - misclassifications of this subtree |
---|
| 164 | \fIr6\fR - misclassifications if this was a leaf |
---|
| 165 | \fIr7\fR - adjustment to \fIr6\fR for additional errors |
---|
| 166 | \fIn2\fR - number of the largest branch |
---|
| 167 | \fIr8\fR - total misclassifications if subtree is replaced by largest branch |
---|
| 168 | |
---|
| 169 | If replacing the subtree with a leaf or the largest branch |
---|
| 170 | reduces the number of errors, then the subtree is replaced |
---|
| 171 | by whichever of these results in the least number of errors. |
---|
| 172 | |
---|
| 173 | |
---|
| 174 | .SH THRESHOLD SOFTENING |
---|
| 175 | |
---|
| 176 | .B Verbosity level 1 |
---|
| 177 | |
---|
| 178 | In softening the thresholds of tests on continuous attributes |
---|
| 179 | (option |
---|
| 180 | .BR p ), |
---|
| 181 | upper and lower bounds for each test are calculated. |
---|
| 182 | For each such test, the following are shown: |
---|
| 183 | .IP " *" 4 |
---|
| 184 | Base errors - the number of items misclassified when the threshold has |
---|
| 185 | its original value |
---|
| 186 | .IP " *" |
---|
| 187 | Items - the number of items tested (with a known value for this |
---|
| 188 | attribute) |
---|
| 189 | .IP " *" |
---|
| 190 | se - the standard deviation of the number of errors |
---|
| 191 | .HP 0 |
---|
| 192 | For each of the different attribute values, shown are: |
---|
| 193 | .IP " *" 4 |
---|
| 194 | Val <= - the attribute value |
---|
| 195 | .IP " *" |
---|
| 196 | Errors - the errors with this value as threshold |
---|
| 197 | .IP " *" |
---|
| 198 | +Errors - Errors - Base errors |
---|
| 199 | .IP " *" |
---|
| 200 | +Items - the number of items between this value and the original |
---|
| 201 | threshold |
---|
| 202 | .IP " *" |
---|
| 203 | Ratio - Ratio of +Errors to +Items |
---|
| 204 | .HP 0 |
---|
| 205 | The lower and upper bounds are then calculated so that the |
---|
| 206 | number of errors with each as threshold would be one standard |
---|
| 207 | deviation above the base errors. |
---|
| 208 | |
---|
| 209 | |
---|
| 210 | .SH SEE ALSO |
---|
| 211 | |
---|
| 212 | c4.5(1) |
---|