[26] | 1 | .EN |
---|
| 2 | .TH C4.5 1 |
---|
| 3 | .SH NAME |
---|
| 4 | .PP |
---|
| 5 | c4.5 \- form a decision tree from a file of examples |
---|
| 6 | .SH SYNOPSIS |
---|
| 7 | .PP |
---|
| 8 | .B c4.5 |
---|
| 9 | [ \fB-f\fR filestem ] |
---|
| 10 | [ \fB-u\fR ] |
---|
| 11 | [ \fB-s\fR ] |
---|
| 12 | [ \fB-p\fR ] |
---|
| 13 | [ \fB-v\fR verb ] |
---|
| 14 | [ \fB-t\fR trials ] |
---|
| 15 | [ \fB-w\fR wsize ] |
---|
| 16 | [ \fB-i\fR incr ] |
---|
| 17 | [ \fB-g\fR ] |
---|
| 18 | [ \fB-m\fR minobjs ] |
---|
| 19 | [ \fB-c\fR cf ] |
---|
| 20 | .SH DESCRIPTION |
---|
| 21 | .PP |
---|
| 22 | .I C4.5 |
---|
| 23 | is a program for inducing classification rules in the form |
---|
| 24 | of decision trees from a set of given examples. |
---|
| 25 | .PP |
---|
| 26 | All files read and written by C4.5 are of the form |
---|
| 27 | .I filestem.ext |
---|
| 28 | where |
---|
| 29 | .I filestem |
---|
| 30 | is a file name stem that identifies the induction task and |
---|
| 31 | .I ext |
---|
| 32 | is an extension that defines the type of file. |
---|
| 33 | The program expects to find at least |
---|
| 34 | two files: a |
---|
| 35 | .B names file |
---|
| 36 | .I filestem.names |
---|
| 37 | defining class, attribute and attribute value names, and a |
---|
| 38 | .B data file |
---|
| 39 | .I filestem.data |
---|
| 40 | containing a set of objects, each of which is described by its |
---|
| 41 | values of each of the attributes and its class. |
---|
| 42 | .PP |
---|
| 43 | The program can generate trees |
---|
| 44 | in two ways. In |
---|
| 45 | .I batch |
---|
| 46 | mode (the default), the program generates a single tree |
---|
| 47 | using all the available data. |
---|
| 48 | In |
---|
| 49 | .I iterative |
---|
| 50 | mode, |
---|
| 51 | the program starts with a randomly-selected subset of the |
---|
| 52 | data (the |
---|
| 53 | .I window), |
---|
| 54 | generates a trial decision tree, adds some misclassified |
---|
| 55 | objects, and continues until the trial decision tree |
---|
| 56 | correctly classifies all objects not in the window or |
---|
| 57 | until it appears that no progress is being made. |
---|
| 58 | Since iterative mode starts with a randomly-selected subset, |
---|
| 59 | multiple trials with the same data can be used to generate |
---|
| 60 | more than one tree. |
---|
| 61 | .PP |
---|
| 62 | All trees generated in the process are saved in |
---|
| 63 | .I filestem.unpruned. |
---|
| 64 | After each tree is generated, it is |
---|
| 65 | .I pruned |
---|
| 66 | in an attempt to simplify it. |
---|
| 67 | The `best' pruned tree (selected by the program if more there is |
---|
| 68 | more than one trial) |
---|
| 69 | is saved in machine-readable form in |
---|
| 70 | .I filestem.tree. |
---|
| 71 | .PP |
---|
| 72 | All trees produced, both pre- and post-simplification, are evaluated |
---|
| 73 | on the training data. If required, they can also be evaluated |
---|
| 74 | on unseen data in file |
---|
| 75 | .I filestem.test. |
---|
| 76 | |
---|
| 77 | .SH FILE FORMATS |
---|
| 78 | The |
---|
| 79 | .B names file |
---|
| 80 | .I filestem.names |
---|
| 81 | is a series of entries defining names of attributes, |
---|
| 82 | attribute values and classes. The file is free-format |
---|
| 83 | with the exception that the vertical bar `|' causes the |
---|
| 84 | remainder of that line to be ignored. |
---|
| 85 | Each entry is terminated by a period which may be |
---|
| 86 | omitted if it is the last character of a line. |
---|
| 87 | .PP |
---|
| 88 | The file |
---|
| 89 | commences with the names of the classes, separated by |
---|
| 90 | commas and terminated with a period. Each name consists of |
---|
| 91 | a string of characters that does not include comma, question mark |
---|
| 92 | or colon (unless preceded by a backslash). A period may be |
---|
| 93 | embedded in a name provided it is not followed by a space. |
---|
| 94 | Embedded spaces are also permitted but multiple whitespace is |
---|
| 95 | replaced by a single space. |
---|
| 96 | The rest of the file consists of a single entry for each |
---|
| 97 | attribute. An attribute entry begins with the attribute name |
---|
| 98 | followed by a colon, and then either the word `ignore' (indicating |
---|
| 99 | that this attribute should not be used), the word `continuous' |
---|
| 100 | (indicating that the attribute has real values), |
---|
| 101 | the word `discrete' followed by an integer |
---|
| 102 | .I n |
---|
| 103 | (indicating that the program should assemble |
---|
| 104 | a list of up to |
---|
| 105 | .I n |
---|
| 106 | possible values), or a list |
---|
| 107 | of all possible discrete values separated by commas. (The latter |
---|
| 108 | form for discrete attributes is recommended as it |
---|
| 109 | enables input to be checked.) Each |
---|
| 110 | entry is terminated with a period (but see above). |
---|
| 111 | .PP |
---|
| 112 | The |
---|
| 113 | .B data file |
---|
| 114 | .I filestem.data |
---|
| 115 | contains one line per object. Each line contains |
---|
| 116 | the values of the attributes in order followed by the |
---|
| 117 | object's class, with all entries separated by commas. |
---|
| 118 | The rules for valid names in the |
---|
| 119 | .B names file |
---|
| 120 | also hold for the names in the |
---|
| 121 | .B data file. |
---|
| 122 | An unknown value of an attribute is indicated by a |
---|
| 123 | question mark `?'. |
---|
| 124 | If a |
---|
| 125 | .B test file |
---|
| 126 | .I filestem.test |
---|
| 127 | is used, it has the same format as the data file. |
---|
| 128 | |
---|
| 129 | .SH OPTIONS |
---|
| 130 | Options and their meanings are: |
---|
| 131 | .PP |
---|
| 132 | .TP 12 |
---|
| 133 | .BI \-f filestem\^ |
---|
| 134 | Specify the filename stem (default |
---|
| 135 | .B DF) |
---|
| 136 | .TP |
---|
| 137 | .B \-u |
---|
| 138 | Evaluate trees produced on unseen cases in file |
---|
| 139 | .I filestem.test. |
---|
| 140 | .TP |
---|
| 141 | .B \-s |
---|
| 142 | Force `subsetting' of all tests based on discrete attributes |
---|
| 143 | with more than two values. C4.5 will construct a test with |
---|
| 144 | a subset of values associated with each branch. |
---|
| 145 | .TP |
---|
| 146 | .B \-p |
---|
| 147 | Probabilistic thresholds used for continuous attributes (see Quinlan, 1987a). |
---|
| 148 | .TP |
---|
| 149 | .BI \-t trials\^ |
---|
| 150 | Set iterative mode with specified number of trials. |
---|
| 151 | .TP |
---|
| 152 | .BI \-v verb\^ |
---|
| 153 | Set the verbosity level [0-3] (default 0). |
---|
| 154 | This option generates more voluminous output that may help to |
---|
| 155 | explain what the program is doing (but don't count on it); |
---|
| 156 | see the manual entry for |
---|
| 157 | .I verbose. |
---|
| 158 | .PP |
---|
| 159 | The following options are also available but need not |
---|
| 160 | be used except for experimentation with tree construction: |
---|
| 161 | .TP 12 |
---|
| 162 | .BI \-w wsize\^ |
---|
| 163 | Set the size of the initial window |
---|
| 164 | (default is the maximum of 20 percent and twice the square |
---|
| 165 | root of the number of data objects). |
---|
| 166 | .TP |
---|
| 167 | .BI \-i incr\^ |
---|
| 168 | Set the maximum number of objects that can be |
---|
| 169 | added to the window at each iteration |
---|
| 170 | (default is 20 percent of the initial window size). |
---|
| 171 | .TP |
---|
| 172 | .B \-g |
---|
| 173 | Use the gain criterion to select tests. The default |
---|
| 174 | uses the gain ratio criterion. |
---|
| 175 | .TP |
---|
| 176 | .BI \-m minobjs\^ |
---|
| 177 | In all tests, at least two branches must contain a minimum number |
---|
| 178 | of objects (default 2). This option allows the minimum |
---|
| 179 | number to be altered. |
---|
| 180 | .TP |
---|
| 181 | .BI \-c cf\^ |
---|
| 182 | Set the pruning confidence level (default 25%). |
---|
| 183 | .SH FILES |
---|
| 184 | .PP |
---|
| 185 | .in 8 |
---|
| 186 | c4.5 |
---|
| 187 | .br |
---|
| 188 | filestem.data |
---|
| 189 | .br |
---|
| 190 | filestem.names |
---|
| 191 | .br |
---|
| 192 | filestem.unpruned (unpruned trees) |
---|
| 193 | .br |
---|
| 194 | filestem.tree (final decision tree) |
---|
| 195 | .br |
---|
| 196 | filestem.test (unseen data) |
---|
| 197 | .in 0 |
---|
| 198 | .PP |
---|
| 199 | .SH SEE ALSO |
---|
| 200 | .PP |
---|
| 201 | consult(1) |
---|
| 202 | .PP |
---|
| 203 | .SH BUGS |
---|