1 | .EN |
---|
2 | .TH C4.5 1 |
---|
3 | .SH NAME |
---|
4 | .PP |
---|
5 | c4.5 \- form a decision tree from a file of examples |
---|
6 | .SH SYNOPSIS |
---|
7 | .PP |
---|
8 | .B c4.5 |
---|
9 | [ \fB-f\fR filestem ] |
---|
10 | [ \fB-u\fR ] |
---|
11 | [ \fB-s\fR ] |
---|
12 | [ \fB-p\fR ] |
---|
13 | [ \fB-v\fR verb ] |
---|
14 | [ \fB-t\fR trials ] |
---|
15 | [ \fB-w\fR wsize ] |
---|
16 | [ \fB-i\fR incr ] |
---|
17 | [ \fB-g\fR ] |
---|
18 | [ \fB-m\fR minobjs ] |
---|
19 | [ \fB-c\fR cf ] |
---|
20 | .SH DESCRIPTION |
---|
21 | .PP |
---|
22 | .I C4.5 |
---|
23 | is a program for inducing classification rules in the form |
---|
24 | of decision trees from a set of given examples. |
---|
25 | .PP |
---|
26 | All files read and written by C4.5 are of the form |
---|
27 | .I filestem.ext |
---|
28 | where |
---|
29 | .I filestem |
---|
30 | is a file name stem that identifies the induction task and |
---|
31 | .I ext |
---|
32 | is an extension that defines the type of file. |
---|
33 | The program expects to find at least |
---|
34 | two files: a |
---|
35 | .B names file |
---|
36 | .I filestem.names |
---|
37 | defining class, attribute and attribute value names, and a |
---|
38 | .B data file |
---|
39 | .I filestem.data |
---|
40 | containing a set of objects, each of which is described by its |
---|
41 | values of each of the attributes and its class. |
---|
42 | .PP |
---|
43 | The program can generate trees |
---|
44 | in two ways. In |
---|
45 | .I batch |
---|
46 | mode (the default), the program generates a single tree |
---|
47 | using all the available data. |
---|
48 | In |
---|
49 | .I iterative |
---|
50 | mode, |
---|
51 | the program starts with a randomly-selected subset of the |
---|
52 | data (the |
---|
53 | .I window), |
---|
54 | generates a trial decision tree, adds some misclassified |
---|
55 | objects, and continues until the trial decision tree |
---|
56 | correctly classifies all objects not in the window or |
---|
57 | until it appears that no progress is being made. |
---|
58 | Since iterative mode starts with a randomly-selected subset, |
---|
59 | multiple trials with the same data can be used to generate |
---|
60 | more than one tree. |
---|
61 | .PP |
---|
62 | All trees generated in the process are saved in |
---|
63 | .I filestem.unpruned. |
---|
64 | After each tree is generated, it is |
---|
65 | .I pruned |
---|
66 | in an attempt to simplify it. |
---|
67 | The `best' pruned tree (selected by the program if more there is |
---|
68 | more than one trial) |
---|
69 | is saved in machine-readable form in |
---|
70 | .I filestem.tree. |
---|
71 | .PP |
---|
72 | All trees produced, both pre- and post-simplification, are evaluated |
---|
73 | on the training data. If required, they can also be evaluated |
---|
74 | on unseen data in file |
---|
75 | .I filestem.test. |
---|
76 | |
---|
77 | .SH FILE FORMATS |
---|
78 | The |
---|
79 | .B names file |
---|
80 | .I filestem.names |
---|
81 | is a series of entries defining names of attributes, |
---|
82 | attribute values and classes. The file is free-format |
---|
83 | with the exception that the vertical bar `|' causes the |
---|
84 | remainder of that line to be ignored. |
---|
85 | Each entry is terminated by a period which may be |
---|
86 | omitted if it is the last character of a line. |
---|
87 | .PP |
---|
88 | The file |
---|
89 | commences with the names of the classes, separated by |
---|
90 | commas and terminated with a period. Each name consists of |
---|
91 | a string of characters that does not include comma, question mark |
---|
92 | or colon (unless preceded by a backslash). A period may be |
---|
93 | embedded in a name provided it is not followed by a space. |
---|
94 | Embedded spaces are also permitted but multiple whitespace is |
---|
95 | replaced by a single space. |
---|
96 | The rest of the file consists of a single entry for each |
---|
97 | attribute. An attribute entry begins with the attribute name |
---|
98 | followed by a colon, and then either the word `ignore' (indicating |
---|
99 | that this attribute should not be used), the word `continuous' |
---|
100 | (indicating that the attribute has real values), |
---|
101 | the word `discrete' followed by an integer |
---|
102 | .I n |
---|
103 | (indicating that the program should assemble |
---|
104 | a list of up to |
---|
105 | .I n |
---|
106 | possible values), or a list |
---|
107 | of all possible discrete values separated by commas. (The latter |
---|
108 | form for discrete attributes is recommended as it |
---|
109 | enables input to be checked.) Each |
---|
110 | entry is terminated with a period (but see above). |
---|
111 | .PP |
---|
112 | The |
---|
113 | .B data file |
---|
114 | .I filestem.data |
---|
115 | contains one line per object. Each line contains |
---|
116 | the values of the attributes in order followed by the |
---|
117 | object's class, with all entries separated by commas. |
---|
118 | The rules for valid names in the |
---|
119 | .B names file |
---|
120 | also hold for the names in the |
---|
121 | .B data file. |
---|
122 | An unknown value of an attribute is indicated by a |
---|
123 | question mark `?'. |
---|
124 | If a |
---|
125 | .B test file |
---|
126 | .I filestem.test |
---|
127 | is used, it has the same format as the data file. |
---|
128 | |
---|
129 | .SH OPTIONS |
---|
130 | Options and their meanings are: |
---|
131 | .PP |
---|
132 | .TP 12 |
---|
133 | .BI \-f filestem\^ |
---|
134 | Specify the filename stem (default |
---|
135 | .B DF) |
---|
136 | .TP |
---|
137 | .B \-u |
---|
138 | Evaluate trees produced on unseen cases in file |
---|
139 | .I filestem.test. |
---|
140 | .TP |
---|
141 | .B \-s |
---|
142 | Force `subsetting' of all tests based on discrete attributes |
---|
143 | with more than two values. C4.5 will construct a test with |
---|
144 | a subset of values associated with each branch. |
---|
145 | .TP |
---|
146 | .B \-p |
---|
147 | Probabilistic thresholds used for continuous attributes (see Quinlan, 1987a). |
---|
148 | .TP |
---|
149 | .BI \-t trials\^ |
---|
150 | Set iterative mode with specified number of trials. |
---|
151 | .TP |
---|
152 | .BI \-v verb\^ |
---|
153 | Set the verbosity level [0-3] (default 0). |
---|
154 | This option generates more voluminous output that may help to |
---|
155 | explain what the program is doing (but don't count on it); |
---|
156 | see the manual entry for |
---|
157 | .I verbose. |
---|
158 | .PP |
---|
159 | The following options are also available but need not |
---|
160 | be used except for experimentation with tree construction: |
---|
161 | .TP 12 |
---|
162 | .BI \-w wsize\^ |
---|
163 | Set the size of the initial window |
---|
164 | (default is the maximum of 20 percent and twice the square |
---|
165 | root of the number of data objects). |
---|
166 | .TP |
---|
167 | .BI \-i incr\^ |
---|
168 | Set the maximum number of objects that can be |
---|
169 | added to the window at each iteration |
---|
170 | (default is 20 percent of the initial window size). |
---|
171 | .TP |
---|
172 | .B \-g |
---|
173 | Use the gain criterion to select tests. The default |
---|
174 | uses the gain ratio criterion. |
---|
175 | .TP |
---|
176 | .BI \-m minobjs\^ |
---|
177 | In all tests, at least two branches must contain a minimum number |
---|
178 | of objects (default 2). This option allows the minimum |
---|
179 | number to be altered. |
---|
180 | .TP |
---|
181 | .BI \-c cf\^ |
---|
182 | Set the pruning confidence level (default 25%). |
---|
183 | .SH FILES |
---|
184 | .PP |
---|
185 | .in 8 |
---|
186 | c4.5 |
---|
187 | .br |
---|
188 | filestem.data |
---|
189 | .br |
---|
190 | filestem.names |
---|
191 | .br |
---|
192 | filestem.unpruned (unpruned trees) |
---|
193 | .br |
---|
194 | filestem.tree (final decision tree) |
---|
195 | .br |
---|
196 | filestem.test (unseen data) |
---|
197 | .in 0 |
---|
198 | .PP |
---|
199 | .SH SEE ALSO |
---|
200 | .PP |
---|
201 | consult(1) |
---|
202 | .PP |
---|
203 | .SH BUGS |
---|