source: proiecte/Parallel-DT/R8/Doc/c4.5.1 @ 24

Last change on this file since 24 was 24, checked in by (none), 14 years ago

blabla

File size: 5.4 KB
Line 
1.EN
2.TH C4.5 1
3.SH NAME
4.PP
5c4.5 \- form a decision tree from a file of examples
6.SH SYNOPSIS
7.PP
8.B c4.5
9[ \fB-f\fR filestem ]
10[ \fB-u\fR ]
11[ \fB-s\fR ]
12[ \fB-p\fR ]
13[ \fB-v\fR verb ]
14[ \fB-t\fR trials ]
15   [ \fB-w\fR wsize ]
16[ \fB-i\fR incr ]
17[ \fB-g\fR ]
18[ \fB-m\fR minobjs ]
19[ \fB-c\fR cf ]
20.SH DESCRIPTION
21.PP
22.I C4.5
23is a program for inducing classification rules in the form
24of decision trees from a set of given examples.
25.PP
26All files read and written by C4.5 are of the form
27.I filestem.ext
28where
29.I filestem
30is a file name stem that identifies the induction task and
31.I ext
32is an extension that defines the type of file.
33The program expects to find at least
34two files: a
35.B names file
36.I filestem.names
37defining class, attribute and attribute value names, and a
38.B data file
39.I filestem.data
40containing a set of objects, each of which is described by its
41values of each of the attributes and its class.
42.PP
43The program can generate trees
44in two ways.  In
45.I batch
46mode (the default), the program generates a single tree
47using all the available data.
48In
49.I iterative
50mode,
51the program starts with a randomly-selected subset of the
52data (the
53.I window),
54generates a trial decision tree, adds some misclassified
55objects, and continues until the trial decision tree
56correctly classifies all objects not in the window or
57until it appears that no progress is being made.
58Since iterative mode starts with a randomly-selected subset,
59multiple trials with the same data can be used to generate
60more than one tree.
61.PP
62All trees generated in the process are saved in
63.I filestem.unpruned.
64After each tree is generated, it is
65.I pruned
66in an attempt to simplify it.
67The `best' pruned tree (selected by the program if more there is
68more than one trial)
69is saved in machine-readable form in
70.I filestem.tree.
71.PP
72All trees produced, both pre- and post-simplification, are evaluated
73on the training data.  If required, they can also be evaluated
74on unseen data in file
75.I filestem.test.
76
77.SH FILE FORMATS
78The
79.B names file
80.I filestem.names
81is a series of entries defining names of attributes,
82attribute values and classes.  The file is free-format
83with the exception that the vertical bar `|' causes the
84remainder of that line to be ignored.
85Each entry is terminated by a period which may be
86omitted if it is the last character of a line.
87.PP
88The file
89commences with the names of the classes, separated by
90commas and terminated with a period.  Each name consists of
91a string of characters that does not include comma, question mark
92or colon (unless preceded by a backslash).  A period may be
93embedded in a name provided it is not followed by a space.
94Embedded spaces are also permitted but multiple whitespace is
95replaced by a single space.
96The rest of the file consists of a single entry for each
97attribute.  An attribute entry begins with the attribute name
98followed by a colon, and then either the word `ignore' (indicating
99that this attribute should not be used), the word `continuous'
100(indicating that the attribute has real values),
101the word `discrete' followed by an integer
102.I n
103(indicating that the program should assemble
104a list of up to
105.I n
106possible values), or a list
107of all possible discrete values separated by commas.  (The latter
108form for discrete attributes is recommended as it
109enables input to be checked.)  Each
110entry is terminated with a period (but see above).
111.PP
112The
113.B data file
114.I filestem.data
115contains one line per object.  Each line contains
116the values of the attributes in order followed by the
117object's class, with all entries separated by commas.
118The rules for valid names in the
119.B names file
120also hold for the names in the
121.B data file.
122An unknown value of an attribute is indicated by a
123question mark `?'.
124If a
125.B test file
126.I filestem.test
127is used, it has the same format as the data file.
128
129.SH OPTIONS
130Options and their meanings are:
131.PP
132.TP 12
133.BI \-f filestem\^
134Specify the filename stem (default
135.B DF)
136.TP
137.B \-u
138Evaluate trees produced on unseen cases in file
139.I filestem.test.
140.TP
141.B \-s
142Force `subsetting' of all tests based on discrete attributes
143with more than two values.  C4.5 will construct a test with
144a subset of values associated with each branch.
145.TP
146.B \-p
147Probabilistic thresholds used for continuous attributes (see Quinlan, 1987a).
148.TP
149.BI \-t trials\^
150Set iterative mode with specified number of trials.
151.TP
152.BI \-v verb\^
153Set the verbosity level [0-3] (default 0).
154This option generates more voluminous output that may help to
155explain what the program is doing (but don't count on it);
156see the manual entry for
157.I verbose.
158.PP
159The following options are also available but need not
160be used except for experimentation with tree construction:
161.TP 12
162.BI \-w wsize\^
163Set the size of the initial window
164(default is the maximum of 20 percent and twice the square
165root of the number of data objects).
166.TP
167.BI \-i incr\^
168Set the maximum number of objects that can be
169added to the window at each iteration
170(default is 20 percent of the initial window size).
171.TP
172.B \-g
173Use the gain criterion to select tests.  The default
174uses the gain ratio criterion.
175.TP
176.BI \-m minobjs\^
177In all tests, at least two branches must contain a minimum number
178of objects (default 2).  This option allows the minimum
179number to be altered.
180.TP
181.BI \-c cf\^
182Set the pruning confidence level (default 25%).
183.SH FILES
184.PP
185.in 8
186c4.5
187.br
188filestem.data
189.br
190filestem.names
191.br
192filestem.unpruned  (unpruned trees)
193.br
194filestem.tree   (final decision tree)
195.br
196filestem.test   (unseen data)
197.in 0
198.PP
199.SH SEE ALSO
200.PP
201consult(1)
202.PP
203.SH BUGS
Note: See TracBrowser for help on using the repository browser.