Changes between Version 43 and Version 44 of Parallel-DT
- Timestamp:
- Jan 18, 2010, 10:29:00 PM (14 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Parallel-DT
v43 v44 35 35 * simultaneously compute the entropy gains of each attribute at each processor and select the best attribute for child node expansion 36 36 * depending on the branching factor of the tree desired, create child nodes for the same number of partitions of attributes values, and split training cases accordingly 37 * repeat above steps until no more nodes are available for the expansion 37 38 38 39 [[Image(SyncronusTreeConstruction-DepthFirstExpansionStrategy.jpg)]] 39 40 40 41 The figure above shows the overall picture. 41 The advantage of this approach is that it does not require any movement of the training data items. However, this algorithm suffers from high communication cost and load imbalance. For each nod ein the decision tree, after collecting the class dstribution information, all the processors need to syncronize and exchange the distribution information.42 The advantage of this approach is that it does not require any movement of the training data items. However, this algorithm suffers from high communication cost and load imbalance. For each node in the decision tree, after collecting the class dstribution information, all the processors need to syncronize and exchange the distribution information. 42 43 43 44 === Partitioned Tree Construction === 44 45 45 In this approach, whenever feasible, deifferent processors work on different parts of the classification tree. 46 In this approach, whenever feasible, deifferent processors work on different parts of the classification tree. In particular, if more than one processors cooperate to expand a node, then these nodes are partitioned to expand the succesors of this node. Consider the case in wich a group of processors ''Pn'' cooperate to expand node ''n''. The algorithm consists of the folowing steps: 47 48 '''Step 1''' processors in ''Pn'' cooperate to expand node ''n'' 49 '''Step 2''' once the node ''n'' is expanded in to successors nodes, ''n1'', ''n2'', ..., ''nk'' then the processor group ''Pn'' is also partitioned, and the successor nodes are assigned to processors as follows: 50 '''Case 1''': if the number of successor nodes is greater than |''Pn''| 51 1. partition the successors into |''Pn''| groups such that the total number of training cases corresponding to each node group is roughly equal. Assign each processor to one node group. 52 2. shufle the training data such that each processor has data items that belong to the nodes it is responsible for. 53 3. now the expansion of the subtrees rooted at a node group proceeds completly independently at each processor as in the serial algorithm 54 '''Case 2''': otherwise 55 1. assign a subset of processors to each node such that the number of processors assigned to a node is proportional to the number of the training cases corresponding to the node 56 2. shufle the training cases such that each subset of processors has training cases that belongs to the nodes it is responsible for 57 3. processor subsets assigned to different nodes develop subtrees independently. 46 58 47 59 [[Image(PartitionedTreeConstruction.jpg)]] 60 61 The advantage of this approach is that once a processor becomes solely responsible for a node, it can develop a subtree of the classification tree independently without any communication overhead. However, there are a number of disadvanteges of this approach. The first disadvantage is that it requires data movement after each node expansion until one processor becomes responsible for an entire subtree. The second disadvantage is poor load balancing inherent in the algorithm. 48 62 49 63 == Project activity ==