Changes between Version 43 and Version 44 of Parallel-DT


Ignore:
Timestamp:
Jan 18, 2010, 10:29:00 PM (14 years ago)
Author:
andrei.minca
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Parallel-DT

    v43 v44  
    3535   * simultaneously compute the entropy gains of each attribute at each processor and select the best attribute for child node expansion
    3636   * depending on the branching factor of the tree desired, create child nodes for the same number of partitions of attributes values, and split training cases accordingly
     37   * repeat above steps until no more nodes are available for the expansion
    3738
    3839[[Image(SyncronusTreeConstruction-DepthFirstExpansionStrategy.jpg)]]
    3940
    4041The figure above shows the overall picture.
    41 The advantage of this approach is that it does not require any movement of the training data items. However, this algorithm suffers from high communication cost and load imbalance. For each nod ein the decision tree, after collecting the class dstribution information, all the processors need to syncronize and exchange the distribution information.
     42The advantage of this approach is that it does not require any movement of the training data items. However, this algorithm suffers from high communication cost and load imbalance. For each node in the decision tree, after collecting the class dstribution information, all the processors need to syncronize and exchange the distribution information.
    4243
    4344=== Partitioned Tree Construction ===
    4445
    45 In this approach, whenever feasible, deifferent processors work on different parts of the classification tree.
     46In this approach, whenever feasible, deifferent processors work on different parts of the classification tree. In particular, if more than one processors cooperate to expand a node, then these nodes are partitioned to expand the succesors of this node. Consider the case in wich a group of processors ''Pn'' cooperate to expand node ''n''. The algorithm consists of the folowing steps:
     47
     48'''Step 1''' processors in ''Pn'' cooperate to expand node ''n''
     49'''Step 2''' once the node ''n'' is expanded in to successors nodes, ''n1'', ''n2'', ..., ''nk'' then the processor group ''Pn'' is also partitioned, and the successor nodes are assigned to processors as follows:
     50   '''Case 1''': if the number of successor nodes is greater than |''Pn''|
     51               1. partition the successors into |''Pn''| groups such that the total number of training cases corresponding to each node group is roughly equal. Assign each processor to one node group.
     52               2. shufle the training data such that each processor has data items that belong to the nodes it is responsible for.
     53               3. now the expansion of the subtrees rooted at a node group proceeds completly independently at each processor as in the serial algorithm
     54   '''Case 2''': otherwise
     55               1. assign a subset of processors to each node such that the number of processors assigned to a node is proportional to the number of the training cases corresponding to the node
     56               2. shufle the training cases such that each subset of processors has training cases that belongs to the nodes it is responsible for
     57               3. processor subsets assigned to different nodes develop subtrees independently.
    4658
    4759[[Image(PartitionedTreeConstruction.jpg)]]
     60
     61The advantage of this approach is that once a processor becomes solely responsible for a node, it can develop a subtree of the classification tree independently without any communication overhead. However, there are a number of disadvanteges of this approach. The first disadvantage is that it requires data movement after each node expansion until one processor becomes responsible for an entire subtree. The second disadvantage is poor load balancing inherent in the algorithm.
    4862
    4963== Project activity ==