Improved Random Forest Algorithm for Stream Big Data Processing
Jing Li1, Yingchun Liu2
1.College of Computer Science and Technology, Huaqiao University, Xiamen, 361021, China
2. Industrial and Commercial Bank of China, Peony Card Center, Beijing, 100140, China
Abstract: Stream computing is an important form of Big Data computing. Random Forest method is one of the most widely applied classification algorithms at present. From the actual requirements, Random Forest method faces not only huge number of features but also constantly changing data pattern over time. The accuracy of a Random Forest algorithm without self renewal and adaptive algorithm will gradually reduce over time. Aiming at this problem, this paper analyzes the characteristics of Random Forest algorithm, gives a new pruning idea according to the accuracy of the decision trees. In order to adapt to the change of data, a new random method based on margin is presented. This new method can update itself constantly and can be applied in streaming Big Data environments. Using the actual customer data, the new method is verified has higher accuracy in classification.
Keywords: Random forest; Big data; Stream computing