Web Content Extraction Technology Chua

sheng WU

Software College, University of Science and Technology Liaoning, CHINA

Abstract: In this information era, we are facing the knowledge explosion, and the information on the Internet is multifarious. It is not convenient enough for us to access to information directly on cell phones due to their limitation. Based on parsing a web page with regarding it as a DOM (Document Object Model) tree, we extract the valuable information with considering three factors: structure, content and programming habits. For illustration, 28 websites are utilized to show the feasibility of the method in web information extraction, and we design the mobile client to present the web content on the cell phones. The Practice has proved that using the web page extraction technology related to this article to browse the corresponding news websites, only consumed 8% of cell phone traffic of the existing mobile phone browser did. And the user experience is improved. This method can help people to get rid of costing too much on the cell phone traffic, redundant information, complicated operations and so on.

Keywords: Information extraction; Android; Soup; DOM

点击下载