香港新世纪文化出版社
地址:香港湾仔卢押道18号海德中心16楼D室
当前位置:首页 >> 国际智能信息与管理科学英文期刊

Web Content Extraction Technology Chua

Web Content Extraction Technology Chua 

sheng WU 

Software College, University of Science and Technology Liaoning, CHINA 


Abstract: In this information era, we are facing the knowledge explosion, and the information on the Internet is multifarious. It is not convenient enough for us to access to information directly on cell phones due to their limitation. Based on parsing a web page with regarding it as a DOM (Document Object Model) tree, we extract the valuable information with considering three factors: structure, content and programming habits. For illustration, 28 websites are utilized to show the feasibility of the method in web information extraction, and we design the mobile client to present the web content on the cell phones. The Practice has proved that using the web page extraction technology related to this article to browse the corresponding news websites, only consumed 8% of cell phone traffic of the existing mobile phone browser did. And the user experience is improved. This method can help people to get rid of costing too much on the cell phone traffic, redundant information, complicated operations and so on. 

Keywords: Information extraction; Android; Soup; DOM