An Enhancing XML Big Data Mining Approach on Spark System
With the development of cloud computing, intelligent mobile applications, and IoT, XML-type data has changed into large-volume data sets since XML emerged as a popular standard for data exchange among them. XML is a kind of semi-structured data and can be modeled as a tree. As the concept of data sharing becomes popular, the XML features such as the parent-child or ancestor-descendant relationships are widely used to share information in XML big data. Through the parent-child and ancestor-descendant relationships, XML big data exhibits big and massive tree structures, which makes the behaviors on XML big data mining more unconstrained. Users can query data in the tree-structured XML big data through multiple access paths. However, this situation makes more difficult to mine frequent patterns in them. Therefore, how to enhance the performance to find out the frequent patterns among tree-structured XML big data has become an important issue. Several XML pattern mining researches have been proposed focus on enhancing the XML mining performance. However, these researches model XML data as a tree and thus cannot improve the mining performance of big XML data. Also, these researches do not consider the concept of inclusion exclusion principle in combinatorial mathematics to reduce the mining time and I/O costs of generating candidate XML patterns. Thus, the mining performance of tree-structured XML big data cannot to be enhanced effectively. In addition, the existing researches do not consider their algorithms to mine XML big data on the framework of cloud computing and thus damage the system performance. As a result, our research will propose a new approach to mine effective XML frequent patterns on Spark system. Based on Spark’s system, the higher mining and query performance can be achieved for XML big data.
Index Terms - Cloud computing, XML frequent patterns, Spark, Hadoop, XML mining.