Final Thesis: Design and Implementation of Graph-based Storage for Wikipedia Articles

Abstract: Despite the lack of crucial features, Wiki Markup is still the primary data format in Wikipedia. The Wiki Object Model (WOM) features a modern alternative based on a tree structure. The use of a graph-based Storage for integrating WOM as the primary data format in Wikipedia seems likely. Managing the immense revision history of Wikipedia articles is one of many problems when facing this approach. In most cases, the difference between a revision and its successor is small. Hence, there are many redundancies inside the database. To solve this problem we have to reduce the amount of redundancies. For this purpose an algorithm was designed connecting nearby revision graphs and reusing parts of the predecessor graph. Moreover, strategies for traversing WOM resources are introduced and user-defined edges between two arbitrary nodes are established. Multiple tests with real Wikipedia articles are performed for evaluating performance and storage savings. Thereby different configurations are tested. Redundancies between nearby revisions are stripped down to a minimum when using the graph-based storage for Wikipedia articles. In addition, all the advantages provided by WOM are given.

Keywords: Sweble, WOM, Wikipedia, graph database

PDFs: Master Thesis, Work Description

Reference: Daniel Knogl. Design and Implementation of Graph-based Storage for Wikipedia Articles. Master Thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg: 2016.