Final Thesis: Wikipedia in times of BigData – Einsatz von NoSQL-Systemen zur Verwaltung und Analyse von Wikipediainhalten

Abstract: This thesis examines the power of modern NoSQL systems in conjunction with WOM to analyse and manage the article’s data of Wikipedia. The Wiki Object Model (WOM) is a machine readable semi-structured representation of markup produced by different wiki engines. The systems studied are native XML databases (Sedna, BaseX, eXist-db, X-Hive, Oracle Berkley DB XML and Apache Xindice), XML-enabled databases (PostgrSQL, MySQL, MariaDB), JSON-based document stores (Couchbase Server, MongoDB), Wide Column Stores (Apache HBase) and multi model databases (ArangoDB, OrientDB). As a foundation for the further analysis of suitability the analytical database’s expected data volume, query profile and workload is investigated. Subsequently, the listed systems are evaluated regarding to their functional and non-functional suitability for the given use case. The functional part is derived by the system’s capability of executing the given query profile – query driven schema design. The non-functional part is dictated by the amount of data which needs to be processed, and the expected workload. The list of potential system candidates is narrowed down to two whose resource consumption and functional suitability for the given use-case is described in further detail. The proposed research is the technical foundation for choosing and deploying the first prototype of an analytical database unlocking the structured data of Wikipedia.

Keywords: Wikipedia analysis, big data, NoSQL, Sweble

PDFs: Master Thesis

Reference: Patrick Kaltenmaier. Wikipedia in times of BigData – Einsatz von NoSQL-Systemen zur Verwaltung und Analyse von Wikipediainhalten. Master Thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg: 2016.