Report on BaCaTec Supported (July 2010) Research Project “Reengineering Wikitext”

Original project description for BaCaTec

MediaWiki is the software that runs Wikipedia. Content is written in Wikitext, a markup and programming language that has evolved over eight years without any formal description. Its semantics is the implementation of the MediaWiki software. The open source research group at the Friedrich-Alexander-University of Erlangen-Nürnberg is reengineering this language to create a new alternative parser and execution engine that will allow for significantly improved machine processing of the content found in Wikipedia. To this end, the research group is collaborating with the Wikimedia Foundation in San Francisco, CA, U.S.A. which has led to the BaCaTec travel grant.

Results of sponsorship

BaCaTec supported the travel of Prof. Riehle and Ph.D. student Hannes Dohrn to California to meet and talk to the Wikimedia Foundation, provider of Mediawiki, the software running Wikipedia. Hannes Dohrn, a Ph.D. student and „wissenschaftlicher Mitarbeiter“ of the group subsequently designed and implemented a Mediawiki-compatible parser for Wikitext, the content (markup) language of Wikipedia. Google (Mountain View, California) later supported this work with a US$ 50.000 grant to the research group. From a practical perspective, the result is a full parser in wide use in research and industry. The source code can be found at http://sweble.org as the Sweble open source project. From a research perspective we have gained a platform from which to expect more research results, including the dissertation of Dipl.-Inf. Hannes Dohrn on wiki content transformations. A couple of published papers and reports detail this work on the group’s website at / and on Prof. Riehle’s website at http://dirkriehle.com/publications.