Category Archives: 2.4 Project Reports

NetzDatenStrom Project Has Started

The research project NetzDatenStrom has finally started. NetzDatenStrom is funded by the Federal Ministry of Economic Affairs and Energy in the context of the 6th energy research program. Experts on research and development, software producers of network control systems as well as grid operators and it-experts in the energy sector work together for integrating standard Big-Data solutions into existing network control systems. The project covers a three year period and will be done by a consortium comprised of network control system vendors PSI AG, KISTERS AG and BTC AG, grid operator EWE NETZ GmbH, OFFIS-Institute for Information Technology (consortium leader), Friedrich-Alexander-University Erlangen-Nuremberg and the Institute for Multimedia and Interactive Systems at University Lübeck. NetzDatenStrom will be supported by openKONSEQUENZ and provides possible contributions to the openKONSEQUENZ-platform.

The official Kick-Off meeting and workshop took place on October 27th at OFFIS in Oldenburg. First task is to specify practical and fundamental Big-Data use cases and establish a foundation for upcoming work steps. In context of NetzDatenStrom, the Open Source Research Group will exercise the integration of external data sources into existing network control systems and investigate the exploitation potential of open source software developed in a consortium.

Reference: http://www.openkonsequenz.de/index.php?option=com_flexicontent&view=item&cid=8&id=91:projekt-netzdatenstrom-nds&Itemid=101

HD-Diff released as part of Sweble 2.0

HD-Diff is a tree-based algorithm to compute the differences between two documents. The algorithm was presented in a paper at the DocEng 2014 conference.

Unlike other tree-based differencing algorithms HD-Diff can look into text nodes, splits them when necessary and produces a very fine-grained edit script. It is especially suited for tree-based text documents (e.g. office documents or WOM v3-based wiki articles) in which changes often happen to the text inside text nodes and not just to the overall document structure.

The reference implementation of the generic HD-Diff algorithm is made available as part of the Sweble 2.0 release in the module hddiff. An adapter for WOM v3 documents is made available in the module hddiff-wom-adapter.

Additional information on the hddiff project can be found at GitHub, on our HD-Diff project page and in our paper.

Fine-grained Change Detection in Structured Text Documents (DocEng 2014)

Abstract: Detecting and understanding changes between document revisions is an important task. The acquired knowledge can be used to classify the nature of a new document revision or to support a human editor in the review process. While purely textual change detection algorithms offer fine-grained results, they do not understand the syntactic meaning of a change. By representing structured text documents as XML documents we can apply tree-to-tree correction algorithms to identify the syntactic nature of a change. Many algorithms for change detection in XML documents have been propsed but most of them focus on the intricacies of generic XML data and emphasize speed over the quality of the result. Structured text requires a change detection algorithm to pay close attention to the content in text nodes, however, recent algorithms treat text nodes as black boxes. We present an algorithm that combines the advantages of the purely textual approach with the advantages of tree-to-tree change detection by redistributing text from non-over-lapping common substrings to the nodes of the trees. This allows us to not only spot changes in the structure but also in the text itself, thus achieving higher quality and a fine-grained result in linear time on average. The algorithm is evaluated by applying it to the corpus of structured text documents that can be found in the English Wikipedia.

Keywords: XML, WOM, structured text, change detection, tree matching, tree differencing, tree similarity, tree-to-tree correction, diff

Reference: Hannes Dohrn and Dirk Riehle. “Fine-grained Change Detection in Structured Text Documents.” In Proceedings of the 2014 ACM symposium on Document engineering (DocEng ’14). ACM, New York, NY, USA, 87-96. DOI=10.1145/2644866.2644880

The paper is available as a PDF file.

Sweble 2.0 released!

Two years after our first public release of the Google-Sponsored Sweble 2.0 Alpha, we are happy to announce the release of Sweble 2.0!

The most important innovation in the alpha release was the introduction of the engine component which allowed full Mediawiki template expansion. Since then many other new features and bug fixes have been added to the software. Here are the highlights:

  • In the post-processing phase Sweble normalizes and fixes the AST according to the rules found in the WHATWG HTML Spec, Section 12.2 of Apri 2012. This improves the quality of the resulting AST and guarantees that a rendered AST looks just like the HTML produced by Mediawiki when viewed in a modern browser.
  • The Wiki Object Model v2 (WOM) has been replaced by a complete rewrite called WOM v3. The WOM v3 implements the org.w3c.dom Java interfaces and thus implements and extends the Document Object Model.
  • The WOM v3 allows full round-trip support of Mediawiki articles. After parsing and converting an article to WOM v3, all formatting information from the original wiki markup is preserved. The original formatting can be restored even after alterations to the WOM tree, when the tree is converted back into wiki markup.
  • Since the WOM v3 implements the org.w3c.dom interfaces it can be processed by standard Java facilities:
    • A WOM v3 tree can be serialized to XML and deserialized to a WOM v3 in-memory document using a javax.xml.transform.Transformer or a javax.xml.parsers.DocumentBuilder.
    • With a javax.xml.transform.Transformer one can also transform a WOM v3 document using an XSLT script.
  • The module sweble-wom3-swc-adapter converts the AST produced by the sweble-wikitext-parser to a WOM v3 document and can restore wiki markup formatting to a WOM document.
  • The module sweble-wom3-json-tools offers serialization of WOM v3 document to and from JSON.

China: Experience, Travelling, Working and Research

Would you like to get inspired by Master student Bilal Zaghloul? Then read his fascinating report about his Master Thesis work in Beijing, China!

“When it came to my mind to travel to China…”, Bilal writes, “…I had a little idea about the life in China. However, I knew that China is one of the fastest-growing economies in the world, particularly in the software industry. Therefore, it appeared to me like a good idea to spend a few months there, so I can get to know more about the country and the culture. Immediately after we (Prof. Riehle, Prof. Zhou, and me) agreed on a research topic, I started my preparation for traveling (e.g. Visa). As travel got closer, I started to wonder about traveling to China. Some doubts came to my mind: ‘I have no idea about the life in Beijing‘, and  ‘I know nothing about the Chinese language‘. In this report, I want to share with you some important notes of my six-months experience in China.”

Read more about Bilal’s experiences…

Design and Implementation of Wiki Content Transformations and Refactorings

Abstract: The organic growth of wikis requires constant attention by contributors who are willing to patrol the wiki and improve its content structure. However, most wikis still only offer textual editing and even wikis which offer WYSIWYG editing do not assist the user in restructuring the wiki. Therefore, “gardening” a wiki is a tedious and error-prone task. One of the main obstacles to assisted restructuring of wikis is the underlying content model which prohibits automatic transformations of the content. Most wikis use either a purely textual representation of content or rely on the representational HTML format. To allow rigorous definitions of transformations we use and extend a Wiki Object Model. With the Wiki Object Model installed we present a catalog of transformations and refactorings that helps users to easily and consistently evolve the content and structure of a wiki. Furthermore we propose XSLT as language for transformation specification and provide working examples of selected transformations to demonstrate that the Wiki Object Model and the transformation framework are well designed. We believe that our contribution significantly simplifies wiki “gardening” by introducing the means of effortless restructuring of articles and groups of articles. It furthermore provides an easily extensible foundation for wiki content transformations.

Keywords: Wiki, Wiki Markup, WM, Wiki Object Model, WOM, Transformation, Refactoring, XML, XSLT, Sweble.

Reference: Hannes Dohrn and Dirk Riehle. “Design and Implementation of Wiki Content Transformations and Refactorings.” In Proceedings of the 9th International Symposium on Open Collaboration (WikiSym + OpenSym 2013). ACM, 2013.

The paper is available as PDF file.

Report on BaCaTec Supported (July 2010) Research Project “Reengineering Wikitext”

Original project description for BaCaTec

MediaWiki is the software that runs Wikipedia. Content is written in Wikitext, a markup and programming language that has evolved over eight years without any formal description. Its semantics is the implementation of the MediaWiki software. The open source research group at the Friedrich-Alexander-University of Erlangen-Nürnberg is reengineering this language to create a new alternative parser and execution engine that will allow for significantly improved machine processing of the content found in Wikipedia. To this end, the research group is collaborating with the Wikimedia Foundation in San Francisco, CA, U.S.A. which has led to the BaCaTec travel grant.

Results of sponsorship

BaCaTec supported the travel of Prof. Riehle and Ph.D. student Hannes Dohrn to California to meet and talk to the Wikimedia Foundation, provider of Mediawiki, the software running Wikipedia. Hannes Dohrn, a Ph.D. student and „wissenschaftlicher Mitarbeiter“ of the group subsequently designed and implemented a Mediawiki-compatible parser for Wikitext, the content (markup) language of Wikipedia. Google (Mountain View, California) later supported this work with a US$ 50.000 grant to the research group. From a practical perspective, the result is a full parser in wide use in research and industry. The source code can be found at http://sweble.org as the Sweble open source project. From a research perspective we have gained a platform from which to expect more research results, including the dissertation of Dipl.-Inf. Hannes Dohrn on wiki content transformations. A couple of published papers and reports detail this work on the group’s website at / and on Prof. Riehle’s website at http://dirkriehle.com/publications.

Sweble on GitHub and Ohloh

The Sweble Project can now be found on GitHub and Ohloh.

The GitHub repositories mirror the primary repositories hosted on our servers. Commits pushed to our repositories will be pushed to GitHub after a short delay.

Please visit us on Ohloh and let us know if you’re using Sweble!

Google-Sponsored Sweble 2.0 Alpha Released

We released an early 2.0 (alpha) version of the Sweble Wikitext parser and related libraries on our git repository. The Sweble Wikitext parser aims to provide a Mediawiki-compliant Wiktext parser implementation in Java. This includes full Mediawiki template expansion but does not cover all of the parser functions and tag extensions (yet).

We would like to thank Google, and in particular the Open Source Program Office of Chris Dibona, for sponsoring the development of our Wikitext parser.

Stay tuned for more Sweble components for Wikitext handling and domain-expert programming.

Sweble 1.1.0 released

Sweble 1.1.0 fixes some bugs and introduces a couple of new features/modules. For a full list of changes please refer to the changes reports of the individual modules. The release can be found on maven central. Jars with dependencies will soon be available from our downloads page.

Fixed bugs (excerpt contains only bugs filed in our bug tracker):

  • Can not parse image block with nested internal link. Fixes 9.
  • The LinkTargetParser is now decoding XML references and URL encoded entities (%xx) before checking titles for validity. Fixes 10.
  • Tests fail under Windows due to encoding and path separator differences. Fixes 11.
  • mvn license:check fails under Windows. Fixes 12.
  • LazyRatsParser.java: type parameters of <T>T cannot be determined. Fixes 13.
  • NPE on Spanish wikipedia dump. Fixes 14.
  • Template expansion does not expand anonymous parameters correctly Fixes 18.

Notable new features/modules (excerpt):

  • Added submodule ptk-json-tools: Library for serializing and deserializing ASTs to JSON and back.
  • Added submodule ptk-xml-tools: Library for serializing and deserializing ASTs to XML and back.
  • Added submodule swc-article-cruncher: A framework for processing Wikitext pages spreading the work over multiple processors.
  • Added submodule swc-dumpreader: A library for reading Wikipedia XML dumps.
  • Added submodule swc-example-basic: Example demonstrating parsing of an article and conversion to HTML.
  • Added submodule swc-example-serialization: Example demonstrating the serialization and deserialization of ASTs to JSON, XML and native Java object streams.
  • Added submodule swc-example-xpath: Example demonstrating XPath queries in ASTs.