Tag Archives: Wikitext Parser

Sweble 2.0 released!

Two years after our first public release of the Google-Sponsored Sweble 2.0 Alpha, we are happy to announce the release of Sweble 2.0!

The most important innovation in the alpha release was the introduction of the engine component which allowed full Mediawiki template expansion. Since then many other new features and bug fixes have been added to the software. Here are the highlights:

  • In the post-processing phase Sweble normalizes and fixes the AST according to the rules found in the WHATWG HTML Spec, Section 12.2 of Apri 2012. This improves the quality of the resulting AST and guarantees that a rendered AST looks just like the HTML produced by Mediawiki when viewed in a modern browser.
  • The Wiki Object Model v2 (WOM) has been replaced by a complete rewrite called WOM v3. The WOM v3 implements the org.w3c.dom Java interfaces and thus implements and extends the Document Object Model.
  • The WOM v3 allows full round-trip support of Mediawiki articles. After parsing and converting an article to WOM v3, all formatting information from the original wiki markup is preserved. The original formatting can be restored even after alterations to the WOM tree, when the tree is converted back into wiki markup.
  • Since the WOM v3 implements the org.w3c.dom interfaces it can be processed by standard Java facilities:
    • A WOM v3 tree can be serialized to XML and deserialized to a WOM v3 in-memory document using a javax.xml.transform.Transformer or a javax.xml.parsers.DocumentBuilder.
    • With a javax.xml.transform.Transformer one can also transform a WOM v3 document using an XSLT script.
  • The module sweble-wom3-swc-adapter converts the AST produced by the sweble-wikitext-parser to a WOM v3 document and can restore wiki markup formatting to a WOM document.
  • The module sweble-wom3-json-tools offers serialization of WOM v3 document to and from JSON.

Sweble 1.1.0 released

Sweble 1.1.0 fixes some bugs and introduces a couple of new features/modules. For a full list of changes please refer to the changes reports of the individual modules. The release can be found on maven central. Jars with dependencies will soon be available from our downloads page.

Fixed bugs (excerpt contains only bugs filed in our bug tracker):

  • Can not parse image block with nested internal link. Fixes 9.
  • The LinkTargetParser is now decoding XML references and URL encoded entities (%xx) before checking titles for validity. Fixes 10.
  • Tests fail under Windows due to encoding and path separator differences. Fixes 11.
  • mvn license:check fails under Windows. Fixes 12.
  • LazyRatsParser.java: type parameters of <T>T cannot be determined. Fixes 13.
  • NPE on Spanish wikipedia dump. Fixes 14.
  • Template expansion does not expand anonymous parameters correctly Fixes 18.

Notable new features/modules (excerpt):

  • Added submodule ptk-json-tools: Library for serializing and deserializing ASTs to JSON and back.
  • Added submodule ptk-xml-tools: Library for serializing and deserializing ASTs to XML and back.
  • Added submodule swc-article-cruncher: A framework for processing Wikitext pages spreading the work over multiple processors.
  • Added submodule swc-dumpreader: A library for reading Wikipedia XML dumps.
  • Added submodule swc-example-basic: Example demonstrating parsing of an article and conversion to HTML.
  • Added submodule swc-example-serialization: Example demonstrating the serialization and deserialization of ASTs to JSON, XML and native Java object streams.
  • Added submodule swc-example-xpath: Example demonstrating XPath queries in ASTs.

Sweble is available on Maven Central

We are finally deploying releases of Sweble and related software to Maven Central. This has many advantages for users of our software, among others:

  • You don’t have to refer to our Maven repositories any more in your own poms (if you only use our releases; snapshots are still only available from our repositories).
  • Releasing your own software on Maven Central becomes easier if you depend on Sweble.

For now only an updated version of our original 1.0.0 release of the Sweble software is available under the version number 1.0.01. However, we hope to provide a new release of the current development branch on Maven Central soon.

With version 1.0.0.1 of Sweble we’ve also started to auto-generate maven sites for all of our software modules. These sites provide documentation of the individual projects and can be found in the Documentation menu of the Sweble Blog.

Design and Implementation of the Sweble Wikitext Parser: Unlocking the Structured Data of Wikipedia

We will be presenting our paper on the design and implementation of the Sweble Wikitext Parser at the WikiSym 2011 conference! The conference will take place in Mountain View, CA in October.

For those of you who want to take a peek before the conference, we’ve put a pre-print version of the paper in the Sweble Wiki’s downloads section.

We still have some days left for fine-tuning the paper; if you have  any suggestions for improvement, we would love to hear from you.

Using CrystallBall, the Sweble Parser Demo

CrystallBall is our parser demo so that you don’t have to get down to code to check out the parser. It is a simple and easy way to see how we interpret Wikitext.

The general Sweble Parser documentation is on the wiki, naturally. Here are a few examples, though, for the hurried among you. Please note that we have not invested in style sheets to make HTML output look nice or like Wikipedia.org output (not our project goal).

Parsing the generic article (page) ASDF:

Some other articles:

The ultimate parser deathmatch Wikipedia article page (courtesy of Luca Dealfaro of WikiTrust fame):

And finally some XPath queries:

Have fun! And please let us know if your favorite article doesn’t do what you think it should do!

Announcing the Open Source Sweble Wikitext Parser v1.0

We are happy to announce the general availability of the first public release of the Sweble Wikitext parser, available from http://sweble.org.

The Sweble Wikitext parser

  • can parse all complex Wikitext, incl. tables and templates
  • produces a real abstract syntax tree (AST); a DOM will follow soon
  • is open source made available under the Apache Software License 2.0
  • is written in Java utilizing only permissively licensed libraries

You can find all relevant information and code at http://sweble.org – this also includes demos, in particular the CrystalBall demo, which lets you query a Wikipedia snapshot using XQuery. (The underlying storage mechanism is not particularly well-performing, so you may have to wait a little if load is high.)

Continue reading Announcing the Open Source Sweble Wikitext Parser v1.0

Sweble Website Launch

Finally, our Sweble [1] project site has launched! And with it one of our first projects is going Open Source: The Wikitext Parser, developed at the Open Source Research Group at the University Erlangen-Nürnberg.

[1] The Sweble project develops and provides libaries and components for a MediaWiki compatible wiki software. An important focus is wiki content analysis.