Features of the Sweble Wikitext Parser

This page gives a quick overview of the features of our parser software.

Parser

  • Close to 100% coverage of the original MediaWiki Wikitext grammar. Highlights:
    • Proper handling of italic/bold apostrophes,
    • Proper handling of tables,
    • Proper handling of nested templates and template parameters,
    • Proper handling of scopes (tables, internal link titles, …),
    • Automatically fixing wrong nesting of HTML tags (using the same algorithm as modern browsers).
  • Generates a native AST as output. Advantages:
    • easy machine processability,
    • fully captures the semantic content of a wiki page,
    • it is easy to produce arbitrary other output formats from an AST (the difficulty depends more or less only on the complexity of the output format),
    • our AST has round trip support (optional): You can convert the AST into the exact original Wikitext.
  • Issues warnings for suspicious syntax (optional, INCOMPLETE)
    • Example: [[Target|Title]
      Triggers the warning This looks like a Internal Link, however the finishing `]]’ is missing.
  • Can perform automatic correction of suspicious syntax (optional, INCOMPLETE)
    • Example: [[Target|Title]
      Is automatically fixed by interpreting it as if the missing `]’ were there.

Engine

We also provide a clone of the MediaWiki engine that serves as backend for the actual parser.

  • 100% MediaWiki compatible template expansion,
  • support for some of the most often used parser functions and variables in the English Wikipedia.

Wikitext Object Model (WOM)

The Wikitext Object Model (WOM) is an abstract description of the semantic content of a wiki page. It consists of

  • a schema (XSD) for XML which addresses the question of how is a Wikitext document structured and
  • a set of Java interfaces that address the question of how to work with and alter a Wikitext document.