Final Thesis: Processing Open Transport Data: Design and Implementation of an Extension for a Data Pipeline Modeling Language

Abstract: Open transport data enables innovation by offering a vast amount of information for developers, researchers, urban planners, and entrepreneurs to create new applications, services, and business models. However, the lack of specific guidelines for making data “open” has led to the existence of diverse proprietary and heterogeneous open data platforms. Open transport data formats, such as static and real-time General Transit Feed Specification (GTFS), provide a standardized way to share information about public transit systems, including schedules and vehicle positions. Since this data is difficult to access and often times volatile, archiving GTFS data has several advantages, including the possibility of passenger routing and traffic flow analysis. The JValue research project aims to democratize collaborative data engineering by providing, among other components, Jayvee, a domain-specific language for data pipeline modeling. This thesis focuses on extending Jayvee to support processing of GTFS static and real-time data. The development process involves defining functional requirements through a Request for Comments (RFC) process and implementing the extension incrementally by introducing new language features such like a data extractor for HTTP content, an interpreter for ZIP-files, or a filesystem component. Providing a demonstrator, an evaluation phase showcases proper system execution and the periodic archival mechanism. As a result, it is now possible for users of Jayvee, to access, process, and archive GTFS static and realtime data periodically. Future improvements include automating optional fields and tables handling, providing user-friendly pre-configured GTFS dataset layouts, and introducing a concept for composite pipelines. This engineering thesis serves as a guide for the open transport data research community, on how to extend open source software like Jayvee to reduce barriers accessing and processing open transport data.

Keywords: Open Data, Data Engineering, Domain-specific Language

PDF: Master Thesis

Reference: Johannes Noah Schilling. Processing Open Transport Data: Design and Implementation of an Extension for a Data Pipeline Modeling Language. Master Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2023.