Ph.D. Theses

Automatic Provenance Capturing for Research Publications

By Linyun Fu
Advisor: Peter Fox
December 1, 2015

Provenance is critical for research publication readers to correctly interpret important content and enables them to evaluate the credibility of the reported results by digging into the software in use, source and change of data and responsible agents. It also would enable the reader to reproduce the scientific conclusions by following or adapting the process leading to the reported results. However, the creation of proper provenance for research publications may cost the authors a lot if they lack the necessary knowledge and technical support. First, it requires knowledge of proper logical provenance information to capture for the report creating process, causing extra learning overhead on the authors. Second, it may also require technical knowledge of the physical configurations of the program(s) execution platform such as the operating system or even the computer hardware, in order to obtain useful provenance information for the purpose of reproducibility and validation of the content. This usually entails even more learning overhead. Even if the authors already know what provenance should get recorded and how to record it, the actual recording work is usually distracting to the authors focusing on authoring the research publications and thus insufficiently motivated.

Existing frameworks and systems for capturing provenance for computational experiments are either specifically tailored for scientific workflow systems or based on a model that is not detailed enough for reproduction of the published results. Authors who are not familiar with any workflow system need to learn how to use one of these systems in order to create provenance that is detailed enough for reproducibility with them.

In this thesis, we specify a paradigm of preparing research publications based on invocation of operations to overcome many of the challenges associated with provenance capture mentioned above. The paradigm is to create publications on a portable provenance aware platform that transparently captures the proper provenance information. The PROV-PUB-O ontology was created for capturing proper knowledge of provenance for authoring processes based on invocations of operations, as well as describing and locating the published results in research publications. To evaluate the usability of PROV-PUB-O, we created the Ontology Usability Scale (OUS), which is the first set of metrics for ontology usability evaluation.

The provenance capture framework enabling the paradigm that fulfills the following requirements will be elaborated. First, the provenance captured must be stored in a way that the reproducibility of the reported results can be decided and the "false paths" can be found in the provenance graph that caused a certain result to not be reproducible. Second, the authoring platform must use a front end supporting a variety of programming languages/modes used by real researchers to create results. The objective is to keep the learning overhead to a minimum. Third, it is also required that the capture of provenance needs no or minimal involvement of the users. A prototype platform is implemented to demonstrate the specified framework. Chapter 4 of the 2014 U.S. National Climate Assessment report (NCA2014) is our use case and the reproduction enabling provenance of tables and figures in this chapter is shown to be captured by the prototype.

Return to main PhD Theses page