Managing large data sets

(Based on GitHub issue 144)

Goal

Rework the distribution model for the DaCapo suite and its internal structure to allow for very large data sets (including data in the GB range). A secondary goal is to reconsider other aspects of the distribution model and internal structure as part of this process.

Background

The current distribution model has a single fully-self-contained jar, with all libraries and data contained within the jar. This has the distinct advantage of extreme simplicity (from the users' POV). The user simply downloads the jar and types java -jar dacapo.jar <benchmark>. The rationale for this is clearly laid down in the DaCapo paper. Ease of use is a first-order principle because it encourages correct use and thus methodologically sound use, which is the overriding concern of the project. Complexity is antithetical to that goal.

However, future releases of DaCapo need to support very large data sets, and the above model will not scale to such large data sets, in part because the very large data set will have to be unpacked from the jar each time the jar is used. So we need to rethink the distribution model.

Proposal

Implement two packaging approaches and evaluate them both, before selecting one:

Minimal change for user. Under this model, the user's experience of the dacapo jar is unchanged unless they use large data sets.

Advantage: only those who use very large data sets will notice any change at all.
Disadvantage: lack of uniformity between use of large data sets and all other data sets.

Complete change. Under this model, all data (and possibly jars) will be packaged differently.

Advantage: uniform treatment of all benchmark sizes, and opportunity to move all data and jars out of existing jar.
Disadvantage: major change in use for all users

It should be straightforward to accommodate both approaches.

The requirements for external storage of any data or jar should be as follows:

The data/jars reside at one of a number of standard paths, or else are user-specified (if the user chose to use a non-standard path).
The benchmark harness will search standard paths (and the non-standard path if provided) and only prompt the user if the data/jars cannot be found. Once installed, the command line use of the suite should be as simple as it was before (it should be identical).
If the data/jars cannot be found, the harness will invite the user to install the data/jars at a prompt, and do the process automatically.

Requirements

We will need to define a standard file structure for the extra data/jars. That structure should have some coherence with the internal jar structure. The structure should be robust to the version changes and the reality that researchers may very well use multiple benchmark versions concurrently.

Edited May 26, 2018 by Steve Blackburn