Skip to content

WIP: External data and lib jars

John Zhang requested to merge 11-managing-large-data-sets into master

Summary

  • addresses #11;
  • use Git LFS to track zipped huge data;
  • automatically separates and builds the dacapo-huge-data.jar when building benchmarks;
  • add ant targets to update MD5 and URL of huge data;
  • update existing benchmarks to adapt to work with this feature (separate huge data; update build script);
  • use the manifest file to carry the URL and MD5 of external data package;
  • add --extdata-install and --extdata-set-location flags;
  • use jar to unpack;
  • use ~/.dacapo-config.properties to save the location of unpacked data directory;
  • use ${DATA} in benchmark config file to refer to files in external data;
  • by default ${DATA} is resolved to ${dacapo-parent}/data, otherwise the saved location in ~/.dacapo-config.properties;
  • supports externally packaged libs/jars by first searching the external data directory, then scratch if not found;
  • checks per-file MD5 integrity when running with --extdata-set-location flag;
  • also fixes a relative path output problem in luindex.

Documentation

Installing external data

External data structure

The external data should contain the following structure:

dat
├── <bench1>-huge.zip
├── <bench2>-huge.zip
├── ...
jar
├── <jar1>
├── <jar2>
├── ...

Building

When building DaCapo, the following properties should be specified in dacapo.properties:

# External data
dacapo.externdata.url=
dacapo.externdata.buildjar=true

dacapo.externdata.url is the URL from where the user can download the data jar; dacapo.externdata.buildjar is the flag deciding whether or not to put huge data into a separate jar (dacapo-huge-data.jar). NOTE: if this is set to false, huge data jar will not be built, AND it will not be included in the dacapo.jar either.

If dacapo.externdata.url is not known at build-time, it can be set after the build is finished by running:

$ ant -Dbuild.target-jar=... -Ddacapo.externdata.url=... update-externdata-url

Similarly, the MD5 of dacapo-huge-data.jar will be calculated automatically at the end of a dist build. If, perhaps for testing reasons, one wants to set the MD5 after a one benchmark build, it can be done by running:

$ ant -Dbuild.target-jar=... update-externdata-md5

This will automatically calculate the MD5 of dacapo-huge-data.jar and update the manifest information in ${build.target-jar}.

The URL and MD5 information will be embedded in the manifest file in the result dacapo.jar.

ExternData-URL: ...
ExternData-MD5: ...

User install

Two flags have been added to the commandline interface with regard to the external data.

--extdata-install <intall_path>           Download and install
                                          external data.
--extdata-set-location <ext_data_loc>     Path to external data
                                          location. Note, this
                                          directory should
                                          contain"data" and "jar"
                                          sub-directories.
install

The user can download and install the external data by using the --extdata-install command. This will download the external data according to the URL embedded in manifest to ${ext_data_loc}, check the MD5 sum, unpack, and set the location to ${ext_data_loc}.

The downloading is done through code, and unpacking is done by launching jar as a process.

The following is an exmaple output:

$ java -jar dacapo.jar --extdata-install extdata
Downloading file:/Users/johnz/Repos/dacapo/dacapobench/benchmarks/dacapo-data-huge.jar to extdata/dacapo-data-huge.jar...Done.
MD5 check OK!
Extract extdata/dacapo-data-huge.jar at extdata...Done.
Extracting extdata/dat/lusearch-huge.zip
Extracting extdata/dat/batik-huge.zip
External data location has been set at extdata.
set location

The --extdata-set-location assumes that the user has the data package downloaded and unpacked on the local machine, and simply records its location.

MD5 checking

When setting the location of the external data/resource with --extdata-set-location flag, DaCapo will check against an included MD5 list file, dacapo.jar!/META-INF/huge-data-md5s.list, the MD5 of every expected file in the list under the target directory. This is to ensure that the user has kept the data integrate.

If the checksum failed, it will still set the directory location, but with the following warning message:

WARNING: MD5 checking failed. Your huge data does not match expected release.
Please download and install the latest huge data using --extdata-install flag;
otherwise please note the changes in research publication.
remembering the location

The location will be written to ~/.dacapo-config.properties file:

#
#Thu Jun 28 15:12:45 AEST 2018
Extern-Data-Location=/tmp/data

This file is read when resolving the location of data directory.

Using external data

Benchmarks can reference the external data in their size configuration, for example:

size default args "${DATA}/luindex/william","${DATA}/luindex/kjv"
  output stdout  digest 0xc90792fce1594b4b9ea1b01d593aefe801e6e58b,
         stderr  digest 0xda39a3ee5e6b4b0d3255bfef95601890afd80709,
	 "index/segments_1"  bytes 136;

In Config.preprocessArgs() (called by all benchmarks), ${DATA} will be replaced by the resolved data directory. This function will also check the existence of the file under the data directory. When the referenced file is not found, it will call ExternData.failDataNotFound(), which prints out the following and exits with -1.

ERROR: failed to find external data for size 'small'.
Please check that you have installed the external data properly (current: /tmp/data)
Please run DaCapo with `--extdata-install` flag to download and install the external data,
or with `--extdata-set-location` to set the location of unpacked external data directory.

${DATA} will be resolved according to the following:

  • if Extern-Data-Location cannot be found in ~/.dacapo-config.properties, it will be resolved to ${dacapo-parent}/data, where ${dacapo-parent} is where dacapo.jar resides;
  • otherwise, it is resolved to ${Extern-Data-Location}/data.

Jar dependencies

Jar dependency can be optionally packaged externally and installed to ${Extern-Data-Location}/jar.

For each jar dependency in the benchmark, first search through the above external data jar path, then search scratch/jar if not found. This way it supports gradual transition of moving all the jars out of DaCapo target jar.

An error message similar to the above is printed if a jar dependency is not found.

Other

This merge request also fixes a problem in luindex where the relatived path is calculated from the scratch directory only. In the above example config that uses the data directory this will not work. This behaviour has been fixed by choosing between scratch and data based on the path prefix.

Checklist before merging

  • squash into 1 commit.
Edited by John Zhang

Merge request reports