WIP: External data and lib jars
Summary
- addresses #11;
- use Git LFS to track zipped huge data;
- automatically separates and builds the
dacapo-huge-data.jar
when building benchmarks; - add ant targets to update MD5 and URL of huge data;
- update existing benchmarks to adapt to work with this feature (separate huge data; update build script);
- use the manifest file to carry the URL and MD5 of external data package;
- add
--extdata-install
and--extdata-set-location
flags; - use
jar
to unpack; - use
~/.dacapo-config.properties
to save the location of unpacked data directory; - use
${DATA}
in benchmark config file to refer to files in external data; - by default
${DATA}
is resolved to${dacapo-parent}/data
, otherwise the saved location in~/.dacapo-config.properties
; - supports externally packaged libs/jars by first searching the external data directory, then scratch if not found;
- checks per-file MD5 integrity when running with
--extdata-set-location
flag; - also fixes a relative path output problem in
luindex
.
Documentation
Installing external data
External data structure
The external data should contain the following structure:
dat
├── <bench1>-huge.zip
├── <bench2>-huge.zip
├── ...
jar
├── <jar1>
├── <jar2>
├── ...
Building
When building DaCapo, the following properties should be specified in dacapo.properties
:
# External data
dacapo.externdata.url=
dacapo.externdata.buildjar=true
dacapo.externdata.url
is the URL from where the user can download the data jar;
dacapo.externdata.buildjar
is the flag deciding whether or not to put huge data into a separate jar (dacapo-huge-data.jar
).
NOTE: if this is set to false
, huge data jar will not be built, AND it will not be included in the dacapo.jar
either.
If dacapo.externdata.url
is not known at build-time, it can be set after the build is finished by running:
$ ant -Dbuild.target-jar=... -Ddacapo.externdata.url=... update-externdata-url
Similarly, the MD5 of dacapo-huge-data.jar
will be calculated automatically at the end of a dist
build.
If, perhaps for testing reasons, one wants to set the MD5 after a one benchmark build, it can be done by running:
$ ant -Dbuild.target-jar=... update-externdata-md5
This will automatically calculate the MD5 of dacapo-huge-data.jar
and update the manifest information in ${build.target-jar}
.
The URL and MD5 information will be embedded in the manifest file in the result dacapo.jar
.
ExternData-URL: ...
ExternData-MD5: ...
User install
Two flags have been added to the commandline interface with regard to the external data.
--extdata-install <intall_path> Download and install
external data.
--extdata-set-location <ext_data_loc> Path to external data
location. Note, this
directory should
contain"data" and "jar"
sub-directories.
install
The user can download and install the external data by using the --extdata-install
command.
This will download the external data according to the URL embedded in manifest to ${ext_data_loc}
,
check the MD5 sum, unpack, and set the location to ${ext_data_loc}
.
The downloading is done through code, and unpacking is done by launching jar
as a process.
The following is an exmaple output:
$ java -jar dacapo.jar --extdata-install extdata
Downloading file:/Users/johnz/Repos/dacapo/dacapobench/benchmarks/dacapo-data-huge.jar to extdata/dacapo-data-huge.jar...Done.
MD5 check OK!
Extract extdata/dacapo-data-huge.jar at extdata...Done.
Extracting extdata/dat/lusearch-huge.zip
Extracting extdata/dat/batik-huge.zip
External data location has been set at extdata.
set location
The --extdata-set-location
assumes that the user has the data package downloaded and unpacked on the local machine,
and simply records its location.
MD5 checking
When setting the location of the external data/resource with --extdata-set-location
flag,
DaCapo will check against an included MD5 list file, dacapo.jar!/META-INF/huge-data-md5s.list
, the MD5 of every expected file in the list under the target directory.
This is to ensure that the user has kept the data integrate.
If the checksum failed, it will still set the directory location, but with the following warning message:
WARNING: MD5 checking failed. Your huge data does not match expected release.
Please download and install the latest huge data using --extdata-install flag;
otherwise please note the changes in research publication.
remembering the location
The location will be written to ~/.dacapo-config.properties
file:
#
#Thu Jun 28 15:12:45 AEST 2018
Extern-Data-Location=/tmp/data
This file is read when resolving the location of data directory.
Using external data
Benchmarks can reference the external data in their size configuration, for example:
size default args "${DATA}/luindex/william","${DATA}/luindex/kjv"
output stdout digest 0xc90792fce1594b4b9ea1b01d593aefe801e6e58b,
stderr digest 0xda39a3ee5e6b4b0d3255bfef95601890afd80709,
"index/segments_1" bytes 136;
In Config.preprocessArgs()
(called by all benchmarks), ${DATA}
will be replaced by the resolved data directory.
This function will also check the existence of the file under the data directory.
When the referenced file is not found, it will call ExternData.failDataNotFound()
, which prints out the following and exits with -1
.
ERROR: failed to find external data for size 'small'.
Please check that you have installed the external data properly (current: /tmp/data)
Please run DaCapo with `--extdata-install` flag to download and install the external data,
or with `--extdata-set-location` to set the location of unpacked external data directory.
${DATA}
will be resolved according to the following:
- if
Extern-Data-Location
cannot be found in~/.dacapo-config.properties
, it will be resolved to${dacapo-parent}/data
, where${dacapo-parent}
is wheredacapo.jar
resides; - otherwise, it is resolved to
${Extern-Data-Location}/data
.
Jar dependencies
Jar dependency can be optionally packaged externally and installed to ${Extern-Data-Location}/jar
.
For each jar dependency in the benchmark, first search through the above external data jar path,
then search scratch/jar
if not found. This way it supports gradual transition of moving all the jars out of DaCapo target jar.
An error message similar to the above is printed if a jar dependency is not found.
Other
This merge request also fixes a problem in luindex
where the relatived path is calculated from the scratch directory only.
In the above example config that uses the data directory this will not work.
This behaviour has been fixed by choosing between scratch and data based on the path prefix.
Checklist before merging
-
squash into 1 commit.