Java Committee Meeting Notes
from SPEC Benchathon, 11/10/97 - 11/14/97


MEETING SUMMARY

Benchathon Activities
	No porting was done!
	Problem investigation
	Benchmark improvements: validation, workload size, tools
	Versions: v11 going into benchathon, v14 going out,
		v15 with any remaining changes agreed at meeting

Meeting
	Reviewed status and problems
	Discussed analysis data and desired additional data
	Discussed benchmark scaling: problem size vs. iteration count, 
		impacts on memory size and I/O content
		benchmarks to be packaged in zip archives to minimize 
		http connection overhead impact on initial run time
	Eliminated benchmark candidates by consensus:
		203_linpack, 204_newmst, 207_tsp,
		212_gcbench, 223_diners, 225_shock
	Set schedule
		Keep benchmark gate open to 12/31/97
			Intel expects up to 5 additional candidates, 
			real applications
			Everyone urged to solicit new candidates
		Member vote planned in March, release in April
	Discussed how to group and report benchmarks and composite metrics
	Discussed other run rule issues

MEETING AGENDA


STATUS/PROBLEMS

By the time of the meeting most of the problems running benchmarks had been resolved in the benchathon. On _222_mpegaudio a validation problem remained that appeared to be a difference in floating point accuracy where a checksum value differed by one least significant bit, possibly due to some or all intermediate calculations being performed in 80-bit accuracy instead of 64-bit. That's not strictly IEEE compliant and thus not Java compliant, but this is currently an issue of vigorous standards debate. Therefore the subcommittee determined by vote that while SPEC will follow whatever standard emerges in the run rules, it will not enforce the current standard in the benchmarks themselves, with Sun as the lone dissenter.

The other problem discussed at length was timing variability, from one run to another, and from execution to execution in an autorun sequence. Most of the problems were observed with V10 and earlier where a large console buffer was occupying memory and causing excessive garbage collection activity. In V11 and later the amount of output sent to the console is greatly reduced, you have the option to direct console output to your Java console which may be a file or system console, and you have the option to discard console output. However there were some indications that timing variability problems remain in some cases with V11, particularly in lower memory systems.

Anirudha Rahatekar (Intel) suggested some additional controls and instrumentation around the benchmark executions in order to better control the memory environment and reduce variability. These are noted in the "DEVELOPMENT RELEASES" section below and will be available in V15.

Members agreed to perform some tests of variability and share the results with the group. (If you don't want to release absolute numbers, then you can at least give relative percentages.)


ANALYSIS - WHAT DO WE HAVE/WANT/WHEN?

We discussed a chart of static and dynamic benchmark characteristics: http://pro.specbench.org/private/osg/java97/benchspec/stats9711.html

Intel offered to provide additional profile information. The profiles in this chart were collected with the ordinary JDK profile flag, which requires that you measure with JIT turned off. Michael Greene saw substantial differences on some of the benchmarks depending on whether JIT was enabled or not, so it was considered important to be able to look at both. Walter Bays thought that he might also be able to get some profiles with JIT but wasn't sure.


BENCHMARK CANDIDATE ELIMINATION

Eliminated benchmark candidates by consensus. Here I note some of the strongest issues raised with these candidates, not necessarily an exhaustive list.
203_linpack
small, Fortran-like, very high locality

204_newmst
small

207_tsp
small

212_gcbench
Feared susceptible to spoofing. Small, high locality. Sun reported significant speedup from dead code elimination. Two vendors reported very short run times with advanced gc algorithms. Editorial note: this was probably the best loved of the eliminated benchmarks, because it is so hard on bad garbage collectors.

223_diners
Suspected synchronization problems, with different systems doing different amounts of work.

225_shock
High locality in Thread.create(), and small application method.

HOW/WHETHER TO GROUP BENCHMARKS

Nik Shaylor suggested that some benchmarks be combined in groups in the same manner that _215_richards_gf, _216_richards_g, _217_richards_gns, _218_richards_dna, _219_richards_dac, _220_richards_dav, and _221_richards_dai were combined into _224_richards. In this way the better synthetic programs could be retained in the suite, while counting for less than real applications in the overall score, and being less vulnerable to possible benchmark optimization "attacks".

There was general agreement that real applications are more important than synthetic, although it was noted that typically for commercial applications the source code is not available for inspection to see what the program is doing. The situation is more like BaPCO than traditional SPEC CPU benchmarks, and we need to look at their rationales.

There was no agreement on how benchmarks might be grouped. Many felt that there should be some solid basis on which to group benchmarks based on program characteristics or application area. Some suggestions were application/synthetic, integer/floating point, or some combination of those divisions. How or whether to combine sub-metrics into a composite metric in these cases was discussed with no resolution.


RE-OPEN BENCHMARK CANDIDATE GATE?

Walter Bays proposed an accelerated release schedule. Michael Greene offered that if the gate were held open until the end of December, that there was a strong likelihood that he could obtain as many as five additional benchmark candidates which were real applications. A later time schedule for this was proposed, and accepted by the committee.

TIME SCHEDULE

     "Early"                    "Late"
Nov
     Close gate                 continue benchmark search
     Analyze
Dec
     subcommittee vote
     OSSC vote                  Close gate
Jan  begin member vote          Analyze
     Annual meeting             Annual meeting
     end member vote
Feb                             subcommittee vote
     release                    OSSC vote
Mar                             begin member vote
Apr                             end member vote
                                release

In the next month everyone is encouraged to redouble their efforts to acquire additional benchmark candidates, particularly real applications. Benchmarks should be fit into the SPEC tool harness. Walter has an action item to send out a guide to the steps needed to do this. In order to give everyone as much time as possible to examine the candidates, and to improve their chances of being accepted into the suite, you should not wait for the last day but send information on any prospective candidates as soon as you have it, and get the benchmark out to committee members as soon as possible. Michael Greene now "owns" the benchmark numbers 232 through 236.


RUN RULES

We agreed that the SPEC tools (with graphic user interface) will be required to report results. The "batch mode" flag implemented in V11 should still allow automated assembly line type test operations while using the SPEC tools. The benchmarks will run on embedded systems without a graphic display, but these are not the primary target of the benchmarks and such results may not be reported. Members interested in such measurement are encouraged to use the client benchmark work as a starting point and investigate defining appropriate metrics for their environment, possibly using scaled down tools from Java client.

We discussed running short versions (e.g. 10%) of the benchmarks for systems without JIT and with small memories, such as embedded systems or low-end NC's. No resolution was reached. It was thought that perhaps some intermediate problem size (e.g. 20%) would be more appropriate. Attention would have to be paid to both memory size and run time. Perhaps one follow-on benchmark would be able to address both embedded systems and low-end NC's.


PROBLEM SIZE VS. ITERATION COUNT

HP raised the issue of which benchmarks increase run time by scaling problem size, and which by iteration count. Problem size was generally felt to be the better method, subject to memory size constraints. The fear is that adaptive optimizers and JIT's will be able to "learn" a benchmark too well if it loops more than a real application does, e.g. overly optimistic branch prediction success.

COMPOSITE METRIC

See above under "GROUP BENCHMARKS".

DEVELOPMENT RELEASES

Changes from V11 to V14 release:
_202_jess 
     Longer 100% workload 
_205_raytrace/ 
     Removed spurious output - KMD/NS 
_213_javac/ 
     New longer 100% workload - KMD/NS 
_214_deltablue/ 
     New shorter 100% workload - KMD/NS 
_222_mpegaudio 
     Fixed validation problem per subcommittee vote on floating point accuracy 
_224_richards/ 
     Restored printout of subunit timings for academic purposes - KMD/NS 
_227_mtrt 
     Fixed validation problem where it was dependent on thread order and now 
     is not 
Removed 6 benchmarks eliminated by subcommittee consensus, and put them into a "Removed" group. There were other changes to some of them as well but these are not particularly important now.
_203_linpack
_204_newmst
     New workload. Some double changed to float - KMD 
_207_tsp
_212_gcbench
     New longer 100% workload. Sized to still fit in 30MB heap space - KMD/NS 
_223_diners
_225_shock

These remain in the "Removed" category in case someone wants to work on revising/combining them to try again to get the committee's approval. Even though removed, we all owe these benchmark authors a big thank-you for their effort, and for the beneficial effects their benchmarks have had on JVM's already during suite development. Should any author not wish to have his code remain in the "Removed/work-in-progress" category, we will pull it from the next release and all SPEC members will be asked to delete all copies of that benchmark in their posession. I will be contacting the authors on that subject soon. As a corrolary, if any of these benchmarks provides you with useful insights on your systems' performance and you would like to retain access to it, then it is in your interest to contact the author and work with her on addressing its' shortcomings for the suite. (Note also that some of these are freely available on the net.)

Version 15 should include changes to the tool harness agreed in last week's meeting, primarily aimed at the issue of timing variability.