From Java to Assembly in Java's 1-Billion-Row Challenge (Ep. 4)

https://www.youtube.com/watch?v=XRUMbGweHsY

-XX:CompileThreshold=1 -XX:-TieredCompilation is an interesting test.

Even then, the interpreted version ran 9969 times.

JVM Flags Explanation https://chatgpt.com/s/t_68fe297b69308191a32d33d5108e0a59

Nice comments on the video:

@alexlindgren858 fyi, those "ignored_X" fields is the Object header in Java, it's two ints called mark word which is afaik information for the GC. The igniored_c field is the klass word, so it's pointing to the objects Klass* metadata. Now, the klass word is not used during runtime at all (afaik), unless you explicitly use getClass() on the object, at which it will fetch the Klass* metadata of the object. You can actually trick the JVM to cast a object illegally and not get a CastClassException by manually changing the klass word to something else using Unsafe, for example, the klass word 8088 (or 8808, I can't remember exactly), will turn the object into a int array

@alskidan All that „crap” was the previous versions of this function. That’s why it final incarnation was called find@10 (that’s the 10th compilation). JVM has a tracing compiler. It can unroll loops, inline function calls and do a lot of pretty sofisticated stuff. For example, it can detect if a function tends to return the same result on every invocation and just inline the answer. It’s both Java’s weakness (slow startup times, slow warmup times) and its strength (potentially much faster code than C/C++ would have produced). All that means that it’s not that valuable to look at the assembly output. Also there are contrintuitive things like: Java favors shorter functions, while C favors longer ones. Java can inline, unroll loops, do other things intramodules, while C loves unity builds (single translation units). Best of luck to you guys, perf optimization on JVM is a very deep rabbit hole ;)

@Bobbias For anyone curious, the numbers in the bytecode listing are byte offsets.

Many opcodes are a single byte because the JVM is stack based, so the offset often only increases by 1, however branches and other opcodes require multiple bytes.

if_icmpgt for example takes 2 additional bytes making up a signed 16 bit instruction offset from the branch location. This is why the first comparison in the bytecode is at offset 7 and the next instruction is at offset 10.

You might also notice the aload_N instructions and wonder about the numbers. These load an object reference from the top of the stack and store it in a local variable indexed by N. These N range from 0 to 3, allowing access to the first 4 local variables without needing to resort to the generic aload instruction which takes an index for N greater than 3.

tivrfoa/java-assembly-with-casey-muratori.md