Math/implicit casts have poor performance under LSL-Luau
tracked
Frionil Fang
The Luau VM seems to have generally worse math performance regardless of the language, though that may just be inherent to the VM and absolute number crunching speed is not of prime importance to most scripts (though it is for some of mine, sadly... the better data structure handling performance will probably make up for it even there, though). What however is curiously poor is any kind of implicit casting under LSL-Luau, to the level of seeming like a bug.
Simple test script: compile as LSL-Luau.
default
{
touch_start(integer _)
{
integer i;
integer a;
llResetTime();
for(i = 0; i < 1000000; ++i) {
a += 1;
}
llOwnerSay((string)llGetTime());
}
}
This should run in roughly ~9-10 seconds on touch.
Now, change the type of the variable a from integer to float, resulting in it having to do an implicit cast during the addition (the second addend is still an integer), and it will instead run in ~26 seconds.
The same effect can be observed for other situations with implicit casting (integer a, multiply by float; float a, multiply by integer), resulting much slower performance.
For comparison, the same script compiled under LSL-Mono will run in roughly ~3-4 seconds, slightly varying between float/integer but with little implicit-cast difference.
Log In
H
Harold Linden
Is there still performance divergence under current versions of the server? We've changed a lot about how script scheduling works since then, which was probably a big driver of extra slowness under LSL/Luau. LSL on the Luau VM now outperforms LSL/Mono on my dev instance, but I might just be holding it wrong.
One thing to note, we expect that float operations will be somewhat slower under LSL/Luau because we've had to add additional opcodes to Luau to achieve 100% parity with how LSL/Mono handles floats. Sometimes floating point temporaries are 64-bit in Mono, and can get truncated to 32-bits at various points. We haven't spent a ton of time optimizing those paths since getting them "correct". Similarly, implicit int->float casts are more expensive because they currently take multiple instructions, but we could introduce a Luau
FASTCALL
for that case which would only use a single instruction.As an example of how things are currently compiled to Luau bytecode:
float foo(float val) {
// 2 is an int here
return val - 2;
}
becomes this Luau bytecode
Function 0 (_ffoo):
GETIMPORT R2 2 [lsl.cast]
LOADK R3 K3 [2]
LOADN R4 2
CALL R2 2 1
SUB R1 R0 R2
LSL_DOUBLE2FLOAT R1 R1
RETURN R1 1
whereas making that
2
a 2.0
gives:Function 0 (_ffoo):
LOADK R2 K0 [2]
SUB R1 R0 R2
LSL_DOUBLE2FLOAT R1 R1
RETURN R1 1
clearly room for improvement here as well, that
LOADK/SUB
pair could also just be SUBK
.Frionil Fang
Harold Linden
For just about all my mathy projects, LSL-Luau runs much slower, still. They naturally aren't
pure
math so they also depend on data structure speeds etc. so they aren't fully representative of this issue, but e.g. a game of life (integer math, comparisons, bit ops) runs at 1/3-1/4 the speed of the LSL version, a fractal plotter (float math) runs at 1/5th the speed.I did some synthethic tests and put them in a "run it 100k times" timing loop. This is of course imperfect since the server frame timing of 1/45 seconds makes large steps in the value, but good for a ballpark.
The timing loop itself (while completely empty) runs at 1/2 speed under LSL-Luau already. For comparison an equivalent for loop in pure SLua is roughly 33% faster than LSL - I don't remember exactly, but may have been slower than LSL-Mono in the first releases - although what I presume is getting used for LSL-Luau under the hood is a "while condition do" loop to allow modifying the loop variables, which runs about 33% slower than LSL empty-for.
Iterated complex number squaring+addition runs at ~33% speed when using floats. As you described above, I'm not expecting float math to reach the same level, but 33% is not super good - there are no constants or casts involved either. With the loop delay subtracted from the results, the LSL-Luau version totals at the same ~33% speed.
Changing the vars to integers (doesn't really make sense for the formula and will just quickly overflow the ints or leave them at 0, but it's still just a couple steps of multiply+add) runs at ~25% speed. With loop delay subtracted, the difference is very dramatic and LSL-Luau reaches only 5% speed. To be fair: within the granularity of the test and casual methodology, one basic integer multiplication has 0% speed since it has barely any perceptible impact vs. an empty loop under LSL-Mono, but a very noticeable one under LSL-Luau.
Bit ops I can't even expect to perform well under LSL-Luau thanks to SLua not having them natively, but a mask & leftshift operation runs at ~10% speed under LSL-Luau, and 0.5% with loop contribution removed. For fairness I converted them into a modulo+multiplication for a second test as I do with SLua where applicable as it's much faster: ~33% speed, ~13% with loop contribution removed.
Frionil Fang
The bit op thing makes me wonder: my hunch is some bit ops would get used especially in performance-critical spots, would it be possible to automatically convert applicable constant masks and shifts to modulo, multiply and floor divide at compile time? Of course that works only for masks that are power-of-2-minus-one so that one might be debatable if it makes sense to bother with, but shift-by-constant might?
E.g. mask 255 -> modulo 256, leftshift 4 -> *16, rightshift 2 -> //4?
H
Harold Linden
Frionil Fang
Ah, okay this is interesting, thanks! It seems that Mono on Agni / Aditi has much higher throughput for numeric calculations than on my devserver. I suspect that the
RelWithDebInfo
build used internally might be quite a bit slower, I was under the impression it was literally just the release build with the debug symbols left in. When Mono is built with RelWithDebInfo
SLua is generally faster.I'll try and do a local
-O3 -DNDEBUG=1
build of the SLua library. The one on Aditi is compiled RelWithDebInfo
with -O1
and has a ton of assert()
s that might be impacting the metrics. Mono doesn't appear to pay that price on Aditi since it uses the Release
build.>The bit op thing makes me wonder: my hunch is some bit ops would get used especially in performance-critical spots, would it be possible to automatically convert applicable constant masks and shifts to modulo, multiply and floor divide at compile time?
Yep, certainly possible, though we'd have to do profiling to see if it's "profitable" perfwise and bytecode-wise. We have a proper LSL AST we can work with to convert expressions into a more optimal form now, though we've left most of that off until we can be sure the base code doesn't misbehave.
Frionil Fang
Harold Linden
Makes sense, appreciate all the details!
H
Harold Linden
Frionil Fang
Sure thing! If you have a benchmarking script I can run it and get numbers once I get things optimized correctly, we just don't want to run non-debug builds on Aditi right now because the
assert()
s have been extremely helpful in rooting out weird bugs.Signal Linden
tracked
I wouldn't place too much stock in raw performance metrics at the moment, as it's going to fluctuate a lot. It's heavily influenced by things like the script scheduler (completely different under SLua compared to Mono,) VM optimization (we only use an
-O1
optimized debug build at the moment,) expression folding (we don't enable this yet,) JIT (we don't use the Luau JIT at all yet,) GC (we really haven't tuned the GC for the smaller size-restricted heaps SLua will have vs Luau's typical workload at all yet,) among many other things.Most things will become faster as we near beta, some things will become slightly slower (tons of repeated heap allocations in quick succession near the memory limit,) but things will definitely change either way.
Another thing to note, I would expect LSL's raw floating-point performance under the SLua VM to be slightly slower than Mono's due to hacks we had to implement for compatibility with LSL on Mono's behavior. LSL-on-Mono treats floating point values as F32 most of the time, except FP temporaries are F64, so we had to add a bytecode instruction to truncate to F32 in cases where Mono would do it implicitly. Note that this doesn't affect SLua proper, since we don't need to behave the same as LSL did on Mono.
Kristy Aurelia
Maybe Lua does the Python thing, where everything is immutable, so
a += 1
actually creates a new variable.Or, try equivalent lua code and compare performance when
i
and a
do and do not have the local
keyword