Relax interrupt checks for constant-time / cheap library functions | Voters

Relax interrupt checks for constant-time / cheap library functions

complete

Harold Linden

This should remove any undue runtime overhead from

bit32.*

calls and friends that have

FASTCALL

implementations.

April 2, 2025

Harold Linden

marked this post as

complete

Harold Linden

Frio Belmonte Just for funsies I did some local benchmarking of the various ways you can do bitmask extraction with math /

bit32

bit32.extract()

appears to out-perform the others both on my REPL and on my dev server.

Inworld, as currently compiled:

[20:14] i: math: 14.916932307183743
[20:14] i: bit32.extract: 13.758476413786411
[20:14] i: bit32.rshift(bit32.band()): 15.959130708128214

On the REPL, as currently compiled (

-O1 -g

math: 0.012317874992731959
bit32.extract: 0.011214333324460313
bit32.rshift(bit32.band()): 0.013357166666537523

And on the REPL with the compile flags we'll eventually use for release (

-O2 -DNDEBUG=1 -fno-math-errno

)

math: 0.005713874998036772
bit32.extract: 0.00464233334059827
bit32.rshift(bit32.band()): 0.009546374989440665

I expect quite a lot of things will get much faster as we're able to remove our debugging instrumentation. You should be able to repro most of this with a local Luau build since it doesn't use LL builtins. I'd be interested to see your tests showing mod + floor being faster than

bit32.extract()

though.

bench_bits

Frio Belmonte

Harold Linden

Looking at your test script, I finally figured out why the arithmetic was working so noticeably faster, sorry to have been barking up the wrong tree... it was sneakily optimizing a variable I did not realize it could do (I've only seen it optimize actual constant expressions, I've learned today!). With the optimization removed by making extra sure it's a variable it can't draw conclusions about, bex often comes on top a little, but due to region fluctuations it also loses sometimes so they're quite closely tied.

Curiously on the region I'm on, both scripts also give over 2x faster results already than the inworld numbers you showed above, e.g.:

Starting
[03:51] Object: math: 5.755074562039226
[03:51] Object: bit32.extract: 5.808989208191633
[03:51] Object: bit32.rshift(bit32.band()): 7.822642065118998
[03:51] Object: Starting
[03:51] Object: math: 5.781822419259697
[03:51] Object: bit32.extract: 5.73350048577413
[03:52] Object: bit32.rshift(bit32.band()): 7.772382610011846

vs. my version's results (32-bit value, extracted from position 12, 1 mil instead of 1mil+1 iterations, extract called via a local function variable, but otherwise same logic):

bex: 5.689974206034094, 219000000
[04:01] Object: arith: 5.777321552392095, 219000000
[04:02] Object: bex: 5.688577990978956, 219000000
[04:02] Object: arith: 5.799790379125625, 219000000
[04:02] Object: bex: 5.711005394347012, 219000000
[04:02] Object: arith: 5.712780042085797, 219000000

Frio Belmonte

The constant-variable optimization may also have kicked in in some places outside synthetic testing and I could swear I also tested it with random inputs previously (is that optimizable by the compiler? Input value = random, but never used outside the one division line in the loop), but that's unlikely to be the case in general so I should "de/reoptimize" some arithmetic in an actual project to bex and see if it makes an impact: based on the above, it should be at least theoretically better on average but maybe not reliably measurable yet. If it will further improve as engine settings change, all the better.

Frio Belmonte

Well, testing things with bex vs. arithmetic on an actual project was as expected a little inconclusive, they should be calling them several hundred times per second at least which is of course a small fraction of the total code being run.
What made it hairy is that it appears that the newer object, regardless of the version of the code, was getting a bigger time slice on a region that had nearly no script time being used, regardless whether both copies were running in parallel or sequentially. Basically, if the version using arithmetic was rezzed last, it's about 10% faster than the other, and vice versa for the extract version which is very strange, surely a near-empty region with 21+ ms of spare time shouldn't be throttling one object based on its creation time? But I honestly have no idea what's happening there.
Either way, the only conclusion I can draw here is that bex should be slightly better or about the same as arithmetic, I've just been outsmarted by the compiler all this time and apologize for the inconvenience.

Harold Linden

Frio Belmonte 
> Either way, the only conclusion I can draw here is that bex should be slightly better or about the same as arithmetic, I've just been outsmarted by the compiler all this time and apologize for the inconvenience.
No worries! I expect everyone's unused to a compiler that uh... actually does optimizations that you have to account for if you want to do benchmarking, since the old LSL compiler did none and left it all to the JIT. The compiler is now allowed to remove expressions that it knows are pure whose result never gets used, among other things.
Although, now we know that it's "profitable" to convert mask & shift to a bit32.extract()
 in our LSL -> Luau bytecode compiler where we know the upper bits will be masked off (due to LSL's right shift actually being an arithmetic shift rather than logical.)
> What made it hairy is that it appears that the newer object, regardless of the version of the code, was getting a bigger time slice on a region that had nearly no script time being used, regardless whether both copies were running in parallel or sequentially
Yeah that is some weirdness in the scheduler code external to both Mono and SLua. We've tried to touch that as little as possible so people can do an apples-to-apples performance comparison in-world, but there are cases where its behavior doesn't make much sense.
Early in the alpha we noticed that the Mono implementation overran its timeslice by as much as 40% on average by design. We had to modify SLua to behave the same, otherwise it made SLua look slower when really Mono was "cheating" a bit by skipping timer checks that would have normally caused it to yield back.
Once things are more settled we're going to revisit the common script scheduling code to see if it still makes sense, since it hasn't been touched in years.

Harold Linden

marked this post as

planned