UUID corruption
complete
nya Resident
Here are some specifics on the parameters, not a method to reproduce the issue, as I cannot narrow that down myself.
- Upon link change, if local sitter_id = ll.AvatarOnSitTarget()istruthythen I storeplayer_id = sitter_id, otherwiseNULL_KEY
- Taking permissions on player_idfor controls, and various controls oncerun_time_permissionsis fired
- Using ll.HMAC(SECRET_KEY, message, 'sha224')with a few values in the message includingtostring(player_id)
- Over time, for unknown reasons, player_idstarts shifting
"e760b472-baf4-43c2-97f9-e36bbc5ee3f6" -- correct
-- eventually starts changing...
"0cf16f0f-002a-8d09-97f9-e36b00000000"
"7c6f7e12-002a-8d09-97f9-e36b00000000"
"7c6f7e12-002a-8d09-fc07-421200000000"
If it helps I have a version of my work that triggers the issue, which I don't mind sharing.
Log In
H
Harold Linden
complete
H
Harold Linden
nya Resident We've rolled out a new version of the server that should address this, do you still see any weird behavior with
uuid
s?nya Resident
Harold Linden seems to be fixed, great stuff. �
H
Harold Linden
nya Resident Thanks again for sending this in! Please open a new issue if you see any further problems :)
H
Harold Linden
in progress
H
Harold Linden
Updated our test suite to try and repro this, and tracked it down to a potential use-after-free:
==10541==ERROR: AddressSanitizer: use-after-poison on address 0x629003d50480 at pc 0x000108b3bf60 bp 0x00016b3a5a90 sp 0x00016b3a5a88
READ of size 1 at 0x629003d50480 thread T0
#0 0x108b3bf5c in lsl_tostring_uuid(lua_State*) llsl.cpp:948
#1 0x108bb3d04 in luau_precall(lua_State*, lua_TValue*, int) lvmexecute.cpp:3394
#2 0x108b1b12c in luaD_call(lua_State*, lua_TValue*, int) ldo.cpp:278
#3 0x108b017f8 in luaL_callmeta(lua_State*, int, char const*) laux.cpp:284
#4 0x108b02db4 in luaL_tolstring(lua_State*, int, unsigned long*) laux.cpp:567
#5 0x105fc6ed4 in DOCTEST_ANON_FUNC_14() SLConformance.test.cpp:378
#6 0x107540b90 in doctest::Context::run() doctest.h:6936
#7 0x107546720 in main main.cpp:453
#8 0x19cb60270 (<unknown module>)
My current guess is that the string table is getting expanded sometime after taking the pointer to the underlying
TString *
, and the memory gets clobbered. I'm going to see if there's a smarter way we can keep a reference to that string instance without storing a pointer to it directly, as it appears my assumptions about TString *
lifetimes were incorrect.H
Harold Linden
We may have a fix for this, but it involved rewriting a lot of how
uuid
s are implemented. I wouldn't be surprised if there some weird behavior crops up. It'll go out in a future roll.Essentially, we used a weak table that was weak on keys to create a
uuid -> str
table that kept the underlying str alive as long as the associated uuid
was alive. That was because lua_unref()
is not guaranteed to be legal in destructors by Luau, so we had to rely on weak tables to defer collection.Since we patched the
table
implementation to allow tables to be keyed on uuid
s by underlying value
, this had the side-effect of making uuid
s with the same value clobber existing references in the weak table. Creating a second UUID with the same value, but having it be garbage collected earlier than the first instance would lead to use-after-free, because the garbage collector thought there were no more references to the underlying TString *
left once the second instance had gone away.Clearly allowing
uuid
entries in tables to be keyed on underlying value rather than pointer value was a mistake.Instead, we've switched to an approach of using a weak table of
str->uuid
which is weak on value
, and every time we push a UUID we check if we have an existing uuid
object for that string value in our weak table, using that if one exists. This is basically the same as how Luau interns string
instances, and lets us use the pointer value of the uuid
as the key in tables, same as string
s. It's much easier, and it saves memory since calling uuid("val")
where there's already a uuid
instance for that string is basically free memory-wise.TL;DR
: Embracing value interning is smart in memory-constrained environments, also memory management is hard.H
Harold Linden
nya Resident
To clarify, is it the
stored
UUID that's shifting, or are you taking new values from repeated ll.AvatarOnSitTarget()
that end up returning different values later on in the script run? We're not seeing any obvious memory corruption in Valgrind or ASAN running our test suite, but we'll keep poking around unless you have a more minimal repro case.nya Resident
Harold Linden it’s the stored one, I’ll try (another attempt) at putting together a minimal one.
H
Harold Linden
nya Resident: No worries if you can't!
The way UUIDs are currently handled is slightly gnarly since we wanted to make it possible to store them efficiently and take advantage of Lua's string interning, so there's a global weak table that's meant to keep the underlying
TString
storage alive as long as any related uuid
instances are alive.There's essentially two different ways of storing UUIDs in the new VM to account for "canonical" UUIDs that can be stored in a more efficient binary form and "UUID-typed strings" to account for cases like
ll.MessageLinked()
where people needed to shove non-UUID data into a key
for efficiency. It helps a lot with memory pressure in scripts that handle a lot of real UUIDs, but wouldn't be surprised if there were subtle flaws in our logic there that caused them to release memory early causing it to get clobbered :)nya Resident
Harold Linden just got back, and managed to put together a smaller script that has the same behavior.
- Add script to an object in any of the SLua regions, 1 prim is enough
- Sit on the object
- Cause a few controlevents (WASD keys), this step might not be necessary
- Wait up to 60 seconds, see if it sends any messages about "player_id" changing
- If not, try a few more key presses and pause for a moment, or pick up the object, rez it, and try again
So far changing the order, or removing anything from this script seems to makes it go away -- that or it's just random enough for me to believe that's the case!
Also, it seems like it only happens once per script reset with this version of the script. Resetting the script only temporarily restores the correct UUID.
-- sat down --
[06:51] Object: player_id changed to: e760b472-baf4-43c2-97f9-e36bbc5ee3f6
[06:52] Object: player_id changed to: 04533e1d-e069-8d09-97f9-e36b00000000
-- reset --
[06:57] Object: player_id changed to: e760b472-baf4-43c2-97f9-e36bbc5ee3f6
[06:58] Object: player_id changed to: dc653e1d-e069-8d09-97f9-e36b00000000
-- reset --
[06:58] Object: player_id changed to: e760b472-baf4-43c2-97f9-e36bbc5ee3f6
[06:58] Object: player_id changed to: ac563e1d-e069-8d09-97f9-e36b00000000
H
Harold Linden
nya Resident:
Yep, that smells like the underlying
TString
is getting released too early in some cases. We'll take a look at this, thanks!H
Harold Linden
tracked
Jamesp1989 Resident
im not sure if related but sims restarted and have come back in a broken state. invisible terrain and unknown location and one sim that didnt come back up at all
H
Harold Linden
Jamesp1989 Resident
It was indeed related, there were scripts that were checking for the bug and ended up causing cascading exceptions in some internal code due to hitting the memory limit. We're working on fixing both those cases.
Frionil Fang
I was testing this issue and made an object that generates 1000 keys (ll.GenerateKey()) into a table on touch, and then immediately discards the table.
According to gcinfo the memory usage kept creeping upwards very quickly, suggesting uuid memory is not being garbage collected properly. If I mash it a little harder, the script just runs out of memory, but for a test I let the script sit at ~1500 kB according to gcinfo and not quite letting it run out, and shortly after the region SLua Tideland crashed.
Frionil Fang
I retried this and just letting the script sit at 1500+ kB memory didn't crash, but after forcing it to run out of memory by generating non-GC'd keys several times in a row and recompiling the script, the region became sluggish (events didn't seem to fire at normal rate anymore) and eventually crashed again.
The script in question, I think the texture animation etc. stuff is not relevant, was just following leads and didn't remove them:
local key = NULL_KEY
local t = {}
function timer()
ll.SetText(tostring(key), vector(1, 1, 1), 1)
end
function touch_start()
ll.SetLinkTextureAnim(LINK_SET, ANIM_ON+LOOP, ALL_SIDES, math.random(1, 4), math.random(1, 4), 0, 0, 4)
for i = 0, 999 do
t[#t+1] = ll.GenerateKey() --utf8.codepoint(math.random(0,65535))
end
ll.OwnerSay(tostring(gcinfo()))
t = {}
end
function changed(change)
if bit32.btest(change, CHANGED_LINK) then key = ll.AvatarOnSitTarget() end
end
ll.SetTimerEvent(0.1)
ll.SitTarget(vector(0, 0, 0.25), ZERO_ROTATION)
H
Harold Linden
Frionil Fang
Thanks for the details! Yep, I expect there are some issues in the GC code related to freeing string instances related to UUIDs that have gone away. They're currently stored in a weakly-keyed table so as to keep them alive for the lifetime of the
uuid
instance, but there might be something amiss there, we're looking into it.H
Harold Linden
Frionil Fang
Looking into it, the case you mention here is probably a separate issue, more similar to https://secondlife.canny.io/slua-alpha/p/creating-a-buffer-of-10221024-bytes-causes-a-crash . At a high level, there were some function calls that we didn't realize could
throw
Lua errors as an std::exception
, and one of those cases was pushing data into a freshly reset script from the outside before the garbage collector had gotten around to actually cleaning things up from the last run. The crashdump was helpful in figuring out what was going on, so thanks for this!We haven't yet tuned Luau's GC to play nice with very tight allocation-heavy loops with strict memory limits (which is part of why we set such a high memory limit for now,) but we're adding some guards to make sure the sim at least doesn't totally fall over when that gets hit.
I'm glossing over a lot of details of how allocations / GC are handled here, but hopefully the behavior of things will be more obvious when we get the source code out in the beta phase!
H
Harold Linden
Thanks, we'll have a look at this!
Have you noticed these UUIDs shifting strictly across sim restarts, or does it happen during normal operation as well?
Jamesp1989 Resident
Harold Linden the object we are observing this on is happening during normal operation even with a user just sitting there