llRegex* functions | Voters

llRegex* functions

tracked

Chaser Zaks

list llRegexMatch(string pattern, string input, integer flags);
https://www.boost.org/doc/libs/1_31_0/libs/regex/doc/regex_match.html
list llRegexSearch(string pattern, string input, integer flags);
https://www.boost.org/doc/libs/1_31_0/libs/regex/doc/regex_search.html
list llRegexSplit(string pattern, string input, integer flags);
https://www.boost.org/doc/libs/1_31_0/libs/regex/doc/regex_token_iterator.html
string llRegexReplace(string pattern, string replace, string input, integer count, integer flags);
https://www.boost.org/doc/libs/1_31_0/libs/regex/doc/regex_replace.html
This would also implement a long requested ability to replace substring instances.
In all instances EXCEPT replace, list would return the following:
[
integer matches,
integer sizeOfMatch,
... match data ...,
integer sizeOfMatch#,
... match# data ...,
...
]
If named groups is disabled, group data would return a list of matches.
If named groups IS enabled, return a tuple of [string groupName, string groupValue].
If zero matches, return a empty list. (So that if(emptyList) works)

January 10, 2024

Nexii Malthus

The big gotcha for regex is that it can be used for a Denial Of Service attack, commonly known as ReDoS, so LL would have to mind to implement some workaround for this, which could get quite complex, as much as I want this kind of functionality myself.

So the simplest implementations of hooking directly into the native regex engines aren't as straightforward as people might think. If it's even possible to somehow wrap a regex engine or if it offers functionality to prevent DOS (even unintended).

SuzannaLinn Resident

With the introduction of llFindNotecardTextSync():

list llFindNotecardTextSync( string name, string pattern, integer start, integer count, list options );

a similar approach could be applied to other data formats, using the same parameters and returning the same strided list (without NAK):

list llListFindText( list src, string pattern, integer start, integer count, list options );

(Each item in the list is treated like a line in a notecard)

list llStringFindText( string src, string pattern, integer start, integer count, list options );

(Substrings separated by "\n" are treated as lines. If no "\n" exists, the entire string is considered as line 0)

Although only one of these functions is strictly necessary, because:

llListFindText( srcList, ... ) == llStringFindText( llDumpList2String( srcList, "\n" ), ... )
llStringFindText( srcString, ... ) == llListFindText( llParseStringKeepNulls( srcString, [ "\n" ], [ ] ), ... )

Having both functions, however, would improve clarity.

By adopting this approach, users would only need to learn a single function format, as all three functions share the same behavior. Additionally, any future enhancements to the options parameter would apply consistently across all three functions.

Pazako Karu

I feel like a fake for this could be made, abusing linksetdata's implementation. It may have to be the only lsd in the object though, or you'll pick up stray matches from other kvps.

rhet0rica Resident

For those who missed it, a related Canny ticket (with far fewer votes...) was marked as complete today: https://feedback.secondlife.com/scripting-features/p/feature-request-synchronous-notecard-text-find-count-functions

Bleuhazenfurfle Resident

I spent rather a lot of time writing I my thoughts on a regex function over in the jira…  which is presently inaccessible.  Basically:
* Named matches are nice, but I'm dubious they're worth the complication, personally.  Could also just "name" the numeric match groups with their match number, when in named match mode.  Or, thirdly, perhaps a function (or option) that inspects the pattern and gives you the numeric index that will correspond to each of the names (whether that's viable, will depend on the specific regex engine they're using).
* In any case, we most definitely need match indexes, and the option to get them entirely instead of the strings; it's trivial to go from indexes to string, but not necessarily the other way, plus with indexes you can capture between groups, and various other combinations.  I'd personally like three separate options, include; start index, match string, and end index.  (The string alone can be the default if not specified.)  Also, match groups can overlap, so now you have several copies of the same chunk of string in your result list.
* It needs an offset index to start the match at, and the character position after the last match should be the last element in the results list (if it's always there, then it's easy to account for in loops, too — omitting it for no matches would be okay).  This supports a "find all" mode, without the risk of getting back an insanely long list (as could happen with llRegexSplit there).
Those last two are because in a limited memory environment like LSL, you want to be able to view into your data, without copying it until you have to, especially if that data happens to be trying to scrape a web page, for example.  And one thing a lot of people spend a lot of time doing with any reasonably large script, is trying to save bytes here and there, so the last thing you want is to get back a list of indeterminate size.

Kadah Coba

Yes please. LSL has to use WAAAAY too many strings and the existing ways of parsing eat up a lot of cycles when native regex could likely be a lot faster in many cases.

Spidey Linden

marked this post as

tracked

Issue tracked. We have no estimate when it may be implemented. Please see future updates here.

KyliaDaden Resident

One additional question: Does the regex feature in Boost implement time limit for executing a regex?
A bad regex can take down infrastructure. Example case: Cloudflare's outage due to a bad regex
The regex implementation in Mono, as I mentioned in BUG-234987 and BUG-234898, has a variant wherein a timeout can be specified. This will not prevent bad regexes from being written, but this can help prevent bad regexes taking down simulators.
For example: This particular 'flavor' of the regex.Match() method (The one with the TimeSpan
 parameter)
So for me, exposing Mono's RegularExpressions facility rather than Boost's not only lowers the "impedance mismatch", it also adds a nice 'safety valve'.

KyliaDaden Resident

I'm all for Regex functions, but my original requests in the JIRA is more aligned towards what's available in Mono/.Net, so hopefully much simpler to implement (since the scripting engine is already Mono anyways)

I think it is imperative to minimize the "impedance mismatch".

Chaser Zaks

KyliaDaden Resident: Scripting engine uses mono, the actual implementation of LSL functions is in C++. LL uses libboost in the viewer, so I suspect they do the same for the simulator.

KyliaDaden Resident

Chaser Zaks: Well since the scripts are being run by Mono anyways, why not just expose the Regex functions of Mono, rather than marshal the call to pass to Boost? Likely better performance

Remember that scripts run purely on server side, so whatever is being used in the viewer may or may not be used in the simulator. The only thing we can really be sure of, is that (newer) scripts are executed by Mono.

Bleuhazenfurfle Resident

KyliaDaden Resident: Because LSO is a thing, still?

I haven't heard LL suggest that LSO gets cross-converted to Mono, meaning it's likely running a custom engine implemented in C/++, and hence, lacks .Net/Mono-specific regex functions.

KyliaDaden Resident

Bleuhazenfurfle Resident We can start sunsetting LSO by making new functions only available in Mono.

The idea of sunsetting LSO had already been brought to the UG some time ago though it is currently being shelved, but not forgotten. Some objections -- notably from Combat scripters -- are starting to get addressed. Also the Combat 2.0 User Group will try to iron out better ways to support Combat scripters needs.

Bleuhazenfurfle Resident

KyliaDaden Resident: As much as I agree, it is, for the time being, still a thing, no decision has been announced, and so it must be factored into it for the time being.
Besides which, LSD happens (almost certainly) on the server, which is not Mono, so that ship has very likely pretty much sailed already, too.  (You don't really want two different regex engines — we have whatever engine they used for LSD.)
That said, I don't think it's too late to make changes, either, it's still new enough there aren't too many regex-using "legacy" scripts (esp. by scripters who've left SL — or even FL —  and hence won't be updating their scripts).
This is also where I think people on here come into the picture…  Moving too soon, or leaving things too late, both tend to produce technical debt, and LL can't sit around for a day or two chatting about every single new idea, trying to tease out all the specifics.  (Granted, reading every single word written about an idea could often take a day or two, on it's own, especially the way I write…  Sorry LL…)

VriSeriphim Resident

I am of 2 minds on this request.
On the one hand, this would simplify a lot of complex LSL code.
On the other hand, regular expressions take a lot of time to learn how to use.
An alternative could to provide a feature that bash provides: globbing. For example foobar or fuz?baz where 
 represents 0 or more character and ? represents 1 character.
Even better, "globbing" with capture using ( and ) to denote where to capture.
"globbing" would be a reasonable compromise between complex LSL code and the difficulty of regular expression.

KyliaDaden Resident

VriSeriphim Resident: Regex is already available in many other programming languages, and as usual, usage of Regex is completely optional.
Globbing with capture is uncommon, though. And necessitates something similar to Regex state machine anyways to implement. That is not something simple to implement.
I'd say just expose the Regex feature of Mono/.Net and be done with it. Simpler to implement.

Bleuhazenfurfle Resident

KyliaDaden Resident: A glob pattern can be converted to a regex reasonably easy. That's why I included it as a bonus option in a regex proposal I made on Jira a while back. It's an easy thing that would help a lot of people for whom regex is just unfathomably complicated.

KyliaDaden Resident

Bleuhazenfurfle Resident: Regex is already exposed through the

llLinksetDataFindKeys()

function. Adding yet-another-way-to-perform-nonspecific-string-matches will instead cause potential confusion. A standard glob also don't have group capturing feature; it's just match or no match. Will need another step to extract the portion one is interested in. Or write a custom glob syntax that no one else has implemented, resulting in longer development time + greater possibility of bugs.

I'd say just standardize on Regex. It's quite standardized, and there are more free tools for making regexes than you can shake a stick at.

Bleuhazenfurfle Resident

KyliaDaden Resident: You seem a little bit confused.
The point of the Jira I was referring to, was an extensible grab-bag of all the searching methods people need, and also acknowledging the varying skills (and lacks thereof) of those people, with the intention that once implemented, could be applied liberally (also, one implemented, adding a method to the core matcher would automatically add it to every place matching is done throughout LSL).
The key to it is that "glob" or "regex" become just another option for finding stuff, everywhere
 that finding stuff is a thing; llListFindListEx could find prefixes, globs, regexes, with or without case sensitivity, even numeric matching (think, those annoying numbers that get added to the end of inventory item names) and all by just adding an extra integer parameter for match type.
And that Jira was actually accepted with one of those "we might get around to it some century" messages.  Regex (and glob) were just one of the options it offered, as a "catch-all" when anchored exact(-ish, it also supported case insensitivity) matching wasn't enough.  I was hoping to head off the situation we're in now, where some things only match exactly, while others only match by regex, and implementing all the options would mean five of six functions at every point.  (Sometimes, a simple exact match is
 what you want, even in linkset data, and a heck of a lot faster too!)  And the way I'd suggested implementing it at least, would open the door for something truly amazing — reverse matching (which you've probably all had to do, even if you don't realise it — though would probably need to be ignored for regex).  A big focus though, was that most people find regex scary, and trying to sanitise a regex pattern is really hard!  (We also desperately need a function for that!)  They could still add it, just give us a xxxFindEx function or something, and the existing search functions can be considered "convenience functions for the [we think] most common case".
On the topic of globs specifically, I have watched many people struggle with regex, long before linkset data existed (people often ask for help with other languages, just because they know we be smart cookies).  So in my more generalised search suggestion, I included glob matches as a "bonus feature".  It is vastly easier to handle for those who find regex intimidating, doesn't require the nutso sanitisation regex does (it's pretty hard for a script to make a pathological glob expression that SL would even blink at), and in the context of my original proposal, would shimmy right on through the same functions (which were in fact regular yes/no matches).  Globs are also _really_ easy on the implementation side (and hence very fast)…  about 25 years ago I wrote a glob matcher in assembler, that also did numeric matching.  If I could do it in assembler in a day, LL should be able to crank it out in an hour flat, with testing.
Oh, and, I do happen to agree that globs don't really lend themselves to group matching, either, but I do still think they're a very good idea, and pretty easy to toss in, really.  You just shove the current position on a stack when you hit an open bracket, and shove the relevant substring on a list with a close bracket.  Done.

KyliaDaden Resident

Bleuhazenfurfle Resident 
"You seem a little bit confused." <= This is ad hominem. Don't do that again.
Now for some actual responses:
> about 25 years ago I wrote a glob matcher in assembler, that also did numeric matching.  If I could do it in assembler in a day, LL should be able to crank it out in an hour flat, with testing.
It's still additional work that's more complex than a simple marshaling between the LSL VM and the Mono engine.
Also, your experience is just one data point. We can't assume everyone is at the same level of expertise, especially people that have to juggle a huge stack of backlogs.
> The key to it is that "glob" or "regex" become just another option for finding stuff, everywhere that finding stuff is a thing ... all by just adding an extra integer parameter for match type.
Adding an "alternative way of performing matching" seems good at a glance. But that means implementing 2 code paths. More maintenance work, plus the function itself might have to return overly complex results.
Sooner or later any programmer will likely have to deal with regular expressions. Yes, it's not easy. But it's not an unsurmountable barrier. People can learn, and learning regex has benefits outside writing LSL scripts.
Ultimately, mastering regexes _potentially_ results in simpler code = easier maintenance = also more memory available.

Bleuhazenfurfle Resident

KyliaDaden Resident: stomps my feets
 You're not the boss of me, SO THERE!  grins
  (It still seemed like you were likely thinking about the wrong Jira post — I did rather heavily rip into one of yours on this very topic, after all.)  It was also much more polite than a simple "Durrr".  Maybe a facepalm emoji would have been more appropriate?
(And btw, this is the short version of this post.  When I tried to actually fully address what you said/assumed/etc., it ran much much longer, and I've cut it down to about a third — there is sooooo much more I could say, on several entirely separate levels, including a bunch of fun anecdotes I rather regret cutting out.)
Pretty much your entire response was stating the "obvious" — stuff that's already been covered more than once, _and_ I put that in quotes because I'm fairly certain you're making a bunch of incorrect assumptions there also (if not being actively selective).  That you're continuing so weakly, really rings as stomping your feet crying, "but I want my regex!", and grasping at any random mostly irrelevant factoid you can find.  My intention, was just to second the idea of globs being an option — they really do get underestimated quite badly.
Yes it's additional work, so is reading this forum.  Yes I am just me, but like you I'm in the trenches helping other scripters with their issues (more so in the groups than the forums), plus I have a fairly good grasp on how things likely work behind the scenes — I've been around the block a few times — and always do my best to balance LL's apparent limitations as I guesstimate them to be, with the capability and needs of the people I've helped, and my fellow Got Gud scripters — speaking of which, your "scripters should Get Gud" really is not an appropriate response here.
Back on the topic, the point is globs are a simpler, safer, and vastly more sim-friendly way to do 80% of what regex presently gets use for (that's _the_ 80%, the reality is almost certainly vastly higher).  The genesis of me including glob in my original Jira (classed as a "bonus feature", specifically for most of the reasons you've been going on about here), is because I very very often find myself wanting to match something like: prefix <stuff> userData <more stuff>
And therein lies an issue; completely besides your "scripters should Get Gud at regex" argument, there's the entire thing of regex sanitisation (which you didn't even bother to acknowledge again).  That "userData" takes a fair chunk of grinding to get into shape — something which a LOT of scripters (Got Gud ones included) just don't even bother to do.  Globs can cope with that far, far better, plus, and they're vastly easier to sanitise — when we do decide it's worth doing, since for a glob it's probably actually worth leaving unsanitised as a feature, there's basically nothing nasty that can be done as long as the fixed part is anchored at one end or the other.
Also, wasn't it you who raised the point that the .Net regex API comes with a timeout in your Jira on the topic…?!?  That right there should hint that an alternative that doesn't need one is well worth considering.
As a final note, yes, regexes _can_ do (almost) everything, just like a sledge hammer can be used to nail up a piece of wood — but I'd hate to see your medical bill by the time you're done.  This side-thread
 (not even entirely on topic) was about providing a _very_ useful alternative.

→