server and region reorganization
tracked
TgBianca Resident
Hiya everyone,
i wrote some weeks ago already a feedback about the too high number of disconnects and crashes in SL and we got told during the round table meeting, that this will be fixed and LL is working seriously on this now. But all our sailors didn't stop thinking about the reasons of the disconnects and crashes as well and we are now very sure, that LL wont ever solve the problem with software changes. To explain why, we need to go some years back in history and and look at when the disconnects started to become a frustrating problem.
LL moved SL to Amazon servers about 6-7 years ago, changed the crossing process for that and the disconnects began. During the 6-9 months test time i had minimum 10 crashes EVERY DAY and was very close to leave SL like a lot of friends did. It became less over the years, but NEVER stopped and a lot new friends start to leave SL frustrated.
LL is doing a great job to make SL much more beautiful and exciting with new connected continents, much better graphics like PBR and many more things.
That is like giving us a Porsche or Bentley to drive, but unfortunately they forgot to improve the roads as well and let us still drive on the dirt tracks with only 25km/h.
This means for SL: Our sailors cut down the graphic parameters to a minimum, decrease DD and view angle to a minimum, reduce all their scripts in their active Avi to a miminum and more and more are using the old default Avi for racing to reduce the crash risk.
And we thought for some times it would work, but it was a fake and the conditions didn't really change much.
We talk a lot about our races after our races in our bar and the idea i want to write about now grew more and more.
Our problem is the organization of the servers. There are about 5 regions on one server. They will be randomly new combined weekly with every rolling restart.
That means for example region A, M, G, O, and W are this week on server 1 and next week the 5 regions can be on server 1, 5, 8, 15 and 20 and on sever 1 can be region H, D, P, Y and B.
The regions are all different equipped and so the servers will work differently every week. In my last feedback i wrote, that we have in our race area regions, which are statistically worse that others and we crash there more often, but we didn't understand why that changes weekly. Now we know.
I am not a server expert, but i am a shopping expert and i can tell you that shopping all your stuff in one big shop is much faster than to buy all in 5 differenmt shops. 5 shops can be a bit cheaper, but when time is my priority then i am willing to pay a bit more and can use the time i won for nicer things i enjoy more.
Here is now our idea:
Have neighbored regions like A, B, C, D and E on always server 1 and region F, G, H, I and J always on server 2... and so on.
- I don't have to change the server, when i cross from region A to B or C or D or E. Server changes happen a lot less, because i have neighbored regions on the same server and they don't get reorganized weekly new. I will win time for crossings and they will be much safer.
- if i still crash on some regions i can easier analyze the reason for that.
I suggest to try and test this idea for example on a small part like Blake Sea area, if you need to test this idea on RC Channel first, because i think those regions should be all on RC Channel.
Our racing community can support the tests in our race area, because that is all on Main Channel and there i will collect the feedback from our community for you.
This will be a game changer and all your awesome work to make SL more beautiful can be used much safer and more. Isn't that a great feeling?
Cheers
Bianca♥
Log In
Signal Linden
tracked
Hi, thanks for the write-up.
Since moving to AWS we have gone through a couple iterations to how we place regions on hosts. In a nutshell, there have been a few ideas in our backlog for improving region placement by colocating simulators on the same host, rack, availability zone, based on in-world topography (region adjacency) and other factors.
This isn't something that we're going to commit to in detail, as many of the restrictions around how we place regions are related to internal infrastructure details --but I can say we'd love to improve the placement. So: I'll mark this as tracked.
Lisa Hyandi
Signal Linden
Thank you so much for considering it
Schickzal Resident
Hello everyone around and not around !
I'm back in late 2024, early 2025 after a few years of interruption. I'm back with fiber and a very, very powerful computer. And I notice that changing the SIM on my canoe is still just as problematic. Like the disconnections when there are a lot of people on a SIM. You're teaching me how SL's servers are managed. Fortunately, not all other companies in the world manage it the same way. I don't think your request will change anything. But I'm still giving my vote.
Please everyone take a look at this and if you want vote for too : https://feedback.secondlife.com/feature-requests/p/library-pbr-environments-permission-fix-request
Kelsie Dakota
Since I've been driving around Second Norway, prompted by a friend who lives there, I've discovered the inconvenience of several random detachment of worn objects. Enough to put you off SL, or at least off the crossing of regions in SL, which seems to cause this problem to appear. Crazy this problem still exist.
Kerra Rhiadra
I agree that this seems to be the main change that has caused the most problems in regards to region crossing; region servers are randomly grouped rather than logically grouped since the cloud server migration that began in 2020 (less than 5 years ago, just want to point out since the initial post says 6-7 years).
The most telling is mainland waterways that are mostly devoid of objects, but I've noticed that those can be arranged on different simulator versions in a criss-cross pattern.. and flipping between regions that handle crossings differently is already a recipe for disaster, and often was and still is.
As a network engineer & server admin, I understand that with the way cloud servers inherently work that it is less feasible to logically arrange simulators to run on the same server, but I would think that an effort to do so for mainland regions at minimum would go a long way towards improving the overall SL experience for the people that actually want to travel it the most.
TgBianca Resident
Yes Kerra, maybe it was only 5 years ago when it started and my memory mixed the dates a bit, but even living with this high number of crashes "only" for 5 years is far too much.
Paul Hexem
Being dropped at crossings has nothing to do with the regions not communicating with each other, it's about dropped communications between server and viewer. See here; https://community.secondlife.com/forums/topic/503010-obscure-question-when-does-the-simulator-send-establishagentcommunication-to-the-viewer/
Lisa Hyandi
That is what I said before and what a Simple network dump done in our computer can confirm.
It is out viewer Who has to communicate with a new server at every cross be even just crossing neighboired regione
This proposal just reduces that need
A little step, we all know!!!
Beatrice Voxel
Paul Hexem Which is why this problem is exacerbated by randomly shuffled servers.
If I am on server A, in region 1, and I cross to Region 2 which is also on Server A:
* Any persistent connections to Server A can be maintained, I just need to update info in my viewer for the new region.
* Server A still has my profile information, what I'm wearing, what I'm sitting on, all of the inventory "stuff" that typically gets lost in region transits.
There is still considerable data that needs to be transferred, but most of it pertains to the new region, NOT my avatar itself.
However, if region 2 is on Server B:
* There are no persistent connections, my viewer must establish new connections to Server B in order to get the region information.
* My viewer also has to tell Server B to get all of my profile / inventory from Server A, everything about my avatar, my attachments, and what I'm 'riding'.
If any of those connections fail, they must be retried, and since there is no teleport "hold screen" all of this has to happen in real time. If the viewer and server can't sync up to the point where the viewer can render, we have a viewer crash.
Christi Maeterlinck
I've only noticed these crashes in the last 3 months or so but wow! They're annoying. I thought at first that they occurred because I had too many other apps running at the same time, e.g. Excel, Mail, on my 18GB MacBookPro and was running out of memory, but why is this a problem now, all of a sudden? Perhaps yours is a better, or complementary, explanation, Bianca, and I support your proposal.
Kim Farleigh
LL - Its about time ( ! )
this pesky disconnect bug happens since march 2019, not only to sailors, pilots and other vehicle drivers, it happens to normal teleports as well.
No matter, what server architecture, organisation and network communications we have here. There is a bug that needs fixed.
So, LL, please get on it and get this finally out of the way! Once and for all ! Give this top priority over developing fancy shiny graphic features.
my two cents.
Jerrod Diavolo
I am sorry but as a professional sysadmin and programmer I have to disagree with the theory that this will make a major difference. Connection issues between servers in the same data center are extremely rare and can happen if they are VMs on the same host, in the same rack, the same data center building or even in different data centers in different parts of the world.
SL crossing issues are much, much more frequent than actual connection issues like that. Given that a region crossing essentially is a hand-off of the viewer from one server to another this involves establishing one or more new connections between the viewer and the new region.
Even connections like that from consumer connections to servers at the other end of the world should not usually fail at the rates we see with region crossings though if given enough time for retries,...
It is very likely that the crossing issues we see are at least two independent problems, one is that there is some sort of timeout in the code that aborts the crossing in a fatal way if things don't go correctly very quickly instead of having a robust retry mechanism that just keeps the entire vehicle and all avatars on it on the crossing until it is done. The other is that double-crossing problem where apparently the second crossing is not properly prevent from starting when the first hasn't finished yet.
What we essentially have here is some kind of distributed transaction mechanism, these are hard even to people who have put much more thought into it than LL for SL (e.g. look at the old blog posts about most database systems on https://aphyr.com/tags/jepsen to see how badly those go even in distributed database systems that make much stronger promises).
TgBianca Resident
Then i just have a question: Why have the disconnects started to become a real problem after LL left their organized servers and changed to the unorganized servers? And the disconnects are not as bad as this like some months, they are this for years already!!! Sorry but LL tried to solve them since years just by software solutions and haven't succeeded in years.
Lisa Hyandi
Jerrod Diavolo
i agree that the optimal solution is the One you propose but before having it, this proposal could minimize the interaction of viewers with a different server at every crosss
It is not a matter of communication in the same data center but a matter of communication with a different server at every cross between out viewer and a server trough the internet…starting from a name resolution with DNS (did you even try to register dns queries done during a cross? And their possible failure)
Jessica Hultcrantz
If you had cared to follow Animats and Montys work on region-viewer connections, you woyld have had an answer.
TL:DR; is that the uplift provided too fast responses from the servers to the viewers by a protocol prone to race conditions and sensitivity for packets out of order. The old inhouse solution with physical servers had it's own slowness that made the exposure of the protocol problems lesser.
Jerrod is right, regrouping servers won't help against underlying communication troubles. The viewer still need to disconnect from the old region and reconnect to the new, that's how SL is designed.
A lot of hippos are burried in the viewer code, some needs some love and care to age well.
Try a different viewer, make sure your UDP bandwith is not set too high, SL likes slow, it is old. Sounds silly, but it might help.,
Btw, in the old days regions were randomly spread over physical servers to distribute load. If you were unlucky you favorite sea region got bundled up with an overcrowded dance club and lag was a fact.
Pilix Nagy
Speed also often seems to be a factor in maximizing the stability of crossings. IE the slower you're going, the safer(note safER, not totally safe even then) crossings tend to be. Which means given many sailors are experiencing this problem, in often slower sailboats, imagine how much worse those of us that fly planes, or drive fast cars, or even speedboaters are getting this issue.
If doing this sort of stable reorganization can at least help reduce the amount of times people crash by reducing the over all number of times we have to change servers on any given crossing, I think that would be a boon for the whole community.
There's a lot of money involved in all forms of travel. People buying boats, planes, cars, etc. and a lot of land being held for the purposes of airports and smaller airstrips, marinas, not to mention just privately owned parcels that people park their boats, planes, cars etc. at, and own for the ability to easily hop into whatever vehicle they prefer, and go travel right from home. Improving conditions in every way possible, means smoother experiences, and more people wanting to travel, wanting to own sims or parcels that allow them to travel out, and people continuing to have an interest in buying vehicles which in turn supports the creators on SL, and keeps them here as well.
castedone2 Resident
I have to strongly concur that the issue described above is a major one, not only for sailors, but also, for aviators and drivers (anyone who operates a vehicle that crosses region boundaries).
In addition to being a Cruise Director for one of SL's larger sailing groups, I am also a pilot. So I regularly sail or fly across the connected mainlands and estates. This issues is on going and as was pointed out sailors and pilots (and their crew or passengers) have to take rather severe measures just to avoid partial or full un-sits on each region crossing. And while reducing draw distance, camera angle, and scripts to bare minimums does significantly degrade and negate the amazing enhancements that have been brought to our SL world, they do not solve the basic issue.
There was a time, when we were able to put in a support ticket to the Labs to have a set number of regions in a proximate geographic area all on the same server. We did that near the Marina I operate to enhance the sailing in and around that Marina. It did, at the time, help. But now the assignment of regions to servers seems to be random.
I mentioned that I am a pilot, and I notice this issue on a number of airports that encompass more than one region. The same airport can be served by two different servers (Main Channel and RC even).
Grouping adjacent regions helped in the past. I support the Labs trying this again.
Load More
→