Lex Neva's thoughts blog of Lex Neva in Second Life

June 11, 2008

Two Suggestions to Help SL Scale

Filed under: Scaling — Tags: , — lex @ 4:51 pm

LL has a problem.

This year has been especially rough. April saw a huge drop in availability, the worst month we’d had in a long time. June is seeing its own trouble, with downtime or severe stability issues every single day for the last 12 days [as of 6/10/08; source]. March was my all-time best month of sales, but in April my sales dropped to more than 30% below normal levels. I know this wasn’t just my products, but a trend that affected many retailers in SL. April’s availability drop also brought with it a huge number of failed transactions and lost inventory, with frequent warnings from LL to “avoid transactions as necessary” during database trouble. The bottom line: customer confidence is way down, and many customers and retailers are losing money, while LL scrambles to keep the grid afloat.

I have two ideas that could alleviate the situation, which I’ll present below.

About “Suggestions”

I’ve been in SL for almost three and a half years now. I’ve seen my share of rockiness, downtime, failed transactions, miscellaneous brokenness, bugfixes that introduce new bugs, and bugfixes for THOSE bugs introducing MORE bugs, all of which break at least some of my content. I’ve seen a heck of a lot of complaining that LL engineering is incompetent, and I have just one thing to say to that: bullshit. Just go look at the comments on LL’s blog, and you’ll see that it’s very easy to talk about how LL is full of incompetent morons, and if they’d just do X, Y, or Z, then all of our grid problems would go away, and since they’re not, they clearly should all be fired because they don’t have a single functioning neuron among them. Techies are often especially prone to fall into this, because we know just enough to make valid-sounding suggestions, and we can use that to illustrate that obviously LL is made of idiots because they haven’t done what we suggested.

It’s much harder to actually build a virtual world like Second Life, much less make it scale. I’ve always had a lot of respect for LL, simply because they’ve created the world as it stands. I have no doubt in my mind that LL is full of a bunch of incredibly bright, talented, and experienced programmers, because if it wasn’t, then the virtual world as we know it would not exist. So when I see people make ridiculous suggestions (“you should use ClusteringTM!”), I get really frustrated, because there’s a very big chance that either someone in LL has already thought about that and determined it wasn’t feasible, or it doesn’t even apply to the situation, or LL has already been doing it for years. In any case, they’re not stupid, and if someone can think of it within 5 minutes looking from the outside, it’s almost certain that someone at LL already has.

I also know that abusing LL doesn’t make much sense. It’s like whipping a horse when what you really need is ten more. You have to remember that these employees are actual people, and like I said above, they’re also very bright people. Telling them they’re stupid is not only wrong, it’s counterproductive.

With that in mind, I’m not making the suggestions below lightly. If someone in LL has already thought of them, then just take this as a vote in the “for” column. While I’ve become pretty frustrated with the way LL has been run of late, I will not fall into the crowd that rants and screams and calls them stupid. It’s not constructive.

Suggestion #1: Gridwide Transaction Kill-Switch

On to the actual suggestions. The first one stems from that common phrase I’ve learned to dread, “Please refrain from transactions until we give an all-clear.” There are several variations, including the ever-helpful “We’re experiencing database problems, please refrain from transactions as necessary”. How do you know what’s necessary? If your transaction just failed, refrain from it :P

LL makes these recommendations through their blog and in-world announcements. The blog announcements used to be on the main LL blog, but that blog was becoming so clogged with announcements of instability that they created a special Grid Status Reports Blog. In-world announcements take the form of those easy-to-miss blue boxes in the upper right, which often don’t go through during times of grid instability. The people this message needs to reach, the Residents, aren’t getting it.

This is a huge disconnect. On the one side, you’ve got LL Operations, and maybe an Operations Liason(?) who actually posts on the blog. These people know that transactions WILL fail. Their job is to get them going as fast as possible, not to worry about making sure people know they’re not working. On the other side, you’ve got Residents, both customers and retailers, who really need this information. Customers need to know that they shouldn’t buy things or rez their no-copy items, and as a retailer, I’d love to be able to incorporate some kind of automated off-switch in my vendors to prevent transactions during grid instability. In the middle is the SL Support Team, which is getting flooded with angry Residents who demand refunds and whatnot. They (I think) make an announcement or two in-world, mostly to get some of the heat off them, I suspect.

My proposal is that if LL knows transactions will fail, don’t allow them. I’ve outlined my proposal in VWR-4431 on the public JIRA. The idea is that, rather than posting a message somewhere that transactions will fail, instead enforce a blanket ban on sensitive transactions at either the client or server level. Kill-switches like this already exist for the profile Web tab and streaming media, where LL can shut these off grid-wide in case a vulnerability is found in Mozilla Firefox.

This would prevent customers from losing money when their purchases aren’t delivered or losing inventory when they try to rez no-copy items. It would also limit load on the system, which might well aid in fixing the problem. It’s much easier to fix a broken pipe when you turn off the water first.

Most importantly, it would go a long way to restoring customer confidence. Right now, people are afraid to purchase items. Everyone’s experienced inventory loss and failed item deliveries. As evidenced by the huge drop in sales across the market, many people have grown frustrated and simply stopped purchasing. Ultimately, transactions should be made more reliable in and of themselves, but at the very least, with a kill-switch, customers would feel a lot safer.

Suggestion #2: Limit Growth in the Userbase

My second suggestion is a tried and true method of managing growth in online services, first demonstrated (I believe) by LiveJournal. Livejournal’s userbase was growing much faster than its server architecture could handle, so they implemented an invitation system to manage growth while they scaled up the server architecture [LJ’s invitation system]. New users could only join LiveJournal if they received an invitation from an existing user or paid for an account. By limiting the number of invitations users could send, LiveJournal controlled growth. In my opinion, LL could greatly benefit from such a system.

I won’t go so far as to say that LL should not have opened the grid to free accounts, and I definitely don’t feel that the grid should be limited to only Premium accounts. I am a Basic account user. I contribute monthly to the payment for a sim, and I’ve produced and released quite a few open source tools. I know that if it weren’t for Basic accounts, I and a lot of other people who positively contribute to SL would not have joined.

On the other hand, there’s been an explosion of growth in the user base dating back to the announcement of free accounts. SL’s popularity has “tipped”, and while LL has made some pretty huge steps toward scalability, every step seems to be immediately subsumed by the growing numbers of users. Concurrency is on the rise, while avaibility and reliability are decreasing. Something needs to be done.

The best part about this kind of invitation system, as demonstrated spectacularly by Google’s Gmail, is that, rather than curbing the popularity of a service, it can create a furor. Suddenly invitations are a hot commodity. When you tell people they’ll have to wait their turn to get in, they start to realize that they should want to get in. And when you do finally open the floodgates after making the system more robust, users will flood in.

Along with a boost due to exclusivity, sugh a system would improve user satisfaction. Right now, new users join SL in droves, but they also leave in droves. Retention is low. Users who join during the current instability might well get frustrated and turn around, and all they’ll remember of SL is that it’s unstable, unreliable, frustrating, and possibly even a waste of their money. It will be a long time before they’re willing to give it another try. On the other hand, with an invitation system, they’ll learn about SL, decide to give it a chance, and get turned away at the door. They might be a little upset, but these users won’t wait nearly as long before they try to join again. And when they do get in, the system will be much more robust and stable, so they’re much more likely to be satisfied. LL would have less signups, but they’d retain more users.

We’d also have another side bonus: less throwaway griefer accounts. The need to pay or get an invitation code to join will deter some of the casual griefers from creating throwaway accounts. Careful alotment of invitation codes will make it hard for griefers to create accounts to “farm” invitations. Throwaway accounts won’t be completely eliminated, but they might well be significantly reduced. Griefing might be reduced as well, since there’s more of a consequence for getting caught.

I hate to use a buzz-word here, but an invitation system would be a “win-win” proposition. LL would buy time to grow their architecture at a reasonable pace, and in the process, they’d get greater stability, higher user retention, and a sense of exclusivity, with even the possibility of reducing griefing. Invitation systems are a well-tested method of controlling growth in order to scale gracefully, and that’s something SL desperately needs.

[I recently suggested these ideas to Torley Linden during an office hour, and Torley seemed interested and was going to forward my suggestions to other Lindens. I posted this here to try to draw more attention from Lindens and Residents.]

4 Comments »

  1. I just found a little gem in the SL blog comments that verifies my guesses about how Grid Status Reports get made. Here’s the comment:

    The folks who post to the Status Reports are, themselves, the people working on the problems ó the team designates one individual who updates that report. Sometimes they donít post details because they donít yet know the cause of problems. Sometimes issues are not widespread, for instance whatever had some experiencing login failure just now, and in cases like that the Status Report might not always be updated. (I donít know what happened there, Iím simply listing some reasons I can think of why the page may not be up to date in terms of your own experience of the grid.)

    Thanks for your patience!
    Ė Katt

    Comment by lex — June 11, 2008 @ 6:42 pm

  2. I appreciate the defense, and the suggestions. We’re just starting to have a system somewhat like point #1 – if you remember when profiles were selectively disabled. I’m not sure how much we’re talking publicly about central service engineering, so I won’t go into much more detail. I can’t comment on #2, that’s much more than a technical proposition.

    Thanks again for the positive comments in the introduction.

    Comment by Poppy — August 14, 2008 @ 8:37 pm

  3. Ooh, sweet, a Linden comment! Thanks for reading. I’m glad to hear that #1 is underway.

    Comment by lex — August 14, 2008 @ 8:41 pm

  4. I just read your article on Second Life in 2600. Your concerns about other support representatives, I would normally consider to be valid, but I used to work for the company that LL has outsourced to. I used to work there with the guy who knows the creator of SL as a close personal friend, and who helped him setup support with this company. I also know most of the people on the LL SL Support team. I can personally vouch for their inteligence and attention to detail(at least the ones I know). To the outsourcer this was a VERY important, if low revenue contract, to prove that they had the ability to provide people who actually know what they are doing as opposed to just idiots with good CS skills, like most hire. Not to toot my own horn, but I’m pretty damn good, and these people easily trumped me most of the time. LL does pay attention to it’s customers, but due to internal politics, and arrogance, and whatever else, sometimes things don’t get implemented. Just thought a little heads up on the people on the support line might be in order to help alleviate any concerns. As for how long LL will stay with said outsourcer, is anyone’s guess. Keeping people with minimum x knowledge and troubleshooting abilities who are willing to accept x wage can be tough. That being said alot of the people on the support line are also SL developers of some sort or another.

    Comment by I must leave this Anonymous — April 15, 2009 @ 10:47 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress