Thursday 16 December 2021

Upgraders of the lost ARC

One of the things that my recent discussion around the issues with segmented, rigged mesh bodies and the issues they cause for viewers threw into stark perspective was that the current complexity metrics are not in the least bit helpful. In particular the mainstay of the current toolset "ARC".

So what is ARC?

Avatar Rendering Complexity (ARC) is a calculated value based upon an algorithm determined some years ago by Linden Lab. It scores an item based on a preset "cost" per feature, that in theory aligns to the overall rendering impact of that item. It has not been adjusted for many years and even assuming it was ever broadly accurate, it has been outrun by changes in hardware and in content/content creation. The calculation assigns a based cost and/or multiplier based on the type of features used, such as legacy bump mapping, flexiprims, alpha. 

Why is ARC wrong?

In part, it is down to bit-rot, it was seemingly based on tests in the past but times and features have changed, technology moved on but the code has not kept pace. However, it is more than just being out of date, even if the algorithm was broadly right for one machine how well would that transfer to another? We can all appreciate that the performance profile of a laptop with "onboard" graphics will differ massively from a desktop with a dedicated GPU. Every one of us has different aspects of our setup. There are too many things that depend upon or are affected by not just your machine,  but your settings and circumstances too.

Fine, ARC is not correct, but it's better than nothing, right?

This has been said time and again but frankly, I am far from convinced of this. Back in the summer, the Lab released their first look at "performance floater", this features a new presentation of ARC, ranking the "worst offenders" and allowing you to derender them. This is a great feature, in theory, except it is frequently pointing the finger at the wrong people. Because ARC is flawed, those at the top of the "complexity" list may not be the ones affecting your FPS at all; worse still because the numbers are misleading they can encourage entirely the wrong changes. I have seen people remove all their flexiprim hair and replace it with rigged mesh, their complexity plummets, but the reality is that they are more than likely taking far longer to draw now. A segmented mesh body will frequently have a lower ARC score than an unsegmented one, yet we have seen in my recent posts that the reality is quite different. The result of this misleading information is that people swap out perfectly good assets for dreadful ones making the overall performance worse all the while being lied to by the ARC.

I would argue, therefore, that false information is worse than no information. So what can we do?

Say hello to ART

ART is my cutesy name for the Avatar Render Time. It is a personalised score that reflects the impact of an asset on you and your machine right now, with these settings. It is not a like-for-like replacement for ARC, it is not a tweaking of the algorithm, it is instead a consistent measure of the specific cost of rendering avatars on your machine, with your settings.

Here is an example of the new screen that will be available in the next Firestorm release, taken at a busy Music venue.




This display is derived from the initial work of the Lab as can be seen in their "performance floater" project viewer released in the summer. The presentation has been "more or less" retained, but importantly, I have moved away from ARC. There are a few aspects of this screen that are worthy of note.

The frame summary



Alongside the display of the overall FPS number, we have a breakdown of the time taken to render each frame. 

The frame time is expressed in milliseconds (ms), a millisecond is 1000th of a second, in this case, the entire scene was drawn in 70ms, 70/1000ths of a second. 15 FPS is roughly (the FPS is an average of recent frames) 1000/70 and therefore if we want to increase the number of frames per second we must reduce the time each frame takes. 

Alongside the Frametime we have the proportion of time spent on various (quite broad) categories.  

UI - All the viewer user interface, the floaters and menus that you have open. It is worth looking at how much certain views cost if you are looking to maximise your FPS.

HUDS - The HUDs that you are wearing. I am a little wary of this number, HUDs can be very high texture overhead and while this increases the HUD rendering cost, it also increases pressure on other rendering as those textures displace ones in the scene form the graphics card. This interdependency is not something that we can accurately reflect at present. 

Tasks - The time spent keeping your viewer connected to SecondLife, processing all the messages and so forth. This should be comparatively low and stable.

This leaves, Scenery, Avatars and the mysterious "Swap". 

Scenery - The scenery number is a gross over-simplification, it effectively means all the rendering that I could not assign directly to a specific avatar (or HUD or UI). It will thus mostly be the environment around you but it can include certain overheads that cannot be easily assigned elsewhere.

Avatars - No prizes for this one; the time spent rendering each avatar, and we'll examine that more closely in a bit. 

Swap - Swap is the most obscure stat here, it is uncompromisingly technical. I'll put a footnote to explain it for those who care.

What we note is that in our scene, a relatively simple skybox venue, almost 20% (around 14ms) of our time was spent drawing the static mesh and prims that we see. Meanwhile. the avatars took a whopping 72% (50ms).

The nearby avatar list

Below this, we see the list of "nearby avatars". This list is built based on the draw distance and as such includes avatars that may be out of your direct line of sight. 



Here we can see that there were 23 avatars being "handled" by my viewer, and we shift into another "scary unit" the microsecond (µs). A microsecond is a millionth of a second a tiny amount of time, used here to allow for the very wide range of rendering times different avatars can consume.

In this snapshot we see that our most complex avatar took more than 4,800 microseconds, this is almost 5 milliseconds. That single avatar (out of 23) was almost 10% of all the avatars combined. 

Furthermore, we see the fallacy of ARC writ large (and this was not contrived, but a pure chance) the lowest ARC, the avatar we'd traditionally have considered "the best behaved" is in fact the slowest to be drawn. Meanwhile, we have "Gavrie", who would be second from the top based purely on ARC, showing as one of the more efficient avatars.

We can now choose to manually derender the worst offenders, by right-clicking, or use the "Maximum render time" slider at the top to set a cap of our choosing.

So how is Avatar Render Time determined? 

To determine the time we take measurements during the rendering process combining these to find the total time dedicated to drawing each avatar. This number varies a little from frame to frame, it will change when you look away from or zoom in to, a given person due to the way that the viewer reduces the textures and geometry being drawn, as such we cannot say that "Gavrie" is always the better behaved, just that in recent frames their render time has been significantly lower.  What we learn from this is that, removing/reducing the cost of a given avatar will save us up to that amount of time next frame and in turn increase our frame rate.

There are about 17ms consumed by the top 5 avatars. If we were to remove them completely that would reduce our frame time accordingly from 70ms to 53ms. or roughly 19FPS. an almost 25% increase.

What drives the render avatar time?

A large proportion of the time is attachments. Don't forget that for the vast majority of us our bodies, heads, hair, hands and feet are all attachments. In my previous blogs, I have illustrated the outright cost of these. There is also an underlying cost of just being an avatar, ironically your invisible basic system body has a cost of rendering that is higher than I had expected (even though it is not there most of the time) but for now at least those are not something we can influence; therefore, we need to look at our attachments and see what we can do to improve ourselves.

The performance floater has an "attachments" view.


Here we can assess whether we are doing the best we can to keep lag down? If I were showing nearer the top of the avatar list I might want to review my attachments and see whether I could remove any of the higher cost ones.

It is worth noting that once again the familiar ARC number is misleading. Faced with the ARC alone I might change my hair and leave the harness straps on my legs.


In fact, those straps are slightly more expensive than the hair in spite of being 1/10th of the alleged complexity.

So does the render time tell us everything we need to know?

Not really, no. What the render time is telling us is long this avatar or this attachment took to draw on our screen. It is a personalised measurement, and for FPS improvement, that is what we need to know. 

Of course, if we are not actually looking at the avatar, or the avatar is far away, then the cost will drop. You can see this in the attachment view by simply camming away from yourself. 

Importantly, the render cost that I see will differ from the render cost you see. Even if we share camera viewpoint and graphics settings, the difference between my PC and your PC will dictate how similar (or not) our results are.

Render Time addresses:

  • What (or more typically who) is causing us to slow down. 
  • What items we are wearing that might slow us and others down.
  • Highlighting the effect of settings changes (shadows on/off, water reflection, draw distance)

It does not (directly) address:-

  • The total lack of accountability in SL for what we wear.
    It is not usable by region owners to restrict entry and manage the lag on their estate.
  • The non-rendering impact of complex assets.
    There can be an overhead, in preparing complex assets, transmission time on the network, time to unpack and validate the asset, time to fetch things such as textures. These all happen over the course of many frames.
  • It does not (easily) help creators know "up-front" what the performance of their creation might be.
    Creators have become adept at "optimising their creations to lower ARC. Sadly that is frequently a red-herring. ART meanwhile cannot tell them exactly how an item will render on their customer's machines. It can, however, more accurately guide their decisions. 

In summary

This new feature is intended to guide you in optimising your attachments and understanding good avatar content. You will be able to make informed decisions about your outfits and things that you are demoing. It will also allow you to identify those people who are causing your FPS to take a dive, allowing you to defender them. 

The feature will launch alongside another experimental feature "auto-tuning". A means of letting the viewer manage your settings in hope that it can maintain a higher FPS for you. I'll discuss the auto-tuning feature in another post. At the moment the two are closely linked but as they mature it might make sense to separate them. 

Footnote - Swap

Swap is the time spent by the open GL drivers in what is known as "SwapBuffer". When rendering the scene the viewer draws to an offscreen canvas and once finished will "swap" this with the current display canvas, allowing us to draw one while showing the other. The reality is more nuanced than this with a lot of the draw commands occurring during the swap process and a long swap time can indicate that you are GPU bound (also commonly known as fill bound) with the CPU having to wait for the previous frame to finish before it can proceed. At present this is not that common, certain settings will exacerbate matters. Heavy texture load, very high shadow quality, low GPU VRAM can all play a part. 

Friday 19 November 2021

I don't wanna Mesh with nobody

OK, I had not planned on this blog but a forum post raised that ugly spectre of LOD Factors and, in the light of a few things that have happened recently, I thought it would be worth a "quick" (yeah right, like Beq does quick) update post. 

A few months ago I posted a couple of blogs calling upon the mesh body makers to give us more options of low render overhead bodies. Demonstrating (in extreme terms) the true cost of those individual "alpha-cut" segments on your body, and your poorly optimised rigged hair. By way of illustration here is a couple of screenshots.

There are a couple of things to note before you look

1) Yes, I am running on a stupidly high spec machine. I am also standing in a deliberately uncluttered space. Ignore the FPS 

2) This is on a version of Firestorm that is at least one release away. It incorporates some of the impressive performance changes made by the Linden Lab team alongside some of our own tweaks. Right now, you can test the LL version by going to the project viewer download page on secondlife.com. It has bugs, many of them, but by heck, they are fast bugs and the more you can find now, the faster we get our hands on the goodies.

3) What you are seeing here is a new floater (also building on the initial work done at the lab) that will be part of the very next Firestorm release. A proper discussion of this is for another blog on another day, but what you see in the screenshot is the time in microseconds that each of my avatar's attachments is taking to render. On a less powerful machine, these numbers will be considerably higher, but the relative magnitudes will remain broadly the same.

The first image is my regular avatar. Beq is wearing a gorgeous Lelutka EVOX head, a zippy Slink Redux Hourglass body (with the petite augment), some arse kicker boots for good measure and a bunch of other bits and pieces. The Redux Slink body is notable for having a very limited number of "segments".


Meanwhile....

Beq slips on a demonstration copy of a well-known very popular and very segmented body...


What this is showing is that the segmented bodies are significantly slower to draw than the unsegmented ones. Note too that while I happen to be demonstrating the Maitreya body here. It is certainly not the worst out there (far from it) it is, however, the most prevalent and the one that I hope beyond all other hope that the creator will provide a cut free option. Many new bodies now require alpha layers and as such the push back from creators against supporting BOM properly due to the "extra work" alphas required is no longer credible.

Soon (though in experimental form, as I want to get it into your hands to find out where the rough edges are) the feature shown here, will be in Firestorm and you'll be able to assess the rendering cost of your outfit and by implication the impact it has on others.

My original blog was really a plea to body makers to give us options, alongside their segmented models, to provide a weight compatible uncut version just so that we have the choice and don;t have to give up a hard-earned wardrobe to be a good citizen. Siddean at Slink did exactly this. The redux models are shipped with the original models making the arguments of "oh but this tiny niche requirement means I must have alpha cuts" a moot point.

Since those blogs were written back in the summer-time, we've seen a few things happen. Inithium, the makers of the increasingly popular Kupra body range, all of which have no cuts and perform well, is launching a male body that is also without cuts (uncut meaning something quite different in the male space ;-) ). Siddean Munroe, creator of the Slink bodies has launched an entirely new product range Cinnamon and Chai, two bodies both of which are minimally cut and in my tests perform at least as well as the redux. 

It can cost you nothing too

In my first write up I inadvertently overlooked the entirely, 100% free option too. The open-source Ruth2 mesh body which has an uncut BOM only version. You can read all about Ruth2 and the her male counterpart  Roth2 on Austin Tate's blog and the bodies are available on the market-place. 

In summary

Your choices for higher-performing bodies are increasing. Your CPU and more importantly your friends' CPUs will thank you for it. And watch out, not all new bodies, (or reborn old bodies) are improving matters, with creators rolling out new versions of problem bodies.



A point of clarification is needed too. While the cost of rendering the body is high other attachments all add up and the basic cost of just being an avatar is also a factor so in the real world you can't have 5 or 6  Beq's for every Sliced body in your nightclub, but you can ultimately have more Beq's or just better FPS.

Looking ahead

Perhaps more significant for performance (though this is very tentative and it remains to be seen in practice) is news from Runitai Linden at yesterday's content creator user group meeting, that he is hoping to have some changes that directly address a large part of the problem that heavily sliced mesh bodies cause. My other blogs explain the concept of "batching" and drawcalls, Runitai's changes will (hopefully) bring improved batching to rigged mesh, cutting down the number of draw calls required. This is very early days and Runitai cautioned that the changes make many shortcuts and assumptions that may not survive contact with the real world. 

I have everything from fingers to toes crossed in the hope this can happen. I would still urge you all to consider a low slice body next time you have a choice, request one from your favourite body maker or at the very least defer that purchase if the creator is not offering an option and take another look at the other offerings. While Runitai's rendering gymnastics may pull us out of the fire, we should never have been in there in the first place and with the hope that these changes will help a lot and make the difference less extreme, it will likely remain the case that a segmented body is just harder to draw and an unwanted burden.


Love 

Beq

x

Monday 13 September 2021

Find me some body to love...benchmarking your lagatar.

This is essentially part 2 of the "why we need to get rid of the segmented bodies." blog.

Hypothesis - Mesh segmentation leads to significant rendering performance issues.

Before we start, just a heads up, this part is the data dump. It's all about the process of gathering data. As such it is somewhat less accessible than the last one. 

Still here? Enjoy.

A few months ago, I decided to quantify the real cost of sliced up bodies. Initially, I did some simple side-by-side tests in-world.

The first attempts were compelling but unsatisfactory. Using an alt, I ran some initial tests in a region in New Babbage that happened to be empty. I de-rendered everything, then had Beq TP in wearing my SLink Redux body. I recorded for a few minutes, sent her away, let things return to normal, then had Liz TP in wearing her Maitreya body.

The results were quite stark. Running with just my Alt alone (the baseline), I saw 105 FPS (remember, this is basically an empty region). With SLink Redux, it dipped a little then recovered to 104FPS. With Maitreya, it dropped to 93FPS.

So this was a good start, but I wanted something a bit more robust and repeatable. Moreover, I wanted to test the principle. This is not about pointing out "X body is good, and Y body is bad"; it is about demonstrating why design choices affect things.

I needed to test things rigorously and in isolation. This meant using a closed OpenSim grid where I had full control of any external events. It also meant I needed to get test meshes that behaved the same way as proprietary bodies. 

Testing proprietary bodies against one another is problematic. 

  1. They won't rez (typically). You need to get lots of friends with specific setups.
  2. If they did rez, they are mostly too complex for SL Animesh constraints (100K tris)
  3. Bodies vary in construction, # meshes, # triangles, with and without feet etc. Making it less clear what is driving the results.
  4. Being proprietary, you can't test outside of SL either, of course, which means you are then exposed to SL randomness (people coming and going - I don't have the luxury of my own region) 

So I asked my partner Liz (polysail) to make me a custom mesh that we could adapt to our needs, and thus SpongeBlobSquareBoobs was born.



"SpongeBlob" is a headless rigged mesh body that consists of 110,000 triangles. Why 110K? It is the upper end of what can be uploaded into SL/OpenSim, given the face/triangle/vertex limits. Body triangle counts are harder to average because some have feet/hands attached; others do not. Another reason why we wanted to have a completely independent model.

The coloured panels shown in this photo are vertex colours (i.e. not textures) randomly assigned to each submesh. This picture is most likely the 192 mesh x 8 face "worst case" test model. We used no textures so that the texture bind cost was not part of this test (that's a different experiment for another day, perhaps)

The single most important fact to keep in mind when you read through this data is:

    Every single SpongeBlob is the same 110K triangles. They vary only by how they are sliced.

Apparatus and setup

So if SpongeBlob gives us the "Body Under Test" (BUT), what are we testing with?

Data Recording

The data is recorded using Tracy, a profiling tool available to anyone who can self-compile a viewer. It works by recording how long certain code sections take (much like the "fast timers" you see in the normal viewer's developer menu). This data gets streamed to a "data capture program" that runs locally (same machine or same LAN). The capture program or another visualiser tool can then be used to explore the data. I recorded things like the DrawCall time, though once we understand how the pipeline works, all we really need is the FPS, as I'll explain later, so you could use any FPS tool if you want to try this yourself in a simpler form.

Environment and noise control

The accuracy of the tests relies on removing as much noise as we can. We all know that SL framerates are jittery, so we do our best to stabilise things by removing as much untested noise as possible. 

To this end, I used an OpenSim system (I used the DreamGrid windows setup as it is extremely quick and easy to set up). With my own OpenSim grid, running on an older PC, I created a 256x256 region with no neighbours. This means I have an SL-like region size, and I have removed any potential viewer overhead of managing connections to multiple regions.

The region was left empty, no static scenery was used, meaning that the region rendering overhead was constrained pretty much to just the land, sea and sky.

Settings

The plan was to record using several different machines of varying capabilities, so I made sure to keep the settings as similar as possible across those. 

We are interested in the rendering costs of different body "configurations", and these are only comparable in the same context (i.e. on the same hardware). Still, we'd like to look for trends, similarities, and differences across different hardware setups, so I tried to ensure that I used the same core settings. The key ones are as follows:-

FPS limiting off - clearly...

Shadows (sun/moon & local) - This deliberately increases the render load and helps lift the results above the measurement jitter.

Midday - Are there implications if the shadows are longer? Let's avoid that by fixing the sun location.

Max-Nonimposters - unlimited. This ensures we don't impostor any of the tests.

ALM on - we want materials to be accounted for even though we are not using them. It ought not affect our results, really.

Camera view - I needed to ensure that I was rendering the same scene. To achieve this, I used a simple rezzing system that Liz and I have tweaked over time. It uses a simple HUD attachment on the observer that controls the camera. A controller "cube" sends a command to the HUD telling it where to position the camera and what direction to point in. 

Test Setup

Each test involves rezzing a fixed set of BUTs (16) in a small grid. These cycle through random animations. The controller cube that is used to position the camera is also responsible for rezzing the BUTs. Every time the cube is long-clicked, it will delete the existing BUTs and rez the next set.

Each avatar model is an Animesh. This full test cannot be run in SL due to the Second Life limit of 100K triangles. Using Animesh removes any other potential implications to the rendering caused by being an actual avatar agent.

This is a typical view being recorded.


Consistency and repeatability

It was important to remove as many errors as possible, so scripting things like the rezzing and camera made a lot of sense. We also made sure that the viewer was restarted between each test of a given BUT.

Tests were run for at least 5 minutes, and I would exclude the first 2 minutes to ensure that all the body parts had been downloaded, rezzed and cached as appropriate. There are implications to the slicing of bodies that alter the initial load and rendering time (you see this with the floating clouds of triangles when you TP to a busy store/region), but this is not what we are testing.

Hardware

Running the tests on a single machine tells us that the findings apply to that machine, and within reason, we can extend the conclusion across all machines in the same or similar class. But, of course, in Second Life, we have a wide range of machines and environments. So it was important to us to get as much data as we could. 

We thus ran the tests across various machines that we have access to. 
As a developer, most of my machines, even older ones, tend to have been "high end" in their day. So we should note that potential bias when drawing conclusions.

Here is the list of hardware tested along with the "Code names."


Methodology and Test Runs

Using the above setup, we would run through a specific set of configurations. Those were as follows.


The baseline test is simply an empty scene. Thus we establish the cost of rendering the world and any extraneous things; this includes any cost to having the observing avatar present.

You can see that every mesh has the same number of triangles but is split into more and more objects. Once we reach 192 objects, we continue scaling using multiple texture faces (thus creating submeshes). 

I will include in an appendix a test that shows the broad equivalence of submeshes versus actual meshes. There is no appreciable benefit to one as opposed to the other in these tests (there may be other implications we do not investigate)

By changing the number of meshes and faces, we are scaling up the number of submeshes that the pipeline has to deal with and thus the number of drawcalls. If you remember the analogy I gave in the first part of this blog, you'll recall that the hypothesis is that the process of parcelling up all the contextual information for drawing a set of triangles far outweighs the time spent processing the triangles alone.

If this hypothesis is correct, we will see a decline in FPS as the number of submeshes increases. As we reduce the number of triangles in each call, we also demonstrate that the number of triangles is not dominant. 

Results

So what did we find?

The first graph I will share is the outright FPS plotted against the total submeshes in the scene.



This graph tells us a few things. 
1) The raw compute power of a high-end machine such as the "beast" is very quickly cut down to size by the drawcall overhead.
2) That the desktop machines with their dedicated GPUs have a similar profile
3) The laptops, lacking a discrete, dedicated GPU, are harder to see.

If we normalise the data by making the FPS a percentage of the baseline FPS for that machine, we will rescale vertically and hopefully have a clearer view of the lower end data.


This is very interesting (well, to a data nerd like me). 
We can see that the profiles of all the machines tested are similar, suggesting that the impact is across the board.
We can also see that the laptops continue to be segregated from the desktops. The impact of the drawcalls, while pronounced and clearly disruptive, is not as extreme as for the dedicated GPUs. This would seem to support the hypothesis that those machines with onboard graphics are additionally penalised by the triangles giving the graph that vertical offset from the rest. As we have not explicitly measured this, we cannot draw too much from this, but there is clearly pressure on those less powerful machines. 

What may be surprising to some and is certainly interesting is that all the desktops are impacted similarly. The shiny new RTX3070TI suffers just as much as the rather ancient GTX670. What we get is more headroom on the modern card. 

The next graph is really another interpretation of the same FPS data. Now though, we are looking at the frame time as opposed to frames per second. To illustrate this, to achieve 25FPS, we have a time budget of 1/25th of a second per frame. We tend to measure that in milliseconds (where a millisecond is 1/1000th of a second); thus, 25fps requires us to render one entire frame every 40 milliseconds (ms).



Here we can see the anticipated trend quite clearly. 

What did we expect?

If the cost of a drawcall dwarfs the cost of triangles, then every extra drawcall will add a more or less fixed cost to the overall frame time. If the triangle count were to have a stronger influence, we'd see more of a curve to the graphs as the influence of the triangles per draw call decreases along with their number.

The drawcall is the dominant factor though interestingly, we see some curvature on the laptop plot.

The curve we see in "Liz's laptop" is initially convex; is this what we expected? Probably so. If the total drawcall cost is the time spent packing the triangles (T) plus the time spent on the rest of the drawcall overhead (D), then initially T+D is steep, but as T decreases and D remains more or less static, we go back to the linear pattern. We can also see a slight kink, suggesting that we may have a sweet spot for this machine where the number of triangles and the drawcall work together optimally.

We see other slight kinks in other graphs. We need to be careful of over-analysing, given the limited sample points along the horizontal axis and those error bars that show quite a high degree of variance in the laptop frames.

Conclusions

Let's use our table from the last blog to examine the typical mesh count for current bodies in use.
BodyTotal facesaverage visible faces# times slower than best in class (higher is worse)
Maitreya Lara30423012.78
Legacy147134018.89
Belleza Freya111619010.56
SLink HG redux149301.67
Inthium Kupra83181.00
Signature Geralt9033708.22
Signature Gianni11594319.58
Legacy Male10461743.87
Belleza Jake9074018.91
Aesthetic2052054.56
SLink Physique BOM97451.00


The implication is clear. A body that has ten times the number of submeshes will take more or less ten times as long to render. However, we do not walk around as headless naked bodies (well, most of us don't - never say never in SL), but we need to be far more aware of the number of submeshes in the items we wear. After your body, the next biggest offender is very likely to be your hair. There are many, often very well known, makes of hair that have every lock of hair as a separate mesh. 

We need proper, trusted guidance and tools.

Ultimately, there are choices to be made, and the biggest problem here is not the content; it is the lack of good advice on making that content. Where is the wiki page that explains to creators that every submesh that they make adds overhead? 

This is ultimately not the creators' fault; it comes back to the platform's responsibility, inadequate guidance and enforcement, and incorrect metrics (yes, ARC . I'm looking at you!). 

Definitions:

BUT: Body Under Test, The specific configuration of our model that is subject to this test.

FPS: Frame Per Second, how many times per second a screen image is rendered. Too slow, and things are laggy and jittery. People get very wrapped up in how many FPS they should/could get. In reality, it depends on what you are doing. However, you'd like to be able to move about with relative smoothness. 

Jitter/noise: These are different terms for essentially the same thing, inaccuracies in the measurements that cannot be corrected. Noise and Jitter are variances introduced by things outside of the actual measurement. FPS is a noisy metric, it varies quite wildly from frame to frame, but when we average it over a few, it becomes more stable. 

Appendix A: Is invisible mesh hurting my FPS?

I mentioned in the last blog that the concerns over invisible mesh were largely over-hyped, in large part due to an optimisation introduced by TPVs courtesy of Niran.

To test this, I set half of the faces of a 192x8 body to be transparent and ran a benchmark. I then ran the same benchmark with a 192x4 body. In theory, they should be practically the same.


Results: 

No, as we had hypothesised, there is no perceivable difference at this level between the two. As noted in the earlier blog, we are just measuring the direct rendering impact. There are other indirect impacts, but for now, they are a lesser concern.

Appendix B: Which is better, Separate meshes or multiple faces?

To test whether there was any clear benefit between breaking a mesh up into multiple faces or multiple objects, I ran benchmarks against three models that equated to the same number of submeshes passing through the pipeline. 
96x2 48x4 and 24x8.



Results:

As can be seen, there is no clear benefit. The raw numbers would suggest that the 96x2 was slightly slower. That would be plausible as there is an expectation of an object having a higher overhead in terms of metadata and other references, but two factors weaken this. 
1) The error bars - the variance in the measurements places the numbers close enough for there to be reasonable doubt over any outright difference. 
2) The 24x8 is slower than the 48x4. Once again, well within the observed variance, but it casts further doubt on any argument that there is a significant measurable difference. 

This may be something that I look at again to see if there is a more conclusive way of conducting this experiment. For the purposes of this blog, which is for determining whether the construction choices affect the overall performance, it is quite clear that it is the number of submeshes and not their organisation that is the driver.

Saturday 11 September 2021

Why crowds cause lag, why "you" are to blame and how "we" can help fix it.

Everything is slow, and we're to blame...



OK, buckle up; this one is going to be a long one.....

We all know the deal, you go to a shopping event, and you wade through a tar pit of lag until you can click the "render friends only" button and remove all the avatars. 

More Avatars = More Lag

Why do avatars cause so much load?

If you look through the blogs and forums, you'll find that there is much conjecture over the causes, and you can easily find an "expert" who'll explain to you the problem of onion skin meshes, triangle counts, poor LODs, and invisible duplicate meshes to name but the most common. 

As with many myths and pseudo-scientific speculation, there is an air of plausibility, and often enough, a grain of truth. However, while many of these may contribute to lag, the hard experimental evidence points to something so large that it eclipses them all.

We'll examine each of these "usual suspects" as an appendix at the end of this post. For now, let's cut to the chase.

The number one cause of lag is...Alpha cuts

Alpha cuts? You know that nice little convenience feature, the one that lets you hide parts of your body? The ones where over the years, people have nagged and pushed for more and more detail in the alpha slices. Every one of those little areas is a "submesh", and (as I'll explain) these are the number one cause of avatar-induced lag without any shadow of a doubt. Until BOM, of course, it was not a convenience feature; it was the only way to alpha a mesh body. A requirement because of a shortcoming that dates back to the first use of mesh bodies. But these days, we have BOM, and for the majority of uses, this same alpha effect can be accomplished with more precision and far more efficiently using an alpha layer.


Why are these specifically such a problem?

Every mesh object in SL is made up of one or more "submeshes". Each submesh represents a texture face on that model (allowing it to be coloured, textured, or made invisible independently from the rest.)

A mesh object can have at most 8 texture faces (submeshes); after that, if we need more independent faces, we have to add another object and start to add faces to that, repeating until we are done. 

In the viewer rendering pipeline, every submesh results in a separate package of data (known as a drawcall) being sent to the GPU (this parcel of goodies gets unwrapped by the GPU and used to draw part of the final image that will appear on your screen).

This is important because "drawcalls" have a substantial overhead. You can picture it like this.

We have a production line in our living room, making little sets of triangles and sending them off to our client (the GPU). We cheerfully pack triangles into a box, along with all the necessary paint and decorations needed to make them look pretty, wrap them up securely, put a bow on top, walk to the post box, and pop the box in the mail to the GPU. 

How long does this take us?

If we break this process down, we find that packing the triangles themselves is remarkably quick; we can pack 10,000 triangles rapidly (for illustration, we'll say 5 seconds). But packing it into the box with all the other paraphernalia and walking to the post box with it takes an awful lot longer (let's say 5 minutes just for this illustration). In fact, it takes so much longer that it doesn't really matter how many triangles we are cramming into the boxes; the time spent dispatching them will dwarf it.

If we had a mesh body of 110,000 triangles to send to the GPU. Placing it in one large box will take us, 
11 x 5 seconds to pack triangles in a box = 55 seconds (let's call it 1 minute)
  1 x 5 minutes to send that box. = 5 minutes 
The total time to send our body was 6 minutes. 

If instead, we chop up the body and send it out in 220 separate parts:
11 x 5 seconds to pack the triangles = 55 seconds (it is the exact same number of triangles)
220 x 5 minutes = 1100 minutes
Total time to send our body is now 1101 minutes, or 18 hours and 21 minutes.

That Mesh body your avatar is wearing is very likely to be in the Render time equivalent of hours rather than minutes. We'll be looking at some "real" numbers next time. 

To put this in context, here are some "typical" numbers collected in an in-world survey, jumping around places in SL. 

BodyTotal facesaverage visible faces# times slower than best in class (higher is worse)
Maitreya Lara30423012.78
Legacy147134018.89
Belleza Freya111619010.56
SLink HG redux149301.67
Inthium Kupra83181.00
Signature Geralt9033708.22
Signature Gianni11594319.58
Legacy Male10461743.87
Belleza Jake9074018.91
Aesthetic2052054.56
SLink Ph. Male BOM97451.00

Notes:
I quote the average number of visible faces instead of the total faces because the drawcall cost of fully transparent "submeshes" is avoided in almost all viewers. This number varies depending on the outfits worn, so it is only fair to judge by the "typical" number visible during our sampling. 

As you can see, we are all paying a remarkably high price for the convenience of alpha slicing.

Bodies are, of course, the number one offender, closely followed by certain brands/styles of hair and heads.

Don't blame the body makers.

Let's be clear why we have ended up this way. It is the lack of a clear understanding of how bad this was. In this ignorance, many designers, body makers, and most importantly, you and I, the customers, have not only let this continue, but we have also encouraged it. Demanding more cuts, more flexibility. I've seen blog posts and group messages written out of complete ignorance, asserting that "[body creators] who have forced their users to adopt BOM and removed alpha cuts" had "got it wrong". 

No, they got it right; this was entirely what was hoped for; it's just that people were not ready for it, and arguably the tooling and support were not either.

So what now? How do I get a lower lag body? I like my wardrobe 

Write to the CSR of your favourite body and ask for a cut-free edition. 

Let's face facts. We all like our wardrobes, and we don't like to give that up, and we don't necessarily need to. The same body, the same weights, can work without the cuts. None of your clothing goes to waste, though you will need alpha masks to take the place of those HUD-based alphas and the "auto-alpha" scripts. 

If enough of us ask for an efficient version, then one of two things will happen:
EITHER:
The body makers you contact will provide uncut versions alongside the cut versions and share some of the knowledge as to why the uncut ones are better. 

I sincerely hope that they will; in the end, this ought to be less painful for them - they have to spend a lot of time slicing those models up and tweaking things.

OR:
New bodies will fill the gap for performance, and slowly people will move over. We already have two female cut-free bodies for those willing to move. 

There is, of course, a second part to this. Those lower lag bodies need alpha layers, and clothing creators need to start offering alpha layers; with the recent success of the Kupra body and this finally happening anyway, and over time this will help make clothes more flexible as they will no longer have to "cut the cloth" to where the alpha cuts are. For older clothes,  it is worth noting that many outfits do not need an alpha; many more can work with standard off-the-shelf alphas. So all those old outfits are probably fine. What is more, if you place the alpha layers into the outfits folder with the clothing, then they will be automatically worn and replaced.

Finally, remember, this is not do-or-die. We can still choose to wear an older laggy body when we want a specific older outfit that we haven't managed to get an alpha for and for which standard alphas do not work.

This is all very well, Beq, but what about X?

There are undoubtedly a bunch of you going, yep, but my XYZ outfit needs blah and what about this corner case over here?

None of those are going away. You will still have the choices. If there is an apparent reason why you need to have a more complex body, then nobody is stopping you. All I am doing here is waving a red flag to highlight just how much damage this "convenience" function is doing. 

For example, time and again, people raise the "oh but I must have onions cos BOM has no materials", The appendix talks about this. If you need onion skins, you can have them, but if you have them on a 12 mesh body, you'll now have 24 meshes. If you have them on a 240 mesh body, you'll now have 480... kill the cuts. After that, I'll moan about onion skins, but it'll have far fewer teeth :-)

If we can move to a saner "normal" where bodies aren't made up of hundreds of tiny fragments, then we can all afford to have extra onion skins for our materials, etc., without breaking the bank.

Can't the viewer make this better?

Not entirely, no. In the course of investigating this, I have identified a few things that can be optimised. But, even if I make the rendering cost of tiny meshes more efficient, all we do is move the scale a little; things would still be 10 or more times slower, and more importantly, we are still wasting time doing a largely pointless activity.

People who know me will have heard me mutter, "The fastest way to do an unnecessary task is to avoid doing it at all." when looking at optimisation. In fact, an amused friend recently sent me a link to this exact engineering philosophy coming up in a SpaceX interview

In the long term, the picture does change a bit. One day, we are promised that the viewer will migrate to a more modern pipeline; when we reach that point, the overhead of these so-called "draw calls" will be diminished, and the argument may flip back towards total triangles, etc. 

However, that is not yet. In fact, that is likely to be a good couple of years away, and keep in mind that some people do not even have a machine that can run a modern rendering pipeline. 

To put it bluntly, we can act, all of us, me, you and your friends, we can act now, to fix this problem and move to a far more sane world where we can go to large events without having to disable all our settings. Or we can wait and hope things get better naturally before everyone abandons SL for being a laggy swamp and goes someplace else.

Finally, there are a few "nice to haves" that we could really do with a mix of viewer features and server-side features. 

1) An easier way to make alphas; this is something I have proposed to the lab in a Jira feature request but is not currently on the "book of work".

2) We also need a few UI/LSL tweaks to allow HUDs to wear/remove alphas without things like RLV. 

3) Perhaps this is really #1. We need solid, reliable tools that tell us, both as creators and most importantly as users, whether an item is efficient or not.

But none of these is required for things to start moving forward. But at the same time, all of these "nice to haves" are entirely doable. Nothing here is out of reach. 

Come on then, Beq, prove your point.

OK, so there's a lot of blame being assigned here, mostly to ourselves as users, so I'd better have a good argument to back this up... I think I do, so before you flame me in the comments, let me show you why. My next post will share some empirical data gathered through weeks of performance testing that supports my arguments; I'll show you the extensive testing done and explain how you might try to do something similar to see it for yourself.

UPDATE: Part 2 is now online - warning statistics ahead 9/10 of you will be bored the other half will love it.

Appendix

But what about ...?

Here's a quick appraisal of the typical perspective on lag. The high-level summary is that most of these are valid observations and identify some form of inefficiency. Whether they contribute to FPS loss is another question, and until we get rid of the massive issue around draw calls, their effect is moot.

The poor LODs and poor LOD swapping:

There are several issues here that get conflated, and while each is a valid problem, they do not, on the whole, impact your rendering performance (hear me out on this).

For a long time now, I have been ranting about how poor LODs affect the quality of our lives in SL. This has not changed; moreover, the bugs that affect Rigged mesh rendering mean that they rarely get shown even when a creator has provided good LODs. 

So that's two issues, 
1) Creators do not provide proper LODs 
2) The LODs are not shown. 

Number one is moot if number 2 is not fixed and number 2 is not being fixed by the lab because of number 1.  That is to say that if we make rigged mesh behave as it "should" (thus fixing point 2), then your clothes would vanish very quickly because of point 1, which leads to grumpy people. 

We are, as they say, between a rock and a hard place...

But does it affect the FPS? Yes, to some extent. If you have a high detail LOD that is being drawn at a distance where most of the triangles resolve to a small number of pixels, then the GPU has to shade each vertex, and this can mean that a pixel in the final render is being shaded more than once, this is known as "overdraw". It means that your GPU is doing a lot more work than it needs to. Simply put, if every pixel on your screen was shaded once and then had to be done against, then clearly, it takes twice as long as shading just once would. This is a great example of a GPU bottleneck, which we rarely see in SL due to the draw call problems that dwarf everything else. These are real problems; they are just hidden from view right now. 

So it's those feet, those invisible feet!!!?

Or increasingly "OMG those lag inducing multi-style hairs". 

You'll see this on many blogs that discuss complex bodies and the impact of rendering, and to be fair, this is entirely plausible, and I have believed it myself. However, tests prove otherwise. 

Due to limitations in the Bento body, we do not have usable toe bones nor the morph targets that would allow us to distort the foot meshes to fit our shoes. As a result, we have multi-pose feet on our bodies. 

Once upon a time, I used to wear SLink single pose feet; I would wear the flat feet in a sandals outfit and the high feet in a high-heels outfit, etc. At some point, market pressure for convenience won out, and body makers started to package all the feet together. Now, when I am wearing my feet, the mesh will be many feet, all bundled up. If I have 6 poses for my feet, then 5 of these will be fully transparent. Leaving just the one set visible. 



We also see a similar trend in hair; rigged hairs with "style HUDs" that allow us to alter the appearance.

I am as guilty as anyone for thinking that this causes significant (wasted) effort and thus lag in the viewer. Undoubtedly, there is overhead; let's not ignore that all of this data has to be downloaded, unpacked, and held in RAM...This is all wasteful, BUT it is not a significant contributor to the rendering lag, which is what we are focussed upon today; this is primarily because most if not all viewers now have an optimization in place that prevents the rendering of fully transparent meshes (thanks I believe to NiranV  Dean's nifty optimisation). 

A problem for the future? Probably. A serious issue right now... No, there are far bigger fish to fry.

Is it Triangle counts?

OK, this is the high-poly elephant in the room, I guess. 
"OMG, XYZ body is 500,000 triangles. What a nightmare, no wonder it lags me out."

This, as it turns out, is highly subjective, and while triangles ultimately do impose a certain load, it does not at the present time matter anywhere near as much as it should. 

There are so many uninformed, technically inept, or just plain lazy creators out there who throw absurdly dense meshes into SL* and as with the poor LODs above, they are undoubtedly a source of lag and of load, causing overdraw and unnecessary, or at least inappropriate, data storage and transfer. This is predominantly going to manifest as GPU load, with a side-helping of RAM pressure and cache misses.

Thus, the full answer is more subtle and depends a little on your computer. In general, if you have a dedicated graphics card (GPU), then the chances are it doesn't matter that much in terms of FPS right now. Remember our box packing antics? We need to pack an awful lot of triangles before it starts to approach the cost of dispatching the box. If you have a machine that relies on the onboard graphics, then the story is slightly different, but even then, the number of triangles (within reason) is not the number one cause of lag. 

I'll illustrate this more clearly when we get to the benchmarks and numbers (which may be tomorrow's post)


* In their defence, creators (clothing creators in particular) are under pressure to deliver new content at a high rate and for surprisingly low returns (for many). If you consider the amount of real-world time that has to go into producing an item and then consider the Second Life price and shelf life, you can start to appreciate that corners are cut to meet demands. So, once again, we are in part to blame; we, the consumers, do, after all, create the marketplace and the demand. Far too many of us accept inefficient content, and perhaps more importantly, we limited tools to identify well-made, efficient content


Is it Onion skin meshes?

"So.. it's those damnable onion skins, I knew it." 
Well, yes and no, The onion skins are indeed one of the contributors, but they are to some extent guilty by association with the actual FPS murderer. Every onion skin is another draw call, and each onion skin layer is a complete skin; it doubles the number of drawcalls. But doubling 6 meshes into 12 is not anything like as much of a problem doubling 60 into 120 or 120 into 240. Every onion layer you add to your body made of N meshes, you add another N drawcalls. 

It is for this reason that people who moan at me, "Oh, but we must support materials, and we need layers", get an unconcerned shrug. If we lived in a world where all our bodies were 10 meshes. If some people need to have 20, then that's fine. You are not going to notice, especially when most people will have them empty. Moreover, you could have onion layers as an optional wearable. Thus avoiding the cost for the majority of us and giving optionality to the others. Ironically, Maitreya detached the onion layers in just this manner and got nothing but grief for it!

End of Appendix.