Monday 13 September 2021

Find me some body to love...benchmarking your lagatar.

This is essentially part 2 of the "why we need to get rid of the segmented bodies." blog.

Hypothesis - Mesh segmentation leads to significant rendering performance issues.

Before we start, just a heads up, this part is the data dump. It's all about the process of gathering data. As such it is somewhat less accessible than the last one. 

Still here? Enjoy.

A few months ago, I decided to quantify the real cost of sliced up bodies. Initially, I did some simple side-by-side tests in-world.

The first attempts were compelling but unsatisfactory. Using an alt, I ran some initial tests in a region in New Babbage that happened to be empty. I de-rendered everything, then had Beq TP in wearing my SLink Redux body. I recorded for a few minutes, sent her away, let things return to normal, then had Liz TP in wearing her Maitreya body.

The results were quite stark. Running with just my Alt alone (the baseline), I saw 105 FPS (remember, this is basically an empty region). With SLink Redux, it dipped a little then recovered to 104FPS. With Maitreya, it dropped to 93FPS.

So this was a good start, but I wanted something a bit more robust and repeatable. Moreover, I wanted to test the principle. This is not about pointing out "X body is good, and Y body is bad"; it is about demonstrating why design choices affect things.

I needed to test things rigorously and in isolation. This meant using a closed OpenSim grid where I had full control of any external events. It also meant I needed to get test meshes that behaved the same way as proprietary bodies. 

Testing proprietary bodies against one another is problematic. 

  1. They won't rez (typically). You need to get lots of friends with specific setups.
  2. If they did rez, they are mostly too complex for SL Animesh constraints (100K tris)
  3. Bodies vary in construction, # meshes, # triangles, with and without feet etc. Making it less clear what is driving the results.
  4. Being proprietary, you can't test outside of SL either, of course, which means you are then exposed to SL randomness (people coming and going - I don't have the luxury of my own region) 

So I asked my partner Liz (polysail) to make me a custom mesh that we could adapt to our needs, and thus SpongeBlobSquareBoobs was born.



"SpongeBlob" is a headless rigged mesh body that consists of 110,000 triangles. Why 110K? It is the upper end of what can be uploaded into SL/OpenSim, given the face/triangle/vertex limits. Body triangle counts are harder to average because some have feet/hands attached; others do not. Another reason why we wanted to have a completely independent model.

The coloured panels shown in this photo are vertex colours (i.e. not textures) randomly assigned to each submesh. This picture is most likely the 192 mesh x 8 face "worst case" test model. We used no textures so that the texture bind cost was not part of this test (that's a different experiment for another day, perhaps)

The single most important fact to keep in mind when you read through this data is:

    Every single SpongeBlob is the same 110K triangles. They vary only by how they are sliced.

Apparatus and setup

So if SpongeBlob gives us the "Body Under Test" (BUT), what are we testing with?

Data Recording

The data is recorded using Tracy, a profiling tool available to anyone who can self-compile a viewer. It works by recording how long certain code sections take (much like the "fast timers" you see in the normal viewer's developer menu). This data gets streamed to a "data capture program" that runs locally (same machine or same LAN). The capture program or another visualiser tool can then be used to explore the data. I recorded things like the DrawCall time, though once we understand how the pipeline works, all we really need is the FPS, as I'll explain later, so you could use any FPS tool if you want to try this yourself in a simpler form.

Environment and noise control

The accuracy of the tests relies on removing as much noise as we can. We all know that SL framerates are jittery, so we do our best to stabilise things by removing as much untested noise as possible. 

To this end, I used an OpenSim system (I used the DreamGrid windows setup as it is extremely quick and easy to set up). With my own OpenSim grid, running on an older PC, I created a 256x256 region with no neighbours. This means I have an SL-like region size, and I have removed any potential viewer overhead of managing connections to multiple regions.

The region was left empty, no static scenery was used, meaning that the region rendering overhead was constrained pretty much to just the land, sea and sky.

Settings

The plan was to record using several different machines of varying capabilities, so I made sure to keep the settings as similar as possible across those. 

We are interested in the rendering costs of different body "configurations", and these are only comparable in the same context (i.e. on the same hardware). Still, we'd like to look for trends, similarities, and differences across different hardware setups, so I tried to ensure that I used the same core settings. The key ones are as follows:-

FPS limiting off - clearly...

Shadows (sun/moon & local) - This deliberately increases the render load and helps lift the results above the measurement jitter.

Midday - Are there implications if the shadows are longer? Let's avoid that by fixing the sun location.

Max-Nonimposters - unlimited. This ensures we don't impostor any of the tests.

ALM on - we want materials to be accounted for even though we are not using them. It ought not affect our results, really.

Camera view - I needed to ensure that I was rendering the same scene. To achieve this, I used a simple rezzing system that Liz and I have tweaked over time. It uses a simple HUD attachment on the observer that controls the camera. A controller "cube" sends a command to the HUD telling it where to position the camera and what direction to point in. 

Test Setup

Each test involves rezzing a fixed set of BUTs (16) in a small grid. These cycle through random animations. The controller cube that is used to position the camera is also responsible for rezzing the BUTs. Every time the cube is long-clicked, it will delete the existing BUTs and rez the next set.

Each avatar model is an Animesh. This full test cannot be run in SL due to the Second Life limit of 100K triangles. Using Animesh removes any other potential implications to the rendering caused by being an actual avatar agent.

This is a typical view being recorded.


Consistency and repeatability

It was important to remove as many errors as possible, so scripting things like the rezzing and camera made a lot of sense. We also made sure that the viewer was restarted between each test of a given BUT.

Tests were run for at least 5 minutes, and I would exclude the first 2 minutes to ensure that all the body parts had been downloaded, rezzed and cached as appropriate. There are implications to the slicing of bodies that alter the initial load and rendering time (you see this with the floating clouds of triangles when you TP to a busy store/region), but this is not what we are testing.

Hardware

Running the tests on a single machine tells us that the findings apply to that machine, and within reason, we can extend the conclusion across all machines in the same or similar class. But, of course, in Second Life, we have a wide range of machines and environments. So it was important to us to get as much data as we could. 

We thus ran the tests across various machines that we have access to. 
As a developer, most of my machines, even older ones, tend to have been "high end" in their day. So we should note that potential bias when drawing conclusions.

Here is the list of hardware tested along with the "Code names."


Methodology and Test Runs

Using the above setup, we would run through a specific set of configurations. Those were as follows.


The baseline test is simply an empty scene. Thus we establish the cost of rendering the world and any extraneous things; this includes any cost to having the observing avatar present.

You can see that every mesh has the same number of triangles but is split into more and more objects. Once we reach 192 objects, we continue scaling using multiple texture faces (thus creating submeshes). 

I will include in an appendix a test that shows the broad equivalence of submeshes versus actual meshes. There is no appreciable benefit to one as opposed to the other in these tests (there may be other implications we do not investigate)

By changing the number of meshes and faces, we are scaling up the number of submeshes that the pipeline has to deal with and thus the number of drawcalls. If you remember the analogy I gave in the first part of this blog, you'll recall that the hypothesis is that the process of parcelling up all the contextual information for drawing a set of triangles far outweighs the time spent processing the triangles alone.

If this hypothesis is correct, we will see a decline in FPS as the number of submeshes increases. As we reduce the number of triangles in each call, we also demonstrate that the number of triangles is not dominant. 

Results

So what did we find?

The first graph I will share is the outright FPS plotted against the total submeshes in the scene.



This graph tells us a few things. 
1) The raw compute power of a high-end machine such as the "beast" is very quickly cut down to size by the drawcall overhead.
2) That the desktop machines with their dedicated GPUs have a similar profile
3) The laptops, lacking a discrete, dedicated GPU, are harder to see.

If we normalise the data by making the FPS a percentage of the baseline FPS for that machine, we will rescale vertically and hopefully have a clearer view of the lower end data.


This is very interesting (well, to a data nerd like me). 
We can see that the profiles of all the machines tested are similar, suggesting that the impact is across the board.
We can also see that the laptops continue to be segregated from the desktops. The impact of the drawcalls, while pronounced and clearly disruptive, is not as extreme as for the dedicated GPUs. This would seem to support the hypothesis that those machines with onboard graphics are additionally penalised by the triangles giving the graph that vertical offset from the rest. As we have not explicitly measured this, we cannot draw too much from this, but there is clearly pressure on those less powerful machines. 

What may be surprising to some and is certainly interesting is that all the desktops are impacted similarly. The shiny new RTX3070TI suffers just as much as the rather ancient GTX670. What we get is more headroom on the modern card. 

The next graph is really another interpretation of the same FPS data. Now though, we are looking at the frame time as opposed to frames per second. To illustrate this, to achieve 25FPS, we have a time budget of 1/25th of a second per frame. We tend to measure that in milliseconds (where a millisecond is 1/1000th of a second); thus, 25fps requires us to render one entire frame every 40 milliseconds (ms).



Here we can see the anticipated trend quite clearly. 

What did we expect?

If the cost of a drawcall dwarfs the cost of triangles, then every extra drawcall will add a more or less fixed cost to the overall frame time. If the triangle count were to have a stronger influence, we'd see more of a curve to the graphs as the influence of the triangles per draw call decreases along with their number.

The drawcall is the dominant factor though interestingly, we see some curvature on the laptop plot.

The curve we see in "Liz's laptop" is initially convex; is this what we expected? Probably so. If the total drawcall cost is the time spent packing the triangles (T) plus the time spent on the rest of the drawcall overhead (D), then initially T+D is steep, but as T decreases and D remains more or less static, we go back to the linear pattern. We can also see a slight kink, suggesting that we may have a sweet spot for this machine where the number of triangles and the drawcall work together optimally.

We see other slight kinks in other graphs. We need to be careful of over-analysing, given the limited sample points along the horizontal axis and those error bars that show quite a high degree of variance in the laptop frames.

Conclusions

Let's use our table from the last blog to examine the typical mesh count for current bodies in use.
BodyTotal facesaverage visible faces# times slower than best in class (higher is worse)
Maitreya Lara30423012.78
Legacy147134018.89
Belleza Freya111619010.56
SLink HG redux149301.67
Inthium Kupra83181.00
Signature Geralt9033708.22
Signature Gianni11594319.58
Legacy Male10461743.87
Belleza Jake9074018.91
Aesthetic2052054.56
SLink Physique BOM97451.00


The implication is clear. A body that has ten times the number of submeshes will take more or less ten times as long to render. However, we do not walk around as headless naked bodies (well, most of us don't - never say never in SL), but we need to be far more aware of the number of submeshes in the items we wear. After your body, the next biggest offender is very likely to be your hair. There are many, often very well known, makes of hair that have every lock of hair as a separate mesh. 

We need proper, trusted guidance and tools.

Ultimately, there are choices to be made, and the biggest problem here is not the content; it is the lack of good advice on making that content. Where is the wiki page that explains to creators that every submesh that they make adds overhead? 

This is ultimately not the creators' fault; it comes back to the platform's responsibility, inadequate guidance and enforcement, and incorrect metrics (yes, ARC . I'm looking at you!). 

Definitions:

BUT: Body Under Test, The specific configuration of our model that is subject to this test.

FPS: Frame Per Second, how many times per second a screen image is rendered. Too slow, and things are laggy and jittery. People get very wrapped up in how many FPS they should/could get. In reality, it depends on what you are doing. However, you'd like to be able to move about with relative smoothness. 

Jitter/noise: These are different terms for essentially the same thing, inaccuracies in the measurements that cannot be corrected. Noise and Jitter are variances introduced by things outside of the actual measurement. FPS is a noisy metric, it varies quite wildly from frame to frame, but when we average it over a few, it becomes more stable. 

Appendix A: Is invisible mesh hurting my FPS?

I mentioned in the last blog that the concerns over invisible mesh were largely over-hyped, in large part due to an optimisation introduced by TPVs courtesy of Niran.

To test this, I set half of the faces of a 192x8 body to be transparent and ran a benchmark. I then ran the same benchmark with a 192x4 body. In theory, they should be practically the same.


Results: 

No, as we had hypothesised, there is no perceivable difference at this level between the two. As noted in the earlier blog, we are just measuring the direct rendering impact. There are other indirect impacts, but for now, they are a lesser concern.

Appendix B: Which is better, Separate meshes or multiple faces?

To test whether there was any clear benefit between breaking a mesh up into multiple faces or multiple objects, I ran benchmarks against three models that equated to the same number of submeshes passing through the pipeline. 
96x2 48x4 and 24x8.



Results:

As can be seen, there is no clear benefit. The raw numbers would suggest that the 96x2 was slightly slower. That would be plausible as there is an expectation of an object having a higher overhead in terms of metadata and other references, but two factors weaken this. 
1) The error bars - the variance in the measurements places the numbers close enough for there to be reasonable doubt over any outright difference. 
2) The 24x8 is slower than the 48x4. Once again, well within the observed variance, but it casts further doubt on any argument that there is a significant measurable difference. 

This may be something that I look at again to see if there is a more conclusive way of conducting this experiment. For the purposes of this blog, which is for determining whether the construction choices affect the overall performance, it is quite clear that it is the number of submeshes and not their organisation that is the driver.

No comments:

Post a Comment