Load testing ASP.NET Core SignalR

Last time around I messed with SignalR I touched briefly on load testing. This time around I’ll deep dive into SignalR load testing, specifically to test the tool supplied in source, Crankier, and build my own load testing tools to investigate the limits of SignalR applications.

Why?#

I have a SignalR-based application I’m building that I intend to gradually test with increasing-sized audiences of people. Prior to these real-human tests I’d like to have confidence and an understanding of what the connection limits and latency expectations are for the application. An application demo falling over due to load that could have been investigated with robots 🤖 instead of people 🤦‍♀ is an experience I want to skip.

I started load testing SignalR three months ago and it took me down a crazy rabbit hole of learning - this post will summarise both the journey and the findings.

Crankier#

Crankier is an ASP.NET Core SignalR port of Crank, which was a load testing tool shipped with ASP.NET SignalR. At the moment the only thing Crankier does is attempt to hold open concurrent connections to a SignalR hub. There’s also a BenchmarkServer which we can use to host a SignalR hub for our load testing purposes.

At the very least, we can clone the aspnetcore repo and run both of these apps as a local test:


        bash
        
            
        
     
git clone https://github.com/aspnet/AspNetCore
cd aspnetcore/src/SignalR/perf/benchmarkapps

Start the server:


        bash
        
            
        
     
cd BenchmarkServer
dotnet run

Start crankier:


        bash
        
            
        
     
cd Crankier
dotnet run -- local --target-url http://localhost:5000/echo --workers 10

I put the server on an Azure App Service and pointed Crankier at it too. Here’s the results;

Server	Max connections
Local	8113
B1 App Service	350
S1 App Service	768
P1V1 App Service	768

These results look alarming, but there’s reasoning behind them. 8113 turns out to be the maximum http connections my local machine can make, even within itself. But if that’s the case why can’t I get that number with the App Services? The limit on B1 is stated on Azure subscription service limits but the same page notes Unlimited for Web sockets per instance for S1 and P1V1. Turns out (via Azure Support) that 768 is the (undocumented) connection limit per client. I’ll need more clients!

Hosting Crankier inside a container#

I want to spawn multiple clients to test connection limits, so containers seem like a great idea to do this. Sticking Crankier inside a container is pretty easy, here’s the Dockerfile I built to do this. I’ve pushed this to docker hub, so we can skip building it and run it directly. The run command above now becomes:


        bash
        
            
        
     
docker run staff0rd/crankier --target-url http://myPath/echo --workers 10

Using this approach I can push the App Services above 768 concurrent connections, but I still need one client per 768 connections. I want to chase higher numbers, so I’ll swap the App Services out for Virtual Machines, which I’ll directly run the Benchmark server on.

Hosting BenchmarkServer inside a container#

Now that I have multiple clients it’s no longer clear how many concurrent connections I’ve reached. I’ll extend the benchmark server to echo some important information every second:

Total time we’ve been running
Current connection count
Peak connection count
New connections in last second
New disconnections in the last second

I’ve forked aspnetcore here to implement the above functionality.

Additionally, I’ll put it inside a container so I don’t have to install any of its dependencies on the VM, like the dotnet sdk. Here’s the Dockerfile, but as usual it’s on docker hub, so we can now start the server like so:


        bash
        
            
        
     
docker run -d -p 80:80 crankier staff0rd/crankier-server

Container client results#

Now we can raise 10-20 (or more!) containers at a time asynchronously with the following command, where XX is incremented each time:


        bash
        
            
        
     
az container create -g crankier --image staff0rd/crankier --cpu 1 --memory 1 \
    --command-line "--target-url http://myPath/echo --workers 10 --connections 20000" \
    --no-wait --name myContainerXX

Over many tests I found that Operating System didn’t seem to make much of a difference, so I stuck to Linux as it’s cheaper. I didn’t detect any difference between running the server inside a container with docker vs running it directly via dotnet installed on the VM, so future tests stick to running the server inside docker only.

Here’s the results:

VmSize	Max connections	Time	Os
B2s	64200	15m	Windows Datacenter 2019
B2s	64957	18m	Windows Datacenter 2019
B2s	57436	> 5m	Ubuntu 16.04
B2ms	107944	7m	Ubuntu 16.04 (50+ containers)

Overall these results are a bit lower than I was expecting, and two problems still existed.

By default Crankier only holds connections open for 5 minutes before dropping them. Any tests running over 5 minutes were dropping their connections and;
Some containers were maxing out 500 concurrent connections only. If I raised 10 containers, only 1 or 2 of them would crank past 500.

The first one is easily solve by passing --send-duration 10000 to hold connections open for 10000 seconds, but the second item would require a move to VMs as clients.

Crankier running on VMs#

I found that VMs were much more reliable in bringing up many connections, but my problem was that they weren’t as easily automated like containers. So, I built the automation myself with these scripts:

clientVM.json
- An ARM template that specifies the structure of the VM to bring up per client.
startUpScript.sh
- Install docker on the VM once it’s initialised and docker pull staff0rd/crankier in preparation.
Up.ps1
- Asynchronously raise count VMs of size vmSize
RunCommand.ps1
- Running commands on VMs is not quick, so this script enables faster command running using ssh & powershell jobs. We can use this to send commands to all the VMs and get the result back.

Using the scripts above I quickly found that Azure places a limit of 20 cores per region by default. As a workaround, I raise ten 2-core VMs per region. Here’s an example of raising 30 VMs:


        powershell
        
            
        
     
.\Up.ps1 -count 10 -location australiaeast
.\Up.ps1 -count 10 -offset 10 -location westus
.\Up.ps1 -count 10 -offset 20 -location eastus

I can monitor the progress of bringing the VMs up with Get-Job and Get-Job | Receive-Job. Once the jobs are completed I can clear them with Get-Job | Remove-Job. Because the VMs are all brought up asynchronously it takes about 5 minutes total to bring them all up. After they’re up, we can send commands to them:


        powershell
        
            
        
     
.\RunCommand.ps1 -command "docker run --name crankier -d staff0rd/crankier --send-duration 10000 --target-url http://mypath/echo --connections 10000 --workers 20"

If we’ve set the client’s target-url correctly, we should now see the server echoing the incoming connections:


        yaml
        
            
        
     
[00:00:00] Current: 178, peak: 178, connected: 160, disconnected: 0, rate: 160/s
[00:00:02] Current: 432, peak: 432, connected: 254, disconnected: 0, rate: 254/s
[00:00:02] Current: 801, peak: 801, connected: 369, disconnected: 0, rate: 369/s
[00:00:03] Current: 1171, peak: 1171, connected: 370, disconnected: 0, rate: 370/s
[00:00:05] Current: 1645, peak: 1645, connected: 474, disconnected: 0, rate: 474/s
[00:00:05] Current: 2207, peak: 2207, connected: 562, disconnected: 0, rate: 562/s
[00:00:06] Current: 2674, peak: 2674, connected: 467, disconnected: 0, rate: 467/s
[00:00:08] Current: 3145, peak: 3145, connected: 471, disconnected: 0, rate: 471/s
[00:00:08] Current: 3747, peak: 3747, connected: 602, disconnected: 0, rate: 602/s
[00:00:10] Current: 4450, peak: 4450, connected: 703, disconnected: 0, rate: 703/s

Monitoring client VM connections#

RunCommand.ps lets us send any command we like to every VM, so we can use docker logs to get the last line logged from every VM to monitor their status:


        powershell
        
            
        
     
.\RunCommand.ps1 -command "docker logs --tail 1 crankier"

Output:


        bash
        
            
        
     
{"ConnectingCount":10,"ConnectedCount":8038,"DisconnectedCount":230,"ReconnectingCount":0,"FaultedCount":34,"TargetConnectionCount":10000,"PeakConnections":8038}

{"ConnectingCount":10,"ConnectedCount":8026,"DisconnectedCount":211,"ReconnectingCount":0,"FaultedCount":34,"TargetConnectionCount":10000,"PeakConnections":8026}

{"ConnectingCount":10,"ConnectedCount":7984,"DisconnectedCount":187,"ReconnectingCount":0,"FaultedCount":32,"TargetConnectionCount":10000,"PeakConnections":7986}
...

Here’s an example of killing the containers:


        powershell
        
            
        
     
.\RunCommand.ps1 -command "docker rm -f crankier"

Results#

Over the last three months I’ve raised ~980 VMs, slowly enhancing how I test and capture data. The lines below represent some of those tests, the later ones also include the full log of the test.

Standard_D2s_v3 server#

Time from start	Peak connections	Logs
15:35	93,100
07:38	100,669
24:16	91,541
24:04	92,506	https://pastebin.com/QPLgDeZt
07:54	100,730	https://pastebin.com/FB9skzJE
13:31	91,541	https://pastebin.com/sDLdm0bh

Average 80% CPU/RAM

Standard_D8s_v3 server#

Time from start	Peak connections	Logs
02:34	107,564
05:55	111,665
03:43	132,175
25:33	210,746
13:03	214,025	https://pastebin.com/wkttPAaS

Average 40% CPU/RAM

Standard_D32s_v3 server#

Time from start	Peak connections	Logs
11:05	236,906	https://pastebin.com/mm3RZM1y
10:28	245,217	https://pastebin.com/6kAPJB9R

Average 20% CPU/RAM

The logs tell an interesting story, including the limits on new connections per second, and how long it takes before Kestrel starts slowing down with:

Heartbeat took longer than “00:00:01” at “05/21/2019 09:37:19 +00:00”

and the time until SignalR starts throwing the following exception (GitHub issue - possible fix),

Failed writing message. Aborting connection.

System.InvalidOperationException: Writing is not allowed after writer was completed

Findings#

My original target was guided by this tweet from April 2018, which suggests 236k concurrent connections at 9.5GB. From the tests above it doesn’t look like ASP.NET Core SignalR is currently (dotnet 3.0.100-preview6-011744) capable of such a number at such low memory. B2ms which has 8GB peaked at 107k with D2s_v3 similar. However, with D8s_v3 and D32s_v3 peaking at 214k and 245k respectively, it’s clear that CPU and memory are not currently the limiting factor. With the tools I’ve created to automate both the deployment of server and clients, once .NET Core 3 reaches RTM it will be relatively trivial to re-test at a later date.

Taking it further#

I’ve sunk quite a bit of time into this load testing project. It’s resulted in three new containers and a few merges to aspnet/aspnetcore. Even so, there’s still things to do.

The functionality from the forked benchmarkserver should instead be moved in to Crankier itself. The server logs are missing important metrics: total cpu & memory usage, however there doesn’t seem to be a nice way to grab these in current .NET Core (currently I monitor top in another ssh session to the server). Finally, one could leverage Application Insights, and, along with echoing to std out, also push telemetry to App Insights via TelemetryClient - this would result in pleasant graphs and log querying over pastebin log dumps.

A final note#

Having become acquainted with Crankier I can appreciate its own limits. The current implementation only tests concurrent connections and not messaging between client and server which without does not reflect “real” load on a SignalR application. To test your own application, not only should messaging be tested, but specifically the messaging that your Hub implementations expect. Instead of extending Crankier to test your own Hub methods, it’s much easier to use Microsoft.AspNetCore.SignalR.Client to write your own class that will use HubConnection to call your application’s Hub methods directly, acting as an automated user specific to your application.

Chasing concurrent connection counts in this manner has been fun, but doesn’t reflect what production should look like. Ramping VM size to achieve higher connection counts ignores that one VM is one single point of failure for your application. In production, using something like Azure SignalR Service would be a better approach to scaling concurrent connections.