Load testing ASP.NET Core SignalR
Last time around I messed with SignalR I touched briefly on load testing. This time around I’ll deep dive into SignalR load testing, specifically to test the tool supplied in source, Crankier, and build my own load testing tools to investigate the limits of SignalR applications.
Why?
I have a SignalR-based application I’m building that I intend to gradually test with increasing-sized audiences of people. Prior to these real-human tests I’d like to have confidence and an understanding of what the connection limits and latency expectations are for the application. An application demo falling over due to load that could have been investigated with robots 🤖 instead of people 🤦♀ is an experience I want to skip.
I started load testing SignalR three months ago and it took me down a crazy rabbit hole of learning - this post will summarise both the journey and the findings.
Crankier
Crankier is an ASP.NET Core SignalR port of Crank, which was a load testing tool shipped with ASP.NET SignalR. At the moment the only thing Crankier does is attempt to hold open concurrent connections to a SignalR hub. There’s also a BenchmarkServer which we can use to host a SignalR hub for our load testing purposes.
At the very least, we can clone the aspnetcore repo and run both of these apps as a local test:
git clone https://github.com/aspnet/AspNetCore
cd aspnetcore/src/SignalR/perf/benchmarkapps
Start the server:
cd BenchmarkServer
dotnet run
Start crankier:
cd Crankier
dotnet run -- local --target-url http://localhost:5000/echo --workers 10
I put the server on an Azure App Service and pointed Crankier at it too. Here’s the results;
Server | Max connections |
---|---|
Local | 8113 |
B1 App Service | 350 |
S1 App Service | 768 |
P1V1 App Service | 768 |
These results look alarming, but there’s reasoning behind them. 8113 turns out to be the maximum http connections my local machine can make, even within itself. But if that’s the case why can’t I get that number with the App Services? The limit on B1 is stated on Azure subscription service limits but the same page notes Unlimited for Web sockets per instance for S1 and P1V1. Turns out (via Azure Support) that 768 is the (undocumented) connection limit per client. I’ll need more clients!
Hosting Crankier inside a container
I want to spawn multiple clients to test connection limits, so containers seem like a great idea to do this. Sticking Crankier inside a container is pretty easy, here’s the Dockerfile I built to do this. I’ve pushed this to docker hub, so we can skip building it and run it directly. The run
command above now becomes:
docker run staff0rd/crankier --target-url http://myPath/echo --workers 10
Using this approach I can push the App Services above 768 concurrent connections, but I still need one client per 768 connections. I want to chase higher numbers, so I’ll swap the App Services out for Virtual Machines, which I’ll directly run the Benchmark server on.
Hosting BenchmarkServer inside a container
Now that I have multiple clients it’s no longer clear how many concurrent connections I’ve reached. I’ll extend the benchmark server to echo some important information every second:
- Total time we’ve been running
- Current connection count
- Peak connection count
- New connections in last second
- New disconnections in the last second
I’ve forked aspnetcore here to implement the above functionality.
Additionally, I’ll put it inside a container so I don’t have to install any of its dependencies on the VM, like the dotnet sdk. Here’s the Dockerfile, but as usual it’s on docker hub, so we can now start the server like so:
docker run -d -p 80:80 crankier staff0rd/crankier-server
Container client results
Now we can raise 10-20 (or more!) containers at a time asynchronously with the following command, where XX
is incremented each time:
az container create -g crankier --image staff0rd/crankier --cpu 1 --memory 1 \
--command-line "--target-url http://myPath/echo --workers 10 --connections 20000" \
--no-wait --name myContainerXX
Over many tests I found that Operating System didn’t seem to make much of a difference, so I stuck to Linux as it’s cheaper. I didn’t detect any difference between running the server inside a container with docker vs running it directly via dotnet
installed on the VM, so future tests stick to running the server inside docker only.
Here’s the results:
VmSize | Max connections | Time | Os |
---|---|---|---|
B2s | 64200 | 15m | Windows Datacenter 2019 |
B2s | 64957 | 18m | Windows Datacenter 2019 |
B2s | 57436 | > 5m | Ubuntu 16.04 |
B2ms | 107944 | 7m | Ubuntu 16.04 (50+ containers) |
Overall these results are a bit lower than I was expecting, and two problems still existed.
- By default Crankier only holds connections open for 5 minutes before dropping them. Any tests running over 5 minutes were dropping their connections and;
- Some containers were maxing out 500 concurrent connections only. If I raised 10 containers, only 1 or 2 of them would crank past 500.
The first one is easily solve by passing --send-duration 10000
to hold connections open for 10000 seconds, but the second item would require a move to VMs as clients.
Crankier running on VMs
I found that VMs were much more reliable in bringing up many connections, but my problem was that they weren’t as easily automated like containers. So, I built the automation myself with these scripts:
- clientVM.json
- An ARM template that specifies the structure of the VM to bring up per client.
- startUpScript.sh
- Install docker on the VM once it’s initialised and
docker pull staff0rd/crankier
in preparation.
- Install docker on the VM once it’s initialised and
- Up.ps1
- Asynchronously raise
count
VMs of sizevmSize
- Asynchronously raise
- RunCommand.ps1
- Running commands on VMs is not quick, so this script enables faster command running using ssh & powershell jobs. We can use this to send commands to all the VMs and get the result back.
Using the scripts above I quickly found that Azure places a limit of 20 cores per region by default. As a workaround, I raise ten 2-core VMs per region. Here’s an example of raising 30 VMs:
.\Up.ps1 -count 10 -location australiaeast
.\Up.ps1 -count 10 -offset 10 -location westus
.\Up.ps1 -count 10 -offset 20 -location eastus
I can monitor the progress of bringing the VMs up with Get-Job
and Get-Job | Receive-Job
. Once the jobs are completed I can clear them with Get-Job | Remove-Job
. Because the VMs are all brought up asynchronously it takes about 5 minutes total to bring them all up. After they’re up, we can send commands to them:
.\RunCommand.ps1 -command "docker run --name crankier -d staff0rd/crankier --send-duration 10000 --target-url http://mypath/echo --connections 10000 --workers 20"
If we’ve set the client’s target-url
correctly, we should now see the server echoing the incoming connections:
[00:00:00] Current: 178, peak: 178, connected: 160, disconnected: 0, rate: 160/s
[00:00:02] Current: 432, peak: 432, connected: 254, disconnected: 0, rate: 254/s
[00:00:02] Current: 801, peak: 801, connected: 369, disconnected: 0, rate: 369/s
[00:00:03] Current: 1171, peak: 1171, connected: 370, disconnected: 0, rate: 370/s
[00:00:05] Current: 1645, peak: 1645, connected: 474, disconnected: 0, rate: 474/s
[00:00:05] Current: 2207, peak: 2207, connected: 562, disconnected: 0, rate: 562/s
[00:00:06] Current: 2674, peak: 2674, connected: 467, disconnected: 0, rate: 467/s
[00:00:08] Current: 3145, peak: 3145, connected: 471, disconnected: 0, rate: 471/s
[00:00:08] Current: 3747, peak: 3747, connected: 602, disconnected: 0, rate: 602/s
[00:00:10] Current: 4450, peak: 4450, connected: 703, disconnected: 0, rate: 703/s
Monitoring client VM connections
RunCommand.ps
lets us send any command we like to every VM, so we can use docker logs
to get the last line logged from every VM to monitor their status:
.\RunCommand.ps1 -command "docker logs --tail 1 crankier"
Output:
{"ConnectingCount":10,"ConnectedCount":8038,"DisconnectedCount":230,"ReconnectingCount":0,"FaultedCount":34,"TargetConnectionCount":10000,"PeakConnections":8038}
{"ConnectingCount":10,"ConnectedCount":8026,"DisconnectedCount":211,"ReconnectingCount":0,"FaultedCount":34,"TargetConnectionCount":10000,"PeakConnections":8026}
{"ConnectingCount":10,"ConnectedCount":7984,"DisconnectedCount":187,"ReconnectingCount":0,"FaultedCount":32,"TargetConnectionCount":10000,"PeakConnections":7986}
...
Here’s an example of killing the containers:
.\RunCommand.ps1 -command "docker rm -f crankier"
Results
Over the last three months I’ve raised ~980 VMs, slowly enhancing how I test and capture data. The lines below represent some of those tests, the later ones also include the full log of the test.
Standard_D2s_v3 server
Time from start | Peak connections | Logs |
---|---|---|
15:35 | 93,100 | |
07:38 | 100,669 | |
24:16 | 91,541 | |
24:04 | 92,506 | https://pastebin.com/QPLgDeZt |
07:54 | 100,730 | https://pastebin.com/FB9skzJE |
13:31 | 91,541 | https://pastebin.com/sDLdm0bh |
Average 80% CPU/RAM
Standard_D8s_v3 server
Time from start | Peak connections | Logs |
---|---|---|
02:34 | 107,564 | |
05:55 | 111,665 | |
03:43 | 132,175 | |
25:33 | 210,746 | |
13:03 | 214,025 | https://pastebin.com/wkttPAaS |
Average 40% CPU/RAM
Standard_D32s_v3 server
Time from start | Peak connections | Logs |
---|---|---|
11:05 | 236,906 | https://pastebin.com/mm3RZM1y |
10:28 | 245,217 | https://pastebin.com/6kAPJB9R |
Average 20% CPU/RAM
The logs tell an interesting story, including the limits on new connections per second, and how long it takes before Kestrel starts slowing down with:
Heartbeat took longer than “00:00:01” at “05/21/2019 09:37:19 +00:00”
and the time until SignalR starts throwing the following exception (GitHub issue - possible fix),
Failed writing message. Aborting connection.
System.InvalidOperationException: Writing is not allowed after writer was completed
Findings
My original target was guided by this tweet from April 2018, which suggests 236k concurrent connections at 9.5GB. From the tests above it doesn’t look like ASP.NET Core SignalR is currently (dotnet 3.0.100-preview6-011744
) capable of such a number at such low memory. B2ms
which has 8GB peaked at 107k with D2s_v3
similar. However, with D8s_v3
and D32s_v3
peaking at 214k and 245k respectively, it’s clear that CPU and memory are not currently the limiting factor. With the tools I’ve created to automate both the deployment of server and clients, once .NET Core 3 reaches RTM it will be relatively trivial to re-test at a later date.
Taking it further
I’ve sunk quite a bit of time into this load testing project. It’s resulted in three new containers and a few merges to aspnet/aspnetcore. Even so, there’s still things to do.
The functionality from the forked benchmarkserver should instead be moved in to Crankier itself. The server logs are missing important metrics: total cpu & memory usage, however there doesn’t seem to be a nice way to grab these in current .NET Core (currently I monitor top
in another ssh session to the server). Finally, one could leverage Application Insights, and, along with echoing to std out, also push telemetry to App Insights via TelemetryClient
- this would result in pleasant graphs and log querying over pastebin log dumps.
A final note
Having become acquainted with Crankier I can appreciate its own limits. The current implementation only tests concurrent connections and not messaging between client and server which without does not reflect “real” load on a SignalR application. To test your own application, not only should messaging be tested, but specifically the messaging that your Hub
implementations expect. Instead of extending Crankier to test your own Hub
methods, it’s much easier to use Microsoft.AspNetCore.SignalR.Client to write your own class that will use HubConnection
to call your application’s Hub
methods directly, acting as an automated user specific to your application.
Chasing concurrent connection counts in this manner has been fun, but doesn’t reflect what production should look like. Ramping VM size to achieve higher connection counts ignores that one VM is one single point of failure for your application. In production, using something like Azure SignalR Service would be a better approach to scaling concurrent connections.