Yesterday we started testing a scale deployment using the HaLowLink1 connected via ethernet to another WiFi AP used as the DHCP server. It all seems fairly stable when there are a small number of devices connected over HaLow but when we associated about 20 they all went off air and the only way we could get it back online was to reboot the HaLowLink1 (not the DHCP box). This points to something in the HaLowLink 1 not being able to handle lots of devices. The devices that were connected are Heltec HaLow-WiFi dongles. Simply switch them all on and the whole network seems to collapse. With just a few of the dongles running in our lab it has been seen to continue for days on end, yesterday’s scale test is the first time we saw this happen and it is easily repeatable. Has anyone else see such a problem?
How many HaLowLink1’s do you have? I’d be curious to see an over the air pcap when things start going downhill. You can put one in monitor mode and pcap as you add devices to the network. There’s two sides to every connection–the ap and the client. So, a pcap could show the clients are actually behaving in a way that violates the standard by not supporting some feature that helps as airtime contention rises.
We have run 100s of stations against the HL1; we definitely wouldn’t expect any issues with only 20. As @dwrice0 mentioned, it would be interesting to get some more details. Make sure you’ve upgrade to the latest version of the firmware.
When you say collapse, do you lose access to the HL1 as well, or is it just that the HaLow network doesn’t appear to be functional? Does ‘collapse’ mean you’re losing association, or just failing to pass much traffic?
Possibly the even simpler thing to do than getting a pcap would be to tail the logs on the HL1 while the stations are associating, and perhaps bring them up in stages. You can also go to the ‘Status → Realtime Graphs’ page to get a sense of what’s going on.
I think I have found a workaround and potential issue on this topic. The test consisted of 20 Heltec WiFi bridges being connected to the HalowLink1 and pinging them regularly to check if they are still connected, we also pinged about 50 WiFi devices that were connected to the bridges although the issue presents itself without the WiFi end points just the bridges. The pinging code sent pings async so they could have been issued in quick succession. The failure is very easy to reproduce on the bench, just run the automated pinger and switch on the bridge devices so they connect, then after a short time they aren’t reachable. If I reboot the bridges they never reconnect however, if I reboot the HaLowLink1 then the bridges will reconnect until it all happens again.
I ran pings to each of the bridges using ping in 20 command line windows and it was reliable so it does look like the async code for pinging them overwhelmed the HalowLink1. I have since added some jitter delay in to the pinging tool and run it over 3 days without problems.
Here is the C# code with jitter added for reference if anyone else is considering using pings to automatically monitor the connection state of their devices. It could also be used to add a test to the firmware development team if the random delay is removed and a storm of pings is required. Hope it helps.
private readonly Random jitterRandom = new Random();
private void StartPinging()
{
Task.Run(async () =>
{
while (!pingCancellation.IsCancellationRequested)
{
foreach (var ip in deviceIcons.Keys)
{
int jitter = jitterRandom.Next(20, 200); // ±some fractions of a second
int delay = Math.Max(0, 100 + jitter); // Ensure no negative delay
await Task.Delay(delay); // Delay before each ping with jitter
_ = PingDeviceAsync(ip);
}
await Task.Delay(PingIntervalMs);
}
});
}
private async Task PingDeviceAsync(string ip)
{
try
{
PingReply reply;
using (Ping ping = new Ping())
{
reply = await ping.SendPingAsync(ip, 1000); // 1s timeout
}
if (reply.Status == IPStatus.Success)
{
lastSeen[ip] = DateTime.Now;
}
}
catch
{
// Ignore exceptions, treated as failure
}
UpdateDeviceStatus(ip);
}
Thanks for the detailed information. That’s quite concerning, and we’ll try to duplicate it ourselves. That the stations don’t reconnect would appear to indicate some part of the system has crashed on the AP.
Would it be possible for you to send us the logs from the AP when the stations drop off and won’t connect? You can get them either via Status->System log in the menu or using logread
in the terminal.