Yesterday we started testing a scale deployment using the HaLowLink1 connected via ethernet to another WiFi AP used as the DHCP server. It all seems fairly stable when there are a small number of devices connected over HaLow but when we associated about 20 they all went off air and the only way we could get it back online was to reboot the HaLowLink1 (not the DHCP box). This points to something in the HaLowLink 1 not being able to handle lots of devices. The devices that were connected are Heltec HaLow-WiFi dongles. Simply switch them all on and the whole network seems to collapse. With just a few of the dongles running in our lab it has been seen to continue for days on end, yesterday’s scale test is the first time we saw this happen and it is easily repeatable. Has anyone else see such a problem?
How many HaLowLink1’s do you have? I’d be curious to see an over the air pcap when things start going downhill. You can put one in monitor mode and pcap as you add devices to the network. There’s two sides to every connection–the ap and the client. So, a pcap could show the clients are actually behaving in a way that violates the standard by not supporting some feature that helps as airtime contention rises.
We have run 100s of stations against the HL1; we definitely wouldn’t expect any issues with only 20. As @dwrice0 mentioned, it would be interesting to get some more details. Make sure you’ve upgrade to the latest version of the firmware.
When you say collapse, do you lose access to the HL1 as well, or is it just that the HaLow network doesn’t appear to be functional? Does ‘collapse’ mean you’re losing association, or just failing to pass much traffic?
Possibly the even simpler thing to do than getting a pcap would be to tail the logs on the HL1 while the stations are associating, and perhaps bring them up in stages. You can also go to the ‘Status → Realtime Graphs’ page to get a sense of what’s going on.
I think I have found a workaround and potential issue on this topic. The test consisted of 20 Heltec WiFi bridges being connected to the HalowLink1 and pinging them regularly to check if they are still connected, we also pinged about 50 WiFi devices that were connected to the bridges although the issue presents itself without the WiFi end points just the bridges. The pinging code sent pings async so they could have been issued in quick succession. The failure is very easy to reproduce on the bench, just run the automated pinger and switch on the bridge devices so they connect, then after a short time they aren’t reachable. If I reboot the bridges they never reconnect however, if I reboot the HaLowLink1 then the bridges will reconnect until it all happens again.
I ran pings to each of the bridges using ping in 20 command line windows and it was reliable so it does look like the async code for pinging them overwhelmed the HalowLink1. I have since added some jitter delay in to the pinging tool and run it over 3 days without problems.
Here is the C# code with jitter added for reference if anyone else is considering using pings to automatically monitor the connection state of their devices. It could also be used to add a test to the firmware development team if the random delay is removed and a storm of pings is required. Hope it helps.
private readonly Random jitterRandom = new Random();
private void StartPinging()
{
Task.Run(async () =>
{
while (!pingCancellation.IsCancellationRequested)
{
foreach (var ip in deviceIcons.Keys)
{
int jitter = jitterRandom.Next(20, 200); // ±some fractions of a second
int delay = Math.Max(0, 100 + jitter); // Ensure no negative delay
await Task.Delay(delay); // Delay before each ping with jitter
_ = PingDeviceAsync(ip);
}
await Task.Delay(PingIntervalMs);
}
});
}
private async Task PingDeviceAsync(string ip)
{
try
{
PingReply reply;
using (Ping ping = new Ping())
{
reply = await ping.SendPingAsync(ip, 1000); // 1s timeout
}
if (reply.Status == IPStatus.Success)
{
lastSeen[ip] = DateTime.Now;
}
}
catch
{
// Ignore exceptions, treated as failure
}
UpdateDeviceStatus(ip);
}
Thanks for the detailed information. That’s quite concerning, and we’ll try to duplicate it ourselves. That the stations don’t reconnect would appear to indicate some part of the system has crashed on the AP.
Would it be possible for you to send us the logs from the AP when the stations drop off and won’t connect? You can get them either via Status->System log in the menu or using logread
in the terminal.
Hello, I have been trying to build a test network here to more closely resemble the network you described and would appreciate some more detail if you can help.
I have a conventional 2.4Ghz WiFi AP acting as my DHCP server. My laptop is connected to this via ethernet. This is attached to the Halowlink1 AP via ethernet to the WAN port. The halowlink1 AP is modified from a default device by going to ‘Wizard’ and choosing ‘HaLow Wi-Fi devices will get an IP on your existing router’s network.’.
The ethernet/halow client bridge devices (for you, heltec, for me, more halowlink 1s) are then connected via halow to the halow AP, using WDS connections and bridged to both ethernet and 2.4Ghz wifi. They are configured as DHCP clients.
I also connected ‘end point’ devices via ethernet and 2.4Ghz WiFi to the ‘bridge devices’. These are also DHCP clients.
Does this accurately represent the network you are trying to build?
I replicated your async ping script in python. I was able to generate ping traffic at high rates to all devices. I have not been able to replicate the error you described. I will try to use your C# code to see if there is any different behaviour.
Are you able to give some more information on the two networks and traffic types you created? The type where you did see an issue, and the type where you did not see an issue. Was the only difference between a working and non working network that you introduced some jitter in the mass ping timing? How often are you sending pings and to how many devices?
Some other things to double check that may be causing issues, although the symptoms you described do not point at these;
- A network loop. If I plug two of the bridge devices together via ethernet things go badly. This could also be the case for the 2.4Ghz wifi network. You may be able to enable STP
- Multiple DHCP servers. This can cause the appearance of network loss as devices have migrated away to another address or subnet. Sometimes an openwrt device can be misconfigured as both a DHCP server and client.
- Excessive broadcast traffic. An ethernet network of 70 devices may handle home network protocols like mdns without noticable issue, but the halow link may be impacted when there are 70 devices broadcasting and engaging in 70^2 peer to peer negotiations. What kind of traffic is running on the network?