The internet's early warning system
It’s 6:30am and the people of Tallahassee in Florida are waking up, checking their phones, getting ready to set off for work. Few people are streaming content via Netflix at this time of the morning – and it’s just as well, because, if they were, the customers of one of the country’s biggest broadband providers would notice that streams are stuttering.
The early hour means the broadband provider’s tech support lines aren’t yet lighting up with customer complaints, however. And all is quiet on social media. But inside the broadband provider’s network control centre, the alarm has been raised.
Tens of thousands of SamKnows’ Agents have been quietly probing the Netflix service around the clock and, shortly before 4am, discovered that streaming performance had dipped around 70% below the bounds normally recorded at that time of night. SamKnows FaultFinder raised the alarm within 20 minutes, showing the network engineers precisely how far streaming performance had dipped below the norm, which customers were being affected, and the network equipment they’re using.
Within the hour, the network engineers were able to diagnose a problem with a CDN cache server, ensuring that the problem was resolved prior to the evening peak, where tens of thousands of customers may have had their night in watching Squid Game ruined. A significant potential problem resolved before many even realised there was a problem at all.
That’s the power of FaultFinder, a SamKnows product that has been years in gestation, but which emerged from the turbulence of the Covid lockdowns. Here’s the story of FaultFinder – how it was developed, what it does, and the reasons it’s the most important product in the company’s history.
Billions and billions of rows of measurement data arrive every day into the SamKnows platform.
Finding the needles
The idea for SamKnows FaultFinder had been “simmering in the back of my mind for many years,” said the company’s founder Sam Crawford.
SamKnows has its software Agent or Whiteboxes in tens of millions of homes around the world, each recording thousands of different performance measurements every day. The problem with collecting so much data is knowing what to do with it. “Billions and billions of rows of measurement data arrive into our platform every day, and often it’s like finding a needle in a haystack when looking for an issue,” said Crawford. “You don’t even know if there’s zero needles in the haystack, or a million needles.”
Sam has frequently been asked to help identify problems on SamKnows’ customer networks, but it would be a “manual, painstaking process” to find the faults. “I’ve always thought that it would be nice to automate the process somehow, because we already have all of the ingredients,” said Crawford. “It’s just a case of putting the data we already have to better use and automating the analysis of it to discover anomalies.”
There was no arguing with the theory, but finding the time and resources to make it happen was a problem.
Then, in the early part of 2020, Covid struck and Sam suddenly found himself with more time on his hands than he was used to. “Lockdown certainly provided an opportunity to explore some stuff technically, because we weren’t travelling as much – or at all,” he said. “We were working from home, we had fewer journeys to the office, and so I had plenty of time to experiment with new things.”
It wasn’t only time that Sam had on his hands, but a new data pipeline that allowed SamKnows to process those billions of lines of measurement data in real-time. For the first time in years, Sam not only had the capacity to work on developing FaultFinder with his team, but he had the technical capability to pull it off, too.
It’s a case of putting the data we already have to better use and automating the analysis of it to discover anomalies.
What’s gone wrong and why?
FaultFinder’s job is to discover the problems on a broadband network and, crucially, what’s causing those problems – ideally before the broadband providers’ customers even notice that something is wrong.
It first takes that vast trove of performance metrics to calculate the expected performance levels on a particular broadband network. Those range from high-level metrics, such as the average download speed across the entire network, to very granular data, such as the average ping times when playing Fortnite. “We have this constant stream of incoming measurement data, and the predictability of it allows us to build trends of what is the expected performance based upon the historical average,” said Sam Crawford. “We know what the expected volumes should be, because they’re scheduled measurements and we have the historical volumes to act as our trend line. We’re basically comparing the current working set of measurements against our historical trends.”
Crucially, FaultFinder doesn’t raise the alarm when the performance of a given metric dips below a fixed threshold – say 10% slower than the average download speed – because setting fixed thresholds is of very limited use, as Sam explains.
“There’s no good, fixed set of bounds,” he said. “Some of these factors change by time of day.
Some operators will, quite reasonably, have some congestion during peak hours, so you’ll see throughput drops slightly, latency increases slightly, packet loss increases slightly during those peak hours.”
“So, you don’t want to come up with a fixed set of thresholds that have to encompass all of the peaks and drops, because you’ll miss stuff while also being too sensitive at other parts of the day.”
Instead, FaultFinder learns about the network’s performance based on historical trends, and then applies a standard deviation multiplier across those previous trends to calculate what is an acceptable tolerance for that particular metric. This means, for example, that a significant dip in download speeds around the 8pm evening peak is less likely to raise an alert than a sudden fall at 4am in the morning, when the network would normally be less congested.
We have this constant stream of incoming measurement data, and the predictability of it allows us to build trends.
Matching the metadata
FaultFinder isn’t only comparing current performance against historical expectation; it’s keeping a watchful eye on the number
of tests being successfully completed, too. “If there’s a big drop in measurement sample size then that’s an indicator
that something else has gone wrong,” said Sam. It might indicate something outside of the broadband network’s control, such as a localised power cut. Or, it might indeed be a problem in the operator’s network, or a configuration issue that needs fixing. FaultFinder not only raises the alarm, but helps the networks to identify the cause of the issue.
The chances of successfully finding the culprit are further improved if the networks enter extra metadata into FaultFinder, which might identify which router the end customer is using, or even which version of the router’s firmware the device is running, for instance. “Some of the ISPs will give us about 30 or 40 different metadata fields,” said Sam. “By combining the measurements we collect with the metadata that’s provided by the operator, we can track down specific things, with super-high granularity and with really high certainty.”
Sam cites the real-world example of a broadband provider that rolled out a new version of the firmware for the routers in its customers’ homes. The firmware contained a bug that saw wired throughput drop down to 10Mbits/ sec, irrespective of how fast the customer’s connection was meant to be.
Using a regular analytics system, the broadband provider would likely have seen a slow trend downwards in average download speed as the firmware was rolled out across its fleet, but it would likely have taken a long time to work out the actual cause of this. It would almost certainly have required engineer visits to customer homes to identify the flaw.
With FaultFinder, however, the software would report that a drop in download speeds is strongly correlated to a particular piece of metadata – the firmware version number. The network managers would be sent the alert, they wouldn’t have to wait for an investigation to establish the cause of the fault. As a result, the time taken to remedy what would potentially be a very damaging issue for the broadband provider’s performance and reputation is massively reduced. “It’s basically finding the needle in the haystack for you,” said Sam.
If there’s a big drop in measurement sample size then that’s an indicator that something else has gone wrong.
Working in real-time
Alerting broadband providers to faults on their network before they even know about them simply wouldn’t be possible if it wasn’t for the real-time data. SamKnows uses the Kafka data streaming platform, hosted on Google’s cloud infrastructure. This constant stream of incoming data gives SamKnows the edge over any of its rivals.
“Some of our customers really value having the measurement data arriving in real-time, particularly the people who are relying on it for network operations,” said Sam.
“I was talking to somebody who worked at another company that collects measurements, specifically about data pipelines. They said they don’t have a real-time data pipeline. They have a 24-hour batch delayed mechanism for loading all of their data. That’s fine for their use case, but it means it’s impossible for them to do something like FaultFinder. Who wants to be told about a major issue that’s occurring across your customer base 24 hours late?”
Indeed, part of Sam’s ongoing focus is further improving the time it takes FaultFinder to raise the alarm once a problem is detected. “We can identify some kinds of faults far more quickly than others,” he said, “potentially down to a minute or so, or maybe even into the seconds in the future.”
Better still, it’s the most serious faults that are the ones Sam believes will be easiest to detect quickly. “Let’s say we have 6 million Agents on one ISP, reporting measurements to us,” he said. “Even if they’re only running a latency test once every couple of hours, since we have so many Agents out there, it doesn’t matter that they’re only running once every couple of hours. It means we have a lot of measurements arriving every second, so we can spot a lack of measurements very quickly.”
Even if an increase in latency might take longer to detect, the far bigger problem of dropped connections would be spotted almost immediately. The network operations team can get on with fixing the problem without delay.
The one thing Sam is wary of, however, is raising false alarms. While he’s determined to send alerts for serious faults in under a minute, this does increase
the potential for mistakes to be made on the back of relatively small samples. Sam is working on a couple of measures so alerts can be accelerated without compromising accuracy.
“In the future, it would be nice to incorporate machine learning,” said Sam. He’s also looking to incorporate user feedback, letting the customers tell SamKnows if the reported fault was a genuine anomaly and if it had a real-world impact on their broadband. “It would be nice to add in a feedback loop so we can take into account user feedback from anomalies. We can take that feedback, incorporate it into our future models, and then increase sensitivity to try to catch those kinds of anomalies sooner in the future.”
The FaultFinder flagship
Continually adding new features and reducing the time it takes to flag faults is part of the ongoing development of FaultFinder. The product might only have existed in Sam Crawford’s imagination a year or two ago, but now it’s central to the future of the company, according to CEO Alex Salter. Salter describes FaultFinder as the “product that will define us.”
“It makes use of everything we have ever developed and solves a previously impossible puzzle by providing precisely what you need, at the exact moment you need it,” he added.
And although its primary job is, well, finding faults, it’s also capable of delivering good news, telling ISPs when their service is exceeding past performance.
After what seems like a couple of years of relentless bad news, isn’t it nice to be told something is working better than you thought it was?
It makes use of everything we have ever developed and solves a previously impossible puzzle.