Resiliency & Network Performance: When A Low Impact, High Impact Event Happens
3:52PM Jan 27, 2021
Speakers:
Tim Lordan
Nick Feamster
Matt Tooley
Keywords:
internet
isp
network
matt
traffic
content delivery networks
people
streaming
nick
home
capacity
question
performed
paper
latency
work
applications
lessons
fcc
added
At the top of the hour
now that I'm looking at this. Have a panel on really resiliency and network performance. When a low impact high or when a low probability high impact event happens when I was coming up, many many years when I was younger but reading about the original internet architecture people talked a lot about how the internet was made to be extremely resilient, that even a nuclear war which I grew up worried about in my youth would not would would would not affect the internet too much it would just right around the damage. And we've never had. Thankfully, we've never had an opportunity to test that theory that. Fortunately, but, you know, this COVID-19, with everybody working from home and streaming Netflix from home and other things like that has really created this kind of low probability but high impact event, typically systems aren't designed to handle massive surges in traffic CloudFlare did a report recently, actually in at the beginning of the lockdown showing that the network surge, the internet, performed as expected. Whereas other systems probably wouldn't have, for example, on 911. The phone network didn't really respond very well to everybody picking up the phone and calling the loved ones to see if they were saved. There was a there was a massive network wasn't prepared for that. In the earthquake in Washington DC, a decade ago, which is pretty extremely rare federal government decided to let everybody out early, early release from their work and the streets of DC were not prepared for health as employees getting in their cars and driving home at the same time so it was a massive traffic jam and DC systems aren't designed like that but the internet was designed like that So today, we wanted to have two experts, Nick teamster, who is a computer science professor at the University of Chicago. You may recall Nick while he was working at Princeton, computer science department which is very well known in internet policy circles. He's taking his talents to Chicago, which is great. And Nick is going to be here talking about his recent paper as is Matt Tooley was vice president of broadband technology at NCQA the internet and television Association. He's also. He works on the broadband internet technical advisory group or tag by tag. And, actually, Matt got his master's at the University of Chicago. So, there was kind of this trend of like looks like, University of Chicago people are really taking over internet policy for you, watching yesterday Genevieve Lakey or the professor was really, really great on that section feisty section 230 panel, and she's also from the University of Chicago. So, let me just start off with Nick Nick you just wrote this big paper on characterizing service provider response to the covid 19 pandemic in the United States. I will put a link in the chat to that paper if you haven't read it, but you know just kind of give us a real quick rundown of like how the internet performed during COVID.
I think the main thing to take away Tim is that the internet performed very well during COVID there was, of course, a little bit of a stress test, particularly in in mid March, with the shift to working from home schooling from home with the stay at home or stay at home orders going in effect in various states across the country. And, and the traffic shifts are evident. You can see substantial increases in traffic volume in. In, particularly in certain types of services streaming video and, and so forth. And you can you can certainly see changes in traffic ratios, as well, because everybody's doing video on the, certainly for certain interconnect an increase in upstream traffic from the ISP to those service providers. And there were there was certainly a brief period. It's evident in the FCC is measuring broadband America data that was one of the sources that we looked at as well as other sources that I think Matt will talk about where you can certainly see some blips on some ISP is in, you know, particularly in isolated areas most evident, I would say, if you were to look would be the blips in latency. Right. And although there was a corresponding different speed. But in response to that the Internet. The Internet Service Provider response in terms of the, the rate in which capacity was added was, I don't know if I can go so far as to say unprecedented but it was, it was being added at an incredible rate much much faster than the rate at which capacity was being added to the interconnects over the previous three or four years and that might add, capacity is being steadily added at those interconnects but during this particular period the ISP responded quite quickly. And we saw that as performance, you know after that initial kind of stress in mid March did sort of returned I won't say to normal, but you know, it has sustained distress quite well.
So essentially we have a stress test, which isn't a test at all. It's actually a real time event. I mean, it's global. Do you have any sense of like how other countries have how the internet is performed in other countries, not just the US or just the global rise and traffic.
I think Matt can also speak to that as well. Some of the data in the B. Tech report actually brings to bear some of the some of what we're seeing in other other other countries a little bit more than our report which is really focused on us internet service providers, but there's actually some pretty good research on that, that was published in the internet measurement conference this year. People did look at traffic volumes at internet exchange points and so forth, and similar trends were observed in Europe at least that Matt.
Matt let me just get to you You're the, I guess the co author of the B tag bit tag report. Can you kind of comment on kind of the larger perspective on global traffic and how the network is performed.
Yeah, so I'm a, I'm the lead editor so it's a group paper. So, but one of the papers, we looked at I think Nick referred to at the Internet. Measurement conference was the Facebook team, published a paper, and in that they they looked at the performance across their global network, and what they saw was some interesting things in terms of the video delivery. And, and the quality and what they saw was in some of the developing countries in Sub Saharan Africa had some issues. There are some issues in India, and some issues in South America with, you know, they're streaming their video. They didn't see that in the United States, though, I mean, which is consistent with a lot of the data that Nick and I looked at this you know that the network's overall in the US, worked very well, there, there are the corner cases and the exceptions. But, as a whole, it worked. And, but they showed a few different things that were kind of interesting where they talked about, they could observe. They saw a correlation when the data that typically normally for video would come from a content delivery network directly into the network. And when there was a poor video quality what they started to see was that the the traffic would overflow. The direct interconnection and start to come in through the public Internet or the transit. And so there's you know it's kind of indicative of, you know, there was a problem, and that they could see it. And, but it goes to kind of your leading question of, you know, you know, when people ask me, you know the internet worked as designed through all of this, you know, and when you start to see the things where it starts to overflow on to the public transit. That's what it's designed to do and it's kind of a circuit breaker. These days and for how it works. So, so,
you know, we've been doing this conference, and you know for 17 years. I think that it flew out of float out of the, you know, congressional internet caucus briefings that we were doing on the hill back until, way back in 1996. And the idea was that the internet architecture was fundamentally different than you know other networks, and members of Congress, back in the 90s couldn't understand that maybe still don't, is that, you know, the architecture was fundamentally different. And I think what we wanted to do today than that was kind of figure out just just kind of highlight that you're saying that the Internet, and its architecture worked as promised, and we've really never had an opportunity to really test it.
Yeah. And, but I think one of the things that is to think about is if He even went back, 20 years or even 10 years, how much the internet and the, the core of the Internet has changed in the last 10 and 20 years, an event such as the pandemic of that's been going on and the shelter in place that we talked about in the spring and put into stress that that would have certainly had a very different experience if we went back 20 years ago, compared to how it works now. And one of the reasons for that is the, you know, when you start to look at some of the statistics, you know, the 90% of the traffic these days is delivered via content delivery networks, at least for consumer ISP which helps greatly in, you know, spreading out that load and we didn't have nearly the pervasive infrastructure of content delivery networks, you know, in the early days of the internet
and CD ends are like companies like Akamai and others. Right. Right. Yeah. Yeah. Nick, can you. one of the big things that, you know, young people home, doing a lot of gaming worried about is latency and that's always been a factor when it comes to, you know, broadband and internet access, and one of the reasons why, you know, geostationary satellite internet is really troublesome is because it just has terrible latency. Can you talk about like how latency performed during COVID.
Absolutely. And I think Tim, it's, it's important to note that there are many different ways that you can look at those statistics, and the paper that I think you may have linked in the chat shows a few of the different ways to slice that you can look for example at the FCC measuring broadband America data which has a, you know, rich data set for for exploring things like latency. And you could look for example across the entire data set and look at averages. Okay, and we we've shown, we've shown that in the paper you can actually see that during that during a period of, you know, from mid March onward. In the FCC data, the average is do bump up just a little bit, you know, a couple of milliseconds, before dropping down to normal but averages as we know just hide things right they hide a lot of outliers. And so I think it's helpful, you know to dig in further and you know we do dig into that a little bit further in the paper and I think what you can see in, in, in, you know, by digging in further, is that, by and large, you know, most customers. You know,
we're fine.
There's a tail there, you know, if you look at basically the the worst 5% of customers and the worst 1% right the so called tail. Then certain, certain subscribers those subscribers on certain ISP s did see some pretty substantial effects and some of the data on that is in the paper, and I want you to call out names right now but you can you can go look at the plots and and so like in the in that sort of outlier right the the 1% for particular SPS those effects were pretty substantial. I would note that we looked at this over sort of a multi year period. And for some of those ISP for which the effects are substantial. It wasn't like total meltdown okay it was sort of like a similar kind of similar kinds of effects could be observed for some of those ISP around Christmas time, when everybody goes home and you know starts browsing the web and watching movies and shopping from the internet because they're not at work. Some ISP saw the same kinds of effects at that time. And so, by and large, I would say on average. The effect is observable, for most users. Latency was fine for the unlucky 1% in a couple of ISP fees, it was pretty noticeable. But, but still pretty short lived.
Yeah, I mean it wasn't like I don't didn't feel like people just screaming, that their internet was awful. Right. You know in my home. You know I have three teenagers, and you know what I was expecting to everything to be ground to a halt. And it didn't in different applications take different loads right like zoom is incredibly efficient. As far as while we're all video conferencing from home. It's incredibly efficient as far as video goes but other applications are different, that can you kind of talk about you know the different modes that different applications do and how to, you know, ISVs and the network respond to that.
From, I mean, I think it's a good point that you know the the video today is adaptive, you know, zoom, does a really good job as do most of the other video conferencing, as does the video streaming so when we look at you know it adapts to the bandwidth and the throughput available for the end user. And often it's not, you don't even notice it as the as the end user that it's adapting and. And that's kind of an outgrowth of the engineers, they figured out how to make things work as opposed to thinking like how to make it work like the telephone network of trying to get like a constant bit rate. And so, um, that was a good thing. And one of the interesting things to note, you know what, we, everybody got concerned about oh we're all going to go do video conferencing what's that going to do the network. And it certainly had a very high growth of uptake. But when you actually look at the absolute numbers in terms of what the the the bulk of the traffic that got put onto the network, it was still video streaming and, and in the downstream was the largest contributor to the growth in the network on the traffic. And so, um, yeah. So I think it's all interesting and but I think it's just a testament to the adaptiveness of how the video conferencing and video streaming guys work it's just, it's phenomenal. Your question to me
and the follow up to Matt I think illustrates a very important gap, particularly as we think about internet policy in this area because there are the tests that are currently being run by the FCC and others. And then there is the user experience as they use applications like zoom and so on and so forth. And there's quite a bit of a gap right now between the design of those tests, and how particular applications you asked me about gaming right now is talking about video conferencing, we don't have great standardized tests, or even an infrastructure in place right now, to answer the questions about, you know, how is zoom performing across the US zoom probably knows, but then you could ask the same question about Netflix or YouTube or Hulu or whatever and, you know, I think from a policy perspective, we need to really think hard about what the next generation of testing looks like in the space because I think what we have right now is, is quite limited.
I think that we have a question from in the q&a from Sally Braun, who's asking you know a question that I had too is like how is capacity added, you know, how is Bing and Bing created and you know she admits that she's not an expert so you know she's kind of looking for, you know, just kind of like just the User's Guide to how fast he's added.
I'm sure I'll take a stab I mean, for at least for the cable that works that I'm most familiar with, you know, you would first think out capacity added that they're actually going out and like pulling a new cable or something but um, often there, there are things that are being done and we actually kind of typically call it an augmentation so you can go do a, an adjustment to the network. And sometimes you either just rerouting some stuff in the Corps, which the guys that a lot during this to to adapt to the shifts, but you can also go out and add some bandwidth through some techniques where you basically readjust how that.
So you know i guess i guess the question I have, you're kind of getting towards the end is can. This is an incredible like God knows we, you know, we'd rather not have COVID right, but it's certainly better than a nuclear war. And I guess the question is, what what lessons can our, you know, network engineers like you guys talking with others
to
like figure out what lessons did we learn what are we going to do differently as we were building these networks out and. Is that is that community coming together. Because this has been a terrible gift. I
think jumping in Nick can come in but um you know there's lots of evidence of collaboration across the board between operators and content providers, I mean, there's a great discussion of one of the North American network operator group meetings that they live streamed for you know the the engineers explained how they had their war rooms set up and you know they were, you know, texting and communicating back and forth to make sure that content from content provider, a was able to get to ISP networks and you know, and if you need to make any adjustments you know they're doing those things in real time. There's evidence of collaboration, you know, even in Europeans asking you know, can the streaming guys please slow down their streams for a little while to make sure nothing goes wrong over here in Europe because they were concerned. So I think one of the lessons, is just, you know, the collaboration works I mean, and the community works. I mean, and it came together to work. I mean, it's usually always working it just and working together but this it just put a really fine point on it the guys all came together and. in this situation.
Nick Yeah Tim does a good question I would kind of take it back to where you where you started with the, you know, sort of observations about, you know, early discussions being, you know, resilient to fairly catastrophic kinds of attacks and there's certain aspects to the internet, particularly the design of some of the protocols that that exhibit that type of resilience right if you take out a particular node, the routing protocols will figure out how to, you know, route around that or find a new path. In response to congestion, you know certain protocols will adapt, so that the the available capacity can be more effectively shared, and that's good in theory, but, you know, it really goes back to what to what Matt said, you know, if the capacity is not there, it doesn't matter if the protocols can find another route. You know the operators have to basically work together to make sure that that capacity exists. Make sure that the applications, respond in in ways that don't overload the existing capacity and use the capacity that's that's there and I think if one of the, I think one of the big lessons we can take away is the testament to the collaborative cooperative nature of the parties involved, the internet being a huge federation of a network of independently operated networks the cooperation was tremendous. And then I'd leave it with again where I kind of started which was when you asked me about measurement which is there are a lot of things that we actually don't have a great handle on i mean i think what if people had started complaining about the video conferencing thank heavens they did. Right. What if people started complaining about your their online classrooms and whatever. Thank heavens, they didn't because we don't have the research community doesn't have it you know the the SEC doesn't have we don't have a global repository of data that would allow us to answer questions about that individual companies do but I think we're still for wanting something that is sort of that version two or the modern day measuring broadband America that looks at application quality. Right,
well we don't have time, then we could do an entire panel or an entire day on broadband measuring and mapping. And I'm not going to open up that Pandora's box right now. I think we'll save it for another time. But be you know you know one thing that you know people that do internet policy meat and potatoes issues like we do and that we actually would say the net. Sometimes we don't spend enough time in the underlying technical groups and Matt mentioned the bit tag, and others but are you seeing people coming together and saying, you know, across all these different, you know, levels of the SEC saying there are a lot of lessons to be learned here. This is an interesting example how can we build better structures or more structures to collaborate and share information on on network performance going forward.
I would say yes, I think it's you know there's still sort of a, it's still a bit of a constellation, but in Matt probably has his perspective on this to the to the different from mine but I would say, absolutely, um, as a as an academic researcher. You know I received funding from both the National Science Foundation so our tax payers, and also industry folks like cable labs and, and Comcast and other DSPs, and so it's, you know, I've basically had the opportunity to have a foot both in in the research community, and with a direct perspective on operations through groups like the B tag, through groups like nanog. And, you know the internet architecture board and other other standards bodies like the ITF are also looking at this, I spend less time in those but I think Matt and others in this community are there so there are a lot of touch points, I don't I wouldn't say that everybody's always in the same room, but there's definitely a lot of dialogue on this problem. And
now there's a third, there has been and still is, you know, lots of groups, and that are looking at how do we capture the lessons learned how to you know what do we need to do going forward to be better prepared so you know I've helped quite a few other reports. Along the same topics and you know I'll probably write a few others, going forward as well, but you know it's a it's a constant theme these days. In the different working groups. Yeah before,
before I asked you guys to kind of close with them, you know, just a few, a minute of parting thoughts. You know, you're both former you're both University Chicago guys. Clearly you both like to play guitar. Do you ever kind of collaborate with low latency on the network, and with jam sessions. We haven't
yet. We collaborate on technical things we haven't tried our creative things yet. So, maybe one day,
I have a task, low latency DOCSIS is here so it's you know it's definitely on the horizon and so I'm looking forward to two future jam sessions. Yeah,
you know I didn't you know obviously I didn't know before the internet that you guys are both doing these things yes we would have had to like finish with the session but in lieu of that. Can you next year
hopefully we're in person but if not, you know, during one of your stretch breaks we can we can jam. All right, all right.
Just, just one minute of closing thoughts on lessons learned and, you know, and maybe like, you know, things that didn't go well but that Nick, I think, I think.
I don't know if there's a big takeaway for less learn other than you, I think we saw that the internet work. And it's, you know, and that the free markets and stuff, you know prevailed. I would,
I would, I would echo that i mean i think you know there's certainly a lot of, you know, fanfare I think we saw some, some pretty high profile articles with a testament to the, to the designers of the internet. In terms of the the resilience we saw, and certainly credit is due there but I think credit is even more due to the operators of the services, and the ISP is because as I said, the protocols are one thing but if if the capacity isn't there, we're not putting in the infrastructure and provisioning It doesn't matter how good the protocol design is. And I think, I think we saw that working really well. Yeah, I just
wanted to add. I mean, one thing that went unnoticed by a lot of people, I mean to make all this work. I mean, it's a question earlier about adding augmentation and stuff I mean there were guys behind the scenes that were working hard, either on, you know, running up and down data centers doing things and guys out in trucks, you know, doing some things so they're, you know, a bunch of these people that were essential workers out there working really hard to make sure everybody's broadband was working.
Right now we're having some fun with a chat. But we're using Slack, to share this information in the slack is at a bitly link, it's on bit.ly slash s o tn Slack, all one word at the end. So, for those who are on slack I apologize, Mark bohannan can email me for the paper. I recommend you to mix paper and to match work and I just want to thank you guys for participating in that kind of overview. This is really helpful for me, and I don't think this was widely known. And that's why we wanted to host you guys today. Great.
Well, thank you. Thanks for the opportunity to to chat to him it's it's been fun. We'll see you again soon.
All right, thanks. Well, as I said, we're having a little bit of a challenge with the, with the chat feature, meaning like, you know,