Archives
Plenary session
17 November 2015
11 a.m.
CHAIR: Welcome back everyone. Please settle down. Take your seats and we'll start with our first presentations from our friends on Facebook, who tell us how they monitor their networks.
DAVID ROTHERA: Morning everyone, my name is David along with Jose, we are going to give you a talk this morning on some of the trials and tribulation that a we have had monitoring the network within Facebook. We both work within the network engineering team which is split between Dublin and Menlo Park in California, roughly 50:50.
So, within Facebook, and more specifically within network engineering at Facebook, we like to say that we run a zero‑impact network, all well and good but what does that mean. In theory, if we're doing our jobs correctly and well you should never actually see what's happening behind the scenes.
In case you are wondering, that's myself, Jose and they are a couple of our team mates.
So, what are we going to talk about today? We are going to talk about some numbers of scale, we are going to talk some of our tooling, we're going to talk about some tails that we have actually got from things that we have seen. We have got some future goals and we have got some time for Q&A at the end.
Facebook scale. What does that mean? Over one‑and‑a‑half billion people use the site every single month and over a billion of them use it every single day which is a cool number, but perhaps the cool number is the fact that over 80% of those are actually outside of the US and Canada. What does a mean for us? It means that we have got a trouble global network and although we have data centres in North America and Europe, it means we have to have parts stretching all four corners of the world. If we look at these machines and traffic and machine to user traffic we can see that although the man to user traffic is growing, it's growing at a slightly more steady and manageable rate. Something that's easy and able to cope with. You can see the machine to machine traffic over time is growing exponentially. As we are adding more and more services that are media rich and those that are travelling between clusters in a data centre, and data centres not only in the continents but but between continents it means the traffic is growing more and more to numbers of scale that are outside of the scope of most people.
At Facebook we like to say that we just build robots and then let those robots build the networks.
So, for anybody that's playing Buzzword Bingo, we took software‑defined network and we turned it into Facebook‑defined networking but I'll give you the point anyway.
So, let's look at this diagram probably with lots of names you have never heard of and whilst I'd love to spend the next eight hours going over those, we don't have time. We'll focus on a couple of these. The first is FBNet. That's our database that allows us to model absolutely everything within the Facebook network. Devices, line cards on those devices, ports on those line cards, circuits between those ports and other ports, BGP neighbours, as I say, literally everything.
We have [Net Norad] which is our packet loss alerting and monitoring system. This works by building a mesh of hosts all over our network and sending mum am different classes of probes in between those to correlate when we have got issues between clusters and also between data centres.
We have Megazord, that's our alarm correlation engine, built to take alarms which are completely unlinked and unrelated and link them to things that we know about, things mostly from FBNet. We have drain services, this is a thing built to take away traffic from either a device or a subset of things within a device, whether it be in the data centre or in the backbone.
So, let's take some those tools that I talked about and see how they are actually used in the real world.
Now, I'm sure everybody that operates circuits starts off with the manual approach. You know, you get an e‑mail from a vendor saying they are going to be doing some work, you wake up at 4 a.m. and you move traffic away from that link. And then the work carries out and once it's completed, you are going to go on there, check that the link is still clean and then you are going to move traffic back on. It's cool. That's fine when you have got 1, 5, 10, 50 circuits. But soon, it doesn't really start to scale. So you move on to more of an hybrid approach. You build tools, strips, that's what we're doing, to make your life easier. But still it doesn't continue to scale even with using things that make your life easier. You are still having to be up at 4 a.m.. as I mentioned, we have a network that goes all over the world, so we don't really have one....
We moved instead to make it fully automated. How does in a look?
We receive an e‑mail from a vendor and, you know, saying that there is going to be maintenance on a number of circuits. We take that e‑mail and we pass all of the data out of it, such as the circuits affected, start time, end time, what is the expected impact, things like that. And then we create an attacks. That attacks is used to collaborate between teams in regards to that maintenance and also just to that we have all the information humanly visible. We pass this off to a system called Poltergeist, which is a [Cron] tab on steroids, it's designed to do things at certain times according to what's in that attacks and it causes the drain services. Those drain services go to the links, maybe an hour before the maintenance starts and moves traffic away from them. Once than maintenance is ended we are going to again go off to Poltergeist because it knows it's finished. We are going to put traffic back on to those links as long as they're clean. All this is brilliant when you know about what's going on. But what about when you don't know what's going on?
So... let's say you get a number of links that go down. We are going to use Megazord to correlate those links. Things such as the vendor, where they go from, to, things that we know that can link them together. We are going to log them in a subset database within FBNet to store things like fibre and things like that. [Then] the vendor is going to check FBNet to see how we contact that vendor. Ideally, through e‑mail, but in future we'd like to be able to use APIs and things like that in a perfect world. Again we're going to create attacks, essentially everybody uses these things, basically just for keeping a record of things and so that you can collaborate between people and teams.
We're going to contact the carrier as I mentioned, most of the time via e‑mail or if you really need to, via phone but let's just say e‑mail and once they have gone and fixed the circuit and all is well and good, I'm sure everybody has seen a circuit that comes back up and then goes back down and does that for a few hours while they are trying to finish fixing it. We enter a hold down period to make sure everybody is stable. Once everything is cool, we close oft event.
Okay. So, I'm not going to hand over to Jose who is going to talk to you something about a different that we have seen within our network.
JOSE LEITAO: David described some of the more common and some of the foundation knowledge that some assistance that we have. So this where two scenarios that probably most of you have faced. How do you manage, plan issues with links in how do you manage the shark, right, let's talk about something more special, I'm not talking about my T‑shirt.
So if you look at this graph, this is one of my favourite graphs ever because this is essentially a graph of pain. As you can see here, this graph represents, I don't know if you can actually see it with the projector, but if you are looking at this via the streaming you'll probably be able to see it better, this is graph that represents over four months of the free memory on a set of large routing platforms that we have. And essentially as you can probably deduct out of this, this is an incident or an event that we call the memory leak debacle. As you can see this free memory decreases, then it gets to a particular point, something happens then it comes back and this process starts again and again and again.
And essentially what was happening here is, as I mentioned before, a series of large routing platforms have a series of memory leaks that would essentially start eating away the memory. So, if you left the device to do this until it will collapse, what would happen is initially, after a period of time, it would get to a particular threshold and you essentially lose management, so the device could not be controlled any more. You could not SSH to it or do any more sophisticated stuff, forget about things like NETCONF figure, basically your management of the device is gone but it's moving traffic. You say okay, that's fine, that's acceptable, we'll figure it out. If you left the device to continue to be in that state after a couple of hours, it will start doing what we call great failure. The device will start processing network control traffic sometimes and sometimes it wouldn't, right, so it would cause all sorts of havoc on the network.
So in this particular situation, we went to the vendor because apart from that it wasn't our own so, we went to the particular vendor which I'm going to keep anonymous, I'm going to be very generic with the terms, so there's a few lights up ahead that's doesn't sound like working lingo, this is intentional, because you will figure it out quickly if I start throwing out the proper designation for this. We have to play a word around to record this, this work around would be interesting, this work around had a risk of actually crashing the box because obviously the box is not in a same state.
So, this ‑‑ and that's essentially what's happening there. There work around is meant that we needed to take traffic off the device and we needed to basically reload the active CPUs. And you would need to do this for the entire feed, for that entire sub‑service of devices until that vendor could provide with you a fix, could actually provide you with you know this is how you solve this you know the root cause of this. As you can see here it actually starts before this ‑‑ this is from October to January, end of January of this year. This actually was happening a little bit before, but this snapshot was just four months. We have to live with this particular platform in that state for four months. How you solve this with humans? Lots of them on coffee, right. And as I said, this was very manual, this had a high risk, and you could actually, there was a good chance this would actually crash the box, depending on what stage you actually ground the device. In some places this would have been you are on call, this is this issue and you need to go check the fleet and find out what the worst offenders and go through this whole process. So how did you actually manage the situation?
Then we describe some of the systems, I am going to try to explain what everything is. For example, we have a key value historical where we keep the value of the devices and servers and all the applications and essentially we are getting all sorts of things, including, what's a free memory of those active CPUs, right? We will have that data and here we have this is automation where you can do detecters, you can do simple math or complicated math or you can do whatever you want to do, which essentially, according to a particular threshold or a percentage or a rate of change or a prediction, if this goes above, below, etc., etc., I want you to fire an alarm. That's what we did. We said this is what we declare as a safe threshold, we know after devices go below this point, we have you know, six hours, eight hours, a day to react. So that's what we did. We defined that safe threshold, and, below that, you should fire that alarm. That alarm would fire, this would get figured up by FBAR, this is our automated human, and essentially a system that runs a series of logics to, you know, perform attacks after it sees an alarm. So think of this as what most places tend to have, which is a [Rum book], like there is a [Rum] book and this is ‑‑ if you see this set of predictions you are going to do this. We have the same thing but it's done by a machine. In this case, we wrote a remediation and this had a series of actions and the series of actions was essentially check that the device is in that particular state, check that this device is actually in the version where we expect this to happen. Essentially that a series of pre‑checks were actually met and if any of them didn't look right it would go to a human because computers are dumb and humans are a lot smarter, right. But assuming that everything actually checked out and the device was in the version and the code and what we expected to see, essentially we would go where this very intrusive work around. It would go and call drain services because the first step here besides ensuring that this is in the state that it expected to be in is actually remove traffic from the device in case the work‑around us very well.
So, FBAR will call the drain services and the drain services will remove traffic from the device with their own set of sanity checks and pre‑checks and all this good stuff. When the traffic was actually off the device, so the drainer services had actually gone through this proposal process and everything was fine, we actually start working on the device, FBAR will start this and applying those work‑arounds, which essentially as I mentioned before you need to reload the active CPU, the stand by CPU needs to take over then when we have that redundancy again this will repeat, so we would wait, and when that was all done the device would be in this clean state. So the device would be in the state which, as we mentioned before, we are back in let's say square one, right, we know that, for example, some of the ‑‑ we had this happen every week, similar devices, this would happen every two weeks, some three. This would bring us back to that point. Right.
So when all of that happens, that comes backs to FBAR, that does a series of checks, if it's okay, put traffic back on that device.
This is essentially how we survive this situation for our four or five months without any sort of human intervention. In the beginning there were a couple of cases where the logic wasn't accounting for a particular set of things but those were sent to a human and they fixed the issue and we moved on. That gave us and the vendor the time to actually route cost properly and give us a software patch for this.
Those are some real world cases of how we have used this automation and we have used this tool, so again to give you an idea of scale, I'm going to go over some numbers over 30 days. And for some of the most interesting systems which I don't think ‑‑ maybe from my perspective, I don't see anything, maybe you guys can. That at least you can see, right.
So, emitters are SNPN, syslog server that essentially processes auto messages, because networking devices are chatty and we don't care about what they are most of what they are sending but there is a percentage that we do. Basically an emitter will go 3.3 billion messages every 30 days, and will actually discard most of them and only 1% actually result in alarm.
Out of those alarms, 99.6% of them get ought mathematically resolved, which means only 4% get seen by a human and FBAR runs around 750,000 times on networking alarm. This doesn't mean unique alarms but it can mean something that's transient, that FBAR is taken after a certain period of time.
Carrier maintenance, which is the name we gave this system for hanling the planned.events would actually act on over 300 maintenances. Vendors which is the system that handles the situation with the shark would notify transport carriers or 1100 distant unique events. And results over hundreds of thousands of alarms into around 12 hundred master alarms that are most of time consumed by automation. All this flexibility has essentially given us the ability to have a single on call in charge of the whole network.
And we are actively working ongoing in the direction where we have no on call. Or the on‑calls basically only get engaged when automation fails in some new mysterious catastrophic way.
So, some lessons and recommendations that we want to share with you, with anyone that's going through this experience or planning to go through this experience.
Lesson number one is try to reuse existing tools and code when it's possible and when it makes sense. There tends to be this tendency in our industry or in some particular organisations where oh this wasn't done here, so let's reinvent the wheel. We try not to do that and I think we tend to be good at it in the sense of for example some of the tools that we have mentioned were actually designed to do something completely different and we have been tweaked or adapted to cover a particular need. In the case of FBAR, for example, this was already originally designed to handle server alarms and we adapted this to be able to do networking. I think now it does a lot of things that it's doing, it's doing on networking devices. But we also did the other way around. Like, for example, David wrote Megazord and Megazord was originally designed to work on the working alarms because you have this inter‑dependencies on a [] /HRARPS and you don't have servers. There are some used case force other teams inside of Facebook and they are using Megazord also to cover their needs, right.
Lesson number two is that hacks quickly become important tools. What I'm trying to say is that essentially and I think most of us have found ourselves in this situation where for example you are on call, there is some event for example, like for example in a call we have a power event and now you loss a number of networking devices, besides all the servers, we don't care about servers, right. You just lost let's say 500 networking devices and there's an indent going on, they restore the power and you find yourself in this situation where you find okay, if I just let this go over the normal channels, automation will eventually tell me what's broken and where do I need to look. We are trying to recover this particular thing so how do we check the 500 devices very quickly? Are they actually alive? Do they have any links down? What was the last ten logs out of them? Things of that nature. Like, as an on‑call and I think there is experience I find okay I'm just going to get something very quickly done, whatever scripting language of your choice, you do this tool, you run it, it goes and checks those $500 devices, you are happy, you get a real view of what's going on, where are we with this? Or we need to direct stuff here and there. After the event is close you say, it would be nice if we could just have that available for everyone all the time. If they have to go over this, it's something that you call in the manner, you don't have to go and hack your solution type of under fire.
So, to do this, some of these systems have come up with this, right or some of the ‑‑ that we have have come up out of this.
Number 3, is that in order to do this successfully, you should implement and unit‑test and document all the things and there is in Facebook we try not to have let's say dedicated co‑owners, like, David wrote Megazord so nobody else except David can touch it. But in order for this to be possible, he would have to have instrumentation; if I go and do this big change into how it works and continue turns how the Megazord is running three times slower, we have those metrics to see this, and we have the recommendation that allows me to understand where there is land as everywhere in the code. Even if you take it from the point of view you are not going to thinking about our engineers that are going to work on your code base. If you write something and you go back to check it a year later, you want to be able to check out where you wrote all the land that is there, what hates actually doing.
Number 4, it's spoke for feedback often. Some of the tools that we developed are not used by us on a day‑to‑day basis. They are tools that are used by people in the callers or people that are in the data centres, so we don't interact with them too much. So, for example, there was an incident where our automation tries to react to certain enters and when people are doing active work on the network they are changing or upgrading or doing something on a certain set of devices or in a part of the network, there are tools that will go and remove that from let's say, from the monitoring point of view like okay ignore this, anything that happens with this device, or with this set of devices, it doesn't matter, somebody is working on it. So there was this incident where half of the network or half of the maintenance that they were acting on had that enabled and the other half didn't. So automation went in because thinking it was a real issue and started doing, you know, interesting things. So when we got to the bottom end, we thought there is a tool to do this, why is a person doing this maintenance, not using the tool. We had a conversation, and essentially he told me what happens is I do this all by hand because the tool is just too slow for me. I went okay have you told anyone about this? He tells me no. We actually went and found out why the tool was slow. We made it faster. But ideally you should have better feedback loops and ideally you should try and talk ‑‑ if you don't use those tools everyday you should try to interact with whoever does to make sure that tool is doing what you want it to do.
5, is that networking devices don't have powerful CPUs, this is interesting because some of the folks in our team come from a server background. They come from being SREs or SROs whatever company you are called this role. So there are people that are very used to have servers with gigantic CPUs and multiple course, they want to get something every second they can and the server is fine. This is not true for networking devices and like this is a lesson that we have learned by actually investigating why the devices are doing the things they are doing. In our lingo, we call this collection, we gather counter, we gather from every set of counters, so like over time those kind of creep up on you and if each one of those jobs eats 0.3 percent of CPU and you are doing that every minute or two minutes and you have 300 of those you are going to find yourself in a situation this set of devices is running at 95% CPU and we don't know why. It turns out it's your collection. We did this experiment with a subset of devices and we found that the CPUs, and the CPU in those devices went down 60%, and the networking, the CPU in those devices did other things like routing, it's kind of important to have it available.
Number 6, the sooner the robots take over, the better. All this automation has allowed us to continue to remain relatively small, and continue to not have to relearn the same mistakes over and over again. Automation can be very silly, but you see the issue once or you see the issue twice, that gets fixed and that issue disappears for ever. Humans tend to make the same mistake again and again.
So, 7 is talk is cheap, focus on impact. The idea is ideas are plentiful. Actually getting something out the door and putting in the event to implement that tends to count more. The last one is done is better than perfect. And this again is something that we tend to say a lot. Which is you can spend six months white‑boarding a solution but most if the time when you actually get that solution in the world, you find all these scenarios you then consider and you will have to like go back to the drawing board or you will have to make adjustments, so it's better, we find that it works better to get something out of the like 90% of the use cases or 95% of the use cases and actually you know get feedback early so you can iterate over it than to do another six months of white boarding.
So, this is what we have been doing. So, or this is what we did. Let's talk a little bit about what we are actively working on.
So, as some of you might know, we released FBOSS, we released Wedge, and 6 Pack, which is something similar to a chassis built out of Wedge. My team is heavily involved in ensuring that we have feet party with whatever else we have on the network. We don't go from a particular platform that has all these good counters and capabilities that FBOSS doesn't have, we are heavily involved into that.
The other thing we are working on is PCE, which is taking the brains out of the devices so you can have a more centralised view on things and we are doing that already, but we are actively working on it.
Number 3 is having better visibility into the optical space and the IP world. This is are heavily disconnected in most organisation where is you have a set of people that control one and the other and when you are incidents they try to talk to each other and it typically not very good. So we are working on making sure that we have better type of end‑to‑end facility on this department.
And the last one is really important. It's something that people when we are thinking about automation tend to overlook, which is like you need to continuously develop your tools. As long as the network is evolving you need to make sure the tools are keeping up. Because all sorts of things will creep up, like, new edge cases that tools will become slower, you have new dependency, tools will break, because some change somewhere, you need top ensure that everything evolves with the network.
Before we wrap up, we have this group called on Facebook, where we have a 350, 400 engineer community of people that are going through this process of doing network automation, so if any of you find this interesting and want to join and pose questions or answer questions for other people, this is a good place to do it. So that's the name of the group. And I think we have some T‑shirts left, so if anybody wants a T‑shirt with this, find me or find Brendan, who is all the way in the back. So that's all. Thank you very much. And before we go and before we go to the Q&A we'll leave you with one question:
"What would you do if you weren't afraid?"
(Applause)
CHAIR: Okay. Thank you very much. We have time for one or two quick questions. Anyone up for it?
AUDIENCE SPEAKER: I have a question, Marco Canini. So my question is regarding to the tool development, I assume that you need to do a lot of testing to get things working correctly while you know some of the special cases only show up in the real environment. So I wonder how do you cope with this? How do you make the road testing efforts while not necessarily destroying or disrupting traffic?
DAVID ROTHERA: As Jose mentioned, one of our key development things is to ship early and often. So basically, once you have got a proof of concept, get that out as early as you can. Hopefully what you are doing is something that isn't going to disrupt traffic, so it's going to be something that you can get out there early and then you know, once it's in the wild as you say, when you are planning it the chances are you have not even accounted for many of the issues you are going to see when it's in the wild. So get it out there and that's really the best way to do it, otherwise you are going to be sat behind a white board for six months trying to think of all these educators and still when you eventually put it live you are still not going to have thought about everything.
JOSE LEITAO: Typically what we try to do is what I mentioned before. You are doing for example, like with this memory thing, you know the work is to drain and reload the CPUs right, so you are expecting a set of conditions so you put in your code to expect this set of conditions, so if you get something that doesn't match that exactly, you abort and you get a feedback loop that goes to a human or goes to you, or whatever. That's not a perfect system. But between that and uni test and making sure that you actually maintain what you just wrote it works out most of the time.
CHAIR: Okay. Thank you very much.
(Applause)
Before we start the next talk I'd like to remind you that we have a proper Programme Committee elections going on. You can nominate yourself for the Programme Committee until this afternoon I think, 3:00, 3:30. The candidates will introduce themselves in the last session today and then the election also start and go on until Thursday afternoon when we announce the results of the election. So if you are interested, go to the RIPE 71 website. Look up the Programme Committee and send in the nomination for yourself.
Our next presenter is Karl Brumund from Dyn, he will talk about data centres in a slightly smaller region.
KARL BRUMUND: Hi. I work for Dyn, so, we just have Facebook talking about the thing that's a little bit bigger scale than us, so a little bit smaller what we're doing.
As, you know, Dyn has been around the DNS field a long time, does e‑mail, Internet intelligence, we have about 28 sites around the world, a few hundred probes, we have got 4 core sites building regional cores and what we're talking about here is data centre network that we're doing in our core site. We'll be rolling that out to all of our sites.
I want to talk about the things that we probably shouldn't have done but you know, it was a learning experience.
So initial design it looked like this, it looks pretty good, CLOS design, it's redundant, lots of bandwidth. Looks good. Buy the stuff, install it, configure, it what could go wrong? Logical‑wise, it's MPLS, because MPLS is great for everything, like BGP, it solves everything. It's great, 10 big ToR switch with MPLS check. Oh v6, 4, VPE, yeah, oops. So the feedback at one point, we heard from one of the people, well v6 wasn't a requirement. You say what? Don't do this. So we had to kind of start over again. So we thought maybe this time we'll actually engineer it.
So, the previous team is no longer with us. So we have a new team with us on this. We thought, you know, let's actually define the problem. What were we trying to do here? We have a bunch of legacy data servers, doesn't have enough redundancy, enough scale. The usual stuff. You know, it also makes it interesting that we have a lot of legacy absent servers, we are trying to rebuild something that's brown field, not green field, we're not starting over from scratch. The other thing is, we're not Facebook, we're not building these huge massive things. So it just has to be good enough and fast enough and cheap enough. So, you know, scale‑wise, you know, work 20 racks, order of magnitude, give me something that kind of handles up to 200 and we can revisit it.
So, the thought, kind of define our requirements. Good, main one was scaleable and the fact that we actually can support it. A lot of times, as engineers, we forget about the fact that we have to support these things and we have to have our existing knocks and ops teams support this. The other thing, we want everything to withstand protocols. If we don't like a box, we can drop another one in. We talk about fast, you know, cheap in a sense it can't cost a fortune. We are talking about effectively cost of ownership both to buy it upfront and run it.
Also, fits us. Unfortunately, legacy stuff, we can't drop everything into VMs, that's just how it works.
And talking about just works, the main thing is like, I don't want to get paged at 3 a.m..
So, a couple of things we had to figure out is the routing, to actually make it work this time, including v6, security, we thought we can do better. And the other thing was service mobility. Being able to kind of move stuff around which I'll talk about a bit later.
The design that we wound up with looks like the previous design. Basically because well we bought all this stuff and we racked it and stacked it so we couldn't afford to throw it away either. We can work with this.
So, effectively, what we're now doing is layer 3 for most of the network, layer 2 hanging down the servers.
Logical stuff, we still like the layer 3, we didn't want to do a layer 2 network. That raises the issue of service mobility because I don't have my VLANs or subnets spread across racks so we have to move an application from one to another, that's going to be interesting. I don't want everything in the Internet and I'm not doing overlays because of legacy. I'm going to need more routing tables, so, yeah, okay, VRF‑lite or virtual routers, that will work for us, and okay that means I'm going to have multiple IGPs, multiBGPs and we thought really cheap switches. So there is an issue of RIB and FIB scaling. There is an issue we have discovered by crappy CPUs in layer 3 switches. We are still not ready for overlay.
So, with this, we thought how many routing tables do we need? Okay, we want stuff that's Internet accessible, all that public. Stuff that's not Internet public: Databases, called PRIVATE. We do stuff that's load balanced, because there has to be different routing, so we call that LB. We have to connect our sites, and okay, so we'll need something there having a test environment, that we don't want everything done testing in production. And we have got a bunch of our continuous integration pipeline things, so, which we want accessible both in the QA environment and production. We wanted to keep that separate so create a separate one. Roughly six, that's a small number. I can count it on my hands, so I'm not too worried about stuff.
Logical design. Looks like this. So, we basically we have got a bunch of different routing tables and effectively we run trunks between them. Each routing table is linked up. Not everything needs to go down to top racks, so not everything does there. And we connect our remote sites in either into our routers or into our ‑‑ across IP sec VPNs.
So one of the things we had to answer was for BGP, are we going to do eBGP or iBGP. So in that particular case we just go iBGP, it worked for us, you know, basically again we got to the point of staff understands it, because again it's the whole concept of it has to be supportable.
For eBGP, one of the things was the concern about, again, small CPUs on our layer 3 switches and the fact with multiple routing tables, if you start doing eBGP you are going to end up with a lot of BGP sessions going down from your spine layer to your top layer rack. If you have got 20 racks a pair of switches in each, from each binder, it's 40 sessions times routing calls, it's 200 sessions, so it seemed like it was a lot. It may have worked, we didn't try it. There was a reference to Microsoft guys who presented this in NANOG a while ago.
In terms of the IGP, I had to pick one there, and in the end we concluded it doesn't matter, it's providing, basically, your loop‑back and point to points. So, you know, would people understand? They understand the OSPF. Okay, we'll use that. The other reason is, we're not a service provider trying to get IS‑IS experience. Is just going to be a little bit hard. So we don't want to compete with the tier 1. We'll do this. In the end of the day, any choice would have worked. Reference to one Internet draft that talks about design choices for this.
The other thing we had to figure out was route exchange, we have multiple routing tables and obviously things needed to communicate. We may have a web server in our public zone, and database in private, they needed to talk to each other. We realised this can get confusing real fast and one of the things we wanted to do was make it manageable, scaleable, and, also, I don't want to have to go back and chain the network, I want this to be able to ‑‑ I can configure something once and it keeps working. BGP communities is the answer. I enter my network, upper racks and then basis on this and my spines I have a policy that goes from one to the other.
Almost everything is done on spines for us and basically we are trying to keep it as simple as possible. If this route has this community, you know, send it over there.
We have a few interesting details we had to work out. We do pairs of top racks, so one of the things there is the issue of because you are doing iBGP in routing reflectors we can have black‑holing, if, for example, both our top of racks announce route up, the route reflectors picks one, I'll pick the lowest IP, the first one, what happens if the link is down to a spine? The spine gets that route, passed on from route reflector and goes, I can't reach it, all I have is B fall, put it back up so eventually you wind up in a loop.
So our solution for that, why we don't do effectively Anycast, we do a root back address, that's per rack and put that on both of them. So we do that, next hop just works, does the ECMP load balancing, works the rest of the time. This actually worked out well for us.
Again, this is a result of us picking iBGP, was Anycast. So run Anycast. One of the problems is the spines only get the best routes. All they ever get is single routes, and a single route isn't really Anycast any more. The whole thing was, we looked at and, well, we don't have tonnes of Anycast routes, so I have got you know, I have got two separate instances with both the Anycast IP on them and they are in different racks. We don't have a whole lot. Well, let's again kind of do something simple. We'll drop them in the OSPF. Well, really, it's not that big a deal. It's a few dozen Anycast routes is all we have here at the end of the day, so why not, we'll throw in the OSPF, it works. Again, it's this concept of simple, which ones do we do? The routes that are tagged with Anycast community. Create the policy once, and basically that's it.
Security: Changing the topic from before. On the legacy network, we had to do the traditional, we had ACLs and firewalls, and there was a problem. We had our server people, they would take stuff out of production. We forgot to tell the network guys you clean up the rules, we reused this IP over here, so clearly it's a problem. We figured the best way to get rid of it is get rid of the problem. No more security. Well, no more security in the network.
So, the concept is the network just moves packets. We don't want to be filtering them. What we did instead is, we are going to push the security directly down onto the incidence, and the service owners are responsible for their own security. They are owning the application, writing it, they are maintaining it; they should know their own flows. The blast radius, if the server does get compromised it's limited to a single server, you throw an ACL on a router interface everything in the VLAN can be compromised, here it's literally single device. We are not keeping ACLs, firewall rules in here, it's all basically been pushed right down to the server level.
Now, one of the things is how do you provide this, particularly when you have got a whole lot of people here who, in the case of our developers, we tell them by the way, you have to do something new and you get to do it. So they weren't really happy with this. Anyway, what we came up with was, when the server first gets built, we install basic security, which is in our case monitoring SSH, everything else is blocked and then the server owners they will go through the rules they need. We use Chef, continuous integration environment, we are familiar with that so we spend a lot of time with them educating them on how it's being done, how do add them, how to be ‑‑ how do find out what the, what sources are and so on. So it took a long time to kind of get them on board. The first one was ‑‑ our first appointment was particularly obstinate, but that came through in the kind. It took a whole lot of meetings. In the end we got him on board.
One other thing is, by doing a layer 3 network, we have this problem of service mobility. Layer 2, fine, overlay, again fine. Our particular case, not really. What happens if, all of a sudden, something goes down that we need to upgrade something, we are going to move it to a new rack, we are going to change the IP, some things are not a big deal. We have to update the tables on all the devices that this one may talk to, again automation makes it easier, but still, it's a bunch of work, and if you have to do something in a hurry because something failed at 2 a.m. on Sunday morning, you are half asleep and you are trying to do a whole bunch of changes. Not ideal. What the IP service didn't change? We don't care about the server itself; it's the actual service.
So, what we came up with was this thing here. This looked great, we thought we were really clever. So basically, we go and we create a dummy zero interface and put an IP on that and use exabgp and it's going to announce its IP up into the network and it's really great. And our first appointment was an application that can't use this route‑bound traffic. So many applications are fabulous, you combine the interfaces so they'll source traffic from the right IP, and others can't.
So, it seemed like a good idea. A bunch of times it worked well, sometimes it doesn't.
Network deployment. And here is where we were some similarities to the our friends on Facebook. The network it pretty automated. For example, all our top racks, if you go and make manual change to them, your change will be overwritten automatically, basically our automation fully owns their configuration. If we need to grow this network, it's trivial. We need to add another 50 racks, big deal, I don't really care.
And we use home‑grown thing called Kipper, my colleague developed that, go and check out NANOG63 for more details on that.
In terms of the rest of network, we are semi‑automated. So, some things are partially controlled by Kipper, some things are still done manually. We are moving in the direction of basically a hundred percent automation. The idea that we never have somebody logging into a router again.
I want to end with some things, what did we learn here? This is kind of lots of the obvious stuff. The design in advance that serves a document. That's a good thing. A design that you can implement, that's even better. A sort of designing it right, instead of just the easy way, because it may not be easy a few months later.
Validating things before you deploy them. Kind of simple there.
And integrating legacy stuff. That's really hard. And legacy cruft, that's even harder. So that's probably the hardest thing there was trying to mesh with what we had existing and still needs to be keep writing and fully supported and up all the time and trying to bring that into the new and doing that without any down time.
Then, of course, everything is as we say here.
Network‑wise, we love our cheap layer 3 switches, the cheap part of them, that's unfortunately the only good thing about them. We learned things about, you know, TCAM size. You cannot have your full Protect RE in the main without the vendor. RIB size, that is a concern. We haven't hit that yet. It's something we keep an eye on, particularly by doing multiple routing table where you copy routes from one table to another, you are effectively multiplying the number of routes so your RIB size can grow quite easily. And those multiple routing tables, yes, they are a pain, but because we only have a few we think we can manage it.
Automation, we heard Facebook talk about that, we also think it's really cool. BGP communities, again kind of talking to the converted here. The other thing we have learned is, no such thing as partially in production. We threw that first service onto a network, when it was quite ready, and well we're now live and that suddenly means that you just can't make a change at 2 o'clock in the afternoon.
The other big thing is staff experience levels, so, it's very easy sometimes for us engineers to get very ‑‑ to be very clever, hey, we can do this, this would be really cool. Yes, it will, but can everybody else also support it and understand it? Are you going to be spending the rest of your life being the person they are going to call because they didn't know what you did?
Security‑wise:
Again, security in the network, that was probably the best thing that we did. That makes our life so much easier. We looked at some commercial solutions to basically deploy and audit this. They all suck. You know, v6 support, hello, really? Vendors, come on...
So in the end we rolled our own, just because we had to on this. The other thing is that many of our service owners and developers didn't know their own flows. You guys actually wrote this code. How do you not know who it's talking to? A lot of it is simply because they never had to care before. All they did was threw the code on and it worked and somebody took care of the security and made sure that their bits flew. So we are now having their own security, it's been challenging, but it's ‑‑ once they get on board, it works out quite well.
In terms of our users, our users actually direct their developers who deploy our applications on their network. They don't like change. They hated it when we told them they had to do more work. We need basically to be embedded so effectively, we had network engineering sitting inside dev squads working with them on a daily basis to kind of explain this and get them on board and show them that this was really quite easy.
And educating these users in doing that. That was actually probably much more work that all of our configuration and network design and testing and everything else. It's one of the things that it's easy for us to underestimate how much time that will take.
I just want to wrap things up. There is many different ways to build data centres and networks, in these talks we have seen many of them. We have seen a whole lot. This one worked for us, maybe it works for you, maybe some elements of it may work for you. You know, one of the key things is that our network just moves bits, you know, to servers. Servers run apps, but the end results are customer buy the services. So really, all these other factors are much more important than the network. We are kind of here doing pipes.
Thank you.
(Applause)
CHAIR: Thanks a lot. Do we have questions?
AUDIENCE SPEAKER: Blake, today from iBrowse.
First, thanks very much in general. These return of experience like lessons learned talks are often very helpful because I saw a lot of people in the room going yeah, I have been been there, done that. In general, something that maybe more people need to think about it applying more pressure on their vendors to not necessarily deal with these problems but at least, like from a feature request point of view, you know if your crappy little top of racks which had some just very, very basisey MPLS functionality, that would solve a lot of your problems and other things like that. And I think that, in general, there needs to be like more pressure on the vendors and there is not enough communication between people using network equipment and the push‑back to their vendors ‑‑ sorry, the feedback to their vendors in terms of what worked for us, what didn't. It's more like, okay, here is a box and deal with it and figure it out has been the modus operandi for most shops that I have worked with.
KARL BRUMUND: Our experience with that was mostly trying to actually get detailed specifications of what some these boxes did, if they are using off‑the‑shelf commodity hardware should be easy but for some reason the vendors can't tell you what size the FIB is and what size the RIB is, and how do you find out when you are going to hit the limit? And their answer was, you can't until it just happens. I am just amazed by some of this.
AUDIENCE SPEAKER: Chris Petrasch, DENIC. You told us that you put security to the servers, but if you have security issue on the servers, if one of your customers and it hits a network as well, do you have a process to prevent it or...
KARL BRUMUND: So, we do have the usual control plane filtering on all our network devices but we don't ‑‑ the network does not do data plain filtering.
AUDIENCE SPEAKER: Okay. Thank you.
CHAIR: Any more questions? We have a couple of more minutes if you like. Okay, then, thanks a lot.
(Applause)
Our next presenter will be Leslie, she will talk about NetDevOps. Before she is going on stage, I'd like to remind you that you can rate the talks. You can go to the RIPE 71 website, to the Plenary programme, click on "rate" and help us understand what kind of talks you like, how you like those talks and make a better programme for the next meetings even better programme. So, Leslie.
LESLIE CARR: Before I get started, I just want to mention that this is a very visual talk so you might want to look up from your laptops every once in a while. I know that e‑mail is super exciting and I couldn't convince the RIPE NCC staff to turn off the wi‑fi, so I just did this...
All right, I am Leslie Carr, I am an operations engineers, been in the industry for about 15 years now. Worked at a couple of websites you might have heard of before, you know, like Google, or Wikipedia, and I most recently was working at Cumulus Networks.
And this talk is really aimed for network engineers that are automation‑curious, but you are running your network in a traditional model or system engineers who love their network engineers and want to help them automate. So, as I said before, you know, you should probably look up, but if you are already a network engineer who is automating everything, congratulations, your job is done and go back to your e‑mail.
So, I want to take you today on the first steps of taking your infrastructure from the legacy model to working in a DevOps model. What do I mean when I say "DevOps"? I think if you ask 100 people, you get 100 different answers. But really, at the heart of it is that, for a long time, software developers and sys admins didn't communicate. This caused a lot of friction in between teams. DevOps was the idea that, when your teams work together, all of your production is in a way better state. But the most important foundation is that your infrastructure is now thought of and described by code. You can use some great tools like Puppet or Ansible and so your infrastructure can be managed by code so there's no more logging in and now everything is in a better place. Here is what we have today, right. We have the dog of ops working with the pony of devs to leave the kitties, which is everyone else, so DevOps is leading the pack, great, problem solved, let's go home. Except, kittens are the network. They are just being pulled along by the pack because traditional network methodology has been fairly old‑fashioned.
So, how do we tend to deal with, you know, you get a new router, you unbox it? First, you are logging in manually with a password. You are typing in commands on the command line, live. Cutting and pasting over console. Some vendors have roll back, but not all of them. So if you make a mistake and, I don't know, you close your terminal window, you have to be like what did I just do? RANCID is the only decent tool that really exists to save your configuration or state, and I have to give RANCID props; it's been around for a long time and it just works. And a typo can bring down your whole network. Everyone raise your hands if a typo has brought down your network. It's a lot. For the rest of you, you probably just don't know that a typo brought down your network.
So, why is it like this? In the automation world, we often forget that our tools have been around for awhile. Like the CF engine, it's considered to be one of the first real configuration management tools. So, you know, does anyone have an idea when CF engine was written? No looking at Wikipedia. He said 1995. That's really close. It was actually written in 1993, right. That's forever ago. It's 22 years.
So, but Cisco 1 PK tools that enabled automation were only released in early 2014, so, easy tools to do this are actually really really new and without the help of tools, and let's be honest, without the help of vendors to help us use those tools, it's no wonder that we have been stuck in the old days.
But, I believe that automation is for everyone. I'm going to show you how to get to this. Working together with the networks and systems engineers working together so our lives are all happy and cuddly.
I have heard some fears and common complaints. It's a fad, but you know, we just had the wonderful people from Facebook up here. I don't think it's a fad and it's not going away. It's hard, that's true, it's hard but we run the Internet, right, like what can be harder than that? It's running the Internet, we can do hard stuff. This will steal my job. That's a very valid fear, but one of the things is, even whenever you get new technology, budgets almost never go down so instead of stealing your job I like to think of it that it's stealing the boring things that you have to do so then you have time to do you will a the fun new projects, or maybe you have time to go and have a beer and tell your boss you're in a meeting.
One wrong move can take everything with automation but it also can take everything down without automation. And you're gear doesn't support it. Sadly, very true, but I'm going to show you how you can get 75% of the way there. So even if your gear doesn't support it today, you know, you are really close.
And the most thing I hear the most is really "I don't know where to start" but now you do, because you are here listening to me.
I think it's time to trust the computers. The Internet is built on trust, right, the BGP peering is really just trusting that you know, that person over there is going to be sending me the correct routes. So, now where can you start?
Look at all these logos. So many logos, right. You have all of these choices. All of these vendors support automation, at least to one degree or another. And look at all these different automation tools. Right. You don't have to be a good programmer. You really don't even have to be a bad programmer. You can cheat. You can use all the cool tools that are out there and all this other code that people have written. I feel like the biggest problem is social and it's not technical.
So now we're going to start learning. Before we start, as I mentioned before, my most recent job is for Cumulus so all of my examples today are going to be using Cumulus and Puppet, but I know many of you aren't using Cumulus here, but good news, you don't have to cry, because all the major vendors support Puppet and like I said before even if it doesn't, I'm going to show you how to get most of the way there.
The first most important thing is to get used to GIT, or as someone mentioned to me in the hallway, another source code pository. You think about it, it's source code or text file repository. It has automatic file revision and change management. So it knows when you have changed the file and what you have changed. It's really built for teams to work on the same files at the same time. So you don't have to worry about both of you are editing your IP address file or switch 1s configuration. It is ‑‑ there's a lot of scary words but it's easy to get started with, even though there are a lot of knobs for advanced users.
So GitHub is really one of my favourite sites. It's really easy to start using. It's free for any public repositories and it's really cheap for private repositories. If you don't know how to run your own GIT repository or you don't care to do it, right, other people can do it better for you. And with private repositories your configurations don't have to be in the public eye.
So, here is some terms that you will hear a lot using GIT and that I'll be using again today. Remote repository. If you are using GitHub think this is like all of the copy of the files on their server. It's sort of like the master that you always go back to to find the latest revisions.
Local repository is your local copy, like the copy on your laptop. This is where you make all of your changes.
A branch: One of the cool things about GIT is that it has the idea of branches. Like, you, let's say, you know, today you have been convinced I need to add IPv6 everywhere, and you know, some of us don't have it everywhere, right. You can make a branch for IPv6 so that you can do all this work, and you and your colleagues can collaborate, but it doesn't actually affect the master. It doesn't affect the running production copy, which is very important.
And then merge is when you push all those changes back from that branch back into production, like, all right, it's tomorrow, you have reviewed everything, you're ready, you are going to push IPv6 now. Now you merge the changes back, so now all of your changes can go into production.
Step 2 is that you can't start the automating without know what you have. So, just start checking in your configurations. Right. We aren't touching them, just checking them in. You are going to go through and you are going to be incredibly afraid when you realise that person that you know rage quit three months ago still has access to this one router because you forgot to log in and remove their access, you are going to find that. Or that firewall change you pushed it to all but one of your routers, this is great now because now you know and now you can make a difference and change things.
So, GitHub is really pretty for changes. If you look right here, you can see all the changes in action. See the red are deleted lines, green are added lines. So, it makes very beautiful UI, very user friendly.
And then you can templatize your configuration files because when you think about it a lot of things are the same right. You are using the same users, need to log in everywhere, you probably have the same or very similar firewall rules, let's say for all of your edge switches or all of your top of rack switches, you have, you know ten web servers, five database servers, things like that, and computers don't make typos, which is great; humans make typos and computers can check it out. And one of these things is, you don't have to have these configurations push out automatically; you can have your configuration management make the config changes for you and then you can cut and paste. This works for all legacy gear, right, no matter how old this is. And if you are still afraid, you don't trust the computers to make all the changes, this gives you an extra step where humans are still involved.
So, I want to just show you a really simple configuration file and template example. I'm going to be using Puppet, we are defining a type called switch config, you can see it has variability for the loop back and a few different ports and VLANs. I wanted to keep it brief so it wouldn't take up too much space on the slide. Here is the actual template file itself. You can see there it goes through all of the web servers, it adds them to a bridge, goes through all the storage servers, adds them to a bridge, has the loop back IP. Very basic stuff. And, ta‑da... this is the output of the configuration file. You can see all the variables we put in and it's all out there. Great. We have just transformed our configuration into code. Great news, right. There is the bad news, though: we're still stuck in silos. So raise your hand if your systems and networks are separate?
All right...
Raise your hand if they work together.
Oh, a lot more work together than I expected. Awesome. Good job. Teams often have what I like to think of as ticket walls. Great examples when you are doing machine turn‑ups, maybe a new data centre. You are going back and forth between the network and the systems team. Instead of you are feeling together, you know, systems team throws a ticket over the network team, turn up this port. You know, the network throws it back. Port turned up. Systems team, oh, I forgot, actually, it needs to go in this VLAN, throw a ticket back. Network team, my goodness I have 5,000 tickets and this is what I'm spending my time on. Throw it back.
So the systems team often winds up feeling resentful. Like the network team is holding them back from their network and the network team is like, I have so much better work to do, like you are just filling my day with junk. So, it causes resentment and it doesn't feel like we're working together. We're all boxed up.
This is where GIT branches can come to the rescue. Right. So, the best part about GIT branches is that you can give some permission to make a branch but not to actually merge it back into master. So, there is a VLAN branch, right, cool, your systems person now, they know what port the server is plugged into. They can send you a pull request to merge back ‑‑ they can make a branch sending a pool request to merge that branch back to master so all you have to do is look through what they have done, hit approve or deny, so done. Sweet.
In GitHub, this is nice and easy. When I push a new branch out, it even puts up this great little bud in, compare and pull request. I click that. Open a pull request. If I wasn't doing that sort of presentation I could even write a little description like, hey, new servers, please be really quick, I'll give you scotch...
And then, I get an approve or deny the request. Obviously I'm going to click that nice little pull‑approved button. Ta‑da, done. So the best part about this is, GIT allows you to restrict who can approve changes, so, you get to have your cake, of not having to do the work yourself, and eat it, too, because you still are getting the security from having silos, but, yeah, now you have removed the ticket walls so now we are really working together.
And we are really happy. So, there's a lot more tools you can use. I could make a whole talk just on those tools, but since we all want to go to lunch, I'm just going to mention two more things that I think can make the biggest impact on your day‑to‑day lives.
First is continuous integration systems. They are awesome. Travis CI is a great free online example and Jenkins is also great. A continuous integration systems runs checks every time you check in your code. For example, typos. I am impossible of typing some phrases correctly. So, I actually have this list of words that I commonly miss spell an when I recheck in my code if Jenkins find any of these codes it doesn't accept them. You can start writing test around your configuration to make sure that everything is correct, just because while you are still having another human review your code, you know, we're all humans, we haven't always had our coffee.
And virtualising your network is amazing. Every vendor has their own virtualisation platform. Sometimes they're free, sometimes they like to charge you all the money. GNS 3 is a great free open source option that has VMs for almost every platform out there so now when you make changes you can try it out on your virtualised network to make sure it works. Like, you don't have a trust that firewall could change will actually work correctly and you are doing a crazy new BGP scheme that is getting implemented on a 2 a.m. maintenance window. You want to make sure it works right. You are not going to be wake enough at 2 a.m. to debug and figure out what's wrong and change it. You can test it. You wake up at 2 a.m., click the button and go back to bed.
And then the last advanced topic is to take the leap of faith and install the automation clients or have automation push the configurations automatically. So, if your vendor's gear doesn't support having one of these clients like Puppet or Chef or Salt, bug them, call them now at lunch.
And now, we are all happy, right...
So some of you might be thinking that's sounds great, everything is going well, what happens when something goes wrong?
So, remember this file here. Now, we made a typo, right. The loop back has a /3. I know of a couple of big websites where someone typing a /3 instead of a /32 has brought down everything. It's common, we're all human. Now, I'm sure you all have used peer review, maybe your reviewer didn't have their coffee so you push the configuration out to the website, think it's going well, and...
The website is down, your pager's on fire.
So, your pager is ringing and you might not have the time to go rummaging through all of the changes to figure out where this went wrong, right? GIT roll‑back will come to the rescue. Since GIT keeps a track of every incremental change, a GIT revert can tell GIT to undo the change so you are back in a safe state. There is an easy way to do a GIT revert.
Just GIT revert, head minus zero just reverts the last change. It allows us to keep the parent branch as the master and all done, very easy. Though, just remember when you make a commit, it's always reflected in your local branch so you need to make sure to push the fix back to the remote branch. All better, that was really quick, right.
So now we fixed the problem, we now have a little breathing room and this brings up my favourite part of the DevOps culture, which is the postmortem. If you don't have the postmortem to tell you what went wrong, you are just wishing that it doesn't happen again.
So, postmortems are great. It's all about finding ways to improve your process to make sure that technical mistakes don't happen again. It's an opportunity to find a way to make it better.
It is not about assigning blame to people. That is very important. Because you want everyone to be honest and feel that they can be honest because, really, we have computers, they can help us move around people.
You have to be completely accurate. Let's say the real root cause is that you were drunk, you went to the data centre at 3 a.m. and you bumped into the rack when you were trying to do a hand‑stand. You need to own up to that. Or, in reality, maybe you had a typo. And the best part is how do we enact institutional change to prevent this from happening again? As previous examples, we could install breathalysers on the data centre doors, right. Or maybe if it was a typo, we can add a test in our continuous integration system to double‑check for that.
We're all human, or puppy. We want everyone to feel comfortable when they are messing up, if they mess up, or really when they mess up, because we're all human and we all fail.
So, now, we have got to get rid of a lot of dredge work, so we have moved from this craziness to the nice relaxed world after automation.
So, just a little reminder.
Step 1: Start trying to use GIT. This website right here has a very easy tutorial.
Then, transform your infrastructure into code using templates.
Break down the ticket silos and the dreaded ticket wall by encouraging your co‑workers to help you with configurations using pull requests.
Then there is the extra credit part, you can use integration tools and virtualisation to test all of your configurations before they deploy.
And, now, you are on the journey to team work and NetDevOps.
Any questions?
(Applause)
AUDIENCE SPEAKER: Hi. Aaron Hughes: First I think this is a great presentation and thank you for putting it together, I enjoyed the humour and I actually lifted my head from the laptop and watched it.
Second, I think this would be a great start for a best practice document and I think some useful examples of some templates would be very good in describing the differences between, say, GIT and SVN and RCS etc. And if you use things like... things versus known working tools with vendors, but I think this is really great. I think if we had ‑‑ if we put some effort into putting some meat into it, I think people could follow a one‑on‑one guide to actually how to do this, beyond just, hey, take a look at these things and work with your systems and network teams to automate things.
Leslie Carr: When we're both back in California, you just bug me to get started on this.
AUDIENCE SPEAKER: Blake [Wizeo]. As Aaron said, thanks for the engaging talk. Just a quick comment about network automation tools in particular, certainly with Ansible and possibly with others, you talk about integrating this with your network devices, and so forth, with Ansible, just because that's the one I'm familiar with. The client on the device is called SSH, so, any router including like crappy old Cicsos can do this. There are a few tools out there like DaiPon, for example, that can even take like an Ansible model and put that into SMTP code and it's like read a variable, get a variable, set a variable. Your gear can do this.
Leslie Carr: I forgot to mention [DaiPon], which has some very strong supporters here, so, yeah, it's another great tool you can use.
AUDIENCE SPEAKER: Benedikt Stockebrand. I'm doing more work with enterprises and apparently you are very, very lucky because you are apparently dealing with people where it's possible to convince them to use things that they have never used before that they don't understand, and if I had wanted to be a programmer, I would have chosen that as a career. It he depending on the particular situation, it can be really, really difficult to convince people to use it actually to the point that you have to somehow get rid of them and replace them with somebody else with more, say, open‑minded to actually doing some serious programming like Shell Script or so. To make the support. That can be really difficult in a lot of places.
LESLIE CARR: One thing I really like about the tools like Ansible and Puppet is that they make it a lot easier to use. I feel like the barrier to entry is lowered and sure, there are some people who will never do that, which is sad. But, also, a lot of times organisations since they have these silos they don't realise that their co‑workers already have the experience and can help teach them. We are like we don't talk to the server people. If you did, you might find out they are already using one of these tools and so you already have an in‑house expert who can help you through a lot of these. So, really, I find talking is really your greatest ‑‑ well talking is your second greatest tool. Googling answers for when something isn't working is your first greatest tool.
AUDIENCE SPEAKER: Benedikt: Fair enough. But from my experience with Puppet, it's quite a pain to use. It's not bad of me to write something better, but just about basically somebody sponsors me, and getting started with that for somebody who is more familiar with the Cisco more than anything else, and possibly somebody who only knows Cisco, which is even worse, is a huge pain. I have had this happen, show somebody a cluster SSH on a Linux box so you can do this on multiple machines at the same time. And the only response you get is, oh, do you have that for Windows? Then, you know, you really have a problem.
LESLIE CARR: That is a little harder, I find Ansible seems to be the easiest for people to pick up because it really is sort of a fancier cluster SSH model, but there are some people that ‑‑ you still have to have a willingness to learn and experiment with new tools and occasionally fail, and I feel that failure is a great way to learn.
BENEDIKT STOCKEBRAND: Then there is another thing. When people are under stress, they tend to fall back to what they already know. And if you are working the average ITN environment where you have basically about half an hour's worth of work per day, actually, it's fine, but if you are on ‑‑ under permanent stress because of this legacy corrupt sort of stuff you have to take care of, it's happening quickly that people are driven by themselves, fall back to their old ways and that's really, really difficult to keep people away from.
LESLIE CARR: In that case, you really need to have your management and company give you support for the time that you need to transition to a newer model, because there is a slightly higher initial investment of time and energy, but then, in the end, you wind up using much less time and energy.
So, it does require some support and some saying, look, you know, we're going to have to take some time to learn this and do this, but then after we have, all of our day‑to‑day maintenance will take a tenth of the time because I can do it on every switch and router at once. It's not snap your fingers, but I do think that the huge gains of both productivity and the safety you can get from having computers test your code before it's pushed out are really worth it.
CHAIR: Thank you. We have another question...
AUDIENCE SPEAKER: Alexander Lyamin, Qrator Labs. Can you manage a programmer debugging application in live environment? Why it should be different with networks? Now we have a nice guidelines how to operate networks in the modern way, but next question that calls for solution is: How can we the bug information? How we assimilate our networks so we don't actually debug on live applications and live environment? So, sad truth is that we are quite, quite bing pre‑historic on instrument on how we manage our networks, and Leslie, really, really nice step forward in right direction. I hope more of them coming. And we need a time series database because storing all this database from your networks is really, really hard, there is no good time series database. So, let's keep up discussion in the mailing list.
LESLIE CARR: As a slight side note, I have been just starting playing with using elastic search ‑‑ well, Elastic search log [Stash Cabana] for storing logs and all of that, and you can use ‑‑ you can put in all of your metrics into Elastic search and output them with graphite. But that's a whole other talk or three. But I'd love to talk with you about that afterwards and, as I mentioned before, GNS 3, free open source tool that's great for modelling networks. I know there are many commercial products out there which are great, but it's a lot harder to tell your boss, "I need a million dollars for this, versus, boss, I need one server to use this on".
AUDIENCE SPEAKER: Hi Leslie, this is Anand from the RIPE NCC. Thank you for this great presentation. I was watching it with you know paying great attention to it, and I'm very happy to see that a lot of ideas you described here are similar to what we do at the RIPE NCC. We use GIT, we use Ansible, we use the branches, we use branches to test our configurations on, you know, one or a subset of servers, and Ansible allows us to push configurations to selected hosts for testing and then roll it out to the entire network. So I'm very happy to see that our ideas are similar and that, you know, I would like to encourage more people to consider this approach because it is a flexible, very easy to work with approach.
CHAIR: Okay. Thanks a lot.
(Applause)
LESLIE CARR: One last thing taking off my speaker hat, putting my Programme Committee hat on, right after this at lunch we are going to have ‑‑ we have two tables reserved for net girls, this is an organisation for women in the networking industry, so, hopefully the tables are labelled. If not, my hair should hopefully stand out. So come find us and we can have a lovely talking lunch.
CHAIR: Okay. Thank you very much. And this concludes the session. Thank you very much, we will continue at two o'clock in this room.
(Lunch break)