Defend Your Own System Through Binary Recompilation
2:57PM Jul 25, 2020
control flow integrity
and religious. Our second talk for the day is defend your own system to binary violation by David potami. He's currently finishing his PhD in binary security at Columbia University, and is also involved in a startup called alpha secure. David is going to talk about binary transformation, and also want to introduce a new tool called Eagle ego, David. Let's begin.
Hi there, I'm David, and I'm here to talk about defending your system through binary rewriting and binary recompilation. I'm currently working at a startup in New York called alpha secure. But the work for this talk was done during my PhD. I'm just finishing up my PhD now at Columbia University. I studied a lot of different binary security, techniques, and binary rewriting techniques as well. So, let's get right into it. I'm going to talk a bit about what binaries are in case you're not too familiar with them, and then do a quick overview of what current binary security looks like. And then we'll go into two different types of binary transformations. So what is a binary really well, a binary is just the compiled version of code that was written in a compiled programming language like C or c++. So do you see the example there GCC source c minus o binary, so here this is how we get a binary file, or an executable file or machine code out of a C source file biomes are very widespread and they represent a ton of the code that actually runs a computer today, partly because let's say you're using an interpreted language like Python. Then the interpreter itself is probably written in a compiled language and therefore is a binary. As you can see, with the mykines code at the bottom there. Unfortunately current machines are running machine code and assembly code, and they cannot run higher level languages directly. There are kind of, there are many different kinds of binaries. I talked about executables already but temporary object files created by GCC are also a kind of binary, same file format on Linux at least, and shared libraries and static libraries could also be considered binaries. You get different ranges of metadata including binaries as well for debugging info makes the program much larger but it's great for the developer running a debugger. It's probably not gonna be that way in a production environment though because of the executables are so large, and you could have symbols, you might not have symbols might even be obfuscated. If you're a company trying to hide how your code works from your competitors then you're probably going to strip the symbols and you might obfuscate the code as well. A lot of tools like debuggers and profilers are going to operate on binaries and not on source code, because the source code is what the programmer wanted, but the binary is what is actually running, and those are not necessarily the same thing. Sometimes there are compiler bugs or architecture bugs or other things that get in the way. Yeah. Unfortunately though compilation is a one way process you can go from source code to binary, but you can't really go from binary back to source code. The compiler gathers a bunch of information during its compilation and it will use some of that info and then throw it away. So, if you try to go backwards, you have to infer a lot about the original source code, and it's quite a difficult process. So in general binaries are meant to be run, and they're easy to run. They're not meant to be modified or understood directly and therefore, it's quite hard to do either of those things. Let me just walk through the actual requirements for binary to be able to run. The first is that all of its dependencies have to be available. So if it depends on a shared libraries, those have to be present on the system, you can create a binary that is statically linked So it includes the libraries doesn't require them to be somewhere else on the system. Secondly, the architecture that you are running on has to be compatible with whatever the binary was compiled for, there's a little bit of leeway here like sometimes a 64 bit x86 machine could probably run 32 bit x86 code. But for the most part the architectures have to match. And finally, the operating system has to support system calls that the binary expected to be using. This isn't as stringent of a requirement as it sounds, because sometimes operating systems, implement, other implement system call interfaces from other operating systems. So I have some examples at the bottom there FreeBSD actually reimplemented the Linux system call interface because Linux binaries are pretty prevalent. So now you can run Linux user space binaries on FreeBSD wine as in wine is not an emulator, lets you run Windows binaries on Linux, for instance, and the way they did that was to re implement the whole Windows system call interface, along with a couple of libraries that would normally be provided by, by Windows by Microsoft. Even windows got in on this and they reimplemented the POSIX system call interface or the Linux system call interface, in order to get Ubuntu on Windows working so you can now run. When Linux userspace binaries on a Windows machine.
While they're easy to run binaries are hard to understand. And this is the main reason is that they have free reign over their own private virtual address space, they can do whatever they want, they can load code from libraries and other files. They don't have to follow normal function calling conventions and register conventions and typical data structures. They are there alone in their own personal Turing machine that will run arbitrarily many instructions and they can do whatever they like. They're basically a blackbox you can't really see what's going on inside system calls are the only interaction with the outside world. And you can observe those but, yeah. In fact, we can we can show, and we'll talk more about this later that fully understanding a binary is equivalent to solving solving the halting problem, which is impossible. So, sounds grim but what can we actually observe. Well, first we can observe the system calls that a binary is trying to make that makes sense because that's the interaction with the outside world. The Linux program s trace can do that for you. We can also modify your intercept the system calls for instance with a debugger. I'm moving on to shared libraries, we can try to observe the shared library calls with something like l trace. I have an asterisk there though because some binaries can be tricky they don't have to follow the normal calling convention and they might confuse something like l trace. You can also try to analyze which libraries are required, with the LDD tool or change which libraries get loaded, like say you have multiple versions of the same library or you want to, you know, overload em Alec for tracing purposes or something like that. And you can drill down even deeper if you want and see the individual instructions, they get executed with a debugger like gdb or with R, or the tool that I wrote down there. But that again just tells you what instructions, the binary is currently executing it doesn't tell you what it will do in the future. And, and that's the hard part. I actually want to show you a quick example of shared libraries. So in this case, what I'm trying to do is take a binary from a really old Linux system and run it here. So this this system is like four years older than the one I'm currently on. So, you wouldn't really expect that to work and indeed it doesn't it complains about a shared library missing. You can use the LDD to like I said, to figure out what shared libraries a program needs. So it prints, the name that the binary requested and then where that is located on this system. It's quite a few here. This is a graphical program so there are quite a few. There might be less than a command line program. And you can see there's a few not found ones.
In fact for them.
Glue and SDL are graphics related and boost. Well boost is a c++ library and it goes through versions very rapidly. So it's pretty common to end up, depending on a version of boost that you don't have installed. But anyway, we could have, we could go run LDD on the original machine, where gorse was installed. And that would tell us the paths to these files, and then we can copy them here. And that's what I've done. You can see the, the libraries there, and basically we can ask the loader, we get where we can set this environment variable here to ask the loader. Hey, when you're looking for shared libraries. Please also check in this directory. And that will actually make gorce able to run. Let's see it.
Yeah, so this is,
Again this is with compilers so if you actually just google hardening, you'll probably find configuration level stuff first like how to set up your firewall or how to configure your boot scripts or whatever on a Linux machine. Just configuration to kind of harden the machine make it harder to attack. There's also a few defenses that exist at the compiler level right like guarding against buffer overruns enabling position independent executables, and so on. These are all sort of minor defenses that help a little bit, and they have zero percent overhead. So, eventually the compiler did implement them, and now. Most Linux distributions will enable these by default on any programs that are not broken by these transformations. So that's at the compiler level but unfortunately attackers are much more inventive than that. Even if we just focus on memory errors for a minute, which come, typically from unsafe memory, memory unsafe languages like C or c++. What does an attack look like there's usually two stages. The first is the attacker tries to get an initial foothold. So this might be by, you know, doing a buffer overrun to overwrite frame pointers or overwrite a return address or overwrite a saved function pointer somewhere. Just get an initial foothold and getting the program to start doing something that was not meant to do. So, there are some defenses and we call them preventative defenses that target this initial step, they try to make getting that initial foothold really difficult. And the second step for an attacker is to try and get to the level of arbitrary code execution. They want to go from that initial foothold to running whatever they want on the target machine. So, One way you can try that is to inject as the attacker inject some new code and then jump to it and start running it due to some existing operating system, protections that's kind of obsolete now it doesn't really work anymore. But you can just take it to the next level and do return oriented programming or rock, where you basically you have this sequence of code you want to run and you find small pieces of it within the existing code, you say if I run this bit and then this bit and then this bit and then that it actually ends up carrying out the attack that I want to have happen, or the payload attached to the attack. So, you can also try to make this step for the attacker difficult, and those are typically typically called mitigations where the attacker did get an initial foothold but then you just try to make it very difficult for them to get to arbitrary code execution.
So, in the research arena.
There are a lot of different defenses out there. I want to mention though that code reuse is super powerful right, the same way the binary is a black box and because it can do whatever it wants and it's a Turing machine, you know, once an attacker gets an initial foothold if they're doing code reuse. They also have access to this Turing Turing complete code sandbox that can run arbitrarily many instructions to do whatever they want. And to make life even harder for a defender. There's even automated ROP compilers, that will take a sequence of code, the attacker wants and figure out exactly where those pieces are, and where to run them to carry out this attack. It's like, you know, extremely easy as an attacker. Alright so what can the defender do. There's a couple of different categories of Defense's their summer preventative summer mitigations, and I want to talk a little bit about one preventative technique control flow integrity, and one mitigation be randomization. So, here's control flow integrity. The idea with this is that any indirect jump you want to make sure it's going somewhere reasonable. So, if you have a call instruction, you want to make sure return instruction is going to go somewhere reasonable like immediately following a call instruction for example, in this example on the slide here, there's a sort function that's called from two places. And this sort of function calls a function pointer like less than or greater than or something like that. This is an indirect call and these returns are also indirect, so the CFO would be very concerned about making sure that they're going to and from valid locations, so that the you know the return can't go like in the middle of this function and then started off chain for example. Unfortunately though control flow integrity comes in a couple of variants, like, so coarse grained where you sort of, you do some checks but you're not too strict about them, and fine grain. So here's an example of coarse grain going wrong. This return statement is supposed to return to here, right return following a call, but maybe it's allowed to return over here. Maybe it's allowed to return over here, if your if your check is just, are you jumping to somewhere that immediately follows a call instruction, then those would be allowed. And there are other cases where control flow integrity core screen control for integrity doesn't really work, and it might sound like a small detail but that small change is enough for an attacker to build a successful to get an initial foothold and then build a successful attack based on this. And as a preventative measure, if you get around CSI then you're home free you can do whatever you want. So, once it's broken. you know you're kind of out of luck. And so let's talk a bit about a, a mitigation technique instead it's re randomization. First, let's talk about the attack here, so I already mentioned ROP where you're doing code reuse and stitching together small pieces. This is even more powerful variant where you don't know the code layout in advance, and you read it. So, let's say this is a program that's, that's sitting across the network from the attacker so maybe it's a web server. The attacker will first trick this program into returning its own code, you know, there are a lot of memory disclosure bugs that can can enable this and that sort of counts as an initial foothold I guess, then the attacker has the code and then the attacker can say hey you know if I run this bid and then this bid and then this bit and so on. Then I'll carry out the attack I want, so you create this rough gadget chain, and then you just inject those addresses back into the original program and the Tech has succeeded. So, the idea with this randomization is the program is called shuffler, the defense is called shuffler. It's actually my work. A couple of years ago. And the idea is to re randomize the layout of the program you're defending so move the functions to different addresses every couple of milliseconds, maybe every 10 milliseconds, for example. So then if the attacker comes along and reads the code and then says runs this rock compiler offline or Iraq compiler actually takes a bit of time to run right so it doesn't finish immediately. and in that time, the defender has actually moved all the code around, then by the time the attacker finishes this computation and says hey I'd really like to jump here. The code is no longer there. And this means that the attack, basically fails. I want to mention of course that this mitigation technique also has its own weaknesses like.
So, okay, we talked about, we talked about.
We talked about binary is having these interfaces right like there's a system call interface that you can sort of introspect on and deal with, and there's like a shared library interface shared library interface where you can try to see what's happening and try to modify what's happening, but real problems, tend to arise really deep in the assembly right they don't happen at these boundaries, these nice boundaries, they happen really deep inside the code. For example, there might be an integer overflow happening or you might be able to pass a.to a web server and then, you know, end up reading every file on the file system or, or maybe there's a compiler or architecture bug like specter or meltdown. And it's just like a small sequence of instructions that ends up causing a security problem. So, the most problems like I said are not at these nice clean boundaries they're really deep inside. Okay, so we have to deal with the binary itself but it's hard. And here's why because modern computers are all von Neumann machines which means there's a single bank of RAM, that stores both code and data or a binary contains both code and data, basically. And if we were trying to reverse engineer that binary and turn it back into assembly code or whatever we have to decide which of those bytes are code. And which of the bytes are data. So here's a problem, let's say, There's a saying there's a loop. And it's, it's just running, and it's code, and that we're not sure if the bytes after it our code or data, then if this loop. Never terminates, then maybe it's data, but if this loop terminates and then keeps running then of course it's code. So you basically say, These bytes are code if and only if the loop terminates, which basically says we were trying to solve the halting problem in order to correctly disassemble a binary. So, obviously that's undecidable, and therefore, binaries are really challenging to deal with. Of course if you just have a small localized problem at a specific virtual address then you can go in and directly modify the code bites, you know, you can even do it with a hex editor if you're brave, just go in and change a couple of code base. That's how you can patch, a known vulnerability, but often what we want to do is not fix one problem but guard against an entire class of problems like we want to fix all buffer overruns, and the way you would probably do that is by finding every buffer, and adding an extra bounce check to it. So, this this type of systemic security transformation is called binary hardening, you want to harden against an entire class of attacks. But again, this would require us to go through the entire code and find every buffer or whatever, and add a check. And because this assembly is undecidable you really can't do this at scale. You can do the localized fixes but you can't do hardening, at least in theory, one way we can solve this is by of course, introducing another level of indirection. Basically we could virtualize, the original program. So we create a virtual CPU, instead of just taking instructions and giving them to the CPU, we instead process them a little bit cheaper instructions if we want to, and then let the CPU run only a small number of them just one basic block. And then when it's done, we go back to the virtualization framework, and then it gets the next basic block, and then says okay see if you can run this now you have the sort of back and forth going on. You may have heard of full system virtualization before. That's like VMware KVM VirtualBox. What that does is provide a virtual CPU and then you run an operating system on that virtual CPU and then you can run userspace programs on that operating system. But that's overkill for what we want, we just have a binary and we want to virtualize it so we can actually stay in user space and do just process level virtualization. valgrind is a pretty famous tool that does that. It's great for finding memory leaks to other tools are Dynamo reo and pin, they're very similar data Maria was the open source one. So I'm going to talk a bit more about that. And actually, this will be my second demo. So let's get into it. Basically, oops, imagine you have a program like this, and you want to run it under Dynamo Rio, you just do this. And, you know, it runs pretty much Normally, it was slower not that you could notice. Because, Jenna Maria was having to take these basic blocks and go back and forth and run them one at a time. But it's a very powerful tool because you can write these little plugins that will do things for you, they modify dynamic behavior. So here's an example of counting the function calls that were taken by LS here. Let me see like there's 4000 calls 200 some indirect calls. What's interesting though is to run this tool and see how many basic blocks there were 100,000 basic blocks. Just for a little LS program so that's how many times Dynamo had to go back and forth and and translate a basic block. So, you can see why it ends up being expensive. That's a very powerful tool. Yeah, I'm gonna say more about it but let's go back to the slides for now. And
let's talk about how you can do binary transformation so we already saw down Maria like virtualization right. Let's say we have this little program here, and we're trying to instrument this call instruction, we try to check for a knob, no pointer, so that we're not calling it doing a null pointer dereference. So how can we do that. I already said binary patching, that's where you have a hex editor and you're brave. And you're going to modify the bytes directly. You have to find space for new code. And because there's no space to insert new instructions here you have to like erase some previous instructions and replicate them or whatever. So it's a little messy and, but it's very simple. And if you have a, like a single change that you want to do is definitely the best approach. We talked about virtualization already. The main point here is that addresses existed at a certain or functions existed at a certain address. And then, sorry should do this, they exist at a certain address and then the virtualization framework moves them to a new address so that it can add new instructions, and it has to translate all the addresses back to pretend that the functions still exist at the original addresses. And of course, it only runs one basic block at a time before going back to the virtualization framework to figure out what to do next. So, a bit of overhead there for sure. And then next we will talk about binary compilation, which was, you know, the nominal subject of my talk. And the idea here is that you can copy the functions to new addresses. But at the same time, insert whatever code you want in the middle of a function. Now, that will change the size of the function. And that's a bit of a problem because that means the function, can't exist at its original address you have to shift everything down. And that means you have to find all references to functions, because you have to update them from old addresses to new addresses. So basically, virtualization defers that decision to runtime right you're dynamically translating addresses, but with binary compilation you're doing the whole process statically, you just figure out in advance where all the functions are so that you can change their addresses. If it sounds hard, it is in fact let's do a little comparison here, patching again very simple. Um, great if you just have a single bug you want to fix. But if you do it at scale your code will look like Swiss cheese. And it's pretty inefficient. If you are doing virtualization, like with Dynamo Rio, the really big Pro is that it works on any binary right because it doesn't try to guess what the binary is doing it doesn't try to figure out where the code is and where the data is it just says what's the next basic block we're gonna run. Okay, let's run that and then go back and forth. So it is slow, but it works on every binary. Now if we're going to recompilation it galletto is the tool that I wrote which you'll see more about the big Pro is that it's super powerful right it's as if you were back in the original compilers back end, or inside the linker, you can insert code wherever you want you can choose addresses for functions and lay them out however you want. The big con is that the GALILEO at least doesn't work on all binaries, which makes just only makes sense because it's trying to do this process statically, and we know that binaries cannot be disassembled it's undecidable so of course it's not going to work on all binary. Yeah. So here's a intro slide for galletto It's a binary compiler. It's part of my PhD research I created over the last few years of my PhD, and it's, it's basically binary to intermediate representation, back to binary again. So it lets you do this recompilation process. And the real goal there was to let researchers, create new security transformations that are too slow or too crazy to go to get merged into compilers. And like I said, it doesn't work on every binary in fact it relies on position independent binaries. So let's go into that a little bit more here. Um, yeah so position binaries used to be position dependent. So, basically, inside a virtual address space a binary used to say I have to get loaded at this address, so that it could hard code addresses and whatnot. And because every binary has its own virtual address space, it doesn't step on anything else that's fine, but from a security perspective, if it can get loaded at any address. Then all the pointers are less predictable, and that's great. And, anyway. It used to be expensive on 32 bit x86 on 64 bit x86, it's now almost free. And once you hit that magical zero percent performance overhead, of course compilers start supporting it. And now all as of 2017 or So pretty much all Linux distributions are finally doing position independent code by default. So, pretty, pretty reasonable for Galileo to require that we also require some exception on wind metadata which is also now in every binary just so that like C code can interact with c++ code.
Okay, let me talk a bit about what it actually looks like here. So, on the right hand side we have, what a compiler does so you start from source code you parse it, you create an intermediate representation and that's where optimizations take place. Maybe you end up creating a low level IR to like assign registers to certain variables and so on. And finally, output binaries machine code. So a galletto starts from the machine code creates its own IR, and then can create new binaries. And this IR can be modified internally by different security transformations or whatever. We also create a higher level IR because this is really low level it remembers like what all the registers are and all the exact instructions and so on. So, it's hard to figure out what's going on. If you're trying to do an analysis it's much nicer to have a higher level representation, but at the same time there's no down arrow here because we don't allow the user to modify this high level representation, because then that might actually end up, meaning that a lot of instructions have to get changed. And we wouldn't know how to do that efficiently because usually like if you change something you make it very different you make it much slower because the compiler was very good at optimizing. So yeah, we have this dual level representation, the low level one, read write the high level one, read only and okay I been talking about disassembly and how hard it is so yes the problem I mentioned differentiating code versus data, there's some metadata that says like this is the text section and its code and most other stuff is data. We do have two other main challenges, first is we have, as I said, find all code pointers, because we need to modify them to point to new addresses, and we have to do that statically this position independent code, really helps. It gives us metadata that says here, here's where all the pointers are. We also had to end up, identifying jump tables and how many entries were in each one. And again, this is an example of the compiler throwing away information. the compiler is perfectly well aware of how many entries are in each jump table, but it only decides to output bounce checks on about half of the jump tables in a binary, because it can prove through various means that the bounced check is not actually needed so that makes our life really hard, only about half the time do we have a definitive answer as to how many entries are on a jump table. Anyway, let me show you a quick demo now of galletto quickly though, let me show you a. The speed difference between Dynamo Rio and a galletto. So this is a Fibonacci program takes like 2.6 seconds to run. If I run it. now with Dynamo Rio, now all the basic blocks can be transformed back and forth, back and forth. It's like, it takes like twice as long to run. And then if I was to use a galletto to transform this, and then run the program again 2.6 seconds. It's exactly the same speed basically. So that's the speed difference between binary compilation, and virtualization. So, it is open source it's kind of like the LLVM framework, but for transforming binaries. There's a lot of existing code especially existing passes, which are like compiler passes. That's what you. That's how you normally write code for galletto. Pass we take in some IR transform it however it wants and pass it on to the next recompile paths. So, yeah, as if you were reading some c++ and doing some new defenses. But if you just want to use the gala so you can use some of the existing defenses. So, one thing I wanted to show was like running the LS program here. Just transforming it. I didn't add any defenses there I just ran it through to the compiler. And we see that it still works. And of course you can run. You can run a bunch of transformations there that are supported like rip cleans control flow integrity shadow stacks. Whatever, let's try control flow integrity. There we go. ls still runs. and let's have a look at the code really briefly. So what the control flow integrity thing does is, it finds every function. That could be called through an indirect like a function pointer invocation and it adds this special instruction end branch. At the beginning of that function. Now, every indirect call in the program is modified so that before it happens you you check to make sure that the place you're about to jump to actually has an end branch instruction. So, this basically is a coarse grained control flow integrity scheme, right, we make sure that every call is going somewhere somewhat reasonable, but really it could go anywhere it to any function whose address is taken.
Still, it's a pretty cool little transformation.
What else oh yeah I wanted to show very briefly the shell, which basically is like a gdb shell or something. It lets you look around and look at the IR like I can go into the functions, I can see what code is in this function, and it lets you kind of interactively examine the IR and see what's happening inside of Galileo. So, pretty cool. What next well, I want to show some, some results from running a Galileo. So, here's some speed comparisons between Ghana Mario and pin which are both pretty slow pretty high overhead. Again this is because they have to go back and forth, back and forth, transform the basic blocks and a Galileo does all of his transformations ahead of time produces a new executable. So it's very good performance. In fact, if you turn on our optimizations. You can even be faster than the original in some cases. So yeah, and we have tested on a lot of different, a lot of different executables. This was 800 or so, I got the compiler to tell me what the compiler thought was in the program. And then, same with Galileo in the Galileo was right almost all the time. After that experiment, we actually took all of the programs in Debian, pretty much, which is a lot, and tried analyzing them, and that worked pretty well to the, this is like the analysis half right if you're actually changing the binary, that is a little bit less effective. Here we took like a number of Debian packages and then tried to see if the packages would pass the package tests after the transformation. That's only 60% right now. Part of that is because of brittle tests, like, some of them will break if you actually like change the size of the executables and things. But some of it's just because binaries are really strange and they're black boxes and it doesn't work all the time. Unfortunately I wish it would but that's the trade off you make galletto will be really nice and really efficient. But it might not work when you first try it on a binary so you have to, you know, work with the framework and see what the binary is actually doing. If you just want to work right away, then you should use something like Dynamo Rio because the virtualization there is slow but it means it will work all the time, on every miner. Anyway, If you want to try some of these evaluations yourself or trends from your own binaries please do cloning galletto or download our virtual machine and run some experiments, there are a bunch of different built in tools. We saw the control flow integrity in that previous example. I don't want to say too much more about any of these, I guess. Yeah and Galileo has been published at an academic conference as loss of this year. So you can go read our paper there and the video from that conference is a bit more technical and learn more about it. It's open source, you know, please download it and if you use it you should join our slack where we can help you debug things and so on. Just send me an email and I'll add you to our slack. So in summary Yeah, lots of binary defenses are not integrated into compilers because they have more than zero percent overhead. And because they're pretty complicated so compilers don't want to be responsible for that. If you want to manipulate a binary, you can do so easily I just interfaces like at the system call layer and the shared library there. But if you want to go inside and modify its internals, and you don't want to just patch it in a hex editor or whatever, then you have to either virtualize the instructions or use a compiler. If the binary is amenable to that. So, in conclusion users really can try to fix binary level issues and harden their own system against attack if they so choose. So that's all I have. Thank you very much for listening.
Well, we're back live, and I'm here with David, thank you so much for that talk. This is a talk that that really has some technical depth to it and I appreciate you sharing your knowledge, why don't you give a little summary or anything you'd like to highlight about the talk and we have one or two audience questions then that I'll present to you.
Sure, yeah, I think the one of the main things I wanted to focus on with this talk was to give you give the audience an idea of the breadth of different types of binary techniques that exists, you know, not just talk about my tool but also say, here's kind of some of the other things that people are doing. And, yeah, I also want to say that although binary analysis seems to be dying a little bit because computers are getting faster it's easier to recompile code and so on. There's always a place for it, especially in larger organizations that really care about attacks or security researchers or or anyone who's sort of extra concerned about the security of their devices.
Get some audience questions there.
Yeah, it's been a fascinating talk I've been watching the live chat and I invite people that are watching on the live stream in the matrix chat there's still time to submit questions. We have time for a few. This first one is a question about reproducible builds. So the question is how does this talk fit in with reproducible builds, and it is something we would do instead of using reproducible builds.
So reproducible builds is when you make sure that the if you recompile the same source code, you get exactly the same binary as output. So there's no non determinism in the compiler basically. And that's extremely helpful especially if you're like running a big open source project, and you want to have a big compile farm to compile new versions of your code like like Debian or Ubuntu would do, is that way doesn't matter which machine compiles the code it will still be the same binary as output. And it's great for verification purposes as well because you can show that no one messed with the code during the process. That's sort of orthogonal to what I'm trying to do here. So, you can make a girl to be deterministic it. In fact is deterministic by default. So if you were to recompile binary, and do that again on a different system you would get this, we get similar results. But usually, most of the properties that you think about about being nice about binaries like this that the hashes stay the same, you can do you can do signing on it, or, or signature verification. All that goes out the window once you start doing binary compilation, right, if I rewrite a binary, it's hash changes. Absolutely, and any signature mechanism that was used to prove that it hadn't been tampered with is not going to work because in fact it was tampered with. So, I think there's sort of orthogonal things. One is to make sure that the compiler wasn't messed with, and recompiling is to change the behavior of the, of the code have access to the full recompile ation from source all the way to binary again.
It's fascinating yeah the, the reliance on confirming that the code hasn't changed is clearly sort of at odds with some of this technique on. So we have another question here is, how do we use techniques and I think you touched on it a bit is how do these techniques apply to systems using code signing.
Yeah, so it's a good question. Um.
So first of all, code signing can actually get in the way in multiple ways so certainly I was, I was asked at one point Can I can I use my re compiler to modify code that runs on Cisco routers certain Cisco routers. And the answer was no because they use hardware adaptation all the way up and they just won't run any code and I was like, unless we get cooperation from Cisco and get them to sign this modified binary. It's just not going to run.
So you always have that challenge.
But, you know, if,
if the compilers integrated well enough into the system you can just sign the binaries after they've been recompiled, or you can add your own signature to the mechanism that's doing the signing checks so I think it's just a matter of how do you prove you are integrated into the system and, you know, in the worst case you can always go back to the original Signing Authority and say, you know, we're doing legitimate transformations here, we need, we need a signature from you guys to make sure that the code will keep running.
Yeah so so in some of these applications you really need better cooperation because they're, you know embedded system or otherwise sort of locked down. We have a last one that I think is more of a curiosity. And, you know, I know you're working. You're talking here about your doctoral research you probably have read it. Someone mentioned the Ken Thompson. Thompson paper reflections on trusting trust somebody I can tell this already familiar, maybe might want to comment on next and I think the audience. I'm watching the chat and some of the audience. This is a little more depth technical depth and, then, then there used to or that they're ready to deal with. But I know that the consultant paper is actually pretty accessible, maybe you have some thoughts on.
Yeah, for sure. So Ken Thompson, very, very smart guy, obviously, I think the story started a little earlier than that though because, basically, when software was first being created and and people started to see the military applications of it. One of the generals in the US Army was like, so what if someone backdoors your compiler What if someone modifies the compiler, and makes it always put a bug into the, into the resulting code, like how would you notice because no one's reading these binaries it just like hit enter and would let the compiler do its thing and then just run it You never. You never investigate it. So that was just sort of an academic thing for a little while and then Ken Thompson, wrote about it and yeah I mean it actually has happened in the past as well where people, you know backdoor The, the tool chain and it takes quite a long time to notice because no one is auditing it right. So I think that's, that's very fascinating about binaries is that
over time people are realizing hey we need to run binaries but we also need to know signed binaries. We need to be able to verify, where they come from. We need to make sure they're not tampered with. There's a lot of different uses that people are starting to see for binaries. And as a security researcher I think we need to add at least a few things, one of which is like being able to disassemble and understand the binary, like, let me give an example let's say every binary is shipped with a copy of some important metadata like where the jump tables are how many there are, you can do it in such a way that it doesn't really compromise the secrecy of the code. However, it would make a make an analysis to like an antivirus program, able to fully understand the code. Imagine if. Imagine if an antivirus program could just fully understand statically all the code in something, and if if you don't give them that ability then they just say maybe this is suspicious. I really think that binaries could move that way in future where you know they're sort of, easy, easy to disassemble easy to understand because otherwise we really can't guarantee their security.
Yeah, that's a great answer
and. And I think this is to me it's a it's a very thoughtful type of question as well or looking into that that that whole. It's not really a debate it's more like an analysis you really have to be thoughtful about this. We have another question I think we have time to take it and your resume dropped out for a few moments before but I think I think we, we got through it. But this next question is, what you might expect someone wants to read a little more about your work and particularly they asked about the experiments with you I do and Debian packages but is there a link or where people find the best source for further information, email.
Yeah, of course. So, you should you could check out the academic paper, you can just put ask boss, 2020, a galletto, or you can go to a galaxy. org and we'll link to it from there. The paper has a bunch of dependencies that lists, various experiments, but also our virtual machine actually has fully replicatable scripts so you can actually download the virtual machine, run a script, give it a long time and it will download the source of all the packages recompile them and print the results at the end.
Fantastic. Well I think we're at the end of our time we'll start getting ready for the next talk, we really thank you for being here. Good luck with with finishing up all this research and your degree and we'll look for you out on the out on the technology and security scene with, with all these further tools and analysis. So thanks again, very much, it and we'll see you. We'll see you out there on the interwebs. We'll end recession again well.