Mastering a data pipeline with Python - Robson Luis Monteiro Júnior
9:47AM Sep 26, 2020
Speakers:
Keywords:
data
pipeline
python
code
architecture
needed
layer
engineer
important
talk
tools
test
monitoring
create
jvm
process
exception
spark
people
files
Sometimes it's boring to work alone here. Personally I have problems my fault, so I like to have a person that people around me to interact and talk because if I'm alone so for me it's easy to get lost to lose my my Fox there you know. Yeah. I mean, these big challenges to work fully remotely because for Microsoft, I have this option so in general I work two days at home and three at autopsy when I was in Microsoft but now it's five days per week and at the home, which, and I live like in a very small place so it's quite boring is quite confusing to me.
Yeah, I can understand I guess it's better when you heard the choice they'd like, I mean, I'm also the same like when there are people around doing things, you know, like sometimes I just take my book, and go to Starbucks. I mean, even if you know there's lots of noise or whatever I can read a book, you know I can sometimes finish a book.
I know, I did. And now that spring is open and in German, at least twice a week I go to some somewhere to talk gleicher, or some Embree earring or some coffee. They still have some restrictions once because of the cuter it's because of the corona virus to sit down and and working from from this place but like largely brewing companies here and barely you can see it, take a beer take a wine or even a cough and you can use the Wi Fi network with the, the people.
That's nice, like, yeah, I also take a walk every day like one or one and a half an hour or something in the park, there's a park near you battling there are lots of parks. So, yes, you might have that opportunity.
Yeah, that's exactly it so I love these the this kind of content, especially for the one thing that I missing here in Berlin for example is the present show meetups, because you can interact with your network so you can expose yourself some some, most of the time, if not parts of the meetups are virtual place, in my opinion I think for the next year for next year 2002. java two. So, everything will keep you for tall so it's more comfortable more convenient, probably people, you'll give up to have a live in person meetups or in person conferences Yeah, maybe by Cohn USA you'll be presential because let's say that selected a firm or by calm but I think that's such a delicacy by cones around the world, you'll be you'll keep it for a while I guess I don't know, it's something that I see that spermine and from from living with for a lot of Corona.
Yeah, well, let's hope the best about that.
Yes.
Looks like it's actually, I'm not sure if we are alive right now.
It's.
I think the other one will finish and we'll come here.
Okay. Very live so basically as like a brown girl before the talk so if you want to join. Just send your comments in the chat. Before that, oh,
hey there.
Okay. Yeah, I see that people coming in right now. 50 people or something. It's increasing people are coming in. Hi everybody. And feel free to shout out if you know you're hearing us right now to the chat.
Anyone from South America there. By the way, Jennifer Cruz.
I think not.
Oh, I was looking at the wrong, chat room right now sorry. Did you click the session. Okay everybody, I'm so sorry I was looking at the other chat room. Um, so, we are with hops on, Jr. If I could pronounce it as well. And it is quite exciting because we're going to talk about the data pipeline, which is a very hot topic. And the thing is it's not. It's a very new topic that's why things are not, you know, established, which means that people that are working on those technologies and their experiences is very important, and hopes on worked in Microsoft before and now he's working in GitHub, and he has amazing things to share with us. So on behalf of everybody. Welcome cops on.
Okay, thank you so much good morning good afternoon with the night depends because you are around the world so I'll commit to my talk. The idea of the staff is to share all the mistakes that I did when a transition from software engineer to data engineer and other lessons learned and size of pie Tom follow me scientists started to programming. This topic was focusing on Python for cars so they engineer is not just Python but in this style, you talk about
the Python data world. Okay.
Again first. Thank you to attain mine My talk right. My name is Rob song. Just a second. Okay. Are you talking about introduce myself a bit so I'm a developer from science it started to torque working by Tom, I learned to program, bytom, of course a lot longer mica here, I transition to another technology, but Python, always follow with me. I've been organizing conferences around the world, but especially in Brazil, I'm from Brazil, South America, and you had like a big city. A big community of pytel community in Brazil as well. I helped to organize one bike only Brazil was only 15. And those are my, my contacts if you want to, to contact me, feel free to reach me out on telegram with some zero or Twitter or in my GitHub, my GitHub is the same It is something I forgot to put here, feel free to reach me out anytime if you want to discuss it to take to bring any information to to correct me if I'm wrong so I'd love to discuss it for you. Right. The end of today is not about the code is not about. Whoa, you can create your code but it's about how Python fits in this data worlds by the US. Well, well known in data analysis data science part, but sometimes we'll have a lacking of data in Python the data engineer road is because we bring these this topic. You're talking about like I don't know the anatomy of a data product, and you introduce the concept of limiter and Kopitar battery, and then you do a brief information about the plot itself data pipeline, and of course were vital matters in these holy words. My goal is to help you to plan in great data Python products, right, because it's very important so I still have the knowledge about pi too interested in Python, why not use a Python two progresses your data engineer. Right. Let's understand that the NACA data pipeline right in the anatomy that you can see everything that's generated. All the kind of the data is generated. Basically, a we call as a volume as a variant of evaluate it because so you have like a different kind of firmance of data. In general you have like the logs the abeyta so it's it's parts called the Ingress right when you are your data come true to you, and then you need it to process this data. This, this data so has a lot of you needed to speed to have like velocity on these disperses because you are talking about a huge amount of data. This part of a product called jobs or datasets sets data sets, part of your hot data. Right. And when you talk at this one. You are handling a big volume of of information right with the kind of information you are talking about is one thing that you need to share from, from the beginning. Right. When you talk about the process you are talking about 80 hours as well we two hours is the concept of extract load and transform, basically, you'll get the data from one source you load this data phony another in another component of your architecture, you do some transformations on this one right and you load it back to another place. This place is eventually call it in a grass, a grass, it can be delivered your data by API's or database analytic database, and you needed to have like if you're acity it the first two who you can trust in your data, right, how you can manage your data. Doesn't matter if you have like tears and terabytes of data, even if you cannot extract the some relevant information if you cannot extract some relevant insight from, from your data. Right. This is not enough a data product so they'll have a like a Ingress, the process and the egress, right, basically you have three steps.
In the same way, so you'll, this works like any other computer program if you think so you have a memory in files in your in your computer program if it's a, if you're talking about like a common design problem of UI program. So you have the memory you have the functions variables processing those those files and memory, ie output or window a file, another file and API. In the same way you have input process and output. Right. So the conclusion is a data pipeline is a computer program is not different. Of course you have different techniques to do if data pipelines, but in the basement. They are computer program right so you can use it basically the same techniques to build your codes, but you can need to you need to use it. There are different techniques to test your program. Right. Another concept is to understand the two kind of factors that you have nowadays for a data engineer, right, um, they are not completely different, but they complement each other right, the most important thing here is to understand the difference between the lumber the carpet architecture, right, of course. Number love comes before because the is Adafruit data is proposed a long time ago. And then with the evolution of technologies, new, new requirements from the market from the community, the copter architecture it comes to simplify, and then to bring in more speed and is a way to deal with data. Let's just start about the limiter. Right. limiter is a conception of our character that are we call as a batch architecture as well. Basically, you have like a time frame that you go time you buy time and go to this data in the grass in the grass layer, bring it this layer. Right. Bring it this data sorry not only bring this data set to your to your process the your process layer, and you can split the this process in two parts, one you call a speed layer. That means that you are consuming our data as a stream so when you receive the data you process your data or in a batch layer maturation bachelor means that you consume the data coming from the data systems, right, you can add this data to your views views means that whole you can see your views and then you can serve your layer. You may spend
a layer,
you'll get this data from your eating grass, right, you process this data, near real time, and to show to your users, right. You'll have like an extra layer in the serving layer that is the consolidate layer that to get the batch of views in the real time of views and joy to consolidate those views in order to serve and keep it permanent right in both layers, you can see the results, right, both in the batch layer in the bachelor layer, you'll have some delay. In January, you'll have delays of minutes or hours and speedy layers, you can in this feature layer, you can say more or less in seconds or even in minutes right but it's depends how you deal with the technique, the most important attribute of this act dead rats in the end, you need to consolidate those, those layers because you needed to put your data set from this bead in the batch layer in the same state, right.
So,
where are you applying this kind of technique so first of all for all the applications who needed to be the data started permanently like big analysis so like we call mercy or, I don't know, internet we call mercy on on user behavior along product creation. So you needed to restart it permanently or in no way that your compliance department he started your company permits you, for example some companies you'll have two years one year and a half and you can start your data but you need to prepare ready to keep this data, started right as well. When you enjoy the needs of the analysis team is your fee in your team needs to be to carry immutable data, it's very important as well for the lambda architecture to write and use it instead that requires a huge amount of data in the data service in the data, data, data sets. It's important to because when I mentioned it to consolidate those layers means that, for example, we process a batch layer off the last hour. Right. When you are speed layer comes with new updates from the previous data, data you consolidate you need it to be dated those, those those informations as well. In the tables are in the files. So, this job is to consolidate the files is very important. Right. But you needed to balance. Nowadays, if it's used for or not to you because sometimes you can mixture, the carpet the projector that you talk after the premiere pro z, the first prossies that they're reliable and safe. It means that it's very consolidated architecture so you have a lot of tools for that so you have a lot of resources and documentation. This is a fault tolerant because you can hit processing our data, everything from scratch, it's costly, of course, but you can do if you find any bug you can find any consistence you can fix it, and you can process everything from scratch. Right. It's escapable as well because if you needed to, if you needed to put more machines in or closer if you needed to to deploy your codes in different kind of clusters, you can do. You can even decide if you want to escape yours or resonsible horizontal or vertical It depends, what's the proposal of your job, right, and you can manage all of this started cold data, and it's to the boot to the file system, or have some kind of it's to boot to the file system nowadays. The most common is Hadoop. Now you'll have a new project from the same sponsors office fire. That's called a date, delta lake. That's very interesting to look at. Right. And the other party will have some calls that is if you started to modelling your data modelling, your your your data. You can suffer you off prematurely modelling because it's harder to migrate our schema. Right. You started to consume some data start to modelling our data, you needed to be dated those schemas step by step you needed to produce a like a processing in your team and your data in order to integrate those schemas because sometimes you need it to where I talk to you is to process all the data right so it can be costly. If you try to predict which kind of data you need to process in the future and try to prepare architecture for that. It might be a big mistake. It is one of the mistakes that I committed. At first, right, because I tried to predict everything as any other software, pretty much, three proprietary data more data prematurely optimizations are worse decision, right, and can be expensive as I told you, because size your data volume is increasing step by step so if you need to hit process if you need it to, to create a new batch it's batch cycles, it can be more even more expensive for per, per day or per month. Right. And if you don't separate the concerns off er are your codes, it means the code the whole deal with the pipeline, the code. The code. The code deal with the scheduler the code. The code who deal with this team has to validation, the validation, all those layers can can become more complex to manage. Right.
Let's talk about the copper. Copper is not a replacement of a little bit architecture, but his alternative is to provide performance. Right. And in some scenarios that alone with the and the bachelor layer is not necessary. Basically if you see it say struction from the speedy layer from lumberg architecture right. It's just two processes three data right performing nearby, or real tiny process right, especially for analytics is simply a single x. You can keep the main advantage of this approaches, you can keep the same code base, and you can deploy the same code base without having to to reprocess all the data because you are talking about the strings are not about to storage at all. No, you can deliver the new features change codes fix the bugs deployed easily, but you cannot start the your data permanently or you need to introduce these new components, a higher the architecture, basically you have your data, you have your stream you'll have your real time views and then you're curious, let's imagine it that is like a log analysis you know real time some exception like products that help you to develop develop software like your data dog sanctuary are a typical example of capture capture for data engineer, or even for even for fraud fraud fraud detection, for example, ad split the form that you needed two milliseconds, you needed to decide which kind of ad to to deliver to your user. This is a kind of application of a capital d'etre. And this kind of better user less research than the unlimited architectural level. It's a good point because it's a trend in the market so you can use machine learning models is basically a real time basis as well. You can use the lambda calculus prepared to scale horizontally and. And most of the time you just needed to reprocess the data. When the code changes. So, if you're introducing a new feature if you're trying to set up a new code change that really improve your data. In general, you don't need to process your data. Right. But when you'll have to rehearse in your data process you needed to have like you needed to do select a massive exception manager in your code it can bring more complex, right, you need it to be more operate you need it to be Markel season, granion, your weeks when you print with your exceptions in our code, right, and the pants off your bug that you introduce it, it might stop your car by blind, right, it requires more monitoring more resources to to open up some allies, your pipeline as well. In general you need it to work in partnership with your DevOps team, or you need to introduce it to yourself, the DevOps concepts as well because you need to take over it fast, your pipeline, because it's a real time pipeline, it can be the data cannot stop. Right.
Let's talk about the quality of an a pipeline. As I told is a computer program so the problems are almost the same. Right. The first of all, and why I think that's the most important ethical part of any data engineer, professional is security. Because issue the data analysis the data science is usually our data is because there's data engineer behind the scene is doing the magic to prepare the huge amount of data right. The first thing is to partnership with the, the previous city Officer of your company the previous city laws union in your country in your, in your society, to have access to all all data levels, it means you have different parts of of your data inside you'll have it the whole data in the data lake, you'll have the data be processed and in our pipeline you'll have your data processes in your database, and your database and analytical database right so you need it to be well defined it needs to be well defined at the access layer. The previous city should be over all the layers, right, that starts with you, not of your company, not with your requirements the previous city on the edge because to start with you, you needed to bring up this idea of previous it first start as a professional rights. Try to use a como ferments when you talk about the data. Right, even anywhere see the different kind of formats of data tried to, to use a firmer accounting firm into like burket files JSON files, even CSV are and other files that you can store it. You can restore data. Right. As I told the separation of concerns is very important for security because if you mixed different kinds of attributes of your architecture, you can bring your problems, and give your brand self security, and I've worked hard vicodins
hard coded
keys for failures inside your code in your pipeline because your code your beats to reboot the different servers, and you can have a like a big leak. Right. For the automation basically works as any software competence, any software computer programming version your code uses the power of different tech firms in specs in the case you're talking about the title, but the title interacts very well with the JVM in order to leverage the Dickey jr to the next, the next level, try to introduce a code review and linting into our code and try to automate everything that you can do in ci CD servers right. As I told the special different copper architectures you needed to have like a bigger monitoring so try to delegate to the monitoring. It's my advice, again, try to delegate the monitoring service to some cloud in some event or to you. Because it's cheap and fast like if you needed to analyze the logs, you need to collect even collect the exception. If you need to have some Performance Analyzer so try to use it this one. Even I'm telling you to you to use a call tutorial helping you to help you to do it and monitoring, try to avoid a vendor routine. So try to create some or operate in your application that you can change your vendor. If you want to reduce the cost, if you're wanting to try another another vendor right. But the mandatory is essential for any data pipeline because you need to understand which kind of data is coming in, which kind of data is processing, ie which kind of data was delivered so monitoring all those the status is mandatory. Right. The most important as well, is to stress testing trace your code. Apply regression tests to data pipeline is very hard, especially when you change your schema because your schema is being validated and evaluated, all the time, and you need it to progress at this test to make sure that you don't bring problems. Make sure that you're always boots up the test, they're nice because it means that you'll know why you're coming to us so it's better to develop your coats. Focus on the unit tests for internals of your pipeline, it means for it to fully show that you're created to manipulate one part of your data it must be tested as well. Try to test out the third party components as well like you have different software's around the architectures so you needed to do like some kind of a mocha test is to to shut it down part of her effort and make sure that the pipeline to recover itself, and do of course I ended to end the test. Right. To be honest, test, all the 30 Park components who are on the most important the learned the lessons that I learned those times because sicu interact with different stacks, when you are talking about the data engineer. Monkey testers are a pretty good solution to help you to avoid all the problems that you have to code because code is one thing that you can monitor in both the third party components something that's quite hard. For example Kubernetes and Docker it's a pretty awesome tools that help you to to test it right size, the size of the, the, the style is so we are almost finished. I will talk about whether Piko matters. Right. Probably you don't have time for question and so both of you tried it, but I will show you some of the tools that I use it, along with mica here as a big engineer using Python there, as I thought I needed to. I had to remove all the code samples, but I can, I can talk before, it's the first always to talk about ETL is extract and load and then transform the most use it.
liberally is pi spark by spark is a wrapper for for Apache Spark Apache Spark is is computed distributed from your work system computed it's to boot the system for rolling in over the JVM that has a Python API wrapper, that you can use it to create basically ETL was in deep analysis right, you can use machine learning, but the mostly to use. I used to, to Intel's right for my into ELLs or for my process that I use it before analytic spark I use the desk desk is a Python tool, wrote for the community is open source that paralyze paralyze the compute for analytics compute is so basically you have like a pandas using different servers so you can use a pandas and the concepts of pandas dataframes to parallelize your analytics analytics pipelines right. Another nice tool is Luigi, this is a was a framework, wrote by Spotify Spotify. That's a model that permits you to create different kinds of pipelines and create a dag, dag is a graph that permits you to interact between steps on the pipelines right in the our jobs in array are MapReduce and MapReduce in machine learning, frameworks to use with Python as well. Let's say that in America was almost the practice because nobody use MapReduce at all. I would say right in ray is when you find a machine you're learning Weka CS is
a brilliant data to
free stream, I can be used fast for a long time it to stream it to process all these streams that comes from the Kafka compliantly streams so you have a different tools in the cloud or clusters of Kafka that you can use it's vital. And if you, I use it sometime in a long time ago. Stream parts for partial started birth, Africa basically is the facto stream, a stream framework for an IDS right for analysis. bozhou is a well well known high performance analysis tools for that so you have a blaze that these interfaces are pandas for work if the bigger data.
And
I still most important tools to mention nowadays have open mind orgy optimales right. Honestly, I just use it for for test, eat but I never wrote the production code on those frameworks. Right now I could talk about air flow or flow for me is one the most important. Two, in the ditch engineer wrote the PI two is an Apache project that is like a platform that you can create pipelines scheduling and monitoring all the workflows of her data inside your company. They have like a stencil library that you can extend the liberalism in title, right in our flow, you can create a new, new components that interact with different third party components right. It's a very easy to extend and completely wrote in Python, right so you have
a different configurations
or have like a standalone standalone deployment. It's rebooted the deployments. You can you can deploy on different servers, you can scale horizontally, vertically, so you have a different amount of way at different ways to to the prior flow. It's definitely it's one, two that worth the work on to set right to test to test. As I told this up code, especially when you are talking about unit testing you're talking about pi test, write your talk, spark testing base is a Python framework that the permits implements by pi spark this right to the B. It's the basic generator
tool that you can
pass as well, and using in your ad 20 tests to make sure that all the things that you generated in your, in your pipeline are key. Right. And when I say, when I talk about to validate or to improve your schema your talk to have some validation. So, here you have some tools in Python, that will permit you to validate schema off your data.
A special edition for sir bills.
That is a very nice and lightweight tools and very extensible so you can use this to extend and create different tools or different parts of your code to validate the improve then, of course, helps you a lot to implement integration tests right volute tools is another nice tool, not so extensible as a serbelloni Smart is valuable for a small project as well. Okay. I do like to thank you for attending my talk. If you have further questions or who I don't know if you have a time for that. But
yeah, like, we have two more minutes only. Yeah, so there's a question about as an experienced data engineer, what is the most. What is the most common error that you see in pipelines that you examine this is one of the questions and
exceptional hands handling is the most is the most ever dare I face it, I committed, and I see people don't.
It's not afraid about the exception decided by blind.
cisely most part of the time you are running our Python code, together with another stack, like JVM using Spark, sometimes to handle this kind of separately cited pipelining running different. For example, I have, I had another talk about that they can send you the link but explain the whole pipe interactive spark but in general. Sometimes they exceptional exception, for example, is true or the inside is true inside of the JVM and then Python needed to capture it sometimes to handle this kind of sexual it's almost impossible. And it's because I mentioned it that the common mistake is how you handle your exceptions in our data pipeline, right, because you needed to fix it is not adjusted to handle and try to resolve it but you needed to go into preview of your code. It's the My advice for you is start to handle barrier exceptions right and the second. In the second size you will have like a good Exception Handling so you can monitor and bearer your code.
Yeah. Another one, we have really short amount of time right now but it says, All these tools sound a little intimidating to be honest, and it feels like one needs to immerse quite some time and set up the infrastructure data processing pipeline correctly, what can I do if I want to start a small prototype an idea.
Okay. It's not intimidating right. The concept is a bit different. Okay. You can trust me right so as I told this is a computer programming so you'll have like inputs processes and outputs right but my advice to you if what did you start in the data engineer field. I would suggest to you to start to deploy airflow it create your first dag, and then you can understand the concept of a create pipeline itself so steady steps that depends off each other, and then you can understand the whole a DAG interact with each other. For example, you need to load your data you needed to transform your data you needed to send our data to another place, then these are three steps, but to to process our data you need to have your data in one place. This, this second step the pain of the first was,
like, the third one, the pain of
the second the world. This is the first example that it's good for you to come in, into the engineer words and dance movement.
Yeah, we've got all the time right now.
Okay, okay, run off this time. But feel free to reach out, or here are a mic tweeter are in my telegram if you want to be a pleasure to talk to you.
Yeah, and also like for the next sessions you will need to go and click decisions like it won't just continue here from the left side I think other people explained before, and don't forget submitting, if you want to have lightning talk assault. See y'all folks later.