Ryan Buick - Data Engineering Podcast

8:35PM Jul 19, 2022

Speakers:

Keywords:

data

teams

spreadsheet

canvas

models

business

people

build

tool

creating

dbt

questions

engineering

metadata

interesting

stack

pretty

sql

terms

interface

Hello and welcome to the data engineering podcast the show about modern data management. Atlona is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo unleash its transformative potential with that lens active metadata capabilities. Push information about data, freshness and quality to your business intelligence automatically scale up and down your warehouse based on usage patterns and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to Data engineering podcast.com/adilyn. Today, that's a te l a n to learn more about how admins active metadata platform is helping pioneering data teams like postman plaid, we work in Unilever achieve extraordinary things with metadata. When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch your production ready, MySQL, Postgres or MongoDB cluster in minutes with automated backups 40 gigabit connections from your application hosts and high throughput SSDs. Go to Data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today I'm interviewing Ryan Buick about Canvas is spreadsheet interface for your data that lets everyone on your team explore data without having to learn SQL. So Ryan, can you start by introducing yourself?

Yeah, hey, thanks for having me on. So I'm Ryan, one of the founders of Canvas. We've been around for about a year now. Our mission is, like you said, to help really bring the modern data stack to operators and to business teams so that they can, you know, make better faster decisions and data teams can, you know, continue to focus on the work that really matters to them?

Do you remember how you first got started working in the area of data,

a bit of a weird story. So started a couple years ago, I was one of the first product managers at a company called Flex port. So if your audience isn't familiar with Flex port, it's basically a tech enabled freight forwarder. So trying to disrupt the millennial old industry of shipping goods from point A to point B and all of the complexities around that are pretty, pretty topical nowadays. So yeah, so I joined one of the first PMS there, and really didn't have all that much experience with data as a product manager. And so, you know, when I joined, I realized, you know, everything was so data driven, you know, every decision had to be, you know, presented with, you know, objective evidence of, okay, here's where we're at, here's how we think we're going to improve at metric X. And, you know, before I had data pretty much served up to me as a product manager as an operator. And so, honestly, I have a lot of anxiety over it. And so I actually went and took a data analytics bootcamp. And so I spent my first couple months at Flexport, during nights and weekends, learning, really, how to write you know, advanced SQL how to, you know, think about cleaning data, how to think about, you know, working with data in Excel, as well. And that was super helpful for me in terms of getting exposed to data. And yeah, that's sort of how I ended up in this space.

And as far as the canvas project, I'm wondering if you can describe a bit about what it is and some of the story behind how you decided that that was where you wanted to spend your time and energy and build a product versus just continuing on your career path of being a product manager.

That story has also started a Flexport. So I met my co founder, so we were both engineers, were on the same team, really, we had seen just how difficult it was for, you know, Flexport as an acute version of this pain, where there's so many operations, employees, right, and the data team could frankly, just never keep up with the amount of requests. And I think we did a good job of, you know, having strategic dashboards for, you know, for the business, but there was just so much ad hoc every day questions that needed to be answered. And there wasn't really a great interface for those questions to be answered. Right, we saw a data team that, you know, spending millions of dollars on, you know, the modern data stack, and, you know, analysts that could help out each individual team. But ultimately, you know, you saw really long lines for data requests, right? You saw business teams getting frustrated, and just exporting data into spreadsheets and you start to see these Google Shouichi reporting empires stand up from team to team, you know, that really creates these silos within the organization and broken feedback loops. No one is really knowing what's happening. And you know, in the CSVs, and you know, the business teams are frustrated, they have to export these CSVs with real time data, you know, daily or weekly. And so worked on an internal data product there that was really the, you know, almost a spreadsheet on top of a few different tools for a pretty specific use case, but we decided, hey, you know, why not try it? Make this horizontal and really give, you know non technical business teams a way to actually work with data in a somewhat independent and confident way, started really thinking about it during COVID, and decided to jump into it in late 2020. And, you know, I'll probably talk a little bit about what Canvas is. So really Canvas is data exploration tool, data visualization tool, primarily for business teams. So we integrate with, you know, the modern data sack we integrate with snowflake and BigQuery, and Redshift and all the warehouses. But we also integrate with DBT, which is, you know, exploding in popularity, obviously. And I think the interesting thing there is that it really gives business teams for the first time as a sort of a reasonable starting place for working with data, right, you have, you know, big wide tables that these data teams are producing, instead of, you know, trying to answer every single ad hoc request that comes through focusing more, it's creating scalable data models that these teams can use. And so we found the spreadsheet interface is a really nice way to say, hey, data teams, you can actually, you know, share these models with your teams, give them an interface to answer their own questions, instead of coming to you directly with the question and be able to explore that, so that they can, you know, really get 80% of the way there to answer their question. If they get stuck, they can actually collaborate in our tool. And so data teams can actually inspect the spreadsheet work that's being done via sequel, make changes there and ultimately get people to an outcome faster.

You mentioned that Canvas is intended as a way to bring the promise of the modern data stack to people who don't want to put in all the engineering effort to manage that themselves. And wondering if you can just talk to the shortcomings that you see in the operating model for what has come to be known as the modern data stack, and how that keeps business users dependent on engineers to for being able to answer the questions that they want to be able to iterate on.

Yeah, for sure. So I think there's, you know, for all of the advancements, I think that the modern data stack is brought, right, we have best of breed tools for each part of the stack. You know, there's a lot of focus, there's a lot of high quality, you know, bringing developer practices of 1020 years ago to data teams, I think it's been amazing. But there's still pretty high barrier to entry, right? The monte de sac is pretty expensive, it requires, you know, having people that are capable of implementing it, building a reasonable, you know, set of data models and collaborating with the business team on scoping that all out. I think the other piece here is, you know, iteration is pretty slow, right? new fields get added, you know, I talked to line of business owners that are frustrated, because they don't know what's happening, as soon as they want to make a couple changes to their Salesforce objects, right, they don't really understand that delay. And so I think there's still a lot of work that we need to do, in terms of, you know, bridging the gap between business and data teams and making it clear, hey, you know, the work that you want to do, we are trying to enable as a data team, but that needs to be more of a collaborative process and not, you know, blind, a business owner saying, hey, we need to make these changes. And then data teams are thought of as an afterthought, right? Like, that never happens with engineering, they're brought in, and they're given a seat at the table, you know, upfront. And so I think there's a lot of work to do culturally there as well with the modern data stack. And then lastly, I think a lot of the interfaces haven't really caught up with the way that you know, data is now being thought about, right. I think, Rick, there's been a lot of talk about the death of the dashboard. And I think we pretty much agree on that. Right? A lot of the strategic dashboards have been solved. But a lot of the work that's being done now is more complex questions, by by operators, and by business teams, it's wanting to compare and contrast and diagnose a couple of different tables together in one place. And so I think that's really the heart of what we're trying to do at Canvas, which is give a flexible interface on top of this trusted data. And on top of this certified data that data teams are working on and give them a way to just ask whatever question that they have at the moment, rather than going into the bread line. And just asking for a question. I think those are the shortcomings that we're seeing with the modern data stack. But I think there's a ton of exciting things that are happening to sort of bridge that gap.

In terms of what you're building at Canvas. You mentioned, the the primary interface paradigm is this spreadsheet that everybody has become familiar with over the years. And I'm wondering, what you see is the reason that the spreadsheet is such a popular and persistent metaphor for being able to work with data and for making it accessible to people who don't think of themselves as technical. Yeah,

yeah. To? It's a great question. I think it's, it's a few things, right. It's just, it's been around forever. It's the first thing that you're taught and, you know, in business school, right, is, you know, how to learn Excel. And so I think the sort of institutional powers that be are sort of, you know, propagating and holding up Excel as a way to do this. But I also think it's a great way to just visually program it's the first programming language that, you know, most people learn. And I think it's very similar to SQL in a lot of ways. And we see there's really two sides of the same coin. Of course, there's some things that break down there, and I'm sure we'll probably we'll cover those. But yeah, I think it's an easy way to just match the mental model of however you're thinking about a problem being able to maleate that and where With that to get to the answer that you want. And it's also just a lot easier to iterate in some cases, right like, of cases that we see data teams in our tool where they'd rather use a spreadsheet paradigm than use SQL to answer a quick, dirty question. You know, it's familiar, it's fast, it's relatively iterative. So I think that's why it's sticking. And I think why it's continuing to stick. And I think that's something that we're really trying to take advantage of, rather than, you know, just looking down upon it, as you know, this thing that's a legacy interface, but really trying to make it better, honestly, in the context of how data is really so prevalent today. And most startups,

and in terms of the existing systems that we have for being able to work with data in the spreadsheet interfaces with Excel, obviously, being the most prominent one, but things like Google Sheets, or any number of other platforms and projects that provide this spreadsheet interface, I'm wondering what are the biggest issues that users encounter as they tried to scale the usage of those spreadsheets, both from a technical you know, data volume and data complexity perspective, but also from an organizational capacity?

The first obvious one is right, it's just, you know, the analytical performance, right, you can't be taking event data and putting into Google Sheets, it's gonna fall over. So I think that's, that's one of the first use cases that we see and that we can really help out with, with our tool. I think, you know, there's another thing there, which is, you know, like you mentioned, organizationally, right. How many times have we seen spreadsheets with do not edit tabs, you know, in red, and people sort of hold on to these, you know, they're afraid to let go their Legos either for, you know, privacy concerns, or, you know, not wanting to someone to break their massive model that they spent weeks building. So there is a sense of, hey, this is because this is editable. And because this is so easy to make changes to I actually want to do the reverse, and I want to keep it close to the vest, rather than having something that is a version controlled, you know, over a trusted data set in real time thing that can be, I think, more easily shared. And if you have some permissions, that makes sense, it's something that can actually spread within an organization and help people actually use that more easily. So I think those are the two biggest things that we see. And I think is a real reason why that people are trying to move away from, you know, having, you know, PII data and, you know, being passed around New York, you know, trying to move away from having, you know, your entire sales ops team run off of one Google Sheet. It's something that's, you know, not great for either the business or the data teams. And I think they're both looking for a better way to do their jobs.

As you explored the capabilities of this metaphor, in your own tool, what are some of the extensions or modifications to that paradigm of the spreadsheet and the ways that people typically interact with them, they needed to introduce to be able to account for some of these technical and organizational limitations in the default paradigm that people have used it for.

Yeah, definitely. And this has been an interesting adventure with DVT. In particular, right? You know, a big part of spreadsheets in the way that you know, information is shared is you often want to see and show the work behind your analysis, right. And so, I think leaving a lot of what's being done with the DAG, and being able to show Hey, this spreadsheet is actually composed of visually, these other items, right of these other datasets, you know, these facts and dim tables were used in the creation of this, this pivot was made on top of this to create this chart, that's something that's really something we're trying to invest in and show people, hey, just because this is a spreadsheet doesn't mean that it's not a powerful analytical tool, it doesn't mean that it's not extensible doesn't mean that you can't actually double click and see the lineage behind it, to understand what breaking changes might happen. Right. And so I think that's the first part, which has been super interesting is sort of the lineage of summarize it as another one is, it's really interesting is thinking of it in terms of more of components, rather than these one off things. And I think this lends itself to the lineage stuff. But you know, you have all of a sudden now business users that are capable of creating these repeatable insights and these repeatable almost models, if you will. So how do you create a nice graduation path for something that is, you know, forked from a model that the data team has created, business team has made some edits to it. And now you actually want to make this something that's reusable and certify it. So as available to the rest of the org? This is something that we're pretty excited about. And you know, we're calling components sort of in the style of figma, right and design libraries that we see out there. And so creating a component library of something that actually can start to accrue value over time, instead of just being another request response workflow between business and data teams is something that we're pretty excited about. Obviously, there's, you know, some of the guardrails that you have to create around formulas and you know, they can't just be arbitrary, you have to put in some guardrails, so people don't, you know, completely choke the pipelines. So that's been an interesting one to see. You know, which ones are relatively non controversial. offer business seems to go and work with and also trying to, you know, account for performance across the board and making sure that, you know, just because something is available doesn't mean it's not going to, you know, result in a 45. Second query, they were they bail, you mentioned

working alongside DBT. And building on top of the modern data stack. I'm wondering if you can give your perspective on the role that Canvas plays in the overall data ecosystem for an organization like what are the utilities that it might replace? What are the utilities that is going to augment? You know, who are the primary personas that are going to be interacting with it? And how are they going to collaborate with the other kind of data roles and stakeholders within the organization? Yeah,

totally. I think the first thing that we see, really the primary pain point for data teams is, hey, we cannot actually continue to scale without, for every 10 employees that we hire on the business side, we have to hire another analyst, right? And so they're looking for a way to break that correlation. And so that's really where we come in and say, Okay, well, let's try to set a goal, let's try to reduce your data request by 25%. And the way that we try to do that is saying, okay, you've already invested in your DVT project, right, you've already created these, you know, wide tables, you know, are a reasonable starting point for business teams to work with. But you don't really have an interface in which that's just available for exploration, right, you might have an existing strategic sort of BI tool or set of dashboards, you might be pumping, you know, this data into different systems with, you know, reverse ETL, or customer data activation, I guess, if you've had these systems for a while, but they're still not resulting in any reduction in your data requests, right? And so what's broken there? What can we fix? And that's really our approach is saying, Hey, we can compliment these systems. Instead of you getting a request for something, why don't you actually respond back with a canvas length and say, Hey, here's the model that you're looking for, I know, you're gonna have six questions, after I even give you back the original query that you wanted, you can actually ask those six additional questions here with just using your spreadsheet skills by just pivoting by just creating some charts by creating some formulas by joining with some other tables. And we give them a relatively easy way to do that without even really knowing exactly what a sequel join might be. And so, you know, that's really the primary sort of Persona is working with data teams that are looking for a way to scale. And of course, we have business teams that come to us too. And they want to just be able to have an interface that they can control over their warehouse or over their database, even if they're early. And that's something in which we're really just trying to help them be able to move faster and do more with less. And yeah, those are sort of the core personas that we're working with today. Primarily, you know, in b2b SaaS, you know, ecommerce, sort of startup roles. But I think the interesting trend that we're really trying to ride most is a lot of these operational roles five years, 10 years ago, were primarily just doing day to day work and, you know, executing on tasks. Most of the work has been, you know, or not most, but a lot of that work has been automated, right? And so they're moving more and more into strategic roles, no matter if you have an ops in your, you know, ops in your title. And so, we have this pretty big skills gap between, you know, 90 95% of the organization doesn't know sequel, but now they're expected to be a data driven role. And so what are the tools, they don't really have a home today in terms of being able to do everyday tasks with the data that they need. And so that's really the trend that we're seeing and really want to give these operators at home in the modern data stack. That makes

sense. And there have been a few other efforts to make exploration of data easier to do. You mentioned, obviously, business intelligence systems. And there have been a few iterations of those recently that have tried to reimagine some of the ways that you interact with it, one of the ones that comes to mind is something like light dash, where it builds on top of the DBT models and relies on some of the dimensional modeling to be able to say, you know, here is your domain object here, the different dimensions along which you can aggregate and slice it. Here are sort of the guardrails for that. I'm wondering what you see is kind of the juxtaposition of what you're doing at Canvas with some of the other ways that people have tried to approach the same problem of providing self service data exploration in a way that is approachable to the business users without having to, you know, throw them in the deep end of go ahead and learn SQL. Good luck. I'll see you in six months. Yeah,

totally. I think there's obviously been a ton of tools that help for, you know, helping with them learning SQL or making SQL more collaborative, or, like you mentioned, being able to give them to sort of more exploration, you know, capabilities. I think the thing that we're really trying to lean in on is there are very analytically minded people throughout the org that need to do things beyond just a couple of set of basic filters and slicing and dicing it data, they actually have models that are sitting in Google sheets that are completely manual and updated with live data, you know, every Monday or every month, right. And so I think that's the work that we're really trying to go after and say, Hey, there's a better way to do this, rather than in Google sheets that you know, are going to start to fall apart after a few weeks worth of massive data sets, and give you the ability to actually replicate those processes over real time data over data that's, you know, going to be trusted, and really tried to bring those data and business teams together. So I think that's really the gap that we're trying to fill is giving citizen data sciences I guess, as a way to call them but really these analytically minded operators that need Excel, like, you know, strength, Excel likes power, but they need it with real time data. That's really where we see ourselves coming in.

unstruck is the data ops platform for your unstructured data, the options for ingesting organizing and curating unstructured files are complex, expensive and bespoke. Unstructured data is changing that equation with their platform approach to manage your unstructured assets. built to handle all of your real world data from videos and images to 3d point clouds and geospatial records to industry specific file formats construct streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. unstruck handles data versioning, lineage tracking, duplicate detection, consistency, validation, as well as enrichment through sources including machine learning models, third party data and web API's. Go to Data engineering podcast.com/unstruck Today, that's U N. s. T. R UK and to transform your messy collection of unstructured data files into actionable assets that power of your business. And in terms of the actual canvas platform itself, I'm wondering if you can talk through some of the ways that it is implemented and architected to be able to provide these data manipulation and data exploration capabilities while still being interactive and performant. Enough for people to be able to iterate on these questions that they have.

Yeah, so one of the goals that we had for setting out was obviously it has to be fast, right? If you're going to compete with these, these sort of incumbents, it needs to feel like a native app, but it has to be in the browser, right? So our stack is pretty exciting. It's pretty new, it's rust and WebGL. And web assembly, looking at a lot of the sort of like modern collaboration tools out there, it's really something that we've tried to model ourselves after. And of course, you know, integrating with the best of breed, you know, parts of the modern data stack, right? So integrating with the warehouses integrating with ETL tools, like five Tran and nearby and a couple others, and really trying to make sure that this is something that data teams aren't again, not going to see as some toy for business teams, but rather as a fully fledged real data tool on top of the modern data stack and something that you know, the high standards that they should have.

And as you have explored this problem space and dug further into actually implementing your own spreadsheet interface, and figuring out how to provide the right paradigms and metaphor is and building the necessary escape hatches for when you start to exceed the complexity that spreadsheets are able to support, what are some of the ways that the design and goals of the product have changed and evolved since you started working on it,

one of the first things was really realizing that, you know, you need to be able to let folks get far enough in their spreadsheet, sort of exploration or spreadsheet modeling before they run into a point where they can't go any further, right. And so you have a couple of different options there. The first thing and probably the thing that's, you know, most exciting or gets people most excited when when I demo is, you can actually open up any of the spreadsheet analysis and will actually generate really nice CTS, and really nice SQL for data teams to actually go in and inspect and edit. And that change is bi directional. So any changes that are made in the SQL will then actually find, you know, almost one to one, sort of equivalent with spreadsheet interfaces. And so that's sort of the first is almost escape hatch, I'll call it right where, hey, I've gotten far enough, I've maybe I've made a mistake, or maybe I just need help that some bit over my skis. I can tag the data person they can come in. And instead of trying to reverse engineer, you know, formulas and references and some sheet that you've never seen before, you can actually just open up the hood and see the sequel and make changes really quickly. And so that's pretty much our primary sort of escape hatch right now. Yeah, I guess I'll stop there.

And as far as being able to manage that translation, what are some of the spots where you've seen the impedance mismatch and some of the ways that it breaks down where somebody has a specific formula that they're trying to build based on their Excel knowledge and it doesn't quite map cleanly to the sequel or vice versa where you have SQL query and you're trying to figure out how to actually map this into an Excel function that is understandable, it isn't going to be, you know, 3000 characters long.

I think the first thing so one of our sort of goals, when we sit down and prioritize, you know, functionality for the spreadsheet or function for the sequel is to make sure that there is a close one to one representation of that. One of the big decisions that we had to make was, do we want actually right back to the warehouse? And do we want to actually write back to the data sources, and that's something that, you know, is raises a lot of concerns from, you know, the CSAT, that you that you talk to? So, that's been one sort of, I guess, example of this in which we wanted to stay away from, and stay away from things like, you know, maybe some data validation, that doesn't make sense, obviously, in SQL. But yeah, I think there's a decent amount of these that can come up if you're not careful about actually prioritizing the right things, which I think we've done a fairly good job of doing to date.

So in terms of the actual workflow for a business user who wants to understand what's happening in their product? What is the actual workflow for being able to figure out? What are the datasets that I have? How do I pull that into a sheet, figure out what the appropriate subset is built by analysis, share it with people collaborate, that overall end to end journey of figuring out how to answer your own questions from the modelled or, you know, partially cleaned datasets that might be available to you.

It all starts with essentially a data library, right. And so data teams have full control over, hey, here's the, you know, the different levels of models that we've prepared, right, here's the ones that are going to probably be the most popular. And so we use social proof there to to show, hey, you know, these are gonna be the things that are popular amongst finance teams or marketing teams, right. So it's part discovery, right, and showing them, hey, here's probably what you're looking for, and allowing them to search for it easily. I think the second part, which has been huge, just, you know, leveraging descriptions and DBT. And being able to surface those in the front end, right. A lot of frustration, even you know, I had when I you know, my Flexport days is, you often don't understand what the column is that you're looking at, right, or how the column was calculated. And so we can actually take a lot of that metadata and actually surface it to the business team so that they can more easily understand what they're looking at or which column is going to be the right one for them to work with. And then I think a lot of it is just exploration. Honestly, once they drag it onto their Canvas, that's basically it looks like figma figma was designed for data, you can see, you know, a lot of the sort of, I can have one table, I can see the relation to its other table in terms of if it was joined or not, I can write some formulas over it, I can do some pivots, create some charts, show my lineage or hide that away, a lot of it is sort of this just, you know, iterative prototyping process. And this, something I think that's been particularly surprising for me is watching data teams actually want to use it for prototyping. Because a lot of that is, you know, done honestly, with making sequel exports, and then pulling it into sheets and looking for anomalies today. So that was definitely something that surprised me a little bit of seeing how data folks actually want to use this tool as well. And so that's really sort of the journey is, you know, pulling in a couple of different tables, you know, creating some pivots, honestly, it's not too complicated and looking for anomalies from there, and then continuing to refine it until you end up with something that, you know, can look like a dashboard with a series of charts, or it can look like a sheet report that you've had in Google Sheets with a bunch of formulas, they really can go in any direction from there. And then of course, the most valuable part is actually sharing that. And so we have you know, permissions that you can take something you can promote it to a Shared Canvas that other people can see. And only that's managed by sort of a higher role to kind of make sure that governance is there from a human lead perspective, right? Not everyone on the team can have that capability. And so, you know, ultimately, it's about sharing that and making it easier, and then making it live as part of your data library as a component, you know, in perpetuity. So this, you know, isn't just a series of one off explorations that are happening, but rather this like growing knowledge base right? Over time, that isn't just contributed to by the data team, it's actually now you're getting input from more folks, you're starting to see who's using models more, you're starting to see us not using models unable to prune those as you go. And really, the result is hopefully a better knowledge base for the entire work.

And in terms of the iterative process and managing the versions that have been the plague of everybody who's ever dealt with Excel. I'm wondering how you've approached that problem. And in particular, being able to manage things like snapshots in time where you say, I want to build this report to be able to capture, you know, sales figures for the month of June, for instance. And I don't want the sheet to automatically update when I get my new sales figures because I want to maintain this view of the data as it exists at the time that I make this analysis so that I can refer back to it as I do successive iterations of it, for instance, and just managing some of those different views. Originating requirements depending on the sort of varying use cases for how the data is being accessed and shared,

great question. So one of the nice architectural decisions that we've made is, you know, because it's a multiplayer application, right, you can actually see the multiple cursors, moving around, in real time working on the same data as if you were in figma, you know, sort of every action is taken as part of a ledger, right. And so we can actually, version control each and every sort of stage of the analysis and allow you to revert back to one of those stages, and eventually let you freeze to a particular stage so that you can just jam those numbers, like you mentioned, for a particular you know, bookmark and time are a snapshot in time. So I think that's one of the other, you know, really interesting things about this architecture is that, you can actually now have a sort of a system that's living in two different multiple different states at once, without the headaches or the anxiety of having a sheet that's exposed to a bunch of people that can potentially, you know, method up.

The other interesting element of the workflow that I'm curious about is how you have seen the analysis being done in Canvas get fed back into things like DBT, or the other analytical systems that an organization might be using

is something that's really interesting, I think, in terms of the future of how data and business teams are working together, where, you know, if we're going to start moving away from and shifting the onus of dashboard creation, to to business teams, right and have data teams really focus on data quality and data usability, I think this is going to be a key, you know, part of the stack moving forward, which is how do you let someone go and take this insight and go and take this model and allow it to graduate back into your systems. And so for right now, this is a fairly simple, you know, interaction between the data and business teams, the data teams are able to see what the business teams are doing, they can collaborate and have a conversation around that there, they can also see which models are being used most often, right. And so I think the really exciting thing is that we're at least for now creating opportunity for this conversation to happen. Whereas before, it's, you know, completely out of purview, right? And you have, you know, heads of data that are like, hey, if it's in the CSV, it's not our problem, right? But it's like, Okay, what if you end up with enough teams with their data and CSVs, that eventually is going to be your problem, because this last mile is broken. And so I think that's one of the things that we're most excited about is, hey, let's actually create a mechanism for business seems to be able to own part of this input. And you know, eventually work on productizing and automating the sort of graduation path back into your DVD library. But I think one of the coolest moments that that we're having is watching this conversation happened between data and business teams that before was never really happening.

Modern data teams are dealing with a lot of complexity and their data pipelines and analytical code. monitoring data quality tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production. It's often too late and the damages done data fold built automated regression testing to help data and analytics engineers deal with data quality and their pull requests. Data fold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database. Data fold integrates with all major data warehouses, as well as frameworks such as airflow and DBT, and seamlessly plugs into Ci workflows, visit data engineering podcast.com/data folds today to book a demo with data fold. In terms of that collaboration, and you mentioned that Canvas is an application that allows for multiple people to be able to work on the same sheet and edit it, I'm sure that you have some capabilities for being able to lock it from editing while somebody else is using it. And obviously, some of the data governance and oversight elements that you mentioned, I'm just wondering, what are some of the other ways that you've seen data teams and business users collaborate for being able to manage some of these component libraries that you mentioned, or being able to handle the setup of the initial datasets and maybe some of the feedback from the data teams from the business teams into the data teams to say, you know, these are the kind of shapes of data or the domain objects that we want to be able to work with so that we can build our own analysis and just some of the conversations that that has opened up as far as how to even think about data modeling at the earliest stages of the process.

I think one of the more interesting things has just been when you have a setup is like let's call it their biz ops team, right? When the biz ops team comes in and sees between what's available in the DBT project or what's in the warehouse versus what they have in their own systems. I think that's often a conversation that like, isn't really happening outside of this. And so you're actually watching the data teams and the business team say, Okay, well, we need to answer these 10 questions. But let's see if they can actually answer eight of those questions. You know, in the tool itself and then see the leftovers that Okay, now this is what we're gonna go scope and put into our DVD project. And so I think we're seeing this like play out live, rather than this just be some sort of scoping exercise between these different teams, and you don't even know if they're going to end up using those models or not. So I think that's often the first thing that we see is like, Hey, we're going to need to, you know, more variations on the shape of the state, or we're going to need a few more preset joints to be available here. And I think that's really exciting. I think beyond that, it's a matter of really trying to automate a lot of the consumption metrics that typically are like you're forced to build manually and some of these tools. And so what we're really excited about is just being able to show on a per team basis on a per model basis, how they're being consumed. So that you can actually have some objective fact that you can bring back to your data team and say, Hey, this isn't working, Hey, these are working, we should invest more in this area. And also look how they're using these actual models, right. And you can point to the evidence role that you drill in from the data library, to the actual canvas where it's being used, that you can actually really grok what they're doing with that data. And so yeah, I think those are two things that come to mind

there, in terms of the applications of canvas that you've seen as you have opened it up to your customers. And as you have entered general availability, what are some of the most interesting or innovative or unexpected uses that you've seen,

I mean, one of the more interesting things was just honestly seeing people wanting to connect their data without, without talking to us, we're doing a product led growth sort of strategy, which for those who don't know, it's basically put up a pre trial or freemium tier, which for data is, I think, relatively new. So that's been definitely surprising just watching how many teams it just will connect their warehouse and just want to kick the tires on it without having to talk to anyone, I think once you get into the tool, and once you start to get into the use cases, I think it's super interesting seeing just how many different verticals and use cases are out there. But at the end of the day, how similar a lot of the data structure that they need is and how similar a lot of the common data sources that they need are. And so that's been something for us, we're really excited, we're creating essentially templates that will basically be, you know, out of the box packages, that you can just say, Hey, if you're looking at stripe data, and you're on this schema, or if you're on this stack, you'll be able to actually consume that model automatically. And so please save your data team some time for some of those more, you know, shallow or simpler models, right stuff that doesn't have, you know, custom fields in it. So I think that's been a interesting insight is that there's so much variety and variability out there. But at the end of the day, a lot of the actual questions that are being asked are the same. And there's a huge opportunity to automate some of that,

as far as the experience that you've had of starting this company, building up the product, working with your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned, as you've explored further into this space?

I think as a founder, it's definitely just, you know, the first the amount of time that it takes to just build a product that people want to use, right, especially in the day to day, it's not a small task to, to really make something that's different. But something that's also reliable, there's just a lot of work that goes into that. And it's a lot of, you know, patience and working as hard as you can, which is two things that are pretty hard for humans. So I think that's been, you know, one of the parts of the journey, I think, also just realizing how much it takes to really get just embedded with your customer, right? Get on a texting basis with them, make sure that they understand that you're not just a product, but that you're actually a service to them. And a lot of the reason why folks buy from early stage startups is going to be Hey, our external data team as a service, you're helping us think about these things, you are working with other companies, you're seeing what's out there, you know, going in, I thought a lot of it would just be Hey, we build software, and we give it to you. But it's often much more than that. And they trust you and want to learn from you in terms of hey, what are some best practices? Are we doing the right thing here? What are some things that you're seeing from the business that maybe don't make sense, right? And so you're a consultant in a way, which I think was really surprising and pretty cool, honestly, because you get a chance to help someone out with not just, you know, helping them save time, but also helping them think about their role and thinking about their business.

And for people who are looking for a way to make their data self serve and easier to access for the business users. What are the cases where Canvas is the wrong choice?

You know, if you're looking for hardcore data analysis, I feel like for Python, and our you know, tools that are going to be very heavy for the data team, it's not going to be something that we're going to be a great fit. I think we complement a lot of tools in that case. And we do have a couple of customers that they have, you know, multiple front ends, right for different personas. And I honestly think that's how the Ark of the modern data stack is bending, right? It's bending towards best of breed tools for whatever use case. And so I would say that's probably you know, something that's going to be a clear sort of decision tree is if you need it for data science use cases are not going to be a great fit.

As you continue to build and iterate on the product. What are some of the things you have planned for the near to medium term or any capabilities or use cases that you're excited? To dig into, one of

the big things is I alluded to this earlier a little bit with templates. So taking in a lot of the requests that we're seeing in terms of what business users want to do across these different companies and, and really work on programmatically, you know, implementing these models that can actually help save the data team time from having to build those and also give business teams, hey, here's the 10 different wide tables that you would want to see without requesting them from your data team. Or, hey, here's some of the metrics that you would want to see, without having to spend half an hour to build them out, right, there really isn't a place that, you know, beyond going to Google and how to build a metric and compiling all the data sources in a sheet offline, there isn't really a great way to understand how you know, metrics are built and how they're calculated and best practices for those. So we're really excited that we're kind of taking the learnings from across our customers and creating, you know, templates and models and packages that can be consumed, you know, pretty automatically. So that's, that's one of the big things that we're doing and providing that to folks that don't necessarily have a warehouse as well, we're putting them on a managed data stack, instead of having to turn away, you know, customers that don't have a warehouse. So those are two really big things that we're working on that we can't wait to get out there.

All right. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.

biggest gap, in my mind is really, I think, opening up that conversation and that collaboration between the creators of the data and the consumers of that data. I think for all the advancements that we've made on the tooling side, still understanding and been a systematic way and collaborating across the gap, I think is really the biggest gap that I see still and something that we're you know, we're most excited about with Canvas.

All right, well, thank you very much for taking the time today to join me and share the work that you've been doing on canvas. It's definitely a very interesting tool and product and an interesting approach to a very real problem. So I appreciate all the time and energy that you and your team are putting into that and I hope you enjoy the rest of your day.

Yeah, thanks so much, guys. Have a great one.

Listening, don't forget to check out our other show podcast.in it at Python podcast.com. To learn about the Python language its community in the innovative ways it is being used, and visit the site at data engineering podcast.com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast.com With your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers