Scaling in the Cloud: A Conversation with Jon Tirsen
In this episode of the Business Cloud Native, host Emily Omier talks with Jon Tirsen, who is engineering lead for storage at Cash App. This conversation focuses on Cash App’s cloud native journey, and how they are working to build an application that is more scalable, flexible, and easier to manage.The conversation covers:How the need for hybrid cloud services and uniform program models led Cash App to Kubernetes.Some of the major scaling issues that Cash App was facing. For example, the company needed to increase user capacity, and add new product lines.The process of trying to scale Cash App’s MySQL database, and the decision to split up their dataset into smaller parts that could run on different databases.Cash App’s monolithic application, which contains hundreds of thousands of lines of code — and why it’s becoming increasingly difficult to manage and grow.How Jon’s team is trying to balance product/ business and technical needs, and deliver value while rearchitecting their system to scale their operations.Why Cash App is working to build small, product-oriented teams, and a system where products can be executed and deployed at their own pace through the cloud. Jon also discusses some of the challenges that are preventing this from happening.How Cash App was able to help during the pandemic, by facilitating easy stimulus transfers through their service — and why it wouldn’t have been possible without a cloud native architecture.Links:Cash App: https://cash.app/Square: https://squareup.com/us/enJon on Twitter: https://twitter.com/tirsen?lang=enConnect with Jon on LinkedIn: https://www.linkedin.com/in/tirsen/?originalSubdomain=auThe Business of Cloud Native: http://thebusinessofcloudnative.comTranscriptAnnouncer: Welcome to The Business of Cloud Native podcast where we explore how end users talk and think about the transition to Kubernetes and cloud-native architectures.Emily: Welcome to The Business of Cloud Native. My name is Emily Omier, I'm here chatting with Jon Tirsen.Jon: Happy to be here. My name is, as you said, Jon Tirsen, and I work as the engineering lead of storage here at Cash App. I've been at Cash for maybe four or five years now. So, I've been with it from the very early days. And before Cash, I was doing a startup, that failed, for five years. So, it's a travel guide in the mobile phone startup. And before that, I was at Google working on another failed product called the Google Wave, which you might remember, and before that, it was a company called ThoughtWorks, which some of you probably know about as well.Emily: And in case people don't know, the Cash App is part of Square, right?Jon: Yes. Cash App is where we're separating all the different products quite a lot these days. So, it used to be called just Square Cash, but now it has its own branding and its own identity, and its own leadership, and everything. So, we're trying to call it an ecosystem of startups. So, each product line can run its business the way it wants to, to a large degree.Emily: And so, what do you actually spend your day doing?Jon: Most of my days, I'm still code, and doing various operational tasks, and setting up systems, and testing, and that sort of thing. I also, maybe about half my day, I spend on more management tasks, which is reviewing documents, writing documents, and talking to people trying to figure out our strategy and so on. So, maybe about half my time, I do real technical things, and then the other half I do more management stuff.Emily: Where would you say the cloud-native journey started for you?Jon: Well, so a lot of Square used to run on-premises. So, we had our own data centers and things. But especially for Cash App, since we've grown so quickly, it started getting slightly out of control. We were basically outgrowing—we could not physically put more machines into our data centers. So, we've started moving a lot of our services over to Amazon in this case, and we want to have a shared way of building services that would work both in the Cloud and also in our data centers.So, something like Kubernetes and all the tools around that would give us a more uniform programming model that we could use to deploy apps in both of these environments. We started that, two, three years ago. We started looking at moving our workload out of our data centers.Emily: What were the issues that you were encountering? Give me a little bit more details about the scaling issues that we were talking about.Jon: There two dimensions that we needed to scale out the Cash App, sort of, system slash [unintelligible] architecture. So, one thing was that we just grew so quickly that we needed to be able to increase capacity. So, that was across the board. So, from databases to application servers, and bandwidth, everywhere. We need to just be able to increase our capacity of handling more users, but also we were trying to grow our product as well. So, at the same time, we also want to build and be able to add new features at an increased pace. So, we want to be able to add new product lines in the Cash App.So, for example, we built the Cash Card, which is a way you can keep your money in the Cash App bank accounts, and then you can spend that money using a separate card, and then we add a new functionality around that card, and so on. So, we also needed to be able to scale out the team to be able to have more people working on the team to build new products for our users, for our customers. Those are the two dimensions: we needed to scale out the system, but we also needed to have more people be able to work productively. So, that's why we started trying to chop up—we have this big monolith as most companies probably do, which that's I don't know how many hundreds of thousands of lines of code in there. But we also wanted to move things out of that, to be able to have more people contribute productively.Emily: And where are you in that process?Jon: Well, [laughs], we're probably adding still adding code at an exponential rate to the monolith. We're also adding code at an exponential rate outside of the monolith, but it just feels so much easier to just build some code in the monolith than it is outside of it, unfortunately, which something we're trying to fix, but it's very hard. And it is getting a little bit out of hand, this monolith now. So, we have, sort of, a moratorium on adding new code to the monolith now, and I'm not sure how much of an effect that has made. But the monolith is still growing, as well as our non-monolith services as well, of course.Emily: When you were faced with this scaling issue, what were the conversations happening between the technical side and the business owners? And how is this decision made about the best way to solve this problem is x, is the Cloud, is cloud-native architecture?Jon: I think the business side—the product owners, product managers—they trust us to make the right decision. So, it was largely a decision made on the technical side. They do still want us to build functionality, and to add new features, and fix bugs, and so on. So, they want us to do that, but they don't really have strong influence on the technical choices we've made. I think that's something we have to balance out.So, how can we keep on giving the product side and the business side what they need? So, to keep on delivering value to them while we try to rearchitect our system so that we can scale out our operations on our side. So, it's a very tricky balance to find there. And I think so far, maybe we've erred on the side of keep on delivering functionality, and maybe we need to do more on the rearchitecting things. But yeah, that's always a constant rebalancing act we're always dealing with.Emily: Do you think that you have gotten the increased scalability? How far along are you on reaching the goals that you originally had?Jon: I think we have a pretty scalable system now, in terms of the amount of customers we can service. So, we can add capacity. If we can keep on adding hardware to it, we can grow very far. We've actually noticed that the last few weeks, we've had an almost unprecedented growth, especially with the Coronavirus crisis. Every single day, it's almost a record.I mean, there's still issues, of course, and we're constantly trying to stay on top of that growth, but we have a reasonably good architecture there. What I think is probably our larger problem is the other side, so the human side. As I said, we are still adding code to this monolith, which is getting completely out of hand to work with. And we're not growing our smaller services fast enough. It's probably time to spend more effort on rearchitecting that side of things as well.Emily: What are some of the organizational, or people challenges that you've run into?Jon: Yeah. So, we want to build smaller teams oriented around products. We see ourselves more of a platform on products these days: we’re not just a single product. And we want to build smaller teams. That is, maybe we have one team that is around our card, and one team around our [unintelligible] trading and so on. And we want to have the smaller teams, and we want them to be able to execute independently.So, we want to be able to put together a cross-functional team of some engineers, and some UX people, and some product people, and some business people, and then they should be able to execute independently and have their own services running in our cloud infrastructure, and not have to coordinate too much with all of the other teams that are also trying to execute independently. So, each product can do its own thing, and own their own services, and deploy at their own pace, and so on. That's what we're trying to achieve, but as long as they still have to do a lot of work inside of our big monolith, then they can't really execute independently. So, one team might build something that actually causes issues with another team’s products, and so on, and that becomes very complicated to deal with. So, we tried to move away from that, and move towards a model where a team has a couple of services that they own, and they can do most of their work inside of those services.Emily: What do you think is preventing you from being farther along than you are? Farther along towards this idea of teams being totally self-sufficient?Jon: Yeah, I think it's the million-dollar question, really. Why are we still seeing exponential growth in code size in our monolith, and not in our services? And I think it's a combination of many, many things. One thing I think, we don't have all of the infrastructure available to us in our cloud, in our smaller services. So, say you want to build a little feature, you want to add a little button that does something, and if you want to do that inside our monolith, that might take you two, three days. Whereas if you want to pull up a completely new service—I think we've solved it at an infrastructural layer, it's very quick and easy to just pull up a new service, and have it run, and be able to take traffic, and so on—but it's more of the domain-specific infrastructures of being able to access all the different data sets that you need to be able to access, and be able to shift information back to the mobile device.And all these things, it's very easy to do inside a monolith, but it's much harder to do outside of the monolith. So, we have to replicate a big set of what we call product platforms. So, instead of infrastructural platform is more product specific platform features like customer information, and be able to send information back to the client, and so on. And all those things have to be rebuilt for cloud services. We haven't really gotten all the way there yet.Emily: If I understood correctly from the case study with the CNCF, you sort of started the cloud-native journey with your databases.Jon: Yes, that was the thing that was on fire. Cash App was initially built as a hack week project, and it was never really designed to scale. So, it was just running on a single MySQL database for a really long time. And we actually literally put a piece of hardware on fire with that database. We managed to roll it, roll it off, of course, didn't take down our service, but it was actually smoking in our [laughs] data centers. It melted the service around it in its chassis. So, that was a big problem, and we needed to solve that very quickly. So, that's where we started.Emily: Could you actually go into that just a little bit more? I read the case study, but probably most listeners haven't. Why was the database such a big problem? And how did you solve it?Jon: Yeah, as I said, so we only had a single MySQL database. And as most people know, it's very hard to keep on scaling that, so we bought more and more expensive hardware. And since we were a mobile app, we don't get all the benefits from caching and replica reads, so most of the time, the user is actually accessing data that is already on the device, so they don't actually make any calls out to our back end to read the data. Usually, you scale out a database by adding replicas, and caching, and that sort of stuff, but that wasn't our bottleneck. Our bottleneck was that we simply could not write to the database, we couldn’t update the database fast enough, or with enough capacity.So, we needed to shard it, and split up the data set into smaller parts that we could run on separate databases. And we used the thing called Vitess for that, which is a Cloud Native Foundation member, a product and [unintelligible] CNCF. And with Vitess, we were able to split up the database into smaller parts. It was quite a large project, and especially back then, Vitess was—it was quite early days. So, the Vitess was used to scale out YouTube and then it was open-sourced. And then, we started using it. I think, not long after that, it was also used by Slack.So now, currently Slack uses it for most of its data. And we started using it very early, so it was still kind of early days, and we had to build a lot of new functionality in there, and we had to port [00:15:20 unintelligible] make sure all of our queries worked with the Vitess. But then we were able to do shard splitting. So, without having to restart or have downtime in our app, we could split up the database into smaller parts, and then the Vitess would handle the routing of queries, and so on.Emily: If at all, how did that serve as the gateway to then starting to think about changing more of the application, or moving more into services as opposed to a monolith?Jon: Yeah, I think that was kind of orthogonal in some ways. So, while we scaled out the database layer, we also realized that we needed to scale out the human side of it. So, we have multiple teams being able to work independently. And that is something we haven't I think we haven't really gotten to completely, yet. So, while we've scaled out the database layer, we're not quite there from the human side of things.Emily: Why is it important to scale any of this out? I understand the database, but why is it important to get the scaling for the teams?Jon: Yeah, I mean, it's a very competitive space, what we're trying to do. We have a very formidable competitors, both from other apps and also from the big banks, and for us to be able to keep on delivering new features for our customers at a high pace, and be able to change those features to react to changing customer demands or, like during this crisis we are in now, and being able to respond to what our competitors are doing. I mean, that just makes us a more effective business. And we don't always know when we start a new product line where it's exactly going to lead us, we sort of look at what our customers are using it and where that takes us, and being able to respond to that quickly, that's something that is very hard if you have a big monolith that has a million lines of code and takes you several hours to compile, then it’s going to be very hard for you to deliver functionality and make changes to functionality in a good time.Emily: Can you think of any examples where you're able to respond really quickly to something like this current crisis in a way that wouldn't have been possible with the old models?Jon: I don't actually know the details here. I live currently in Australia, so I don't know. But the US government is handing out these checks, right? So, you get some kind of a subsidy. And apparently, they were going to mail those out to a lot of people, but we actually stepped up and said, look, you can just Cash App them out to people. So, people sign up for a Cash App account, and then they can receive their subsidies directly into the Cash App accounts, or into their bank accounts via our payment rails. And we were able to execute on that very quickly, and I think we are now an official way to get that subsidy from the US government. So, that's something that we probably wouldn't have been able to do unless we've invested more to be able to respond to that so quickly, within just weeks, I think.Emily: And as Cash App has moved to increasingly service-oriented architectures and increasingly cloud-native, what has been surprisingly easy?Jon: Surprisingly easy. I don't think I've been surprised by anything being easy, to my recollection. I think most things have been surprisingly hard. [laughs]. I think we are still somewhat in the early days of this infrastructure, and there are so many issues; there's so many bugs; there's so many unknowns. And when you start digging into things, it just surprises you how hard.So, I work in the infrastructure team, and we try to provide a curated experience for our product teams, the product engineering teams, so we deal with that pain directly where we have to figure out how all these products work together, and how to build functionality on top of them. I think we deal with that pain for our product engineers. But of course, they are also running into things all the time. So, no, it is surprisingly hard sometimes, but it's all right.Emily: What do you think has been surprisingly challenging, unexpectedly challenging?Jon: Maybe I shouldn't be, but I am somewhat surprised how immature things still are. Just as an example, how hard it is, if you run a pod, in a EKS—Amazon Kubernetes cluster, and you just want to authenticate to be able to use other Amazon products like Dynamo, or S3, or something, this is still something that is incredibly hard to do. So, you would think that just having two products from the same vendor inside of the same ecosystem, you would think that that would be a no-brainer: that they would just work together, but no. I think we'll figure it out eventually, but currently, it's still a lot of work to get things to play well together.Emily: If you had a top-three wish list of things for the community to address, what do you think they would be?Jon: Yeah, I guess the out-of-the-box experience with all of these tools, so that they just work together really well, without having to manually set up a lot of different things, that'd be nice. I think I also, maybe this all exists, we haven't integrated all these tools, but something that struck me the other day, I was debugging some production issue—it wasn’t a major issue, but it was an issue that had been an ongoing thing for two weeks—and I just wanted to see what change happened those two weeks ago. What was the delta? What made that change happen? And being able to get that information out of Kubernetes and Amazon—and maybe there's some audit logging tools and all this stuff, but it's not entirely clear how to use them, or how to turn them on, and so on. So, that's a really nice, user friendly, and easy to use kind of auditing, and audit trail tools would be really nice.So, that's one wish, I guess, in general: having a curated experience. So, if you start from scratch, and you want to get all of the best practice tools, and you want to get all the functionality out of a cloud infrastructure, there's still a lot of choices to make, and there's a lot of different tools that you need to set up to make them work together, Prometheus, and Grafana, and Kubernetes, and so on. And having a curated out-of-the-box experience that just makes everything work, and you don't have to think about everything, that would be quite nice. So, Kubernetes operators are great, and these CRDs, this metadata you can store and work with inside of Kubernetes is great, but unfortunately they don't play well with the rest of the cloud infrastructure at Amazon, at AWS.Amazon was working on this Amazon operator, which you would be able to configure other AWS resources from inside of the Kubernetes cluster. So, you could have a CRD for an S3 bucket, so you wouldn't need a Terraform. So right now, you can have Helm Charts and similar to manage the Kubernetes side of things, but then you also need Terraform stuff to manage the AWS side of things, but just something thing that unifies this, so you can have a single place for all your infrastructural metadata. That would be nice. And Amazon is working on this, and they open-sourced something like an AWS operator, but I think they actually withdrew it and they are doing something closed-source. I don't know where that project is going. But that would be really nice.Emily: Go back again to this idea of the business of cloud-native. To what extent do you have to talk about this with business stakeholders? What are those conversations look like?Jon: A Cash App, we usually do not pull in product and business people in these conversations, I think, except when it comes to cost [laughs] and budgeting. But they think more in terms of features and being able to deliver and have teams be able to execute independently, and so on. And our hope is that we can construct an infrastructure that provides these capabilities to our business side. So, it’s almost like a black box. They don't know what's inside. We are responsible for figuring out how to give it to them, but they don't always know exactly what's inside of the box.Emily: Excellent. The last question is if there's an engineering tool you can't live without?Jon: I would say all of the JetBrains IDEs for development. I've been using those for maybe 20 years, and they keep on delivering new tools, and I just love them all.Emily: Well, thank you so much for joining.Jon: Thanks for inviting me to speak on the podcast.Announcer: Thank you for listening to The Business of Cloud Native podcast. Keep up with the latest on the podcast at thebusinessofcloudnative.com and subscribe on iTunes, Spotify, Google Podcasts, or wherever fine podcasts are distributed. We'll see you next time.This has been HumblePod production. Stay humble.