Exploring Ant Financial’s Cloud-Native Journey with Haojie Hang

 

Some highlights of the show include
  • The challenges of operating digital commerce at scale, including the need for resource pooling and resiliency — and how this caused Ant Financial to re-think their infrastructure.
  • Ant Financial’s former approach to scaling, which was mostly manual, and highly resource-intensive.
  • How Kubernetes is expediting cloud development for Ant Financial.
  • Haojie’s thoughts on the global engineering skills gap, and China’s growing cloud computing market including driving factors and barriers.
  • Why Ant Financial’s migration has largely been a success — and why achieving operational security is now a top priority for the company.
  • How Ant Financial is managing disconnect between its engineers and business leaders.
  • The company’s ongoing mission to migrate its systems and applications away from legacy architectures.
Links
TranscriptAnnouncer: Welcome to The Business of Cloud Native podcast where we explore how end users talk and think about the transition to Kubernetes and cloud-native architectures.
Emily: So, I always start the same way. Can you introduce yourself?
Haojie: Hey, my name is Haojie Hang. I'm a product manager in the CTO office at Ant Financial. I work on the product and strategy side for, basically, the CTO and the other executive leaders, as well as leading a small product teams within the org to look at the frontier technology in the cloud and other infrastructure businesses.
Emily: And can you tell me a little bit more about what Ant Financial does? And then, also, what do you do on a day to day basis? What do you do when you get into the office?
Haojie: Yeah, I'll do a quick introduction about the Ant Financial business. It's not just one business or two business, it's a group of businesses that we innovate and we do, mostly in China, but we're also expanding very rapidly all over the world. So, Ant Financial is basically a group of businesses including credit for both consumers and the enterprise, as well as loan businesses, both consumer and enterprise businesses. We say that the parent organization is basically, we call it Alipay, it’s the earliest business we do since 2004 when the business was basically born from Taobao, which is our parent company. So, in short, the Ant Financial Business has a lot of presence in the business of payments business, remittance, credit card, loans, securities, and many other businesses like intelligent technology, blockchain, pretty much everything you can imagine in the FinTech and financial services, we’re in there.
Emily: Tell me a little bit more about the cloud-native journey for Ant Financial. When did it start? Why did it start? What was some of the motivations behind moving to cloud-native?
Haojie: Yeah, it's actually quite interesting. I joined Ant Financial in 2008, but actually, the entire company started to look at cloud-native technology quite early, in 2012. So, back then, people were just looking at these technologies around the world, mostly from the US, they look at this open-source community, look at what other companies are doing, how to use the cloud-native technology to help with their business in the peak time, so during event. There’s online promotion event we're doing every year, called Double 11—Shuāng shíyī in Chinese. Every year, so we have a large amount of promotional events happening online, trying to help merchants and the customer is trying to sell and buy stuff in our Tmall and Taobao platform in very, very discounted price. So, for that promotion event online, we have to think about the resilience, the resource pooling, oftentimes the visits has to increase multiple times, sometimes over 100 times the increase compared to the normal time. So in that case, we have to think about how we can be very resilient and efficient infrastructure to support that business needs. So, this is a very large topic. And then, back then, there was a lot of focus and study in our cloud computing department. So, we started looking at this technology called Mesos in 2012. And then, we do a lot of experiments around this technology, but from the business perspective, it's still hard to justify the benefits of moving to Mesos completely. So, we have multiple teams doing a lot of research in Mesos, in Kubernetes, sometimes in our own technology stack, but there's not enough proof or enough confidence for us to move completely over to that technology, until the emergence of Docker container, this Docker technology. Then we started to look at our container infrastructure, really do the investigation around this technology, and understand why this is taking over so quickly over the world, from the business perspective, and from the technology perspective. If you look at the community of Docker, the thing does not really happen until 2015. But we are already in the game for about a year or two. So, we're actually quite happy about our original strategy, but it's just in terms of the research. We're actually a little bit behind in terms of moving to this cloud-native architecture. But as you can see, that I had an interview with CNCF. So, we are very happy about the results that we have right now. Pretty much the entire architecture we run within Ant Financial is, basically, on Kubernetes ecosystem. It's not just using the open-source version of it. We're doing a lot of customization around this open-source framework. Yeah, I can talk more about the details.
Emily: Yeah. Well, let's back up just a little bit. I’m curious what you were doing to manage this scaling before? And how did that change? And what about the whole process changed? Like, how stressful is it now, compared to before?
Haojie: The process was very manual, I would say. We have extremely large team of engineers, and DevOps, security teams. And oftentimes their responsibility are overlap. So, some engineers are doing security work, some engineers are doing basically operational work. I would say, some people really hated it because they have to be on the computer, look at monitor 24/7, making sure transactions succeeded. When the peak time happens, there's nothing wrong with it. Sometimes they have to keep their phone open 24/7, basically to make sure this thing will not fail, right? And then, just many parts of work has to—so in the previous way, the way we do this operation is quite manual. We don't have a mature system or methodology telling us what we should do first, we should do second, and what's what would you do after this. So, basically the collaboration chain was not there. Therefore, when issue happens, our operation team has to respond very quickly. But then, how can we quickly identify the problem, and make it a problem? That's a problem, right? So, we have to make sure every time we respond, we respond in a very effective manner. That's the problem. In the previous process when something unexpected happen, who had to engage with the entire team from product, engineering, operation, security, everybody has to get up and look at the problem together, which was quite inefficient. So, after we moved to this cloud-native architecture—it's not the standard cloud architect, it's, kind of—we have a lot of innovation on top of this, to make sure that’s fitting to our tech community, to our businesses. So, we basically did a lot of innovation in the process to make sure after we had this transition, people are clear about the roles, what they should be responding, and then who should be doing what. That's quite important.
Emily: And tell me a little bit more about some of the additional layers or some tools that you've built on top of Kubernetes, and how that's helped you be successful.
Haojie: Yeah, so I can give you an example. So, when we look at Kubernetes technology in our intelligent technology, or intelligent cloud business units, we're thinking about how can we use Kubernetes for cloud deployment. Okay, so previously we are using Mesos to do that. But we found this technology that lots of people are familiar with this technology. And then, people are not very sure if Mesos is the right path for container management, resource management or cloud deployment. But when we move to the Kubernetes for cloud deployment, people are actually quite happy. We are seeing a decrease in the amount of—to stand up the cloud. And previously, it took us two to three weeks to build an entire cloud. But after we use the Kubernetes technology, we can do that in a week. Oftentimes, if the scale is smaller, we can do that within three days. That's quite important because people are confident about the community; about this technology. And then, from the users perspective, they are also more willing to invest in Kubernetes. Oftentimes, this is the chicken-egg problem, right? When more companies are hiring more this, these terms appears in the market, in the job descriptions, the more people are willing to learn. So, this is actually what we're seeing a very, very good cycle for me, from both a company perspective and the talent perspective. So, that's actually quite good. But the problem for us is, there's still not enough people, or we say, you know, good talents in the market that we can attract. Basically, we're seeing a shortage of great engineering talent in the market, after the cloud-native transition. So, we're still trying to think about how we can educate the internal audience in the technology community to help them quickly pick up this new technology in the cloud, as well as the practice behind the cloud-native architecture.
Emily: You know, I wanted to talk to you a little bit about the overall situation in China and that there's also this, sort of, skills gap. It sounds like it's just as present in China as it is in other parts of the world.
Haojie: So, I would say in terms of the cloud computing, cloud-native tech community, it's pretty much—we had community forming as early as the rest of the world. But then, in early days, it was just a marketing term by people saying, oh, this is cloud, we want to learn something. But then really, from the business perspective, there's still not enough customer trying to pay for this technology. Oftentimes, the contract size was not large enough to feed engineers. That's what I say. And then, I think the trend of the serious adoption really happens in the past three to four years when a lot of startups coming out in the cloud businesses. There’s a company called [QingCloud]. There’s a company called [unlcear]. There’s many other unicorns in China in the cloud computing space, and I think two of them just went public in the A-listed share recently. So, from that perspective, I would say, the cloud computing business is really maturing rapidly in the past two years. Because we see some unicorns really coming out of this game, besides Alibaba. I think, from that perspective, I will say, it's getting better and better. It's just in terms of the pay behavior, right? How much customers are willing to pay for this technology, pay for the services, pay for the products? I think it will still take some time to mature.
Emily: To what extent do you see Chinese companies using cloud-native services and tools from Europe, from the United States, from elsewhere in the world? To what extent is it a segregated market? Like, the rest of the world doesn't use Chinese tools, Chinese companies don't use tools from the rest of the market.
Haojie: It's a good question. To me, I'm coming from an engineering background, I believe open-source community is global. It's a global phenomenon. I think the world is connected. And I would never say that people in China, or Chinese companies, or—they only using technology businesses that created in China, this is not right. And often in many cases, there's not enough options, right? So, I think, even though Chinese companies and startups are trying to innovate very aggressively, but I think the world is still connected, they have to build on top of the innovation that’s already happened in the rest of the world. So, in that case, I think we're still seeing a lot of collaboration across the globe. China, United States, for sure, Europe, other parts of the world. It's just how aggressive people are in terms of investing in frontier technology. And are they really seeing the benefits of using the frontier technology? There’s question of technology innovation versus business innovation, right? Do you see the business value? Can you really see that in the next five years? I would say in China, most of the non-internet sector, they're quite short-sighted. They're still trying to survive, they're trying to make sure they are doing—they can become the top three, top five in their business. So, technology is oftentimes secondary. But for the leaders in that sector, they have to think about that quite early in order to become the top player in their sector. So, I think, the trend is that people are still collaborating academically and engineering side to make sure the right technology gets applied in the right scenario and trying to improve the technology at home.
Emily: It's interesting that you mentioned that some Chinese companies might not focus as much on the technology. Do Chinese companies tend to consider moving to cloud-native, important for their business? Or a strategic move?
Haojie: From the strategy perspective, yes. Every leader would definitely know about cloud computing, cloud-native architecture, they would definitely think about moving. It's just that internal execution when they think about moving seriously, they have to evaluate, do we have enough talent? How much business value am I getting out of this? Is it really helping? What is my budget? And all those kind of fears, problems. So, that's what backing them up because oftentimes they don't have enough budget. That's what I say in those non-internet sector. Because I've lived and worked in the US for a while. I think in the US, the non-internet sector are quite advanced in terms of the technology adoption, especially in cloud-native. It's quite easy for them to recruit, then build a large engineering team to work on cloud infrastructure software. But it's not the case in China because people are still trying to, especially the leader who can make the decision, they're still thinking about the ROI, the rate of returns, the rate of investments, for building a strong software teams, making sure they have the robust infrastructure running at the bottom level. So, they are still trying to figure out the budget, make sure they are profitable enough to afford that.
Emily: Do you feel like Ant Financial got the business benefits out of the transition that it was looking for?
Haojie: Yeah, as I mentioned, the entire organization are quite happy about the move because really, they are, kind of, [unintelligible] move in China. So, basically, even the non-engineering teams started to appreciate this, and talk about this technology, and trying to understand it deeper, because they see the entire organization are quite happy, especially from the business protective. As I mentioned, in the Double 11 event we have—last year in 2019, the GMV we had was 260 billion RMB in total, which is 25 percent growth compared to the last year. So, for that large amount of GMV, we supported the entire infrastructure, are building from our cloud infrastructure. It is quite massive. We don't have a infrastructure for that business, we have the infrastructure for the entire group, including Ant Financial and Alibaba. Basically the entire businesses is running in the cloud. We have very, very few siloed data centers and the infrastructure—you know, uh, data centers—basically, we have the entire thing running in single cloud. That's the largest achievement we had I think since 2019, which was one of the strategic goal we had, we achieved last year. And this year, we're putting a lot of emphasis in the secure operation. It has been one of the primary cloud business goal because when bad things happen, people are literally losing money. Imagine one of the transactions failed. It failed the entire country, right? Like, no one else in China or in the other part of the world can make a purchase from Alipay app. This is quite devastating. So, secure operation has been the only thing we focus on this year, I would say. I remember in some meetings, one of the leader mentioned, “If there’s only thing we should do this year, it’s secure operation.” We're trying to make sure we operate the entire business safely on top of our new cloud-native architecture, with the minimum amount of incidents and failures.
Emily: And what do you think have been some of the challenges? What has been more difficult than you imagined in making this transition?
Haojie: Yeah, I think for me, the most obvious point is that we still have a large amount of operating team engineers, and support team, and the product, and the entire organization, basically, to making sure the entire thing working seamlessly because I think it's very hard to quantify it. I think the overall efficiency in running the cloud-native architecture, we're still looking at that. Let me try to find a good example.
Emily: Let me ask a question. What's gone unexpectedly well? Was there anything that you thought was going to be really challenging that wasn't?
Haojie: Oh, I think after moving to the cloud-native architecture, the engineers are quite happy. They're working much, much harder. They're trying to do things much more quickly than we imagined. Basically, they are very aggressive, and very happy to see the leadership teams really buying this technology, and they’re invest—want to invest seriously in this technology. They are building not only the engineering team but also the prod team, the entire organization around to cloud-native technology. So, oftentimes in order to persuade business leaders to do something serious in the technology, they have to spend a lot of time trying to evangelize to the leadership team to making sure they understand, oh, this is the right direction. We have to do this right. It takes oftentimes from six months to a year for them to really doing that. So, for that, I think it's quite successful. We see a very—basically I think the entire engineering culture has changed. People are looking at open-source community more aggressively. They think about how we should contribute back to the community. What community events should I support? What conferences should I go to? There's more and more discussion like that happening within the organization. And, I think, larger Ant Financial has become one of the sponsor in the events. We are one of the most active participants in the community, I think, since 2019, along with Alibaba. So, that's the positive side I'm seeing. People really start to form a culture on their own, especially in open-source community. Trying to be more present, trying to take more active position in the discussion, both within company and outside of company. So, that's actually quite the good. We're happy to see engineers are doing their work, and are doing it more aggressively.
Emily: Do you feel like there's any sort of disconnect between the engineering teams and business leaders? Or do you feel like they're mostly on the same page?
Haojie: Yeah, I would say there are still some gaps between the business leaders and the engineers. So, oftentimes, I would say the engineers are quite updated with what's going on in this community, in some new plugins, in some new components coming out of this Kubernetes ecosystem, but then the business leaders don't have enough time to to pay attention to this. So, it really depends on how confident they are about this technology. And how much more time do you want to put into this personally. I think the business leader will look at the numbers like KPIs, metrics, the number of accidents, the operating efficiencies, things like that, but that’s in the business context, all right. The engineer leaders cares more about what kind of new technology we use, what kind of new technology we created on top of this ecosystem, and how many people are happy about using this technology? And how many more can we do from this transition? So, basically, they are disconnect. So, I think the good part in Ant Financial is that for business leaders, most of the business leaders are coming from engineering background, but they have a strong KPI in their work. And then, most of the engineer leaders has to learn business, because, in order to persuade business leaders to invest in this, they have to think from their perspective. I think, in terms of the communication, they're quite up to date. It's just in terms of the execution and the timelines there are some disconnected happen. Yeah.
Emily: What would you say that business leaders are looking for that engineering teams might not be thinking about?
Haojie: I think one example that I see is a business leader will think about the team building, the talent building, the culture, and the public image that we had in the public, especially in China. Yeah, let me give you an example. If the technology—if the company is not cool enough, from the technology, from engineering perspective, it’s very hard to attract the top talent in engineerings from the business leader. Without strong engineering teams, we cannot execute. We cannot innovate. So, that's something they oftentimes think about when they try to invest in technology. But in terms of the execution, after the engineers gets on board, and work in Alibaba, in Ant Financial, that's something engineers have think about. How do they keep the talent? How do they make sure talents are happy? How do we make sure they are satisfied about what they do? So, I would say these two things have to work at the same time. You cannot have a strong image in technology, in frontier technology. But then, after the talent gets on board, they realize, oh, this is just great from outside, but from inside, we are still working on the legacy technology. It's operate very inefficiently internally, and how can we make sure people are dealing this? And I think that's quite important.
Emily: Is there anything that you think is preventing you from moving further along in the cloud-native journey? Anything other than lack of human resources?
Haojie: I would say that how can we securely move away from the legacy architecture, whether it's built privately or you built it using other vendor’s technology? You know, for that kind of transition we're taking very seriously. We still have a large amount of systems and applications running on Oracle, running on, sometimes in MySQL, sometimes in other siloed stack. And we're not 100 percent. We're in one cloud, we're not 100 percent away from Oracle, MySQL or that type of, we consider legacy, architecture. So, the moving will still take some time. And so, how can we make sure the transition is successful? How can we make sure the transition is less painful? Is something we as the leaders and the business executives will think about because how do you how we can set up the right KPI and the right goal for engineers to feel happy about doing this work? I think that's one of the challenges. Oftentimes people, when they are placed into this kind of work, moving from legacy architecture, to new architecture, just very minimum business value we can see from this transition, right? So, we have to have the right—we have to set the enough goal to motivate them to do the work. That's something we have to really think about that in the long term. Because this is not like we do that for six months, a year. It's going to be an effort for the next three to four years. Imagine, Alipay business started in 2004, and it's been already 16 years. So, the transition was to happen over time. It's just, how we can make sure that the transition that it's less painful?
Emily: Tell me just a little bit more about some of the custom capabilities that you built on top of Kubernetes.
Haojie: We have our own internal monitoring architecture, which is quite advanced, I would say. And this kind of monitoring infrastructure is built for both developers and operators. I think that is something we invest extremely heavily because we cannot find any other alternatives in the market. I'll give you some background about this monitoring infrastructure. So, the entire tech stack was primarily built on Java stack, the thing starting from 2004. And now a lot of cloud-native technology are leveraging Go technology, right? So, the monitoring of Go is quite different from the monitoring of Java. We have different versions of JDK and JRE that we created—one of them was actually recently open-sourced called Dragon Well. You can check out on online, a lot of posts around that. So, we have to make sure the entire stack, from the application, middleware, in the mesh-level, container host, all the way down to compiler has to be monitored quite efficiently. Once anything happened, from the operation side or from the technology side, we have to quickly respond to identify in what layer the error happened. In order for that mitigation to be efficient, we have to make sure we are monitoring every single thing in the stack. As I mentioned, from the application, middleware, host level, all the way down to hardware level, sometimes a failure in hardware will cost the entire failure in our business. It's quite often. So, we have to make sure we are monitoring our own technology in a very good manner. And also imagine monitoring that amount of infrastructure in that massive scale. It's very challenging. I think before 2014, we had a lot of failure in our monitoring infrastructure. This is quite ridiculous, but this is what happened. So, we spent a lot of time to make sure we have the supporting infrastructure ready for that kind of businesses. That's quite important.
Emily: Anything else that you'd like to add about either your own experience moving to cloud-native or some observations about how things are going in China in general?
Haojie: I think from the strategy perspective, Chinese company or startups from China are doing quite well. It's just the market is quite different. For companies to survive and thrive in the Chinese market, they have to go with the customers, right? So, even though the innovation happens at the same level, the customers are not at the same level from what I see. But overall, I think the trend is quite positive, I think eventually, be it five years, or seven years, or ten years, Chinese companies, Chinese customers will be at the same level as the rest of the world: in the US, in the UK, in Australia, in the rest of the world. I think people are more and more aggressive, and they would like to allocate more and more budget into technology business. They realize the benefits of it, especially in the current outbreak. When people, they cannot go to work, but they still have to do something. The business has to survive. Like, they have to do something in order for the business to survive. So, from the business perspective, how can they build their strong online presence during the outbreak? Is actually quite important. Before the outbreak, I would say, in the retail business, there still some people think about, “Oh, how can we do this in our traditional manner? How can we open as many stores as possible.” They didn't really care about building a store online. From in Taobao or Tmall, [unintelligible] seriously. But during outbreak, people they have to stay at home. They have nowhere to go. But then the business, they still have to pay their employees. So, how can you do that? The only thing is going online. In order to go online, they have to build online infrastructure for their customers, for their employees, for them to work. So, that's quite—honestly, that's one of the trend I'm seeing: that people are paying more and more attention to work remotely, and use software, SAAS software without on-premise deployments. In that case, people, they are able to work wherever they go. Being at home, office, on the road, people are really interested in the benefits of SAAS, of cloud. I think that's something that I'm seeing. I think after this year, definitely the market of SAAS will become better and better because not only the technology is, but the business leaders will understand the value of using Zoom, using Ding Ding, using WeChat, to make sure their employees, they can work anywhere they want.
Emily: Well, thank you so much. A couple finishing up questions. First of all, what is an engineering tool that you couldn't do your job without?
Haojie: Do you mean, like, just tools for me to do some engineer work?
Emily: Yeah. What's your favorite tool, something you just can't imagine working without?
Haojie: We have a lot of tools innovated within the company. I don't think I can mention that in this podcast.
Emily: Okay. No problem. And then, how can people connect with you if they want to?
Haojie: At work, or outside of work?
Emily: like on Twitter or on social media.
Haojie: Yeah, I had a lot of invitation from LinkedIn, not so much on Twitter because I'm not active on Twitter. But I think people, they get to know me, oftentimes from word of mouth, they got introduced from other friends of mine, they want to understand about the technology adoption in China, especially in the cloud. Yeah, people oftentimes, which me from LinkedIn, that's the primary source.
Emily: Well, thank you so much. I really appreciate you taking the time to chat.
Haojie: Thank you, Emily.
Announcer: Thank you for listening to The Business of Cloud Native podcast. Keep up with the latest on the podcast at thebusinessofcloudnative.com and subscribe on iTunes, Spotify, Google Podcasts, or wherever fine podcasts are distributed. We'll see you next time.
This has been HumblePod production. Stay humble.
The Business of Cloud ...