Discussing Cloud Native Security with Abhinav Srivastava

This conversation covers:

  • How Frame.io was faced with the decision to be cloud native or cloud-enabled — and the business and technical reasons why Frame.io chose to be cloud native.

  • How Abhinav successfully built a world class cloud-native security program from the ground up to protect Frame.io users’ sensitive video content. Abhinav also talks about the special security considerations for truly cloud native applications.

  • Cloud native as a “journey without a destination.” In other words, there is no end point with cloud native transitions, because new technologies are always being developed.

  • Why Abhinav is a firm believer in both ISEs and GitOps, and why he thinks the industry should embrace both of these strategies.

  • The challenge of not only maintaining security in this type of environment, but also communicating security issues to various stakeholders with different priorities. Abinhav also talks about the role that specialists like AWS and machine learning experts can play in furthering security agendas.

  • Common misconceptions about cloud native security.

  • Frame.io’s decision to roll out Kubernetes, and why they are also considering adding chaos engineering to fortify against unexpected issues.

  • Tool and vendor overload, and the importance of trying to find the right tools that fit your infrastructure.

Links:

Transcript

Announcer: Welcome to

The Business of Cloud Native

podcast where we explore how end users talk and think about the transition to Kubernetes and cloud-native architectures.

Emily: Welcome to

The Business of Cloud Native

. I'm Emily Omier, your host, and today I am chatting with Abhinav Srivastava. Abhinav, can you go ahead and introduce yourself and tell us about where you work, and what you do.

Abhinav: Thanks for having me, Emily. Hello, everyone. My name is Avinash Srivastava. I'm a VP and the head of information security and infrastructure at Frame.io. At Frame, I am building the security and infrastructure programs from ground up, making sure that we are secured and compliant, and our services are available and reliable. Before joining Frame.io, I spent a number of years in AT&T Research. There I worked on various cloud and security technologies, wrote numerous research papers, and filed patents. And before joining AT&T, I spent five great years in Georgia Tech on a Ph.D. in computer science. My dissertation was on cloud and virtualization security.

Emily: And what do you do? What does an average day look like?

Abhinav: Right. So, just to tell you where I answer the question where I work: so I work at Frame.io, and Frame.io is a cloud-based video review and collaboration startup that allows users to securely upload their video contents to our platform, and then invite teams and clients to collaborate on those uploaded assets. We are essentially building the video cloud, so you can think of us as a GitHub for videos.

What I do when I get to office—apart from getting my morning coffee—as soon as I arrive at my desk, I check my calendar to see how's my day looking; I check my emails and slack messages. We use slack primarily within the company doing for communication. And then I do my daily standup with my teams. We follow a two-week sprint across all departments that I oversee. So, a standup gives me a good picture on the current priorities and any blockers.

Emily: Tell me a little bit about the cloud-native journey at Frame.io? How did the company get started with containers, and what are you using to orchestrate now? How have you moved along in the cloud-native journey?

Abhinav: We are born in the cloud, kind of, company. So, we are hosted in Amazon AWS since day one. So, we are in the cloud from the get-go. And once you are in the cloud, it is hard not to use tools and technologies that are offered, because our goal has always been to build secure, reliable, and available infrastructure. So, we were very, very mindful from the get-go that while we are in the cloud, we can choose to be cloud-native or just cloud-enabled. Means use tools, just virtual machines, or heavyweight virtual machines, and not to use container and just host our entire workload within that.

But we chose to be cloud-native because, again, they wanted to boot up or spin up new containers very fast. As a platform we, as I mentioned, we allow users to upload videos, and once the videos are uploaded, we have to transcode those videos to generate different low-resolution videos. And that use case fits with the lightweight container model. So, from the get-go, we started using containerized microservices; orchestration layer; From AWS, their auto-scaling; automation infrastructure as a code; monitoring. so all those things were, kind of, no brainer for us to use because given our use case and given the way we wanted to be a very fast uploader and transcoder for all of our customers.

Emily: This actually leads me to another question: have you guys seen a lot of scaling recently as a result of stay-at-home orders and work from home?

Abhinav: Right. So, we are seeing a lot more people moving towards remote collaboration tools who are actually working in the production house since they have to work from home now. So, they are now moving to these kind of tools such as Frame.io. And we do see a lot more customers joining our platform because of that. From the traffic perspective, we did not see much increase in the web traffic or load our infrastructure, because we have always set up the auto-scaling and our infrastructure can always meet these peak demands. So, we didn't see any adverse effect on our infrastructure from these remote situations.

Emily: What were some of the other advantages? Like you were talking about that you had the choice to be either cloud-enabled or truly cloud-native? What were the biggest, you know—and I'm interested, obviously in business rationale to the extent you can talk about it—for being truly cloud-native?

Abhinav: So, from business perspective, again, a goal was to [basic] secure available and reliable production infrastructure to offer Frame.io services. But cloud-native actually helped us to faster time to market because our developers are just focusing on the business logic, deploying code. They were not worried about the infrastructure aspect, which is good. Then we’re rolling out bug fixes very quickly through CI/CD platform, so that, again, we offer the better [good] services to our customer.

Cloud-native helped us to meet our SLA and uptime so that our customer can access their content whenever they would like to. It also helped us securing our infrastructure and services, and our cost also went down because we were scaling up and down based on the peak demand, and we don't have to provide dedicated resources, so that's good there. And it also allowed us to faster onboard developers to our platform because we are using a lot of open source technologies, and so the developers can learn quickly—there are a lot more resources out there for them to learn. And it also helped us avoid vendor lock-in. We are relying on more and more open-source projects, CNCF [unintelligible] projects, so that has helped us. And more importantly, it is helping us stay competitive because in this industry—in this time—we would like to be available, we would like to be secure. So, for our customers to stay doing their job that they used to do in an office setting or in a non-remote setting, and we can continue providing help that they need.

Emily: How has this changed the security story?

Abhinav: So, obviously, security story is same what we have before because, I mean, we allow people to have upload their media content to our platform. So, that's very sensitive content. So, we always wanted to make sure that they stay secure. And for that, we have built a world-class security program from ground up, with emphasis on product security, cloud security, security data science, and also compliance and privacy program. So, we are doing what we used to do: making sure that content is still secure, our infrastructure follows the AWS security best practices, we can identify vulnerability within our application and fix it. So, again, as I said, that it hasn't changed much from security perspective, as far as Frame.io’s daily operations are concerned.

Emily: How does having a truly cloud-native application, how is that different from a security perspective from something that isn't cloud-native?

Abhinav: So, security is very important whether you are cloud-enabled or cloud-native. So, security is very important for all the services. Being in the world of microservices and in the container, actually, it helped us to model the application behavior. For example, if you have one very big monolithic application, it does so many things, so it's really hard for you to know to find out what's the normal execution pattern. And when this application is going to—if it attacked, how it's going to behave, how is abnormal execution look like? But in the microservices world, since each application, each microservices is getting one job. So, you can create a good model of behavior of that container.

Or even if you are monitoring their runtime behavior, you know that what kind of processes are going to be invoked from that container? What kind of network connections are going to be made? What are the files are going to be accessed by the services within the host, or within S3, or other resources? So, you know their interaction pattern—execution pattern, and that, you can qualify, both in terms of your security rules that you want to create on the infrastructure for those services, or you can create a better anomaly detection or machine learning models for those behavior. And we did both in our infrastructure to keep them secure.

Emily: And how do conversations about security go when you talk with different stakeholders. I'm curious to know if there's any sort of miscommunications, or things that are lost in translation when you're talking about security with, say, the development team; with the business stakeholders; with platform engineers. What are some of the things—anything that gets lost in translation?

Abhinav: So, there are two parts of this question. In general, having a discussion around cloud-native services and the security of cloud-native services. Because there are various ways you can deploy a service in the cloud, you can have a service deployed in the cloud just by running a bunch of VMs, or you can deploy it using cloud-native architecture where you have doing all those things. But the cloud-native architecture requires you to think of all the stages of the services. For example, how will SLAs, SLOs, SLIs look like for this service? Or, how do you monitor the service when it execute? How will you protect these services when you deploy them? What kind of resources are going to be accessed by this service? How will create their identity and management rules there? How would you deploy it and how would you create network rules for that so that you can do it in a principle of least privileged fashion, you can execute these services?

So, you need to do proper planning that how would a new service going to interact with other services in the infrastructure. And these non-functional requirements are, many times, described poorly or not written at all because as a developer, you would like to create service and deploy service, and so that customer can use it. And these are the things behind the scenes we have to think about it. And we, as a team are working very actively to bridge this knowledge and semantic gap so that these things don't get lost in the translation when you're thinking about the service.

Emily: What about when you talk to say, business stakeholders? Is there anything that gets lost in the translation?

Abhinav: So, I mean, in the business sense, we always have to keep the discussion at a very high level. That, what's a use of service? Or, where we should deploy? Who are going to be the users? So, at that time, we don't want to talk about those underlying infrastructure-related issues because at the business level, we would like to know that how the service is going to function, and mostly functional requirements. But at the low level, we would like to think about that when we are about design these services, what are the things we have to worry about in order for that service to deploy securely and reliably?

Emily: How important is security to Frame.io? Not every company thinks the same about security, I should say.

Abhinav: And that's a great question. I think for us, security is very important. I know every company says that, but I think we truly mean that. So, we are close to 150 employees, but I was hired around when I was a [00:12:31 unintelligible] employee as a head of security. So, that shows that we care about security. And I have been building security from ground up. We got our SOC 2 Type II compliance when we was around 70 employees. And there are companies out there who are doing SOC 2, and they are thousand employees. So, we are GDPR compliant; we are working towards our CCPA compliance, and we are TPN compliant as well. TPN stand for Trusted Partner Network, which is the [same world] media, and entertainment companies, and industry users. And we were the first few companies who got that certification, also. So, we care about security very much because we allow users to upload their contents in our cloud and we make sure that those contents remain secure.

Emily: And so, is there any tension that you feel between talking about security or making things as secure as possible, and either business stakeholders or other parts of the IT team?

Abhinav: So, there is definitely attention. [laughs]. If I say no, then I would be lying because our goal—engineers or developers or service creators, they want to deploy the service. They will get satisfaction once the customer start using those services. And our job is to make sure to—we put some guardrails in place—or barriers in place so that we can vet the application, we can vet the service, we can do the proper testing, we can make sure that by deploying the service, we don't increase our exploitable surface.

So, that kind of tension will always be there because, by nature, security's job is to make sure that whatever is deployed is secure. Our infrastructure is secure and the service owner’s job is to deploy the service. But I think what we are trying to do in the organization, we are trying to take a risk-based approach because security is just another business function. The way sales is important, the way engineering is important, the same way that security is important. And there's a risk in this environment of not meeting sales targets, same way there's a risk of getting breached.

So, how do we provide a risk-based methodology so that when we talk about security, we talk in terms of risk; we talk in terms of probabilities versus possibilities? Because there is always possibility of something going wrong, but what's the probability of something happening? And that basically gives us some way of talking to other business-holders saying that, “Okay, if you deploy the service the risk is high. But the risk is high because the likelihood of getting breached is high, but impact would be very low. So, since risk is the product of impact and likelihood, overall the risk is low.” But sometimes the risk is that chance of getting attacked is very low, but the impact could be very high. Again, you will have risk low because probability of actually happening that event is low.

So, that basically gives us some common language we can use to talk to other business-holders because risk is being used as a language across other departments. We try to use the same language to convey cybersecurity risk as well.

Emily: Since starting with Frame.io and building this security program from the ground up, what surprises have you encountered?

Abhinav: I would say there were many surprises. First of all, I had those surprises because I come from a background from research and development. There, goal was to develop services, goal was to think about new security product, and goal was to think of attack and coming up with defenses for them. Having the responsibility of building the security program from ground up, or having to adjust this risk-based mentality was a big surprise because it's not that just because there is a bug, engineering is going to fix it. You have to show the impact of that bug. You should have a proper [unintelligible] associated with that. You have to show that what are the ways that bug can be launched. So, it means, just because you care about security, doesn't mean that everybody else cares about security. So, you have to keep the communication on. You have to always talking, you have to always adjusting, and you have to use the right language to the right person that you are talking to.

Emily: What tips do you have about adjusting your language for different audiences and getting them to understand what you're talking about?

Abhinav: So, one thing is to use risk-based methodology. That is saying that, “Oh, we have a bug, or we have a high priority bug.” I think saying that, “What is the impact of that bug? How would that bug be exploited in a real setting?” I think those things are important because people care about security, but then they have hundred other things to do, as well. So, how do you talk to their language?

And also building the right team, as well. So, if you want to target product security, you have to have a product security specialist, who can understand these nuances; who can understand what are the different attacks. Some companies build a security team with many generalists. I took an approach where I'm building team with the specialists.

So, for product security, I have two core product security engineers who have done this thing many times before. For cloud security, I have a specialist who knows about AWS Cloud and everything. For security data science, I have a machine learning expert. So, for each of those roles that you have mined, you try to fill the position with the right set of people. And coming back to this cloud-native security.

I think one thing is very important in the cloud-native world, as I have realized lately, that infrastructure as a goal is very important piece for securing your cloud. It's not that I or the team don’t know about it, but the temptation to do things quickly sometimes resorting to manual work instead of writing your Terraform or CloudFormation. So, you can do things quickly, but then the chances of you making error are also high. Because if you go to Terraform, you can follow the regular CI/CD process, you can have your pull request approved by somebody, and chances of finding a error quickly is high.

And for security purposes, infrastructure code is a blessing. Because you can put proper guard rails in place to make sure that nobody does manual operation in the infrastructure, and everything goes through proper approval process, and that will—as a head of security if you know that if somebody wants to do anything or open any port in the infrastructure, two people are going to look at it and then they're going to have a dialogue with each other, and they’re going to find out the real need for opening that port. Your life will be a lot simpler.

Emily: What do you think are some misconceptions about cloud-native security, both inside the engineering department—so developers, for example—and then outside in the rest of the company?

Abhinav: I think misconception that I view—and it's my opinion—is that the only thing that is important is deploying fast, or moving to production very fast. I think there are so many things has to be done behind the scene in order for you to move fast. And if you don't do those things, then it means that either you're going to break your application, or you're going to make your infrastructure insecure. So, for example, if you have a CI/CD set up and you want to deploy a business logic, and you think that, “Oh, I can code that thing in AWS Lambda functions.” AWS Lambda function is completely managed service. You went ahead and coded in Python, and your service is up and running. But now in doing so, what you did quickly that you forgot to follow the best practices that Lambda function has to be within the VPC; you need to generate an IAM role that has restricted permission; you have to make sure that proper security groups has to be attached to Lambda functions so that it is not open to www. And those things are part of misconception that, “Oh, if I have to do something, AWS allows that we can do it quickly.” That's what we are trying to do. We are trying to come up with a set of best practices for each of those resources as a team, writing documents, sharing with engineering that, “Okay, you want to do it? Sure, go ahead, do it, but just follow these best practices.” So, that even if you SAM or Terraform, whatever you want to use to deploy your application, make sure that best practices are always followed.

Emily: Can you think of any misconceptions about cloud-native security that, say, somebody might have if they're coming from a legacy environment: managing security but in a very different type of environment.

Abhinav: So, I mean, cloud-native security is all about making sure that your microservices are secure, the kind of access pattern they have, kind of network pattern they have. So, I think one misconception is that—you can think of misconception is, if you are coming from a monolithic world, where you have logged on your services, but just by assuming that you have a parameter between outside world and inside world, so your firewall rules are just like that between in and out. But that parameter is blurred now. There is no such thing as a “them versus us.” It's all blurred now.

So, in the microservices world, instead of North/South traffic going up and down. You have to think about East/West traffic as well. So, making sure that your service communication are secure as well: you make sure you use proper cryptography, make sure your endpoints are authenticated so that your services are not compromised. Because if one service compromised, if you don't use proper control among those services, then your other services can be compromised very quickly. And that's the problem when we go from monolithic application to microservices.

Emily: Do you think that people outside of the security team understand that distinction?

Abhinav: I would say they do, to the extent that they know about it, but then when we have to actually implement it, there are always some concerns that it is going to slow down our application, it is going to introduce latency in the application. So, people do understand that okay parameter is going away, but to the extent that they know about it, but when you—again, when we start implementing it, there is always concern that how it's going to play out.

Emily: Do you think Frame.io is fully cloud-native? Do you think there's anything that you could do to be more quote-unquote, “cloud native.”

Abhinav: So, in my opinion, it is a journey without any destination. Just like security, you can never say, “I’m secure.” You will have to adjust your control based on the threats or attacks going on. In the same way, there is no end to transition to cloud-native because new technologies are coming, and we will have to evaluate new tools that can help us realize our business goals effectively. So, we are cloud-native, but still, we can do a lot more things, given time and resources.

So, in some concrete world that we are doing right now, that we are creating more tools for developers to perform tasks themselves. So, creating more self-serve culture. As I said that moving towards more [IFC] model, and so on. And for that, we are setting up guardrails so that they can perform those operations within those boundaries without impacting security and reliability. We are also looking into ways to extend Kubernetes. Because Kubernetes is in itself a full cloud platform with a lot of possibilities. So, we are interested in making it more programmable for our environment. But these are ongoing things that we'll have to continue doing it.

Emily: Do you have any other next steps that you could share? What's next in your journey?

Abhinav: So, we rolled out Kubernetes in our infrastructure last December, and that move paid us off. So, we are building more tools on Kubernetes. As I said, that we are going towards more self-service style of architecture where developers can do a lot more things within those guardrails and we are also looking into ways to introduce chaos engineering in our environment because we do things fast, but we break things fast as well. [laughs]. So, one small configuration error can create severity zero alert. So, what we need is a good chaos engineering practices to simulate these areas, so that everybody can train on these events and know how to prevent and respond to such problems. That will reduce our incident resolution time as well.

Emily: When—sort of last question: anything else that you would like to add?

Abhinav: Two things, I think. One thing is we all should be going towards IFC and GitOps; infrastructure code and GitOps. If this is the one takeaway from this podcast, is that that's the way to go. I know manually doing work is tempting, but that creates problem down the road. So, life will be a lot simpler if we go with the IFC and GitOps.

Second thing is that I feel this pain, and many other people are facing the same way, that there are too many tools and vendors out there. So, it's really hard to choose from what is going to work in your environment. CNCF is helping us by highlighting some of these projects by assigning proper maturity levels, like sandbox incubation, and graduated project, so on, but it still is very challenging to find the right tooling that fits your infrastructure. So, always make sure that when you choose a new technology, see how it's going to be working with your existing technologies because it's not that easy to throw away an existing thing because all these things that the tool that you try, it also complicates your security as well because you just do not know how it's going to play out when you deploy this new technology in your environment where the other tools and services are running. So, I think we have to evaluate all tools carefully to make sure that we understand its a security and reliability impact on our existing infrastructure.

Emily: What is your can't live without engineering tool or security tool?

Abhinav: Huh, that's a good question. Right now, one tool that I cannot live without is Falco. That is a runtime container monitoring solution. We invested a lot on it, and it is paying off in terms of the kind of alert it is generating, kind of visibility it is providing in our infrastructure. And one tool I can't leave off from both from security infrastructure perspective is Slack because we have done a lot of automation to bring all these alerts through Slack. So, all of our ops happen via Slack. So, I think these are the two tools I’m relying a lot in terms of visibility and in terms of response.

Emily: Well, thank you so much for joining me.

Announcer: Thank you for listening to

The Business of Cloud Native

podcast. Keep up with the latest on the podcast at

thebusinessofcloudnative.com

and subscribe on iTunes, Spotify, Google Podcasts, or wherever fine podcasts are distributed. We'll see you next time.

This has been

HumblePod

production. Stay humble.

Uncategorized