BT

如何利用碎片时间提升技术认知与能力? 点击获取答案

专访肖雨浓:Netflix是怎样探索落地FaaS的?

| 作者 肖雨浓 关注 0 他的粉丝 发布于 2018年6月21日. 估计阅读时间: 42 分钟 | 都知道硅谷人工智能做的好,你知道 硅谷的运维技术 也值得参考吗?QCon上海带你探索其中的奥义

2014年,Serverless架构进入大众视线,当时业界普遍认为,Serverless化可大幅降低IT成本,将云的费用减少10%-90%,同时还能提高服务部署效率。

经过这几年的沉淀,部分公司已经在实践Serverless,取得的效果也很明显。7月6日在深圳举办的ArchSummit全球架构师峰会,我们邀请了Netflix首席软件工程师肖雨浓来分享Netflix对FaaS技术的探索过程,希望能给技术工作者带来收获。

InfoQ:能否为大家介绍一下您目前的工作内容和职责是什么?

Xiao:I currently lead the FaaS and API platform at Netflix. The Netflix API is a tier-1 service through which every single request from all Netflix clients flow through. It allows us to integrate the hundreds of microservices on the backend into one coherent service for clients to access. We're building a FaaS platform to enable engineers to quickly develop, test, and operate these API services -- which generally are bespoke to each device.

我目前在Netflix带领FaaS和API平台团队,Netflix API是一个tier-1服务,通过这个服务,来自Netflix所有客户的每一个单个需求都可以平滑经过。基于这个API服务,我们还可以将后端的上百个微服务整合进一个连贯的服务里,便与用户访问。我们当前正在构建一个FaaS平台来帮助工程师们快速开发,测试并维护这些API服务,通常情况下,这个平台会被定制到每一个设备里。

InfoQ:实践Serverless过程中给Netflix带来哪些方面的优化?在您看来Serverless架构适合哪些业务场景,不适合哪些场景?(Serverless模式能给Netflix降低多少成本?)

Xiao:At Netflix, we design our product with innovation in mind. What this means is that we're constantly A/B testing our product and launching many new features each week. In order enable this kind of velocity, we require a API services platform which enables client engineers to be able to rapidly deploy to production changes to their services. FaaS achieves this by abstracting away all of the platform components usually associated with a service down to just business logic itself -- allowing engineers to focus on developing great new features instead of writing boiler plate code.

Netflix的产品在设计上就已经被赋予了创新的基因,除了不间断的A/B测试之外,每周都会发布很多新功能。为了确保这样高强度的工作成果,我们需要一个API服务平台来助理客户端工程师快速而有效的将更改的需求部署到服务层。FaaS通过把那些与服务相关的所有平台组件抽象为业务逻辑本身来实现这一目标的,这样可以使工程师能够更专注于开发优异的新功能,而不是编写那些冗余而又不得不写的代码。

Additionally, operating services at more than four 9s of availability is difficult -- even for seasoned server engineers. Thus a serverless model where we centralize the operations allows us provide a platform that allows even engineers without server and operational experience to develop highly available services.

此外,即使对于经验丰富的服务器工程师而言,运行服务的可用性超过四个9也是很困难的。 因此,我们集中操作的Serverless模式能够为我们提供一个平台,即使没有服务器和运营经验的工程师也可以开发高可用的服务。

InfoQ:能否进一步详细介绍API Platform的架构?目前API Platform是如何落地Serverless的?

Xiao:At a very high level, the API platform consists of a FaaS platform which allows engineers to deploy functions with customs business logic as highly available production services.

在更高层面上,API平台由FaaS平台组成,该平台允许工程师将用户业务逻辑的功能部署为高可用的生产服务。

InfoQ:Serverless架构是否是微服务的极致?您团队接下来的优化重点是什么?

Xiao:There are tradeoffs to consider with serverless. By adopting the FaaS model, you are essentially trading customization for velocity and perhaps availbility. There are some applications where FaaS for services works really well -- as is the case for the Netflix API where we run relatively uniform microservices that only need to access and mutate data from downstream services. However, if a service requires customization, such as needing to change various parts of the service platform e.g. RPC, data access, caching, authentication, then the FaaS model may not provide enough flexibility for such services.

无服务器在实践场景里是可以考虑权衡点的。 通过采用FaaS模式,本质上是对交易速度和可能性的定制化。有些应用程序的FaaS服务表现得很好——Netflix API的情况就是如此,我们运行的是相对统一的微服务,只需要访问和改变下游服务的数据。 然而,如果服务需要定制化,例如需要改变服务平台的各个组成部分,像RPC,数据访问,缓存,认证等,那么FaaS模式可能无法为这些服务提供足够的灵活性。

Our focus currently is to finish migrating the legacy API services over to the new stack. After that our focus could include many areas such as performance -- both to reduce cost and improve customer experience -- and other areas such as infrastructure and platform improvements.

我们目前的重点是完成将旧版API服务迁移到新堆栈。之后,我们的重点可能会包括很多方面,例如性能,既要降低成本,又要改善客户体验,以及基础架构和平台改进等其他领域。

InfoQ:能否结合实例谈谈Serverless中,怎样的函数依赖关系是合理的,从业务逻辑上如何评估哪些关键路径需要报警,哪些允许失败?(如何防止错误地消耗大量资源进而增加大量费用?)

Xiao:Functions are deployed as isolated services -- which means we're not deploying functions from different services on the same instance. This is really important for us as we wouldn't want one misbehaving service to take down all of Netflix. This isolation helps us prevent large scale outages across all of Netflix. We also integrate against our internal metrics, alerting, and monitoring systems, which gives us visibility into the health of each service. The service platform contains modern load-shedding technologies such as concurrency limits and circuit breaking -- these generally help prevent large scale outages. We've also invested heavily in runtime debugging, profiling, and sampling which provides the observability we need to operate many services at scale. There are many other components in the platform that help us run reliably, come to the talk to find out more!

函数被部署为独立服务,这意味着我们不会在同一个实例上部署不同服务的函数。这对我们来说非常重要,因为我们不想让一个行为不良的服务拖累所有的Netflix服务。这种隔离有助于防止所有Netflix服务出现大规模停机。我们还会对内部指标、警报和监控系统进行整合,从而让我们了解每项服务的健康状况。该服务平台包含先进的削减负荷技术,如并发限制和断路,这些措施有助于防止大规模停机。我们还在运行时调试、分析和采样方面投入大量精力,这为我们提供了必须的可观察性,以便对服务进行大规模运维。该平台还有许多其他组件帮助我们更可靠地运行,来听我的演讲了解更多信息!《Going FaaSter: Function as a Service at Netflix

In terms of dependencies we allow users to import third party libraries at will -- but of course this means engineers need to exercise judgement with respect to things like security and performance.

就依赖性而言,我们允许用户随意导入第三方库,当然,这意味着工程师需要对安全性和性能等方面进行判断。

InfoQ:如何决策或对比使用公有云 FaaS 服务或私有云自建 FaaS 服务?

Xiao:This comes down to the classic build vs buy question. I think one should be pragmatic when faced with this decision. When we were first designing our FaaS platform, we considered public options such as Lambda and App Engine. We would be happy to use off the shelf solutions if they fit our use case.

这归结为典型的“自建 or 购买”问题。我认为面对这个决定时应该务实。当我们首次设计FaaS平台时,我们考虑了诸如Lambda和App Engine等公共选项。如果符合我们的场景,我们当然很乐意使用现成的解决方案。

As it turns out, we needed a platform that integrated with the existing Netflix service platform components such as metrics, alerts, service discovery, and many others, and this integration with high level FaaS platforms would be difficult.

事实证明,我们需要一个能与现有Netflix服务平台组件(如度量,警报,服务发现等)集成的平台,而且这种与高级FaaS平台的集成将是一个很困难的过程。

Additionally, we needed full visibility into the services using the FaaS platform. Building it ourselves meant that we have full control all the way down to the operating system -- and we can give operators (ourselves) the tools and visibility to debug the services and platform.

另外,我们需要全面了解是什么样的服务在使用FaaS平台。自建意味着可以完全控制操作系统,需要给运维人员提供调试服务和可视化工具。

Obivously there's a huge amount of effort, time, and cost that went into building our own FaaS platform -- so we don't make these decisions lightly. However at the time we couldn't find an open source or public FaaS option that satisfied our requirements.

显然,自建FaaS平台需要花费大量的精力、时间和成本,所以我们不会轻易做这样的决定。然而,当时我们找不到满足要求的开源方案或公开的FaaS选项。

This doesn't mean others should follow in our footsteps. If there is an open source or public FaaS option that suits your requirements, then absolutely go and use it. Opportunity cost is also an important metric. Technology is just a means to an end -- and people should absolutely use the best tool for the job -- often this means buying and not building

这并不意味着大家都要模仿Netflix的脚步。如果符合需求的开源或公开FaaS选项存在,那么绝对要去使用。机会成本也是一个重要指标。技术只是达到目的的手段 - 我们当然应该使用最好的工具来完成这项工作,通常这意味着购买成熟的方案而不是自建。

InfoQ:对于 CI/CD 与 FaaS 的结合,有什么样比较好的建议?

Xiao:Providing a robust first class testing framework is important. We designed our FaaS platform with testing in mind. As a result, we created a testing framework with features such as first class mocks and tight integration with the developer tooling to make it very easy for engineers to write unit, integration and end to end tests using the FaaS platform.

提供强大的一流测试框架非常重要。我们在设计FaaS平台的时候考虑到了测试,创建了一个测试框架,其中包含一流的模拟功能以及与开发人员工具紧密集成的特性,使工程师可以非常方便地使用FaaS平台编写单元,集成和端到端测试。

One of the main advantages of the our test framework is that it allows them to test their functions in isolation, either locally or on jenkins -- without having to deploy code to the cloud. This ease of use inventivises our customers to write tests -- which helps us improve the reliability of the service.

我们的测试框架主要优点之一,是允许在本地或在Jenkins上单独测试其功能,而无需将代码部署到云中。这种易用性使我们的客户能够编写测试,而这有助于提高服务的可靠性。

InfoQ:目前业界全面落地Serverless尚且遥远,且没有统一的构建标准,如何确保你们的实践方向是正确的?能否分享历年过程中你们的经验教训?

Xiao:Today most Serverless solutions are geared towards batch and event driven tasks which are not latency sensitive. However, we believe serverless should also be considered for production services since they reduce operational and code complexity by abstracting away the platform and infrastructure.

目前大多数Serverless解决方案都适用于批量和事件驱动的任务,这些任务对延迟不敏感。然而我们认为Serverless也应该被考虑用于生产服务,因为它能通过抽象化平台和基础设施来减少操作和代码复杂性。

For us, there was a clear need within the Netflix API organization for a FaaS model which supported service style workloads. We believe through converstaions with other companies that there is an appetite for service style FaaS platforms -- most services for teams are a means to an end -- they're not opionionated or care about how the service is implemented, only that it performs the business logic they need reliably with good developer ergonomics.

对于我们来说,Netflix API组织中有明确的需求,需要FaaS模式来支持服务型工作负载。我们相信通过与其他公司的交流,大家对服务型FaaS平台会有浓厚的兴趣,大多数团队服务都只是为达到目的一种手段,没人激励他们,也没人关心服务是如何部署的,只需要它们可靠的执行业务逻辑。

I think FaaS is a natural evolution, many years ago most services used bespoke software up and down the entire stack, running inside data centers owned by each company. We're moving towards a model today where we're commoditizing the components further and futher up the stack -- we started with the commoditizing of hardware and data centers with IaaS (think AWS EC2), and then moved towards commoditizing some parts of the platform with PaaS (think Heroku, or Google Cloud Platform), the natural evolution of this is toward FaaS where everything is provided by the platform except for the business logic which is the function itself.

我认为FaaS是一种自然演变,许多年前,大多数服务使用定制软件在整个堆栈中运行,并在每个公司内部数据中心运行。现在,我们正朝着一种模式迈进,在这个模型中,我们将组件进一步商品化,并进一步向前推进。我们开始使用IaaS商业化硬件和数据中心(例如AWS EC2),然后转向将平台与PaaS的某些部分商业化(例如Heroku或Google Cloud Platform)。这种自然演变促使FaaS出现,一切都由平台提供,而只有业务逻辑是函数本身的。

InfoQ:随着容器和Kubernetes技术的兴起,当前有很多基于这两种技术构建的Serverless架构,比如Fn、Kubeless、OpenFaaS、IronFunctions等,您如何看待容器技术尤其是Kubernetes为Serverless架构带来的机遇?

Xiao:One of the reasons we see so many FaaS platforms built on top of K8s is due to the fact that K8s abstracts away the infrastructure and platform required for building scalable and reliable services on top of containers. This is powerful as it means that FaaS frameworks can focus on the function runtime.

如此多FaaS平台构建于K8s之上的原因之一,是K8s将基础架构和平台抽象为在容器上构建可扩展和可靠的服务所需的事实。这是非常强大的,因为它意味着FaaS框架可以专注于函数运行时。

This space will continue to evolve and I hope to see additional FaaS frameworks emerge -- especially ones that can fulfill the need for service style workloads at scale (Think rich metrics, autoscaling, performance optimizations). I believe K8s will evolve in terms of its ability to run at larger scales -- this would make it an even better fit for use cases exceeding 5000 physical nodes.

这一块将继续演变,我希望看到更多的FaaS框架出现,尤其是能够满足大规模服务风格工作负载需求的那些(能够考虑到丰富的指标,自动调整,性能优化)。 我相信K8s将以更大规模运行的能力发展,这将使它更适合超过5000个物理节点的使用情况。

InfoQ:在涉及整体架构的重构中,您认为应当采用渐进的方式逐步替换还是完全重写?如何防止技术人陷入下一个酷技术的陷阱?

Xiao:Engineers should be pragmatic and look to make incremental changes to the architecture. Changing everything at once significantly increases the complexity, risk, and timeline of the project. Making incremental changes means we can shorten the feedback loop, realize gains more quickly for the business, and reduce the risk by changing only a few components at a time.

工程师应该务实,对体系结构进行渐进式改变。立即改变一切只会增加项目的复杂性,风险和时间成本。渐进式改变意味着我们可以缩短反馈周期,为业务更快实现收益,并通过一次只更改少数组件来降低风险。

We should balance the tradeoffs of each decision and seek to get broad alignment within the company and mine for dissent. Be judicious when it comes to adopting new technology -- ask yourself the question, "why are you picking this technology?" If you can't answer it in a way that satisfies your team or organization -- then you should think twice. Think about the implications of adopting new technologies. Does it have a broad user and support base? Does it provide a good set of tooling to operate and debug? What about documentation? How about the maintenance cycle? What is the impact to the organization as a whole by adopting a new technology -- will platform teams now need to support this new technology across the entire organization?

我们应该权衡每个决定的利害关系,寻求公司内部的广泛一致,并寻求异议。在采用新技术时要谨慎,扪心自问,“为什么选择这项技术?”如果你不能以满足团队或组织的方式来回答这个问题,那么你应该三思而后行。仔细考虑采用新技术的意义是什么,拥有更广泛的用户和支持基础?提供了一套好的工具来进行运维和调试?文档是否明确?维护周期有多长?采用新技术对整个组织的影响是什么?平台团队现在需要在整个组织中支持这项新技术吗?

For example, we adopted containers for the FaaS platform, for very specific reasons. It allowed us to enable engineers to run their services everywhere, and gave us immutable build artifacts. This decision didn't just impact our team -- as it required us to create a new team at Netflix which was tasked with building a container orchestration system. The decisions to use new technology can often have rippling and unforseen consequences up and down the entire company.

例如,我们FaaS平台采用了容器技术,原因很特殊。它可以确保工程师随时随地运行服务,并为我们提供不可变的构建组建。这个决定会对团队有一些影响,需要在Netflix内部创建一个新团队,负责构建一个容器编排系统。决定使用一项新技术经常会给整个公司带来不确定的后果。

InfoQ:在 FaaS 服务的开发过程中,工程师最关注点的是什么?

Xiao:For the development experience, we focused on the ergonomics of our FaaS platform. This was the biggest feedback from engineers using the FaaS platform. As a result we focused on building developer tooling that allows engineers to develop and debug their functions locally on their dev machines -- including the ability to tail logs and attach debuggers.

对于开发体验,我们专注于FaaS平台的人体工程学。这是工程师使用FaaS平台的最大反馈。 因此,我们专注于构建开发者工具,使工程师能够在其开发机器上本地开发和调试其功能,包括尾部日志和附加调试程序的功能。

InfoQ:将越来越多的核心功能部署云上的时代,您认为工程师应该将精力更多地放在哪些方面?

Xiao:Engineers should focus on the things that matter to their teams -- for most this no longer means the infrastructure or service platform. For our engineers who use the FaaS platform, this allows them to focus on product innovation -- improving the Netflix experience for our more than 125 million members.

工程师应该将重点放在与团队有关的事情上,大多数情况下,这不再意味着基础架构或服务平台。对于使用FaaS平台的工程师来说,这能让他们更专注于产品创新,为Netflix超过1.25亿的会员提高用户体验。

点击查看7月深圳ArchSummit全球架构师技术峰会日程

嘉宾介绍:

肖雨浓目前是 Netflix 位于美国加利福尼亚州洛斯盖多斯(镇)的首席软件工程师,带领 Netflix API 平台设计和架构团队。在此前,他任职于 AWS 和 Joyent,主要方向是分布式系统,并帮助规划和构建了多款云计算产品,例如 AWS IAM 和 Manta。与此同时,他也在维护开源项目 Node.JS 框架的校正。Yunong 获得了滑铁卢大学计算机工程荣誉学位。

评价本文

专业度
风格

您好,朋友!

您需要 注册一个InfoQ账号 或者 才能进行评论。在您完成注册后还需要进行一些设置。

获得来自InfoQ的更多体验。

告诉我们您的想法

允许的HTML标签: a,b,br,blockquote,i,li,pre,u,ul,p

当有人回复此评论时请E-mail通知我
社区评论

允许的HTML标签: a,b,br,blockquote,i,li,pre,u,ul,p

当有人回复此评论时请E-mail通知我

允许的HTML标签: a,b,br,blockquote,i,li,pre,u,ul,p

当有人回复此评论时请E-mail通知我

讨论

登陆InfoQ,与你最关心的话题互动。


找回密码....

Follow

关注你最喜爱的话题和作者

快速浏览网站内你所感兴趣话题的精选内容。

Like

内容自由定制

选择想要阅读的主题和喜爱的作者定制自己的新闻源。

Notifications

获取更新

设置通知机制以获取内容更新对您而言是否重要

BT