How the scheduler in Googles borg clusters works

你好神奇的人

We already learned a few weeks back that Google uses a tool called borg for running their cluster infrastructure and thereby the backbone of our modern internet. This weeks paper is about the Omega scheduler, which is quite likely used in one form or another in the borg clusters. (paper is from 2013, so sure there was some evolution from there, but base concepts still apply). The most interesting part about it is the way how it handles state: In comparison to other scheduler architectures, is works with shared state and lock free control. (Before that googles cluster where managed with one monolithic scheduler)

The paper compares different architectures against Omega giving a nice overview of the cluster scheduler landscape out there.


If you enjoy reading the Weekly CS Paper, I would be really thankful if you would support it with a few bucks: gum.co/weeklycspaper. The newsletter will stay free forever!

Software exists to create business value

I am Simon Frey, the author of the Weekly CS Paper Newsletter. And I have great news: You can work with me

As CTO as a Service, I will help you choose the right technology for your company, build up your team and be a deeply technical sparring partner for your product development strategy.

Checkout my website simon-frey.com to learn more or directly contact me via the button below.

Simon Frey Header image
Let’s work together!

Abstract:

Increasing scale and the need for rapid response to changing requirements are hard to meet with current monolithic cluster scheduler architectures. This restricts the rate at which new features can be deployed, decreases efficiency and utilization, and will eventually limit cluster growth. We present a novel approach to address these needs using parallelism, shared state, and lock-free optimistic concurrency control. We compare this approach to existing cluster scheduler designs, evaluate how much interference between schedulers occurs and how much it matters in practice, present some techniques to alleviate it, and finally discuss a use case highlighting the advantages of our approach – all driven by real-life Google production workloads

Download Link:

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41684.pdf


Additional Links:

Weekly in-depth computer science knowledge to become a better programmer. For free!
Over 2000 subcribers. One click unsubscribe.