Engineering, Automation, and… Chaos Monkeys? 🤔🙊

How large companies like Netflix ensure a seamless delivery for 200M+ subscribers

You’re sitting on your couch, scrolling through Netflix when you finally find one that catches your eye (no, not The Office which you’ve already watched 34x… jk 😉).

You mouse over to click ▶Play, when the website crashes and a blaring error pops up. You try again and again to no avail, and finally flock over to social media where chaos has broken loose.

Good thing this anecdote isn’t real 😅, and chances are you’ve only experienced one or two issues while trying to stream your favorite movie in the past.

But… it could very well be a common issue for hundreds and millions of users if delivery engineering didn’t exist.

In essence:

Delivery engineering is what prevents this

from becoming this 😔

When you click on the Play button, go on social media, or order takeout online, that’s all delivery engineering ensuring that your customer experience is as seamless and faultless as possible. 🥡

So how do large companies such as Netflix do it, and at such a global scale? 🌐

I talked with Amy Smidutz, Director of Delivery Engineering at Netflix and prev. Amazon and GoDaddy, to find out. Here’s what you’ll get in this newsletter:

  1. 🙊 Chaos engineering and how breaking things during production actually increases resilience

  2. 🤖 Automation in reducing human error

  3. 🙌 How you can get into a career in engineering (and what companies like Netflix look for in individuals)

I write a biweekly newsletter breaking down tech, startups, and investment. Curious about any of the above? Drop your email below to start receiving editions every other week 🤗👇


🎙 Δx podcast

Let’s get into it. In this episode, you’ll learn how large companies like Netflix deliver content seamlessly to hundreds of millions of subscribers. Spoiler alert: it’s not what you might expect.

Counterintuitively, chaos experimentation and breaking things during production can actually help reduce failovers for millions and improve resiliency to outages.

You’ll hear about the importance of reducing blast radius, why regional failures can prevent large outages, and how automating away certain tasks reduces human error. 🙌

💎 Δx takeaways

I’ve been really excited to write to all of you about chaos monkeys and the fascinating field of chaos engineering :) And it’s finally here!

But first, let’s talk about a few of the different ways Netflix increases resilience 💪 for its systems and reduces impact of failures:

  • 💥 Reducing blast radius: in the event of a failure or outage, how can you minimize the radius of people impacted?

  • 🤹‍♀️ Regional failovers every 2 weeks: you always want to have a Plan B

    Netflix may not work for a few people for a few minutes, but this can prevent outages for millions

  • 🐒 And finally… chaos monkeys: destroying EC2 instances for chaos experimentation

What this means is Netflix intentionally tries to break and disable computers in a production network to test resiliency. Here’s a more descriptive example:

“Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices… 🙈 The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.” - Wikipedia

Fascinating, right?

If all this went a bit over your head - Amy does an incredible job explaining chaos and delivery engineering in simple terms in the 20 min podcast above ⬆

Okay, so now that you’ve endured the chaos monkey, what’s next?

Automation.

“One of the principles of chaos actually is you have to automate it after you do it, because then you get regression testing for resiliency” - Amy Smidutz, DeltaX podcast 🎙

And finally, how does Netflix find the right people to do all this? How can you become an engineer or work at your dream company?

Here are some things Netflix looks for in candidates and advice for useful skills to develop:

  1. Learn frameworks to solve problems (i.e. debugging) 💻

  2. Curiosity and passion 💗

  3. Leadership 🧙‍♂️ (Amy mentions how similar this is to engineering; making sure people processes are smooth is just like ensuring engineering is efficient)

All of these skills can be developed at any stage in your career - I for one will be keeping these in mind as I continue to grow and explore :)

📰 Δx change

Let’s talk innovations:

  • New way to generate electricity: This is mindblowing. Using tiny carbon particles interacting with an organic solvent they float in, an electric current can be generated in a simple and efficient way. Essentially, all you have to do is flow a solvent through a bed of particles (electrochemistry ftw!)

  • 🧠 First 5G remote brain surgery in China: The operation was conducted successfully in a nearly real-time operation. With new technology for remote surgery, “you barely feel that the patient is 3,000 km away”.

  • 🤖 Optical chips for AI: Compared to standard electronic architectures, optical chips are higher speed, lower power consumption and lower latency.

Hope you’re as excited about the future and positive change as I am :)

If you enjoyed this newsletter and podcast and want to get involved in an episode, don’t forget to take a minute to share and hit reply. As always, hope this was insightful and helpful in your own journey (I’m always one email away if you have any lingering thoughts for discussion or feedback!)

Share Δx

<3,

Ellen X


Thanks for reading!

Woahh, thanks for reading to the *very* end! :) you’re a real one and hope you have an awesome week full of good food, lots of sleep, and vibey music 😄