InfoQ Homepage Articles Key Takeaway Points and Lessons Learned from QCon London 2019

The next QCon is in New York, Jun 24 - 26, 2019. Save an extra 100 with INFOQNY19! # Key Takeaway Points and Lessons Learned from QCon London 2019 QCon returned to London this past March for its thirteenth year in the city, attracting 1,500 senior developers, architects, and team leads. The conference opened with Sarah Wells, Technical Director for Operations and Reliability at the Financial Times, offering a highly-rated talk that served as of a "runbook" for successfully operating mature Microservices. Over the five days of the conference, the major trends in modern software development were well represented across the 18 individually curated tracks. The 2019 QCon London tracks featured deeply technical topics covering kubernetes, machine learning/AI, developer experience, architecture, modern language innovation, and more. In addition to featuring practitioners from household names in software like BBC, Google, Docker, Airbnb, and the FT, the conference had a powerful undercurrent of speakers from the physical sciences contributing to the conference. Dr. Jason Box and Paul Johnston spoke about the risk of climate change and what we, as technologists, can do to help. The talk was followed by an unofficial event the next morning called "Breakfast with a Climatologist" in which attendees had the opportunity to have an open discussion with Dr. Box about climate change. In addition to the climate change thread, QCon London saw Diane Davis, an astrodynamicist and principal system engineer @NASA, discussing how Java is being used to visualize and plot trajectories for spacecraft. Finally, keynotes from Ben Goldacre (an award-winning writer, broadcaster and medical doctor who specializes in unpicking scientific claims) and Peter Morgan discussing the work being done with Artificial General Intelligence showed the incredible breadth of expertise that QCon London draws on. In total, the conference feature 174 speakers (a 9-1 attendee to speaker ratio) over the three days of the conference. Some of the top technical sessions of the conference included (videos of these sessions are published each Friday to InfoQ): Workshop sessions on day 4 and 5 includes the latest versions of Java, Electron, programming for the cloud with Typescript, Go, service meshes, containers and container orchestration. InfoQ had a number of editors at the event, and you can read the coverage online. We’ve already started making the videos and complete transcripts from the event, available on line. This article summarises the key takeaways and highlights from QCon London 2019 as blogged and tweeted directly by the attendees that were there. ## Keynotes ### Building Artificial General Intelligence by Peter Morgan Twitter feedback on this keynote included: @lizthegrey: .@PMZepto leads off #QConLondon day 2 with an overview of artificial general intelligence. Starting with why -- because it can help us solve challenges in medicine, physics, applied science, etc. @kriswager: First up, a keynote on AI by @PMZepto He kindly presented us with a tl;dr summary up front #QConLondon https://t.co/sRRfPdZfjt @lizthegrey: @PMZepto but the perception in the media is quite dystopic. And even incredible things like Watson, AlphaGo, AlphaStar, are narrow AI rather than general-purpose. [ed: curious what he thinks of AlphaZero though...] We need to start by defining general intelligence. #qconlondon @wiredferret: #QConLondon @pmzepto: What is intelligence? The things we call AI right now are very narrow - If something can only play chess, it's not generally intelligent. @wiredferret: #QConLondon @pmzepto: 9 types of intelligence - spatial, naturalist, musical, logical-mathematical - existential, interpersonal, body-kinesthetic, linguistic, intrapersonal. @wiredferret: #QConLondon @pmzepto: Most of the things we've built have been logical-mathematical, maybe a little linguistic, but very focused on this one thing, as opposed to the broad spectrum that human intelligence has. @lizthegrey: Our systems don't yet ask "who am I? why am I here?" #qconlondon @wiredferret: #QConLondon @pmzepto: We're maybe halfway to understanding intelligence in logical/mathematical, linguistic, spatial. Less with other types of intelligence. @wiredferret: #QConLondon @pmzepto: How to we get to AGI? We need computer science, but also physics, neuroscience, and psychology, @_3Jane: How can we understand intelligence by dividing it into skills/categories? How far has current AI research progressed in emulating these? Second day #qconlondon keynote by @PMZepto https://t.co/qK10TILiOP @wiredferret: #QConLondon @pmzepto: So let's look at what we can use to model and learn more about intelligence - how can biolgoical systems teach us? Biological systems are hierarchical, intelligence is emergent - the trick is integrating all the layers of system. @lizthegrey: 100 bil neurons in the human brain. So many different layers that need to be peeled apart & understood -- ranging from molecular interactions of neurotransmitters to synapses, all the way up to the systems level. Think to try to understand AWS to the electron level. #qconlondon @tracymiranda: AI we talk about today is mostly logical-mathematical intelligence, calculation not creativity. We still have a long way to go with most types of intelligence especially inter/intrapersonal #qconlondon @PMZepto https://t.co/4nArQJeswz @danielbryantuk: "It takes a village to build an artificial generalized intelligence" @PMZepto #qconlondon https://t.co/fudRyW3MhT @wiredferret: #QConLondon @pmzepto: Neurons are a type of information processing system. So our are smart phones. It's the act of taking in information and making meaning out of the inputs. @lizthegrey: Above the individual neuron layer, there are cortical columns that structurally define the cortex -- 2 million of them. #qconlondon @wiredferret: #QConLondon @pmzepto: Computation is not a new idea, we just have been making our abacuses smaller and faster for a long time. ;) @lizthegrey: And then there's structures above that -- discrete parts of the brain etc. -- so we could approach this problem by physical modeling and integration of each of these parts. In comparison: digital computing has evolved from 2700BCE (abacus) to 1830 (babbage) to today #qconlondon @lizthegrey: Recent innovations: GPUs (which can do linear algebra well), TPUs (designed for intelligence) rather than just general purpose CPUs. #qconlondon @wiredferret: #QConLondon @pmzepto: We use different kinds of processors for different kinds of workload. CPUs were designed for office apps and evolved to web. GPUs were built for graphics and evolved to early AI engines. @berndruecker: "These huge machines won't give us general AI - but they are really good at playing Go" @PMZepto @qconlondon #qconlondon https://t.co/2UCgdMMbCm @wiredferret: #QConLondon @pmzepto: Neuromorphic computing is biologically inspired - it uses analog signals and spiking neural networks (SNN). Needs to scale to learn more, available to public access. @lizthegrey: What about quantum computing? Well, it's probably not what's happening in our warm, mushy brains. And it's less far along (~70 quibits). There are digital, neuromorphic, quantum, and biological ways of performing computation. #qconlondon @danielbryantuk: The data center of the future, via @PMZepto at #qconlondon https://t.co/yRXxCDSxKp @lizthegrey: The fourth industrial revolution is the open source set of artificial intelligence and machine learning frameworks e.g. Tensorflow. But how can we get towards general intelligence? We need physics. "The principle of least action". Lagrangians of the human brain? #qconlondon @lizthegrey: [ed: okay, I'm completely lost] Intelligence needs to not just respond in predictable ways, but build a *model* of the world. #qconlondon @lizthegrey: [ed: still confused] Active inference = fighting the second law of thermodynamics by decreasing entropy. and somehow we can do that by making our predictions come true. #qconlondon @lizthegrey: Can we build AGI? Speaker argues yes, we have all the ingredients. But should we? That's a separate ethics talk. [fin] #qconlondon ### Mature Microservices and How to Operate Them by Sarah Wells Ben Sigelman attended this keynote: She documented, with evidence, how Financial Times increased their release velocity more than 100x by switching to microservices. It’s all very compelling and hard to ignore. Twitter feedback on this keynote included: @lizthegrey: @sarahjwells They've been operating microservices for many years. But sometimes the complexity leads to debugging challenges. They had a bug in their 'streams' service that was 404ing content, but nobody remembered where the service lived, how to back up, etc. #qconlondon @lizthegrey: It eventually got fixed, but only after 20 minutes of indecision and timidity. "Does the monitoring tell you there's a problem [yes], and does the documentation give you what you need? [no! :(]" -- @sarahjwells #qconlondon @lizthegrey: There are 10 or more microservices in the line between an editor clicking 'publish' to it showing up on the site. You have to figure out where the underlying problem in the system is to fix things. Being able to identify the USE/RED of each microservice helps. #qconlondon @lizthegrey: But performance problems can be tricky even if you know where the problem is -- because debugging one query getting slow and backing everything else up is tough. [ed: why didn't they roll back?] #qconlondon @danielbryantuk: "Polyglot microservice-based applications that use multiple databases are great for flexibility, but they can be challenging during issues and backups. If you haven't touched a service for several years, it can be challenging to operate the DB" @sarahjwells #qconlondon https://t.co/fNcwWPC5y8 @lizthegrey: Most people read the FT online rather than on paper these days. And there's a paywall for a revenue model. There are a lot of experiments that need to happen for A/B testing to make the site better and improve business metrics! Experimenting requires deploying code. #qconlondon @lizthegrey: Done correctly, CD and microservices let you deploy thousands of times a year, but it creates a new challenge: The teams deploying the microservices need to learn to operate them too. And when a team gets disbanded or moves on, what happens to their services? #qconlondon @lizthegrey: "Your next legacy system will be microservices, not a monolith." --@sarahjwells #qconlondon @lizthegrey: Increasing the speed of microservices: delivery lead time less than an hour, and ability to deploy whenever on demand. But time to restore service is important too -- otherwise we're adding speed just to crash into a wall. #qconlondon @ctford: Very true, as @sarahjwells points out, that microservices that haven't been worked on recently become "unmaintained, unloved and risky". #QConLondon @lizthegrey: and what's the change failure rate? [ed: I think this is citing @nicolefv's research, but there are no captions here so I can't read the backscroll :/] #qconlondon @randyshoup: All of your technical decisions should come down to *business reasons* @sarahjwells keynoting at #QConLondon @danielbryantuk: Quoting @RisingLinda at #QConLondon in @sarahjwells keynote on how to build an effective culture of experimentation. However... "the word 'experiment' for most organizations really means 'try' " You must have a hypothesis and be able to fail in order to be experimenting https://t.co/iRur2f0UPq @kriswager: A bunch of measurements of high-performance organizations presented by @sarahjwells, all dependent on continuous delivery #QConLondon The failure rate is much higher at organizations that are not high performance, especially if they try to push the release frequency up https://t.co/yXkOnEHWLY @lizthegrey: "When it hurts, do it more often." You can't experiment when you're doing launch freezes and 12 releases a year [spoiler: the news won't stop coming just because you have a release window], because you can't tell what had what effect, and you have to wait 6 weeks #qconlondon @lizthegrey: Continuous integration means you're actually practiced with pushing changes. "If you aren't releasing multiple times per day, consider what's stopping you." Investing in microservices is a big investment, make sure you're getting the benefit from it. #qconlondon @kriswager: "If you aren't releasing several times per day, you need to figure out what is stopping you" @sarahjwells She went on to point out that it is usually architecture #QConLondon @lizthegrey: Done right, you need loose coupling to make sure that you can release pieces independently. You also need to change your processes though. You can't review every change, so stop pretending you can. Eliminate "process theatre" like change boards. #qconlondon @lizthegrey: How fast can we move if we do things right? Hundreds of changes per month = tens of changes per day. 250 times as often as releasing the monolith. #qconlondon @lizthegrey: Lower the blast radius - make them small and easy to reverse, so that if they do fail, it's less painful. The monolith fails 16% of changes, but microservices fail less than 1% of changes -- and can be easily rolled back. #qconlondon @kriswager: "Change approval boards don't reduce failures but it slows everything down" @sarahjwells #QConLondon In my experience it also makes releases bigger, because people don't want to go through the change approval process for small releases. https://t.co/g7k8iEKKkY @lizthegrey: Let's talk about the costs though - they "transform business problems into distributed transaction problems", citing @drsnooks. Some patterns and approaches can help: use DevOps to have one team focusing on the goal of delighting customers rather than arguing. #qconlondon @shanehastie: @sarahjwells #qconlondon DevOps is crucial for success - you build it, you run it https://t.co/AMzxs43Jsh @lizthegrey: Empower your teams to make their own decisions. It'll make your teams happier and also let them move faster. Delegating tool choice means you can't have a single central team supporting everything; teams have to support the choices they've made. #qconlondon @lizthegrey: But make things someone else's problem where you can because it's easier -- don't run your own Kafka, or run your own backups if you can help it. Heroku, AWS, etc. can help #qconlondon @lizthegrey: Buy rather than build, *unless* it's critical to your business. Is the customization you can do or the core technology critical to your business? If not outsource. #qconlondon @lizthegrey: Quoting @mipsytipsy, you need to embrace failure and lean into resiliency rather than trying to prevent failures. Use your error budget. Retry failures. But avoid the thundering herd with backoff, jitter, and timeouts. #qconlondon @mbonifazi: @qconlondon IT is not a hospital, sometimes we forget about that @sarahjwells'quote was great! #qconlondon https://t.co/DNGTY019Dg @danielbryantuk: Lots of wisdom from @sarahjwells keynote on lessons learned from running microservices at @FT, at #qconlondon Optimise for loose coupling, both technically and organizationally. This comes with tradeoffs, though https://t.co/GQlax3Xz28 @lizthegrey: Don't get inundated with alerts. Only alert on the business capabilities. "Are we able to publish right now?" Use synthetic monitoring to generate events that test the system. #qconlondon @ctford: Your architecture is in a constant state of "grey failure", says @sarahjwells, but if you understand your appetite for risk and experimentation you know how hard to push. #QConLondon @lizthegrey: Poke at one end, read out the data on the other end, don't rely on the internal details of the system (which may change over time). And it'll let you know before someone complains because their real publish didn't work. And won't falsely alert on holiday inactivity. #qconlondon @RogerSuffling: Keep the blast radius small - #microservices @ft @sarahjwells @qconlondon #QConLondon @lizthegrey: One way of getting eventual consistency is making queries idempotent, and retrying if they timeout or fail. #qconlondon @lizthegrey: Build observability into your system. You can't just attach a debugger any more. You have to test in production. "Can you infer what's going on in the system by examining its external outputs e.g. logs/metrics/monitoring?" #qconlondon @lizthegrey: Log aggregation. You need to be able to find the logs for a particular event, even though there are many many logs. Propagate transaction IDs through your services and dependencies so you can find them later. #qconlondon @ctford: There are a bunch of things you can do to make working with microservices more pleasant. @sarahjwells cites synthetic transactions, zero downtime deployments, continuous delivery, build-it/run-it, distributed tracing. #QConLondon @lizthegrey: Metrics let us see things over time. But people get overwhelmed with how many they have. Look at what's closest to the user first. Keep it simple - USE/RED #qconlondon @danielbryantuk: "The way you debug microservices is entirely different from a monolith. Build metrics, logging and tracing into your services, and keep it simple" @sarahjwells #qconlondon https://t.co/lWBGdKPDBO @lizthegrey: [ed: and having to switch between metrics/logs/traces is a context switch, so eliminate those context switches. More in my talk!] You're always going to be migrating something or deploying something. Have a template so you have one framework for updating each svc. #qconlondon @lizthegrey: Service meshes can take on a lot of the platform work that FT did, but it didn't exist when they were developing their platform. Any codebase has bits that don't change much; that can mean that you accumulate crufty microservices. Are you testing you could deploy? #qconlondon @lizthegrey: Urgent security vulnerabilities won't wait for you to figure it out. So what happens when people move on from the team? #qconlondon @shanehastie: @sarahjwells #qconlondon Every system must be owned by a team, not an individual and ensure there is ongoing maintenance https://t.co/qhLDtj3JDA @ctford: An underlying theme of @sarahjwells's talk is to commodify things that don't add business value. She mentions Wardley Maps as an analysis technique and service meshes as a potential tool to help. #QConLondon @lizthegrey: Make sure that people get concrete benefits out of using the common platforms e.g. common dashboards/alerts -- it'll encourage them to keep the data correct. #qconlondon @lizthegrey: Practice so that you're not scratching your head at 3am. Failovers and database restores are just the beginning. Make it routine and done every week. #qconlondon @lizthegrey: And eventually you can add chaos engineering. It's not about randomly unleashing chaos, it's structured experimentation that relies upon knowing your steady state and how to minimize the blast radius of your experiments. #qconlondon @lizthegrey: What do you expect to happen? Write down your hypothesis. Run the experiment. Maybe you'll find your hypothesis was wrong! Microservices are harder than monolith but worth it. Maintain your knowledge bases and garden away the weeds. Plan. Remember the business value. #qconlondon ### Restoring Confidence in Microservices: Tracing That's More Than Traces by Ben Sigelman Twitter feedback on this keynote included: @danielbryantuk: "You can inspire confidence through use of tooling, whether that's in software or life" @el_bhs #qconlondon https://t.co/0gFqwkOo9a @jkriggins: In this industry, we celebrate genius too much. If we're relying on people to be brilliant all the time especially in an emergency, we are in for a rough time. I think it's important to apply tooling. @el_bhs opening today's #QConLondon on creating confidence in software https://t.co/vMG7aGb0Np @lizthegrey: Do all high-leverage systems require a lot of developers? And do they necessarily have to be incredibly complicated? #QConLondon @lizthegrey: If you have a monolith, you get crises of confidence in ability to deliver software with velocity. We need to be able to break up our large groups of developers into smaller teams. Microservices. #QConLondon @lizthegrey: Think about who people have to talk to - you spend time talking to your own team, and need to talk up and down by one layer. [ed: same slide as the previous talk, and same comment that this resonates wrt Google SREs' scope & relationships] #QConLondon @danielbryantuk: Regaining confidence in software development velocity, via the use of microservices and an appropriate organizational structure @el_bhs #QConLondon https://t.co/aTLmhxqiPU @danielbryantuk: "There are so many ways your microservice system can fail, that you are often overloaded in trying to create hypotheses in order to locate the issue" @el_bhs #QConLondon https://t.co/2hcoIxRSNY @danielbryantuk: The tracing conundrum -- managing the data volume, cost effectively, at scale @el_bhs #qconlondon https://t.co/rI5hVDaONc @lizthegrey: You fundamentally have to sample. Dapper addressed this with non-biased sampling 1/10,000 at the start of the operation. Which means you miss data at the long tail (since chance of a 99th pctl trace getting sampled is 1/1M), but it's usually good enough. #QConLondon @lizthegrey: It's hard for human beings to process the information in a trace. How can you present the data in a way operators can understand in an emergency? #QConLondon @danielbryantuk: "Distributed tracing is powerful, and yet traces are too big for our brains." @el_bhs #qconlondon https://t.co/ITu8CM60bE @lizthegrey: So in order to figure out how to make tracing useful, we need to start with the service level indicator. e.g. request rate, latency, and error rate #QConLondon @lizthegrey: We have two ways that we need confidence: either gradually improving an SLI over months, or restoring an SLI rapidly when our SLI is very far from healthy. #QConLondon @danielbryantuk: "There is some confusion between traces and tracing. Distributed tracing is the art and science of making distributed traces valuable" @el_bhs #qconlondon https://t.co/JG4VmnP3zq @pinglinh: SLI = Service Level Indicators Being confident is being able to control your SLIs @el_bhs @LightStepHQ @qconlondon #QConLondon https://t.co/CNdhxvVIec @danielbryantuk: "Using histograms of trace span latencies is a very effective way to understand what is happening with systems" @el_bhs #qconlondon https://t.co/yk9bV506Rm @lizthegrey: Knowing where the bottlenecks *aren't* is important too - because it prevents you from wasting time optimizing things that are out of the critical path and won't actually make your service faster. #QConLondon ## AI/Machine Learning without a PhD ### How to Prevent Catastrophic Failure in Production ML Systems by Martin Goodson Twitter feedback on this session included: @kriswager: A common problem when dealing with ML models is data leakage where the model gets access to data that it shouldn't have access to #QConLondon There are four different types of data leakage, which @martingoodson will address in turn https://t.co/cPPYMArh50 @kriswager: The four types of data leakage, as presented by @martingoodson #QConLondon All of them contaminates the data and changes the results either of the training or the actual running of the model https://t.co/Qc7cuEIy9J @kriswager: How widespread is the problem of data leakage? According to academia, it is pretty bad. Looking at just music genre recognition, all models are tainted. As @martingoodson points out, it is likely at least as bad in the private sector, where there is less openness #QConLondon https://t.co/vXj6DAlnpx @_3Jane: When using a dataset generated by humans, you need to make sure you're not detecting patterns that said humans' brains used to automate data generation. Martin Goodson of EvolutionAI at #qconlondon talks about ML system failures. https://t.co/JaV6fjkYkj @kriswager: One important way of finding data leakage is to test in real-world setting as soon as possible in order to see if it occurs according to @martingoodson #QConLondon https://t.co/AsVwyZR5tz @_3Jane: Just because your model gives good answers in a test environment - doesn't mean it's looking at what it should be looking. So have a way for humans to check its inputs. ML failure talk, at #qconlondon https://t.co/RTcGmb5RdX ### Test Driven Machine Learning by Detlef Nauck Twitter feedback on this session included: @ctford: The team only realized a flaw in their design when rewriting from R to Spark and found the new system choking on negative values. #QConLondon https://t.co/YDVGLZ1ruK @ctford: Some libraries can warn on nonsense data, for example a variable that's actually constant. But you can't rely on that. #QConLondon https://t.co/c9gLkZ2chA @ctford: "Machine learning is statistically impressive but individually unreliable." - Dr Nauck #QConLondon @ctford: "How do we know that our models are any good?" Dr Nauck #QConLondon https://t.co/sNuxAIKqoq @ctford: Validate at each stage. "You should understand you're writing software at this stage. Data wrangling is software. It should be versioned, it should be tested." Dr Nauck #QConLondon https://t.co/pgYI4cdqj3 @ctford: Cross-validation checks if the selection of data has an impact. #QConLondon https://t.co/96m6RXcxmp @ctford: Things to look for when cross-validating. #QConLondon https://t.co/hhW1xhqu5M @ctford: If you have a 1% target class, you can claim 99% accuracy with a constant output! "Accuracy" is a misleading term. #QConLondon https://t.co/GSLKKFKYfH @ctford: "Test your data!" Dr Nauck #QConLondon @ctford: You can do simple tests to cross-check your data, for example that probabilities sum to one. #QConLondon https://t.co/wZwHMd5nIl @ctford: Fail hard rather than continuing on undefined values. #QConLondon https://t.co/gigWlp0N8Z @ctford: You can do dynamic statistical tests, for example whether values are within three standard deviations of the mean. #QConLondon https://t.co/FQVqn4aSrb @ctford: The toolspace is maturing. #QConLondon https://t.co/xSTNjvrJJc @manupaisable: Very need evolution for #AIML Explainability by design - being able to ask a model why it made a given decision and more - @DetlefNauck @QCon #qconlondon https://t.co/bnHEYgsVj1 ## Architecting for Failure: Chaos, Complexity, and Resilience ### Amplifying Sources of Resilience: What Research Says by John Allspaw Twitter feedback on this session included: @manupaisable: Software engineering is the new kid on the block in the resilience engineering community (which exists since early 2000's) by @allspaw @QCon #qconlondon https://t.co/S2y9w0s968 @danielbryantuk: "You never interact with the system below the line; you interact with the representation. In some sense the things below the line don't exist e.g. you can't physically see code running" @allspaw #QConLondon https://t.co/HEbt26FRUp @danielbryantuk: "Incidents are messy. And people bring their own context" @allspaw #QConLondon https://t.co/YdGIOo8Hyy @FraNobilia: What #resilience is #not #QConLondon https://t.co/f2tpjWLumu @sarahjwells: "Things that have never happened before happen all the time" - Scott Sagan "The Limits of Safety", quoted by @allspaw at #qconlondon @danielbryantuk: "The rise of chaos engineering is almost an implicit acknowledgment that our software systems are too complex to fully understand. You don't need to conduct experiments if you know the answer" paraphrasing @allspaw discussing resilience at #QConLondon https://t.co/JPO6QXhXmg @danielbryantuk: Resilience is sustained adaptive capacity. Some hints on how to find adaptive capacity from post-incident reviews, via @allspaw at #QConLondon https://t.co/RbvtiIJVul @monkchips: Why not look at incidents with severe consequences? - Scrutiny from stakeholders with face-saving agenda tend to block deep inquiry #resilience by @allspaw #QConLondon https://t.co/YskrFn0rAh @danielbryantuk: Great examples of some (contextual) sources of insight/data for increasing resilience, via @allspaw at #QConLondon https://t.co/4bqFjwY6WG @danielbryantuk: "Resilience is the story of the outage that didn't happen" A great talk by @allspaw at #QConLondon https://t.co/Q1vVzDuHTM ### An Engineer's Guide to a Good Night's Sleep by Nicky Wrightson Twitter feedback on this session included: @lizthegrey: If you're terrible at out of hours support, like @NickyWrightson is, then you need to figure out a way to not get paged. She got paged while she was dreaming, acknowledged the page, and didn't wake up! She instead thought she was James Bond and getting a mission! #QConLondon @lizthegrey: And no, you're not getting a Matrix-style deja vu, she worked on @sarahjwells's team and thus some of the lessons are similar. But she just switched companies and is about to be put oncall. And she's nervous and panicking about it. #QConLondon @lizthegrey: Our distributed complex systems are getting _really_ complicated. And citing @martinfowler, you need a mature ops team to manage the cross product of high service counts and high deployment count. A 4 person team at River Island has to support dozens of technologies. #QConLondon @lizthegrey: We no longer have dedicated people to keep our systems up, and have moved to full service ownership. Now, some context: when she was a developer at FT, her service was so unreliable in 2014 that her consumers added a reliability caching layer. #QConLondon @lizthegrey: They had to rapidly switch to a 15 minute SLA in 2017 when they went into the critical path. Then they had to move from home rolled containers to k8s in 2018 with the same SLA. and in 2019, out of hours calls to third tier support dropped to 3. NOC had confidence. #QConLondon @lizthegrey: What three things changed to go to 0 escalations? (1) Engineering mindset. Operability needs to be paramount rather than an afterthought, citing @mipsytipsy. Enable ownership & necessary relationships. Having secondary oncall rotations helped. Voluntary & compensated #QConLondon @lizthegrey: Daytime support doing in-hours triage of non-critical alerts, and driving platform improvements based on what they saw. People touched all areas rather than their SME areas #QConLondon @lizthegrey: and being responsible for operations out of hours made them think about the potential consequences of not doing proper error handling and being woken up. They also designed severity levels to not wake up everyone for minor problems. #QConLondon @manupaisable: "enable teams to own their own support model" - spot on by @NickyWrightson @QCon #qconlondon @TeamTopologies https://t.co/rZf7ifgivB @lizthegrey: "A security patching alert is not a 3am emergency." #QConLondon @sarahjwells: This is the mindset you need as a developer, that you build it you run it encourages - @NickyWrightson at #QConLondon https://t.co/tPfa0cPziD @lizthegrey: Systems become more complex over time, and decline in quality unless we maintain or reduce it. Both code and process involved. #QConLondon @lizthegrey: (2) Don't get called overnight for issues that could have been dealt with daytime. Don't let daytime releases wake you up. We have superstitions about 5pm releases, but that's a sign of low confidence in your deployment process. #QConLondon @lizthegrey: But you do have to think about risks and mitigating them. Deployment times are a function of attention span. If your release takes too long, people will get a cup of coffee or go home and leave releases unattended and unresolved. #QConLondon @lizthegrey: But if it only takes 12 minutes to do a release, you can watch it. Or you can hook your slow deploys to chatops and other notifications to make sure you're reminded to check. #QConLondon @lizthegrey: and verify, verify, verify, and use feature flags. Turn off your experiments at night if they're risky! #QConLondon @lizthegrey: Quoting @copyconstruct, watch her talk "testing microservices the sane way". Don't run 3am monstrous batch jobs. Do you want to detect out of spec data at 3am because you're deferring the real work until then, or can you move up the sanitizing work to realtime? #QConLondon @lizthegrey: Use auto-remediation. Don't require humans to recover from container failures, use k8s and serverless to reprovision and restart instead. (but you may need to adapt your apps to support lame duck mode, transactionality, clean restarts, idempotence/queues) #QConLondon @lizthegrey: Get rid of as much state as you can. If you make things idempotent, you can retry them! And you can even do automatic system failures through replication. But redundancy is not a backup; corruption can spread across and make both copies unreliable. #QConLondon @lizthegrey: So the EU stack went "bang" and and the US cluster started going down immediately after it. They literally were in a pub at the time getting drinks. They bypassed every service in AWS and routed back to legacy datacenters to survive the event. #QConLondon @lizthegrey: they were able to survive the night and look at it the next day. What had happened? They had a graph query platform living in a container, and a query of death that would consume infinite memory and blow up the VMs. #QConLondon @lizthegrey: Use the capabilities of your platforms and set namespaces and resource limits; set things up to automate failure recovery whenever possible. "Let the computers do it." --@NickyWrightson #QConLondon @lizthegrey: (4) what do your customer care about? do they care about all 500 requirements? Know what's important to the business. And be able to react to critical issues before your customers find out. Don't have someone knocking on your door complaining. #QConLondon @lizthegrey: .@sarahjwells says "only have alerts that you need to take action upon." If you have a large number of 500s, can you/will you do anything? if the answer is no, it just contributes to alert fatigue. Not all services are as important. Janitor job vs. payment processing #QConLondon @lizthegrey: Do synthetic requests to test the health of your system if query volume is sporadic. Otherwise you have Schrodinger's platform where you can't tell if it's dead or alive. #QConLondon @lizthegrey: Use tracing to monitor critical flows. And it may not necessarily look like a vendor solution; however, it's better in the long term to move away from home rolling a polling solution to streaming structured event logs to measure pipeline latency. #QConLondon @kriswager: A reference to a mantra by @sarahjwells, which @NickyWrightson admit doesn't come easy to her, but which she admit is correct. If you don't plan using the alert, then don't send it #QConLondon https://t.co/rOFtWWpuQZ @lizthegrey: Move your instrumentation closer to your code so that it's not a separate service that's tightly coupled; that way when you refactor code you're updating the instrumentation at the same time. #QConLondon @lizthegrey: (5) How do we do a better job of responding at 3am? Practice during the day. Do Chaos Engineering. Practice communication between first line ops team and second tier support team. [real alert fires] "Don't worry, we took it down" defeats the purpose of the test run. #QConLondon @lizthegrey: There's no point chaos testing a monolith if it'll just break and you know it's going to break. Instead, start doing it only when you think your system is resilient enough. But manual simulation of outages can happen prior to full migration. #QConLondon @lizthegrey: When you practice something over and over such as a failover, you naturally create incentives to make it shorter and shorter through automation. And this also builds confidence in the steadiness of the platform. 3am manual interventions need to be simple. #QConLondon @lizthegrey: Don't make people scramble to do unfamiliar things at 3am, build tools that do it for them. Make sure your alerts have relevant information for diagnosing and responding. Don't make people look up info in 5 separate places. "What should you do to mitigate?" #QConLondon @lizthegrey: .@NickyWrightson's most important piece of advice: don't try to fix things at 3am, just stabilize your system enough to get it into the next day. [ed: amen.] You're smarter during the day when you've had coffee and can put heads together with your colleagues. #QConLondon @lizthegrey: So the lessons all boil down to "do everything during the day if you want to sleep at night." Think proactively about failure cases. Do preventative maintenance during the day. Automate. Understand your flows. Break things and practice. You can own your service [fin] #QConLondon ### Building Resilient Serverless Systems by Johnathan Chapin Twitter feedback on this session included: @danielbryantuk: Some rough edges remain with @awscloud for running globally resilient serverless systems, but these can be coded around, according to @johnchapin at #QConLondon https://t.co/mey3yZK8n4 @dcastelltort: Building resilient serverless system by @johnchapin at #QConLondon . Build your application as a single big one across AWS region / AZ. https://t.co/gHN6SpsCdK ### Learning From Chaos: Architecting for Resilience by Russell Miles Twitter feedback on this session included: @kriswager: "Everything you have tried to build will fail in production" @russmiles "but that is okay, as long as you are aware of it" #QConLondon @kriswager: "Most of software development can be destilled to one simple message: we don't know what we are doing" @russmiles "so how about we embrace that?" #QConLondon @kriswager: "Admitting you are wrong is the first step towards learning" @russmiles #QConLondon @kriswager: "Essentially we are making it up as we go along. Embrace that" @russmiles #QConLondon @kriswager: "If you strive for feature velocity without reliability you don't have velocity" @russmiles #QConLondon There is no conflict between feature velocity and reliability. Delivering faster might make you reliable, as you have more chances to get it right. @kriswager: "Just knowing what normal is, is a good starting point [for figuring out if your system works]" @russmiles #QConLondon @kriswager: "You are not building simple systems. Sorry, you are not" @russmiles You are likely building at least complex systems #QConLondon https://t.co/WJWMssea8m @kriswager: #QConLondon "Technical debt is something you know that you are accruing. Dark debt is debt that you don't know you are accruing." @russmiles @kriswager: When someone takes the blame for a failure, the learning stops there. People don't look at why the system failed ~ @russmiles Good point I hadn't thought of before #QConLondon @JitGo: Learn about how your systems could be wrong to better understand how they work. But to do that you need to break them... Know anyone good at doing that @russmiles #qconlondon @techiewatt: Chaos engineering is about moving you out of chaos not creating it @russmiles on #chaosengineering at #Qconlondon https://t.co/hn7MKbMiFy ### Streaming Log Analytics with Kafka by Kresten Krab Thorup Twitter feedback on this session included: @lizthegrey: In an error situation, you want to be able to find things instead of having regrets that you didn't index. [ed: this conflates logging to local disk vs centrally *indexing* it. Honeycomb won't be a full logging product, but it does point to where debug logs live...] #QConLondon @gwenshap: "we decided to delegate the "hard parts" of distributed systems to Kafka. It's nice stable system that you can use for leader election and sequencing and managing data" #qconlondon @lizthegrey: The box diagram: agents -> ingest service -> digest and storage nodes. Every event coming in needs to both go into the event store, as well as in the real-time component that performs standing user queries like 'grep foo | count()'. #QConLondon @lizthegrey: On the query flow, fresh real-time data (which isn't yet serialized) is served by the digest, and the storage node searches across historical data for ad-hoc queries. #QConLondon @lizthegrey: Historical queries performed with brute-force search. It requires a full stack of performance tuning - compression, datastore, querying efficiency, etc. #QConLondon @lizthegrey: [ed: yup, here we go. yay!] Kafka solves the 'hard parts' of distributed systems: coordination, commit logs, and transient data. This is similar to KSQL feature of Kafka, but they had to implement before it existed. #QConLondon @lizthegrey: Reviewing the basic API of Kafka: you have partitions within a topic which can be replicated and implement acknowledged write = safe semantics. On the other end, we can have as many consumers as we like which read in sequence from the partitions. #QConLondon @lizthegrey: They've written a Zookeeper-like system maintains a replicated copy of the key-value store in each running node. Coordinated using single-partition Kafka queue that tracks atomic changes (and uses snapshots to bootstrap). [ed: why not use Zookeeper if using Kafka?] #QConLondon @lizthegrey: Performance consideration: when throughput is desired, you need to batch things up and have long latencies; but for this usage for smaller infrequent updates, use low latencies. [ed: not sure if I transcribed that right. still new to the Kafka world.] #QConLondon @lizthegrey: Nodes know where their files live in the system using this database. Event store has start time, end time, and metadata for each segment of recorded data. And each 10GB compresses to 1GB. [ed: this is convergent evolution-y similar to Honeycomb's architecture...] #QConLondon @lizthegrey: The indices aren't built at store time; instead they're built at query time. To run a query, you fetch the index files associated with the dataset (filtering by metadata tags of the source), and then find the segments with matching time ranges. [ed: yup, same!] #QConLondon @lizthegrey: Parallelizing the chunk decompression and search processes is necessary so that you don't have to start uncompressing from the very beginning single-threaded. #QConLondon @lizthegrey: So Kafka is used to keep the info on which segments live where and what their metadata is. For durability, they also use Kafka -- there's Kafka between ingest and digest. [ed: yup, same!] Validate the data and push to Kafka, and we can ack as soon as Kafka has acked #QConLondon @lizthegrey: On error, the agent can retry while it still has the data, so it preserves durability. They use 3-way replicated Kafka for acknowledged data. But we need to make sure to take it out of Kafka before it expires #QConLondon @lizthegrey: Push new log lines into query engine, and WIP buffer where segments that are being built are stored. Then segments get flushed. Segments store metadata of Kafka sequence numbers. If we have to restart, we replay ingest from our last segment's final sequence. #QConLondon @gwenshap: If Humio stops, we throw away WIP and re-play from Kafka. Retention in Kafka must be long enough to deal with these scenarios. #qconlondon https://t.co/fDTNjss4CU @lizthegrey: Their most important performance metric promised to their customers is ingest latency. They measure the p50 and p99 from initial ingest to the data reaching the WIP buffer. What makes it feel snappy is that they can query from the ingest buffer as well. 200ms delay. #QConLondon @lizthegrey: Alerts can fire within fractions of a second and trigger automatic remediation, which matters very much for intrusion remediation. You don't want to be 15 minutes behind on your telemetry during an attack. #QConLondon @lizthegrey: What do they measure? They need to make sure the partitions are balanced. They have 24 partitions, and track load distribution. The hashing function decides which data goes where. You don't want a Kafka node to get slow due to excess load. #QConLondon @lizthegrey: Also need to monitor the consumer processing rate - if a partition's consumer is falling behind. They use ingest latency as their key metric. [ed: SLI maybe even?] They increase their parallelism and adjust hash function (re-partitioning) when more than 10s behind. #QConLondon @lizthegrey: The data source is a unique set of tags; it has a cardinality problem. They compress it to Hash(k,v+k,v+k,v) If someone puts user or IP addresses in tags, they get millions of time-series and files, overloading metadata. [ed: mhm. writing the honeycomb ad right there] #QConLondon @lizthegrey: they deal with this by multiplexing high cardinality value types to common streams/data partitions, and remembering which partition goes with which value type. e.g. karsten -> 13, james -> 13, and when searching for karsten, look at dataset 13 and search both/filter. #QConLondon @lizthegrey: For maximum utlization, you can't just take the Kafka defaults; you can't have 100k dynamic topics that perform and scale indefinitely. [ed: it's fascinating that they solved this by multiplexing data across segments, while we wrote a columnstore instead] #QConLondon @lizthegrey: So, overall: they're happy to have used Kafka as a dependency; it provides the stability and fault-tolerance, and customers frequently just put it on their own Kafka setups and take on the operational responsibility for the Kafka themselves. "Give us 3 topics." #QConLondon @lizthegrey: The most frequent brittleness occurs with smaller customers who need to boost the size of their Zookeeper or Kafka. But if customers don't have Kafka/zookeeper then Humio provides disk images they can run. #QConLondon @lizthegrey: Other production bugs they saw: garbage collection pauses on JVM because Kafka relies upon the JVM working correctly. The Kafka and HTTP libraries compress data, you don't want to compress twice or have slowness with byte[] so they wrote their own custom compression. #QConLondon @lizthegrey: Resetting Kafka requires epoch numbers (but cluster id works). [ed: I was hoping this talk would be entirely focused on the last three slides here, of the operational issues encountered running large-scale Kafka... so color me disappointed by time spent earlier] #QConLondon @lizthegrey: Humio moves computation to the data living on disk (or in memory) because of product requirements, whereas KSQL needs to move the data to the KSQL engine. They're not solving the same problem and are complementary. [ed: we also compute where data is for Honeycomb] #QConLondon @gwenshap: Q: Why use Kafka to build coordination and not ZK directly? A: because Kafka makes it easier to build an in-memory model from all updates. #qconlondon @lizthegrey: Audience question asking my why meta-zookeeper on Kafka? A: because no subscription to all updates and no maintenance of the data locally on each client mode. and no query language for zookeeper. Thus they needed something more than zookeeper can provide. #qconlondon @lizthegrey: Humio typically uses 16-20GB of memory and keeps most of its information on disk since it's quick about flushing data to disk. the rest of the ~120GB on a machine is all ram filesystem caches that OS does automatically. #QConLondon @lizthegrey: Deployments don't need to be clustered because it's so efficient (because it doesn't try to pre-index, it just is efficient about brute-forcing at query time). #QConLondon ## Architectures You've Always Wondered About ### Airbnb’s Great Migration: Building Services at Scale by Jessica Tai Marcus Craske attended the session: This went through Airbnb’s migration from their monolith, the monorail, to microservices using Kubernetes,. Some key points were about the technical approach, from starting at the persistence layer and working their way up to presentation. As well as the shift in their organisation structure, and how it enabled them to go from releasing every few weeks to thousands per a year, and how it enabled them to scale work for a huge growth in engineering teams. An interesting approach I had not seen before was the use of a diffy environment, where requests were put through both an old and new version of a service, and the output compared to spot regressed behaviour. Twitter feedback on this session included: @lizthegrey: @jessicamtai The size of the eng team used to let new hires run through a human tunnel of every engineer before them, but that didn't scale. [ed: I sense a parable coming...] They wound up moving away from the monolith and to a microservices architecture. #QConLondon @lizthegrey: They initially started with the "Monorail" ruby on rails monolith with model, view, and controller all in one process. but moved towards service oriented architectures as the weight of rollbacks and slow deployment velocity accumulated. #QConLondon @lizthegrey: Tenets: services should own their own datastores, rather than having someone else modify their data out from under them. Services should be good at specific business functionality, rather than doing everything or only one thing each. Don't duplicate functionality. #QConLondon @lizthegrey: To propagate data between different services, use an event bus rather than modifying peoples' data for them. #QConLondon @lizthegrey: Tenets: services should own their own datastores, rather than having someone else modify their data out from under them. Services should be good at specific business functionality, rather than doing everything or only one thing each. Don't duplicate functionality. #QConLondon @lizthegrey: To propagate data between different services, use an event bus rather than modifying peoples' data for them. #QConLondon @lizthegrey: To avoid cutting corners, make every service have the correct alerting, o11y, and best practices out of the box. [ed: I worry about saying "every service is production critical" -- some are more critical than others and that should affect your design decisions] #QConLondon @lizthegrey: First revision to the monolith model: use monorail to route, but send requests to dependencies under the hood. Have a strict hierarchy of dependency layers. [ed: yes, yay, @whereistanya, @silvia_esp, and others did this at Google!] #QConLondon @lizthegrey: Data -> Derived data -> Middle tier -> presentation. but it initially still used one shared database. They then intercepted calls for individual data types to a separate data service to proxy it, then could migrate the underlying data. #QConLondon @lizthegrey: Version 3 let them cut monorail out entirely and replace it with an API gateway that calls middleware services for session/auth data, and routes to the microservices for presentation/data. The web rendering service uses the API gateway too! #QConLondon @lizthegrey: But this wasn't an instantaneous migration. They needed to support both the monolith and the microservices at once. They did dual-reads across both stacks and compared the results to make sure they were consistent. #QConLondon @lizthegrey: Dark launch - they could compare offline without impacting user traffic. They then started ramping to 1% to the microservice framework, and increment over time, waiting at each step to gather data. #QConLondon @lizthegrey: Once everything looked good, they shut off the monolith read path. For the write path comparison, they needed shadow databases instead to dual-write, then compare the writes. #QConLondon @lizthegrey: For the API gateway, they needed to compare the results of the request with context from monorail vs. from the middleware plus services. This already seems cautious enough, but they also limited blast radius by migrating one endpoint at a time. #QConLondon @lizthegrey: Having a partial migration is better than migrating everything at once; it lets you move faster. They also could migrate individual pieces of data at a time, causing the presentation layer to read from the monolith where needed. #QConLondon @lizthegrey: Having a partial migration is better than migrating everything at once; it lets you move faster. They also could migrate individual pieces of data at a time, causing the presentation layer to read from the monolith where needed. Standardization helped as well. #QConLondon @lizthegrey: Extracting all of the boilerplate pieces (including testing frameworks, deployment, and observability) into a service framework enabled delivery teams to focus only on business logic. #QConLondon @lizthegrey: They used Thrift for commonly defining RPC interfaces, for instance. and then annotating whether it accepts replaying production traffic, or whether it should use the standard rate-limiting. and documenting who owns services, points of contact, etc. #QConLondon @lizthegrey: Fail-fast mechanisms with configurable timeouts, retries, throttling, backpressure, and circuit-breaking. Baked in by default. Separate work queues for each dependency [to avoid cascading failures to another dep], and graceful degradation when one dep unavailable #QConLondon @lizthegrey: Full cycle testing: local development full-featured environments, replaying prod traffic in staging and diffing staging vs prod responses (including controls for non-determinism between the same version). Diffing lets you both find expected and unexpected changes. #QConLondon @lizthegrey: But the migration wasn't perfectly smooth. There were laggards who preferred the older, "quicker" way, and the team developing the framework was small. Oncall structure needed to change too. Volunteer sysops didn't scale. #QConLondon @lizthegrey: Service ownership given to each team; when creating a service it's a default to specify how you receive alerts for it. [ed: my talk tomorrow is precisely about the pitfalls of "just telling teams to run their own stuff", and how to give an onramp.] #QConLondon @lizthegrey: Initial results of migration: bugfixes are faster, build/deploy is faster and broken change rate is down. Developers are happier. Latency is lower because requests can be parallelized between microservices [ed: conflating language change w/ decomposition]. #QConLondon @lizthegrey: Monorail now frozen except to migration work; increasing percentage of engineers working on microservices. #QConLondon @lizthegrey: Cautions: remote calls are tricky, separating databases make strong consistency hard, and orchestrating across multiple services and teams add complexity. moving towards k8s! [ed: again, solving this complexity is a people, not a tools problem!] #QConLondon @lizthegrey: Be prepared for a long process of migration, and decompose your services incrementally. You can scale your efforts with tools and automatic code/docs generation. Shifting your culture is also required. [ed: yes! finally!] Look both ways before you migrate. [fin] #QConLondon @lizthegrey: Audience question asking about how to enforce contracts between different services. Speaker suggests SLOs! [ed: yup, that's the way!!!] #QConLondon ### What We Got Wrong: Lessons from the Birth of Microservices by Ben Sigelman Twitter feedback on this session included: @lizthegrey: Setting the backdrop: Google was growing quickly and choose to trade expensive but reliable Sun boxes for cheap commodity Linux boxes, and patching over the problems with software. Everything had to be made from scratch at the time. Couldn't clone from Github. #QConLondon @lizthegrey: They had to DIY infrastructure for very large datasets and QPS, that would scale horizontally to unreliable machines. Nobody had this as open source or as a vendor solution... and Google's culture was descended from DEC/Compaq labs' culture of chaotic autonomy. #QConLondon @kriswager: If you want to start a startup now, you should start by not trying to invent everything yourself, according to @el_bhs, but he points out that this wasn't really an option in the past (eg no Github) #QConLondon @lizthegrey: The org structure was super flat with managers having 120+ reports. Dapper was initially created by 3 people [inaudible/quickly spoken names], none of whom were Ben. #QConLondon @lizthegrey: Ben was working on Ads but joined Cheryl who was completely different from him on every axis in working on this new prototype of Dapper. Without telling his manager. The culture was very audacious which had benefits and drawbacks. #QConLondon @lizthegrey: Cambrian explosion of infrastructure - GFS, BT, MR, Borg, etc. -- in part because it was needed, but also because engineering glorified infra projects. They all shared properties in common: horizontal scale, good APIs, rolling upgrades, and frequent releases. #QConLondon @lizthegrey: But we had a tendency to ship our organization chart. We accidentally created microservices by creating planet-scale infrastructure. It created problems organizationally. #QConLondon @MelanieCebula: 1. Know Why "You will inevitably ship your org chart"; most people should adopt microservices b/c of human communication #qconlondon @el_bhs https://t.co/dy4X6HrtI8 @danielbryantuk: "Accidental microservices" at Google, via @el_bhs at #qconlondon "Microservices here largely emerged from the requirement of running applications at planetary scale" https://t.co/SgOjrh7Otv @MelanieCebula: Google mistake 1: #Kubernetes almost didn't get off the ground because it didn't fit into the box of working at massive, plant-scale services #qconlondon @el_bhs @lizthegrey: Lesson 2: independence is not an absolute. ants vs. hippies -- ants have distributed decision making, but they're not disorganized, unlike hippies. #QConLondon @lizthegrey: Platform decisions are multiple choice, but from a fixed, opinionated set = lawful good. k8s is true neutral - it's not opinionated. #QConLondon @MelanieCebula: 2. Independence is not absolute. microservices often portray service owners as independent and get to make their own decisions, and they may not make the best decisions #QConLondon @el_bhs https://t.co/90xvrO7IeI @thetruedmg: Give microservice teams choice over tech stack but from preordained choices #QConLondon @lizthegrey: The word 'serverless' doesn't mean anything to @el_bhs. What matters is getting out of the business of SSHing to a machine, but the current simplistic version of FaaS is too limited compared to the full potential power. #QConLondon @danielbryantuk: Great slide on "microservice platforming" in the world of Dungeons and Dragons, via @el_bhs #QConLondon https://t.co/JNGtD43YXw @cfhirschorn: "You will inevitably ship your org chart" aka Conway's Law. Love this illustration from .@el_bhs #QConLondon https://t.co/cgRNPRBqTD @lizthegrey: Because it's so much more expensive to do a remote RPC than do a function call in the same process, we're making two steps back performance-wise when we make one step forward decoupling things. Read Hellerstein's paper for more. #QConLondon @lizthegrey: It's a 1000x performance hit to move things across process boundaries. And it magnifies market dominance of proprietary solutions. Think about when it makes sense to deploy functions to the edge, but proceed with caution for your core performance critical services. #QConLondon @Shaleenaa: In terms of Microservices, What We Got Wrong: Lesson 3: Serverless Still Runs On Servers @el_bhs‚ #QConLondon platform lock-in is real https://t.co/pFXSXdiXmt @thetruedmg: Approach serverless in a ms architecture with caution, the two are not synonymous #QConLondon @MelanieCebula: Lesson 3: serverless involves servers! Function calls, two separate processes are a LOT slower (like 1000x), service comms slower, complex function lifecycle. Does wonders for vendor lock-in. Not as the backbone for microservices, but for works for other things #qconlondon https://t.co/65mxjWGm8C @lizthegrey: "You'll never as a human being be able to sort out which is the problem, and which is just correlated to the problem. It's a great way to visualize and rotten way to explain." [ed: YES YES YES] #QConLondon @lizthegrey: The number of failure modes is n^2 in the number of microservices you have due to communications between them, but users still only care about one thing: does it work? [ed: again, yes yes yes. I love it when my competitors and I are on the same page :D] #QConLondon @thetruedmg: Reduce the search space when diagnosing an issue #QConLondon https://t.co/QsbJCyV8pe @MelanieCebula: 4. Beware giant dashboards Microservices are really problematic for understanding root cause, and you need to reduce the search space considerably. #qconlondon @el_bhs https://t.co/olGnLVZIlO @danielbryantuk: "As you add more microservices, I can almost guarantee that your customers won't care. However, they will care when things break" @el_bhs #qconlondon https://t.co/fOXQ3wOeuR @MelanieCebula: "a giant dashboard showing everything that might vary with your service is a recipe for confusion" @el_bhs #QConLondon @lizthegrey: You have to think about how to scalably and sustainably collect in an economically efficient way. You can't store everything, so be selective about your sampling (either at execution or aggregation time). #QConLondon @lizthegrey: and you have to be able to find the right trace to *find* the critical path in your transaction[s]. There are many ways to do it [and @LightStepHQ and your humble editor from @honeycombio are in agreement on this :)] Measure SLIs w/ precision than do biased sampling.#QConLondon @lizthegrey: So to review: are you doing microservices because of business reasons, or to scratch a CS itch? are you solving at the right scale? are you orchestrating and decomposing well? and can you *detect* and *refine*? [ed: so much applause for all of this from me. :)] [fin] #QConLondon @MelanieCebula: 5. Distributed tracing is more than distributed traces Lots of potential with distributed tracing but currently very primitive, biased sampling over random sampling, #lightstep announcement being made today #qconlondon @el_bhs https://t.co/vMr15bYasf @lizthegrey: I don't usually type up Q&A but this question was great: "Q: How do I decide when to do microservices and adopt the tooling?" "A: Pick tooling that lets you deal with both cases so you can adapt and migrate gradually. Envoy is useful from 1 to hundreds of services" #QConLondon @lizthegrey: "Q: benefits and drawbacks of internal infrastructure?" "A: it's easy to tap another engineer on the shoulder, and also teams don't oversell the capabilities as external vendors would because they're all working for the same company" #QConLondon ## Career Hacking ### Becoming A Fully Buzzword Compliant Developer by Trisha Gee Twitter feedback on this session included: @kriswager: "Step one on staying on top is denial. No, not really. It's awareness" @trisha_gee #QConLondon She advocates monitoring sites like InfoQ (look at headlines, don't necessarily read the articles, unless it is something new) @danielbryantuk: "Statistically with cloud, failure is happening all the time. Maybe this picture actually is fine, as long as there is appropriate isolation and resilience at all levels" @johnchapin #qconlondon https://t.co/Q72xP5zqIG @kriswager: Step 2: speaking the lingo #QConLondon tongue-in-cheek, @trisha_gee suggests using the buzzwords you've become aware of, dropping the words into conversation @kriswager: Great talk by @trisha_gee. Her basic advice is to be aware of new technologies, pick the ones that interest you and then learn more, up to and including using it #QConLondon ### Surviving the Zombie Apocalypse by Andy Walker Mike Salsbury attended this session: Some of the talks we are finding most interesting on the last day of QCon London 2019 are on the non-technical “career hacking” track. Andy Walker presents Surviving the Zombie Apocalypse, in which he pleads for developers to look after themselves – get enough sleep, watch your diet and exercise. He’s got some interesting points about context switching, minimising interruptions and prioritising your own work – for some things it’s good to know what’s coming, but tackling low-priority issues too early can disrupt your flow and can lead to more work overall. ### Take Control of Your Career: A Personal Growth Framework by Aaron Randall Mike Salsbury attended this session: “Why are we here?” – existential questions in Aaron Randall’s talk Take Control of Your Career: A Personal Growth Framework. Randall suggests taking matters into your own hands and regularly pick something to improve on. As engineers, it seems natural for us to pick technical skills to improve on. But that’s an easy choice and there are other important aspects that we could pick instead – communication, leadership or business knowledge, for example. The speaker suggests finding your “North Star”, a bigger picture goal that you want to achieve next in your career – this may be a new role, or growing into your current role. Based on this you can assess where you are, where you need to improve and define achievable short-term goals, which you could discuss with your peers and/or manager. Sounds like a good way to ensure your career stays on track and you stay happy with your work. Twitter feedback on this session included: @_3Jane: It's often unclear what the most important skill is to develop next. Technical skills are not necessarily the best option, even though they are the default. How do you turn the light on to see all the options? #QConLondon https://t.co/JIWZmX9SP3 @_3Jane: Growth frameworks consist of skills and roles and smush them into a matrix to guide your growth. [Reminds me of deliberate practice, though these skills are more coarse grained.] #QConLondon https://t.co/WQbVj52G1L @pinglinh: Professional development is yours @AaronJRandall @qconlondon #QConLondon https://t.co/BzVl6IK4WQ @_3Jane: Now a warmup. List 3 things each: Q2. What's going well? Q3. What could be better? #QConLondon @_3Jane: Q4. Describe strengths/weaknesses and rate yourself on your skills in the following areas: technical, communication, leadership. Compare to your job description for inspiration. #QConLondon https://t.co/hZpoIz7KdG @_3Jane: Q5. Take some areas of opportunity from the previous question, given your North Star (Q1) and translate them into measurable goals. #QConLondon https://t.co/TqYBzQgLC2 @_3Jane: ...now get feedback on your goals from your colleagues and your manager. See if you can explain why your goals make sense. Get support so that you can achieve them (accountability) #QConLondon https://t.co/HMo0e4nioo @_3Jane: ...and here is where you can find the growth framework: #QConLondon https://t.co/WjrULQnsXt ### Using Your Super Powers to Boost Your Career Development by Francisco Jordano Twitter feedback on this session included: @_3Jane: How do you grow your career? is a question that shows up in job interviews. Nobody knows the right answer: the point is to learn what people do apart from the obvious (blog posts, meet-ups, courses.) #QConLondon @_3Jane: To make an interesting superhero movie you need a superhero, a sidekick, a villain... and a diverse team of other heroes. Diversity means resilience. #QConLondon https://t.co/OyWUarl4d7 @_3Jane: Your superhero powers are personality traits that you're great at. For each trait, find a story that demonstrates how it helped you grow your career. #QConLondon https://t.co/jhp0qfekhI @_3Jane: You need feedback from others, because they see parts of you that you're not aware of. They can help you adjust your self image. #QConLondon https://t.co/Z6fmtInmOe @_3Jane: Identify what you're bad at, including how your strengths sometimes work against you. Feedback from environment helps you figure our better ways to deal with what normally defeats you. (Radical Candor referenced here.) #QConLondon https://t.co/1n4GgVBtiZ @_3Jane: As you're a part of the team, the more you help others, the more they will help you. #QConLondon https://t.co/5Nmlo5Wl0B @_3Jane: Q: How do you start a discussion around identifying your strengths? A: Praise people for specific things. Make people aware of their weaknesses in specific situations. You need continuous feedback. #QConLondon @_3Jane: Q: how do you deal with senior employees who don't want to train juniors? [Paraphrased] A: this needs to be a part of company culture and understood as expected behavior; the higher you go, the more you're responsible for growing others. #QConLondon ## DevOps & DevEx: Remove Friction, Ship Code, Add Value ### Develop Hundreds of Kubernetes Services at Scale with Airbnb by Melanie Cebula Marcus Craske attended the session: A fantastic extension to the first talk by Airbnb, that went into much more technical detail about how they’ve actually implemented Kubernetes. They’ve got hundreds of services, with deployment configuration living with the code. However they’re using Python scripts to refactor such configuration, at scale, in order to roll out regular security updates. Their scripts also create generic templates for new services. And such scripts have automated integration tests, which are routinely creating and deploying new projects to ensure they work. Twitter feedback on this session included: @danielbryantuk: Current challenges and solutions with deploying to @kubernetesio at @AirbnbEng scale, via @MelanieCebula at #QConLondon https://t.co/sb0cgW5zeh @timanderson: Kubernetes has solved lots of problems for Airbnb but its complexity has been a challenge. It has open issues and "some of those issues are quite frightening" says Melanie Cebula #QConLondon @danielbryantuk: "We constantly have automatically generated paved-road (supported) service templates that we deploy to an environment and run integration tests on. This ensures templates and best practices work as expected" @MelanieCebula from @AirbnbEng at #qconlondon https://t.co/s2yRitPckZ @tracymiranda: Automatic refactoring configuration is great for bumping stable versions, dealing with security vulnerabilities, etc especially when you have hundreds of services to deal with @MelanieCebula #qconlondon https://t.co/8izYvNf1p1 @danielbryantuk: "The 'k tool' started as a make file used by the @AirbnbEng @kubernetesio team to help automate deployment, and then it evolved into comprehensive dev tool that is now distributed throughout the org" @MelanieCebula #qConLondon https://t.co/kk9ArP9rBx @danielbryantuk: Great takeaway from @MelanieCebula when deploying apps onto @kubernetesio "code and configuration should be deployed with the same process [and tools] on dev and CI" #qConLondon https://t.co/xhIEQtmnN3 @danielbryantuk: The @kubernetesio kubectl plugins looks very interesting (and Google's krew project too). Thanks for the reference @MelanieCebula! #QConLondon https://t.co/SGd49Yksub @danielbryantuk: "I recommend validating config before deploying to @kubernetesio. This is what we do at @AirbnbEng" @MelanieCebula at #qConLondon https://t.co/XhcohdpBGC @danielbryantuk: Excellent summary and future work slides from @MelanieCebula about working with @kubernetesio at @AirbnbEng via her #qConLondon talk https://t.co/iWWKLOMFLE ### Progressive Delivery by James Governor Marcus Craske attended the session: The new buzzword on the block, as the next frontier after continuous delivery to smoothly roll-out changes in a high-cadence environment. Covering canary releases, A/B testing and feature toggling. Even though I’ve already used those techniques in real production environments at large, my main takeaway was at how we as engineers can sell them to the business, so they gain confidence in shifting a cultural change towards faster releases. A key point was how a traditional ITL process has a lack of empathy for both teams, and more importantly, our customers. Jennifer Riggins attended this session: In the complicated world of distributed systems, what separates the elite performers from the rest? These are the ones that are deploying all the time, but not breaking. These are the Netflixes and Expedias of the world that successfully commit thousands of deploys a day without user disruption. What do they have in common? There are certain practices shared by the few, the proud that are working so fast yet still don’t stop working. Each company has its own mix of chaos, canaries, and colorful code tests that keep their continuous delivery from cutting off customer experience. Founder of RedMonk James Governor offered to this year’s QCon London audience a new umbrella term for this creative experimentation toward systems resiliency: progressive delivery…. Progressive delivery is the next step after you’ve shifted testing left, automated load-testing and deployment, and committed to DevOps and CI/CD (continuous delivery/deployment and integration) — or even ideally it’s a part of that journey. Governor says CI/CD is the onramp to everything good in modern software development, but argues that some of the associated disciplines of the early pioneers haven’t got the attention they should. With sophisticated service routing, it becomes easier to adopt experimentation-first approaches like canarying, blue/green deployments, and A/B testing which slow the ripple effect of a new service rollout. Progressive delivery routes traffic to a specific subset of users before being deployed more broadly, so that testing in production doesn’t have to be a massive risk. For Governor, progressive delivery is really progressive experimentation that spreads until it reaches the entire user base without — or hopefully without — a degradation of user experience….. Twitter feedback on this session included: @wiredferret: #QConLondon @monkchips: CI/CD is foundational to moving quickly, delivering software faster, with higher quality. Shift your testing left and work on accelerating. @wiredferret: #QConLondon @monkchips: But if we accelerate past our ability to manage, we end up with thundering herds and we can go very wrong very fast. @wiredferret: #QConLondon @monkchips: What are leading organizations doing? Canary testing - use a small amount of traffic to test delivery. A/B testing to understand what people actually like. Blue/green deployments to move traffic gradually. @thetruedmg: Estimate the blast radius when deploying a new service #QConLondon @wiredferret: #QConLondon @monkchips: I was talking to Sam G from Azure DevOps - he suggested Progressive Experimentation as a way to test the parts of delivery. @jessfraz: We do not invent new terms for things. So what did I do, I invented a new term for a thing @monkchips, #qconlondon @wiredferret: #QConLondon @monkchips: Progressive Delivery - continuous deliver with fine-grained control over the blast radius. Building blocks: User segmentation, traffic management, observability, automation. @wiredferret: #QConLondon @monkchips: Carlos Sanchez - Progressive Delivery is teh next step after Continuous Delivery, where new version are deployed to a subset of users and are evaluate in terms of correctness and performance before rolling them to the totality of the users @wiredferret: #QConLondon @monkchips: At Comcast-scale, it's scary to roll something out all at once. If you screw it up, you have 30k customer service reps who are trying to make it work. @wiredferret: #QConLondon @monkchips: Most people don't really like application changes! We think we do, because technology is cool, but Gmail changes and we're all grrrr! @wiredferret: #QConLondon @monkchips: For business, change is terrifying! @mipsytipsy tells us to debug in production, but business in general is so scared of that! @wiredferret: #QConLondon @monkchips: When we talk about cattle versus pets, we assume that all cattle are homogenous, and a change that should happen should propagate like wildfire - but we want firebreaks as well. @wiredferret: #QConLondon @monkchips: SumoLogic rolls out a new service to 5% of their customers at first, then they analyze the logs heavily. We do testing in production. Test the AI models in production, but at a smaller scale. @wiredferret: #QConLondon @monkchips: We used to have pets and cattle, but now let's talk about CloudFlare's model of dogs, canaries, and pigs. Dogs are faithful customers you don't want to break. A canary can be a whole city (Grubhub does cities to) @wiredferret: #QConLondon @monkchips: Think about which users you're rolling out a service to, when, in which order, and why. Do Japanese customers use services differently, Grubhub canary deploys to small cities first. SRE Book Golden Signals - latency, errors, traffic, saturation. @wiredferret: #QConLondon @monkchips: Developer experience - You don't have to start from nothing to start doing progressiver delivery. Weaveworks made Flagger, a K8s operator to move traffic to do canary deployments. @wiredferret: #QConLondon @monkchips: None of this is new - Pete Hodgeson wrote about feature toggles and the different use cases - release toggles, ops, permission, experiment toggles. @wiredferret: #QConLondon @monkchips: We needed an orchestration platform, because Google had the same problem, but a bit earlier than the rest of us, so they made it and then open-sourced it. @wiredferret: #QConLondon @monkchips: Here's the thing Google's network IS homogenous, and they can swap parts out. But not all of us are doing that. But Target is rolling out k8s in each store. It's taken a lot of janky engineering to make it work even a bit. @wiredferret: #QConLondon @monkchips: IBM has customers with stateful servers and can't just swap out servers, because some servers must be unique and specialized. @wiredferret: #QConLondon @monkchips: The pace of k8s is really impressive/frightening. We have to keep current for both security and value reasons. And we're not just delivering quickly, we have to learn to consume quickly. @timanderson: Long term support is dying says @monkchips you have to consume at pace as well as deliver at pace #qconlondon @wiredferret: #QConLondon @monkchips: Services meshes give you the abilility to do advanced service routing and traffic shifting, easier rollbacks, automatic metrics, logs, and traces. That all rolls up to .... Progressive Delivery! @wiredferret: #QConLondon @monkchips: Outages at Expedia are caused by code changes... that seems pretty common. Amazon doesn't do production deploys at Christmas. Things break when we change them, so how do we better understand that? @wiredferret: #QConLondon @monkchips: Progressive delivery is a compartmentalization strategy. @wiredferret: #QConLondon @monkchips Just culture is an important part of release velocity and progressive delivery https://t.co/XuRMLINKp6 @wiredferret: #QConLondon @monkchips: Developers today are not in a lot of places. 'GitOps' is putting the desired state in git, let K8s roll it out a bit, monitor it, and align the desired state and current state. @wiredferret: #QConLondon @monkchips: @mipsytips is talking about debugging in production, but we can debug production at 10% rollout. Then scale of data matters. @wiredferret: #QConLondon @monkchips: @copyconstruct If you want to understand observability and how not to make things that break or how to fix them, follow her, @mipsytipsy, and [missed it] from Google. @JitGo: What is progressive delivery? In a nut shell it's continuous delivery with control over how the features are rolled out and affect your users. Makes it a lot less scary for business users to understand then #CD @monkchips #QConLondon https://t.co/g717znhtZI ### Who Broke Prod? - Growing Teams Who Can Fail Without Fear by Emma Button Mike Salsbury attended this session: “Who broke Prod?”, by Emma Button, was all about your company’s culture and how to deal with situations when things go wrong. Things can go wrong and will go wrong, so don’t blame but deal with it. Develop a culture where it is okay to fail, then review and take action. One way to deal with it is to be transparent: be honest and share with your organisation what went wrong, e.g. in Slack. Document the steps you used to fix the problem, so that you later have a “How to” guide or a report. This can also encourage people to help you and collaborate. Very important is to use the term “We”, as it is about the whole team or organisation. Make failure visible through physical dashboards and monitors, as this can lead to less blaming and a focus on fixing the issue. But also reward experiments and celebrate people who take action. Praise success: say “Thank you!” and “Well done!”. I think the main thing I take from this talk is to praise and thank my team or individuals in public. ## “Don’t Mess Up The Culture!”—Scaling with Sanity ### Building and Scaling a High-Performance Culture by Randy Shoup Twitter feedback on this session included: @adrianmouat: .@randyshoup quoting Dave Thomas on what a company must do to be successful: build a product, sell a product and get along :) #QConLondon https://t.co/igOUQZRC0l @shanehastie: #Qconlondon @randyshoup what psychological safety actually means - We respect one another https://t.co/NN17zYoHnc @adrianmouat: Psychological safety is the number one indicator of high performing teams, not number PhDs etc. We can only do our best in a culture of mutual trust and respect. @randyshoup #QConLondon https://t.co/lWNbgnCSuJ @shanehastie: #Qconlondon @randyshoup Cross functional collaboration Well meaning people tend to agree when given the common context. Make it safe to disagree and commit https://t.co/QfZ8aHLKMm @ctford: Healthy organizations are capable of disagreeing but still committing to supporting their colleagues' success. @randyshoup #QConLondon https://t.co/L6Xg9z4Jan @tastapod: #QConLondon @randyshoup Even though it was Google it wasn't all unicorns and roses... about site reliability of Google App Engine. @tastapod: #QConLondon @randyshoup's model for resolving a problem: 1. Identify the problem 2. Understand the problem 3. Consensus and Prioritization 4. Implementation and Follow-up ...Profit! @shanehastie: #Qconlondon @randyshoup Autonomy and accountability Steve Jobs quote. Start with a goal, give accountability and have accountability https://t.co/bnqsMHF78b @ctford: Goal-setting is a critical part of autonomy. Without that, teams are lost not autonomous. @randyshoup #QConLondon https://t.co/RLDExP5CQv @kriswager: Given the team a goal (with a customer oriented metric) and give them autonomy, but hold them accountable ~ @randyshoup Never hold them accountable without giving them autonomy #QConLondon @ctford: We put multiple skillsets in teams to tighten feedback loops. We also do it so that we can put the team in charge of something important to the business. @randyshoup #QConLondon https://t.co/dy1Y6l2msa @shanehastie: #Qconlondon @randyshoup Full stack teams, shorten the feedback cycles, aligned around a business problem https://t.co/oCwnXiAKoJ @shanehastie: #Qconlondon @randyshoup The team who build it need to own it and maintain it https://t.co/CDa0CDJ4BE @ctford: Don't have one team responsible for the sexy new stuff and another for the old stuff - it puts the boundary of accountability in the wrong place. @randyshoup #QConLondon https://t.co/9PjAH6ti7M @shanehastie: #Qconlondon @randyshoup pragmatism and progress help me understand what you're trying to achieve? Engineers are uniquely trained to solve problems - if we're not asking this question we are not doing our job https://t.co/ynSnyQwn71 @CatSwetel: "Engineering is about solving problems. Sometimes we do that with code." @randyshoup #qconlondon @kriswager: We are not doing our job if we don't ask "what problem are you trying to solve?" before coding anything according to @randyshoup often a better solution exists, not necessarily involving code #QConLondon https://t.co/xWsammo9Mc @ctford: Finishing more things protects you from priorities that change mid-stream. @randyshoup #QConLondon https://t.co/GOojCI8yJs @shanehastie: #Qconlondon @randyshoup it doesn't matter how much effort we put into something if it doesn't ship https://t.co/PPFcxeONpy @shanehastie: #Qconlondon @randyshoup quality matters If we don't do it right we're bound to have to do it again https://t.co/u5cSOSuDhz @therealluigi: We don't have time to do it right! Do you have time to do it twice? @randyshoup #qconlondon @shanehastie: #Qconlondon @randyshoup Contrasting technical debt vs investment cycles https://t.co/SFsJkkxHnR @ctford: Important coda from @randyshoup: try to help your organization improve, but remember you're one person and give yourself a break sometimes. #QConLondon ### Discovering Culture Through Artifacts by Mike McGarr Twitter feedback on this session included: @_3Jane: Culture consists of values, behaviors reflect values. Values are hard to verbalize. Accepted and rewarded behaviors may be different from expected behaviors, which people notice by observation. #Qconlondon https://t.co/ow9VwunCqM @_3Jane: Actually, here's everything culture consists of :) We can analyze people, tools and processes before we join the organization and see behaviors and values. #QConLondon https://t.co/OJvlSkbUdD @tastapod: #QConLondon @SonOfGarr says culture is the entire pyramid of tools, values and beliefs, accepted and rewarded behaviors, etc. that you see in an organization. https://t.co/5rX13zajc2 @tastapod: #QConLondon @SonOfGarr: Company values project into the company's environment, e.g. Fog Creek's separate offices for engineers, GitLab's entirely distributed structure. @_3Jane: Managers have power. Succinct definition. [Examine power lines within the company if you can. For example, % of women within leadership team is revealing.] #QConLondon https://t.co/y22NQsyK6g @tastapod: #QConLondon @SonOfGarr: When you reward someone, you are rewarding _all_ of their behavior. Be careful of the message you are sending when you reward 'jerks'. Powerful stuff. @_3Jane: Here are some artifacts that you can analyze. However, they are hints, not evidence. Ask yourself: are the statements the company makes representative of its culture or merely aspirational? #QConLondon https://t.co/YNVcJBRbSg @_3Jane: How do they make money? This defines economic forces that shape leaders' behavior. #QConLondon https://t.co/EIGKARdQyH @_3Jane: How large is the engineering team compared to the rest of the company? You want to understand how decisions are made and by whom, how they are communicated, what departments have higher status. #QConLondon https://t.co/PMifuzH1jG @_3Jane: Company that is willing to let go of brilliant jerks sends a signal to the rest of the team is that they are more important. [Equally, company that leaks team players sends an inverse signal.] #qConLondon https://t.co/2XVgZ9t2lb @_3Jane: Languages are good at different tasks, but that's not the only factor of choice. When I'm choosing a new language, I'm also choosing to pull its community beliefs into the company, because communities are self-selecting. #QConLondon https://t.co/lG7LZEbZ5J @_3Jane: The relationship between people and company culture: analogous to Theseus' ship paradox. As people leave and new ones join, at which point does the culture shift into something new? #QConLondon @JitGo: Excellent talk by @SonOfGarr on ways to think about #CultureInTech and a model to go and understand what your culture is. From there you get a better understanding of why people do what they do and what is and isn't likely to work. #QConLondon https://t.co/K5ObKx4bvA ### People Are More Complex than Computers by Mairead O'Connor Twitter feedback on this session included: @shanehastie: #Qconlondon @maireadoconnor Collaboration is how we solve complex problems https://t.co/ndOFkxYdCa @shanehastie: #Qconlondon @maireadoconnor the network is what makes us better https://t.co/FHACd2eVHL @shanehastie: #Qconlondon @maireadoconnor bigger is different - not better or worse, just different. https://t.co/yGkCsTrjdG @tastapod: #QConLondon @maireadoconnor: No company has no structure. Instead they have invisible or hidden structures, and this can be dangerous. @shanehastie: #Qconlondon @maireadoconnor no structure means implicit structure - growth needs deliberate design https://t.co/kdOPORfaZ7 @shanehastie: #Qconlondon @maireadoconnor run the company like we run software Inspect,learn, adapt, iterate https://t.co/ZyoHNkrUtU @shanehastie: #Qconlondon @maireadoconnor every business problem is a communications problem https://t.co/7vjtkuQkC5 @tastapod: #QConLondon @maireadoconnor: Distributed decision-making, The Advice Process: - State your intention - Collect feedback - Make your decision @shanehastie: #Qconlondon @maireadoconnor an example of the advice process It's not about getting consensus- it's about seeking feedback and advice The decider owns the decision and the outcomes https://t.co/JcTzpZQW3X @shanehastie: #Qconlondon @maireadoconnor it works, and there are still opportunities to learn and improve https://t.co/DHCYWX3KPC @tastapod: #QConLondon @maireadoconnor: Having an associate model at @EqualExperts is like using the cloud: we can flex and expand temporarily when we need it. @shanehastie: #Qconlondon @maireadoconnor break th organizational monoliths Choosing the bounded context is hard https://t.co/yI7ShhzDn9 @tastapod: #QConLondon @maireadoconnor: 'Don't build for control, build for adaptability' works for teams of people as well as software. @shanehastie: #Qconlondon @maireadoconnor cross functional teams are better at solving problems You build it, you run it https://t.co/9K3hlqya3X @tastapod: #QConLondon @maireadoconnor: Can you run the business using cross-functional teams? The answer is Yes! (Ok, the answer is We're working on it.) @shanehastie: #Qconlondon @maireadoconnor Culture Debt is hard to pay back. You need to put active, thoughtful effort into addressing the problems https://t.co/gNLQ43m4dR @tastapod: #QConLondon @maireadoconnor: There is no staging environment for life! @shanehastie: #Qconlondon @maireadoconnor you can't experiment on people - you're always testing in production https://t.co/fefrmSorGk @shanehastie: #Qconlondon @maireadoconnor just enough hierarchy, trust that people will do the right thing https://t.co/M3NXIly8jL @shanehastie: #Qconlondon @maireadoconnor some things are just hard. https://t.co/HQ4Oyuw3FV @shanehastie: #Qconlondon @maireadoconnor continuous improvement and learning https://t.co/0vOZ8AYFpN @shanehastie: #Qconlondon @maireadoconnor do an experiment with the advice process https://t.co/Dz7UTnDcq0 ### Variety: The Secret of Scale by Cat Swetel Twitter feedback on this session included: @tastapod: #QConLondon @CatSwetel suggesting your corporate strategy shouldn't be completely resetting every two weeks. Low variety in some areas enables higher variety in others. https://t.co/rm0anw3f0W @_3Jane: Ladder of inference: your actions are restricted by what parts of reality you are able to perceive, and how you interpret them. [Mindfulness practices attempt to unpick the stack and get you to experience unfiltered reality!] #QConLondon https://t.co/wquMA6aFWD @tastapod: #QConLondon @CatSwetel: The more widely you share your strategy and e.g. OKRs, the fewer assumptions people have to make. @_3Jane: Get people to fill out these questions and compare in order to discover differences in their assumptions, or unnecessary restrictions they took on. #QConLondon https://t.co/9Lmj3gtAoG @randyshoup: Diversity is a *risk mitigation strategy* @CatSwetel at #QConLondon https://t.co/fDA4x9CC5k @kriswager: Variety of people is a risk mitigation technique for some companies according to @CatSwetel If people are too alike, they might not even be able to spot threats #QConLondon https://t.co/idwswwwM4K @tastapod: #QConLondon @CatSwetel using Chris Argyris' Ladder of Inference to illustrate how people in organizations have fewer and more limited options than they realize. Variety in people can counter this. @_3Jane: One more for the reading stack, quoted in response to how much time things take. I missed some context, but I think the point is people at higher levels of org should be thinking more long term. If they're not, you have a problem. #QConLondon https://t.co/WeysHdNiv2 @_3Jane: Yes, a #WardleyMaps name check! Anonymized map from a real client quoted. MC: I have a theory Wardley Maps are the new Conway's Law #QConLondon https://t.co/xGmcfUKq7C ## Evolving Java & the JVM ### Life Beyond Java 8 by Trisha Gee Tim Anderson attended this session: Java has a problem – the language and platform is evolving faster than ever, but many developers are stuck on the five-year-old Java 8. When Trisha Gee, a developer advocate for Java tool company JetBrains, surveyed Twitter ahead of a talk at QCon conference in London this week, 78 per cent stated they were using Java 8 – and considering that her following is likely to tilt towards the bleeding edge, the reality is likely even higher…. It's a shame, since there are a bunch of strong new features in later versions: JLink to create small Java executables for Docker images; Var for implicit typing; JShell interactive Java; improved collections; optional class; improved garbage collection; modularity and much more. So why have developers not upgraded? Simply, Java 9 introduced major changes, including internal restructuring, new modularity (known as "Project Jigsaw"), and the removal of little-used APIs. These changes broke code, and even developers who are happy to make the necessary revisions have dependency issues. … "I want to explain why it was necessary," said Oracle's Ron Pressler, part of the Java platform group developing the language and lead for Project Loom. "There are billions of lines of code in Java, and Java 9, it did break some things. The reason is that Java is 20-something years old. It will probably be big and popular in another 20 years. We have to think 20 years ahead. The way the JDK was structured prior to Java 9 was just unmaintainable. We could not keep Java competitive if we had not done that change. That was an absolute necessity."... At a QCon Java panel, Pressler expressed some frustration. "There is no fundamental reason why your Java code won't run on Java 9+. You may need to change access to old APIs etc. But it's not a different language." At the same time, he acknowledged that the current practice of giving Java a new version number every six months gives the wrong impression. "One of the biggest confusing things that we've done is to give the new six-month releases integer version numbers. So going from Java 9 to Java 10 you think that is a new Java major version. It is not. Java 10 is not a major Java release. It is a small release. The last ever major Java release was Java 9. There will be no more for the foreseeable future." ## Modern CS in the Real World ### Automated Test Design and Bug Fixing @Facebook by Nadia Alshahwan Mike Salsbury attended this session: Nadia Alshahwan presents her team’s effort to automate test design and bug fixing at Facebook with their “Sapienz” system. Sapienz aims to be smarter than a “random fuzzer” testing system and approximating human, manual testing. A testing system that improves through analysis of its random test paths seems an interesting approach, and their auto fix workflow brings tangible benefits to the development process. With the collected test data, the Sapienz team was able to identify the top causes for crashes of their Facebook for Android app, with Null Pointer exceptions leading the board by a long way. Their tool can suggest patches for certain types of issues and submit them for code review, where it auto-assigns as reviewers the developer who introduced the issue and anyone heavily involved in that part of the product. This is certainly something we would like to try back at Caplin! ### Functional Composition by Chris Ford Twitter feedback on this session included: @nrp_1: "Musical notation is a particular kind of language designed to be executed on a particular type of finite state machine called a 'musician'" says @ctford #QConLondon https://t.co/H99fpbLl62 @iam_ijaz: @ctford : never do a live. programming demo , if you have to do a small one,if you still have to don't do in an unfamiliar language to the audience. But I'm going to break this rule today. #qconlondon @kegilpin: @ctford echoing Fred Brooks while live coding a musical performance: You need to think of your music as data. #QConLondon https://t.co/S6HuX2d2rK ## Modern Operating Systems ### A Journey into Intel’s SGX by Jessie Frazelle Twitter feedback on this session included: @ctford: "Now I'm paranoid too. Never read things." @jessfraz #QConLondon https://t.co/lqXsVFzELc @ctford: "It's not game over if they have access to your cloud environment, at least that's the promise of SCONE." @jessfraz #QConLondon https://t.co/U0gd4d04ne @MelanieCebula: If you actually don't trust cloud providers, then you can just host the service on your own and you don't need the cloud at all, especially if you're already running on prem @jessfraz #QConLondon ### Fine-Grained Sandboxing With V8 Isolates by Kenton Varda Twitter feedback on this session included: @jessfraz: By using the v8 runtime you can lessen the code footprint since you can automatically use anything you want in v8 and it supports wasm as well, pretty cool design #qconlondon @jessfraz: Everything has bugs we need to be thinking about risk management when there are bugs. - @KentonVarda #qconlondon @jessfraz: CloudFlare removed all timing and concurrency primitives (to prevent people from building timers) from their API in order to stop spectre from workers. They also have the freedom to reschedule. Neat! @KentonVarda #qconlondon ### LinuxKit by Avi Deitcher Twitter feedback on this session included: @jessfraz: Linuxkit is for runnable, immutable, disposable images Size is smaller, cycle time is faster, start time is faster, easy to debug, good performance, smaller attack surface - @avideitcher #qconlondon ### The Future of Operating Systems on RISC-V by Alex Bradbury Twitter feedback on this session included: @MelanieCebula: RISC-V is an open set architecture (ISA) that encourages custom extension and is open source hardware @lowRISC #qconlondon https://t.co/pD7si1GzYZ @MelanieCebula: "this fundamental interface between hardware and software has [historically] remained non-standard and not open source" @lowRISC #qconlondon @justincormack: If you have an Arm license you still can't add custom instructions for your use case. #QConLondon @justincormack: Current AMD CPUs have 15 or so different ISAs for different parts like power management and so on. These could all be embedded RISC-V cores. @asbradbury at #QConLondon @justincormack: Current AMD CPUs have 15 or so different ISAs for different parts like power management and so on. These could all be embedded RISC-V cores. @asbradbury at #QConLondon @MelanieCebula: RISC-V has 3 privilege levels rather than 2 (Machine, Supervisor, and User) #QConLondon https://t.co/1Ly3v1tZm0 @justincormack: Ingredients for rapid hardware/software innovation @asbradbury at #QConLondon https://t.co/fvUrTp5mcK @MelanieCebula: end goal: more rapid innovation in hardware (sorely needed post spectre/meltdown and insecure hardware) #qconlondon https://t.co/nUwn4t8s2D ### Unikernels Aren’t Dead, They’re Just Not Containers by Per Buer Twitter feedback on this session included: @justincormack: Why don't our computers have a separate control plane? @perbu kicks off on unikernels at #qconlondon https://t.co/iCoVdx7gXE ## Operationalizing Microservices: Design, Deliver, Operate ### Complex Event Flows in Distributed Systems by Bernd Ruecker Mike Salsbury attended this session: Bernd Ruecker – Complex Event Flows in Distributed Systems. One of the main topics at this year’s QCon was Tracing and Monitoring. Be it through microservice architecture or, as described in this talk, decoupled systems using event-based flows, it is easy to lose sight of the larger-scale flow and end up not knowing what is going on anymore. To help us regain sight and control, we need tracing and monitoring tools. But besides just adding the tools, we need to add the right matrix and understand what they do. ### Cultivating Production Excellence - Taming Complex Distributed Systems by Liz Fong-Jones Twitter feedback on this session included: @ctford: You might try and buy DevOps... @lizthegrey #QConLondon https://t.co/HJoCCqFXhd @ctford: But there are no shortcuts to making your observability meaningful. @lizthegrey #QConLondon https://t.co/zKLnoHIZvV @ctford: So you're forced to make up for your monitoring's legibility deficit by relying on a few experts. @lizthegrey #QConLondon https://t.co/MdfKW0Fcu1 @sarahjwells: Need a people-first, not a tools-first approach to operations @lizthegrey - tools can't perform magic, we need to first know what we want to achieve by using them #qconlondon @danielbryantuk: "Without a strategy, no amount of tools will fix your software operations. Invest in people, culture, and process" @lizthegrey #qConLondon https://t.co/6jiLGK1gOH @cfhirschorn: "The company forgot that it is the people that operate the systems and that the systems need to be designed for humans." .@lizthegrey PREACH! #QConLondon @_3Jane: Operational overload due to tool-first approach and employee burnout. Tools won't solve the situation if the company hasn't worked its strategy out. (Somebody tell this to everyone who screams AI please.) #qconlondon https://t.co/47SLgOdkzB @ctford: ProdEx requires both technical and human considerations. @lizthegrey #QConLondon https://t.co/fJqZb35aeW @ctford: You need to involve your team *and* people adjacent to your team. @lizthegrey #QConLondon https://t.co/InFwM9AK9h @danielbryantuk: "It's not okay to feed the machines with the blood of humans. People are important, and they will quit if you don't invest in them, and specify goals, and conduct planning" @lizthegrey #qConLondon https://t.co/fxTNuA7jGt @ctford: "How do we increase people's ability to touch production? How do we let them ask questions?" @lizthegrey #QConLondon https://t.co/7cHxbQLrlg @danielbryantuk: "To implement production excellence you will need to build everyone's confidence. Encourage the asking of questions, and measure and act on what matters" @lizthegrey #qConLondon https://t.co/Pfk8Wx9J8D @ctford: To figure out what too broken means, ask folks who have a product perspective. @lizthegrey #QConLondon https://t.co/k9VM8nmajC @_3Jane: Our systems are always failing in places... and it's fine. What matters is whether users are happy and base metrics on this (SLI). How do you know which ones are appropriate? Collaborate: ask your product manager! #QConLondon @danielbryantuk: "You should be able to detect system issues and be able to debug them by working *together*. Strive to eliminate complexity. Make the systems and metrics easy to understand" @lizthegrey #qConLondon https://t.co/TEz6WAPyFT @ctford: A good Service Level Objective is just at the point the user says "meh" and presses refresh. @lizthegrey #QConLondon https://t.co/qiuWPenCKL @ctford: An error budget gives you discretion over your focus. @lizthegrey #QConLondon https://t.co/PMna6JeBj9 @jpetazzo: A good reminder from @lizthegrey in her #QConLondon talk (paraphrasing) If a single blade of grass is brown, it's not a problem. We're not going to replant the whole lawn for that. Same things for our services: it's not necessarily a problem if a single thing is down. @ctford: You can work with the business to agree on a risk appetite. @lizthegrey #QConLondon https://t.co/93m641z0Yu @_3Jane: Good SLOs just barely keep the users happy. You have an error budget. Trade off errors for engineer sanity (less pagers) and future proofing (experimentation.) Save money on reliability if you're within budget. #QConLondon @danielbryantuk: Great commentary about SLOs and error budgets from @lizthegrey at #QConLondon. These can be used to make data-driven business decisions, such as whether you should push a risky new experiment to production or not https://t.co/vwCzRw2GI4 @ctford: SLOs will tell you when something's wrong, but they won't tell you how to fix it. @lizthegrey #QConLondon https://t.co/RfsWQj3LQN @_3Jane: Perfect is the enemy of good. Start by measuring anything. Then periodically re-evaluate based on user feedback. Improve your capabilities or metrics if needed. [Agile maintenance hey!] #QConLondon https://t.co/PzFMd1rxsr @danielbryantuk: "Outages in complex systems are never the same. You can't possibly instrument for all of the issues. Instead support debugging novel issues in production" @lizthegrey #qConLondon https://t.co/tIveQPbsUR @_3Jane: We can't foresee all possible failure modes. So, we need to make our system observable (it is not if we need to change code in prod to test a hypothesis about what went wrong.) Events in context, so that people can work out later what went wrong #QConLondon https://t.co/kiLpBnlhTq @ctford: Debugging is a team sport. You need collaboration as well as tools. @lizthegrey #QConLondon https://t.co/ITqMq6veJn @_3Jane: ...but the final component is collaboration. Debugging involves multiple teams, including non-engineers. Give them access to debugging tools, help them learn debugging skills. #QConLondon https://t.co/HhV50lXMPv @_3Jane: ...followed by a call for empathy from management to not place mothers on call at night. [Have I mentioned cries for sanity?] #QConLondon https://t.co/xvFgO1eHsx @danielbryantuk: Much like history in general, "outages don't repeat, but they often rhyme..." Risk analysis will help you plan and prioritise @lizthegrey #qConLondon https://t.co/jti3ikYR6K @ctford: There's a business case to observability. Make it. @lizthegrey #QConLondon https://t.co/gBzZsCZ8oL @_3Jane: Document things. Stop hero culture. Quantify risks and prioritize mitigation accordingly. Prioritise completing the work. #QConLondon https://t.co/WucF6SUHaM @danielbryantuk: "Lack of observability is systemic risk. So is lack of collaboration and trust" @lizthegrey #qConLondon https://t.co/jVtNwrn9bm ### Reactive Systems Architecture by Jan Machacek, Matthew Squire Mike Salsbury attended this session: The talk Reactive Systems Architecture by Jan Machacek and Matthew Squire, was again about regaining sight into our distributed systems by using tools for observability and monitoring. In this talk, the emphasis was mostly on checking the default setups of your tools and environments in terms of timeouts, retries, pool sizes and so on to gain control over the amount of logs and information you might collect. Make sure you configure your incident management and review it from time to time. Another buzzword mentioned in this talk was Chaos testing, which seems to mean just use all kinds of input to test your system. ### What Lies Between: The Challenge of Operationalizing Microservices by Colin Breck Marcus Craske attended the session: Colin gave a good insight into how Tesla is successfully using Kubernetes in a mechanical environment, for energy, at scale, and some of the important lessons learned. These include storing KPIs/metrics of services into a time-series database for observability, handling failure in business logic and use exponential backoff for erroneous requests to prevent a cascading failure (due to overloading an unhealthy service with too many requests). An important point was made about how observability, as a term, is being abused in the microservice world, and how it can be used in its original definition from control theory. Twitter feedback on this session included: @lizthegrey: @breckcs Many of us have embraced microservices. But are your microservices like homogeneous bricks, or like an uneven stone wall? In either case though, there are cracks between each individual service. Managing the space between, & the people/orgs is the biggest challenge. #QConLondon @lizthegrey: Each service is its own failure domain, which is a blessing and a curse. Will k8s solve the problem for us? No, it won't. Orchestration will manage the boxes, but what we put inside matters. Three challenges: integration, observability, and embracing failure. #QConLondon @lizthegrey: We're not integrating services just because we can, we're trying to solve problems for our customers. This gets even more difficult when you're managing physical infrastructure, or you have single copies up to fleets of hundreds. #QConLondon @lizthegrey: Individual nodes have byzantine fault modes, including reporting ghost events from months in the past, going missing, and so forth. We may get different results if we try to re-process data from the past. It's not deterministic. #QConLondon @lizthegrey: We need to establish taxonomies of assets, and be able to deal with missing data. GROUP_BY queries often omit knowledge of what's missing. Our models and aggregates change over time as we reassign items in the hierarchy. Nothing is static. #QConLondon @lizthegrey: There's lineage and custody issues over the data -- can you see the data from your older thermostat if you replace it with a newer one from a different manufacturer? Is data expected, desired, and/or what's actually flowing in? #QConLondon @lizthegrey: We have to handle the uncertainty by writing code. And being able to evolve it serverside can help! Just like physical assets, our microservices can evolve and their telemetry and set of relationships changes over time. #QConLondon @lizthegrey: Connecting everything to a single event bus doesn't decouple us from the consequences of our data -- it forces everything to consume the whole firehose, schema changes are still hard, and we can't go back in history. #QConLondon @lizthegrey: We can use a hybrid model of writing data to a database, but then also producing a change log for every service that needs everything downstream. IoT allows us to store the data for an item where it actually lives. But the time dimension never goes away. #QConLondon @lizthegrey: Sometimes you need to find out about things quickly, or have a fast ability to query only the most recent data; but other things depend upon just batch database querying. Or more different kinds of queries and latencies. #QConLondon @lizthegrey: be aware of edge-gated systems -- if you miss a message, you may lose state, but you can bound the error by periodically sending the current state checkpointed so that you only need to look back that far to catch up or sync. #QConLondon @lizthegrey: What about security and access controls? The list goes on and on. You need to model uncertainty in business logic; you need to have consistent asset models from varying sources of truth, and make sure you can express temporal and other relationships. #QConLondon @lizthegrey: Now let's talk about observability -- he doesn't like the term. and "o11y" isn't much better, although it's shorter. It's a word from control theory about gaining insight from direct/indirect measurements and estimations to ask arbitrary questions. #QConLondon @lizthegrey: Often used today to talk about: Recording requests/responses through e.g. logs metrics, traces Sample statistically for cost However, the dual is controllability. and that asks specific questions of the state space, and event/streaming matters not just interactive. #QConLondon @lizthegrey: current discussion of o11y tends to neglect this. So let's talk about "pigging" of inspecting the pipeline. Our logging, tracing, etc. are valuable point in time tools but don't let us examine the full state of the system. We need continuous tools to understand. #QConLondon @lizthegrey: We don't operate pipelines with pigs, we use analog metrics like flow rates. We need to zoom out to get continuous improvement. Tagging events with static metadata doesn't give us the dynamics and operating parameters of the system. #QConLondon @lizthegrey: Colin finds that making metrics for each processing milestone, and taking rates of counts allows the most flexibility. This is the counter data type; the cardinality can be thousands to millions [ed: I worry that many current metrics systems can't handle this :/] #QConLondon @lizthegrey: The derivative of a count is a rate; averages are dividing two differences at runtime. All of this can be aggregated... [ed: but then you can only aggregate by what you've thought to aggregate by] #QConLondon @lizthegrey: We can also use counters inside of our agents for black-box probing, but for white-box situations we want to have sidecars reporting from inside of our systems. #QConLondon @lizthegrey: If you scrape rather than buffer and push the data from the sidecars, you miss data when the monitoring system goes down. And during a failure, you don't want your monitoring system to go down. #QConLondon @lizthegrey: Make sure that you have knobs to turn that can help you achieve your service level objectives. More services let us have more control points e.g. over scaling. You can scale up only the part that's most loaded. #QConLondon @lizthegrey: How do we deal with dimensionality and correlation? We're usually correlating time series, and using multivariate coordinates; we can examine the correlations, and show when we leave the envelope. It looks like systems engineering and process control. #QConLondon @lizthegrey: We'll wind up moving towards controlling sizing of our systems the same way we do industrial controls, to proactively tune our systems to remain within constraints while minimizing cost and predicting load. #QConLondon @lizthegrey: and we can use factorial designs to find more optimal operating conditions based upon experiments, optimizing for throughput or latency. this is testing in production which process industries do. So he's not happy with how o11y is used today: we'll need it later. #QConLondon @lizthegrey: We need to move beyond request-response, and model system dynamics. Okay, now onto failure. #QConLondon @lizthegrey: Our tests are like pigs - they can only go so far in eliminating failure, because of unknown unknowns. They can only check what we can think of as possible errors. #QConLondon @lizthegrey: Type systems won't serve us from logic errors or OOMing our processes. Functional programming won't save us either from OOMing or head of line blocking. Formal verification won't help us find the cracks either. Model checkers don't scale to large state spaces. #QConLondon @lizthegrey: You can't formally type check all of google3. Formal verification can be used for logic and algorithms; types can be used for compile safety, and functional programming can help with immutability. But they only help building each microservice. #QConLondon @lizthegrey: We need to embrace rather than prevent failure. Systems drop messages all the time. That's a micro-failure in a distributed system. Error handling is critical. "At scale, failure is normal." #QConLondon @lizthegrey: If we want different behaviors in each of our states in a finite state machine... we can have actors that execute the state machine, and mirror relationships between components and the models themselves. We can test what happens if there's failure. #QConLondon @lizthegrey: Streams embrace runtime failure by decoupling reading and writing, allowing orderly queuing. #QConLondon @lizthegrey: Reactive streams provide pushback and dynamic control without manual tuning. #QConLondon @lizthegrey: Backpressure allows the stream to flex and bend with the system dynamics. We can also implement exponential backoffs that way. But how can we get visibility into long-running, non-discrete streams? We should notice blockages and lack of inputs. #QConLondon @lizthegrey: "A person should not have to intervene in the middle of a night to restart a service." Can handling failure cause failure? Yes, sprinkler systems do sometimes flood buildings when there's no fire. Yes, improving reliability sometimes can worsen reliability. #QConLondon @lizthegrey: MySQL is the worst example of this - a failed login 100 times results in an automatic block, requiring a human to remove the IP address block. Only some clients will see a block, therefore. Argh. Also, cascading failures and retries (oh no). #QConLondon @lizthegrey: Sometimes it's better to fast-fail than to retry. And k8s can increment your workload with its health checks, which continue for the lifetime of the pod. And if they all fail at the same time, because they get slightly slower... your whole service dies. #QConLondon @lizthegrey: We need to handle dynamics and model behavior, rather than assuming individual services are correctly designed and foundational checks are working. And we need to not make things worse. The hardest problem of microservices is managing the spaces between. [fin] #QConLondon @lizthegrey: The challenges beyond our microservices are many though. Security, team dynamics, regulation, etc. We need to understand our interactions and take a systems view, embracing system thinking. #QConLondon @lizthegrey: Infrastructure alone isn't enough -- we shouldn't hide these problems from developers, only free them from thinking about the internal implementation. We need programming models for the cloud and runtimes that do the heavy lifting for them. Composable pieces. [fin] #QConLondon ## Security Transformation ### Speed The Right Way: Design and Security in Agile by Kevin Gilpin Twitter feedback on this session included: @kriswager: "Designing secure software is not like building a million cars" @kegilpin commenting on the fact that much of agile comes from car manufacturing #QConLondon @ctford: "The complexities of modern security are beyond that which can be remediated by automation." @kegilpin #qconlondon https://t.co/jx7EP4cJzX @ctford: "Accidents are part of being on the frontier. However we have to be honest about facing our mistakes." @kegilpin #QConLondon @ctford: The Swiss cheese model explains how threats become vulnerabilities when multiple "holes" line up. @kegilpin #QConLondon https://t.co/eunUrvoj2d @ctford: Using the kind of design analysis performed by the aviation industry, we can identify design flaws. @kegilpin #QConLondon https://t.co/e4XaUgMLU4 @ctford: "The order in which boxes are made carries information." @kegilpin #QConLondon https://t.co/4jtcpTpvVx @kriswager: "one problem with diagrams is that they make a lot more sense to people watching it get drawn" @kegilpin on the limits of whiteboard diagrams, which is a common design practice in the field #QConLondon @ctford: "When you use code as the design artifact, you exclude people who aren't programmers." @kegilpin #QConLondon https://t.co/bGZCS0ziYU @ctford: What makes a good design medium? @kegilpin #QConLondon https://t.co/oBG121Diab @kriswager: We need cognitive artifacts to document systems and we need to keep them up to date, according to @kegilpin #QConLondon https://t.co/NvcX2L9yHK @kriswager: @kegilpin Documents get outdated because they usually document the design rather than the result, so we need to ensure that design changes get reflected back into the documentation #QConLondon https://t.co/6eDHQQVfq9 @ctford: For gathering effective feedback, be clear to reviewers what kind of feedback you need. @kegilpin #QConLondon https://t.co/zd2ap2ePYY @guypod: Security design investments should grow by the amount of unknown aspects to the task at hand (and of course proportional to risk). #QConLondon @kegilpin https://t.co/lIkbZfqqdo @kriswager: "Verifying security in a new design isn't a matter of writing tests, because tests only verify that the code works like the design" @kegilpin #QConLondon https://t.co/6axco5woaf @ctford: My two takeaways from @kegilpin's talk on design and security: * No redundant system ever fails for a single reason, by definition. * Security is a systems problem, which makes it a social problem. #QConLondon ### The Three Faces of DevSecOps by Guy Podjarny Udi Nachmany attended this session: As Snyk CEO Guy Podjarny recently pointed out in his talk at QCon London, DevSecOps is a highly overused term that few stop to define for themselves in depth. In his approach, DevSecOps is actually an umbrella term for three areas of required transformation: technologies, methodologies and models of shared ownership. ## Surviving Uncertainty: GDPR, Brexit, or Politics? Beyond DR ### Avoiding Getting on the News by Investigating Near Misses by Ed Holland Twitter feedback on this session included: @lizthegrey: So we launched our app and it was a smash success, reaching 10M monthly users. Yaaay. Oh no. So we did premortems and risk analysis moving the post-its listing each possible risk around... but it only helped a little bit. #QConLondon @lizthegrey: Also did load testing before launch. Wait for incoming tickets. But can we be more proactive than just waiting to appear on the news? Instead, we need to investigate our near-misses before we make the news. #QConLondon @lizthegrey: Prevent tickets from even being created in the first place. [ed: yup, ticket means that a user is already unhappy or not automatically getting what they need...] #QConLondon @lizthegrey: But if we're updating the heart of the telephony core, we can't just easily experiment. Or can we? We get some "simple" problems, like dealing with leaking file descriptors that might go down all at once simultaneously exhaustion happens in sync... #QConLondon @lizthegrey: Other "simple" problems: internal vs external load balancing, where traffic is unpredictable and sticky, resulting in imbalances of traffic between nodes. #QConLondon @lizthegrey: Ignoring the spike in quiescent servers/overloaded servers close to a server restart. But before it. [ed: I appreciate the anecdotes, but I'm worried there's no overarching set of points beyond "react early to harbingers of mass failure"...] #QConLondon @lizthegrey: They were incorrectly handling the shutdown process and the retries involved; once they fixed it, the upgrade/resilience story got better. #QConLondon @lizthegrey: Another example: processing quotas. some nodes got very high tail latency. why? [ed: didn't quite get the explanation of why this was] #QConLondon ### Balancing Risk and Psychological Safety by Andrea Dobson- Kock Twitter feedback on this session included: @lizthegrey: @andrea_kock Risk is contextual. What's risky in one situation isn't necessarily another. Risk is essential to innovation; but people need psychological safety to take risks. [ed: I'm writing an article for @Medium on misconceptions of psych safety soon so this is topical!] #QConLondon @shanehastie: #qconlondon @andrea_kock defining risk: hazard + exposure + vulnerability https://t.co/o4EK83OhUX @lizthegrey: Using the parable of Icarus: risk is a product of existing environmental hazards, prolonged exposure to those hazards, and the vulnerabilities that people's brains will ignore the first two despite the rational assessment. #QConLondon @shanehastie: #qconlondon @andrea_kock we don't use rational thought to assess risk - we use system 1 thinking https://t.co/6UWCl1BChz @lizthegrey: Two kinds of thinking: fast reacting (system 1), slower thinking (system 2). If we're lazy, we'll only use the first system, which is automatic and biased. We have to push ourselves into system 2 thinking; it's not automatic. #QConLondon @kriswager: A nice easy rational formula to do risk assessment, but unfortunately our risk assessment is rarely rational - @andrea_kock #QConLondon https://t.co/YdAcJoy1QI @lizthegrey: Illusions, for instance, trick us into using our systems 1 instead of systems 2 abilities. How can we address this pitfall? We have to recognize situations where mistakes are more likely. #QConLondon @shanehastie: #qconlondon @andrea_kock innovation and risk are related - risk pushes innovation It's not about avoiding risk, it's about managing it @lizthegrey: We have to be able to stop experiments when they're going too far or not leading to the results that we're hoping for. How do we get continuous innovation out? We need people to have growth vs fixed mindsets. Is skill a have/do not binary, or is it teachable? #QConLondon @lizthegrey: A belief in lifelong learning characterizes the growth mindset. But fixed mindsets react badly to the setbacks and mistakes that continuous improvement entails. The fear of failure creates change aversion. #QConLondon @lizthegrey: Change aversion comes from system 1 instincts to fight or flight when threatened. #QConLondon @lizthegrey: The amygdala reacts to fear and shame. How do we change from fixed to growth mindset? It takes effort. But you and your employees can do it. Make sure your *organization* is focused on learning, creating knowledge, and adapting. (citing Pieter Senge, 1990) #QConLondon @lizthegrey: but you need structure and building blocks to implement this. examples of success: honda, corning, GE. Amy Edmonson et al developed idea of how to create a learning organization, including a supportive learning environment, appreciating difference, openness to ideas. #QConLondon @shanehastie: #qconlondon @andrea_kock what's needed to create a supportive learning environment https://t.co/H31876n4EJ @lizthegrey: and you need time for reflection to innovate. Give people time to pause. Also, give people concrete learning processes and practices based on data. Make sure you can learn from failed experiments. #QConLondon @shanehastie: #qconlondon @andrea_kock Concrete learning processes and practices https://t.co/evoZj2T0ac @lizthegrey: Spread the word about what you learn. The talks that are about failures are even better than the talks about success, because they're learning opportunities for others. #QConLondon @lizthegrey: Make sure leadership reinforces learning. Encourage leads/managers to invite input, listen, and ask questions. Authoritarian, top-down teams have a hard time innovating. #QConLondon @shanehastie: #qconlondon @andrea_kock Psychological safety is a shared group belief that one can speak up and not be punished or humiliated https://t.co/kRokPZCmiz @lizthegrey: Edmonson didn't start off researching psych safety, but instead realized it explained teams of doctors/nurses at hospitals having better or worse performance records than others. #QConLondon @lizthegrey: They didn't make fewer mistakes, but instead the change in outcomes came from being able to address the mistakes before they compound into larger mistakes. Admitted number of mistakes didn't correlate to negative outcomes - in fact, the inverse. #QConLondon @lizthegrey: We get comfort if there are high psych safety and low accountability, but it leads to a lack of pushing for higher results. Instead, the goal in a growth industry should be to push people hard and make them safe, ensuring they have the tools and motivation to learn. #QConLondon @shanehastie: #qconlondon @andrea_kock high accountability and high safety is where you need to be - the learning zone https://t.co/jnuULRK0ot @kriswager: Psychological safety is necessary to do proper risk assessment/management according to @andrea_kock #QConLondon You also want demanding goals, otherwise you'll become obsolete https://t.co/aASFPnYy33 @lizthegrey: How do we create the shared belief of psychological safety? It doesn't take being a leader or manager; as a team member, you can set an example for others. Do experiments for curiosity and involve others. acknowledge fallibility. #QConLondon @shanehastie: #qconlondon @andrea_kock We can't prevent risks from happening but we can prepare and train ourselves to deal with the uncertainty from it https://t.co/3Q206KLXZV ### Change Is The Only Constant by Stuart Davidson Mike Salsbury attended this session: He started with huge enthusiasm and confidence and quickly showed us the group of six books he recommended as fundamental to the theories behind his talk. This included the ‘Phoenix Project’ (again), but also ‘The Art of Action’. And it was reading that book on the way home, that I learned about the military strategist Carl von Clausewitz and his ‘Friktion’, which is essentially describing the same Cognitive dissonance in organizations that I had been thinking about a few days ago. Stuart had great ideas and slides, including how Skyscanner uses a variation on Spotify’s Tribes and Squads engineering model to manage their teams and initiatives across teams. Ideas I’m keen to take back to Caplin, possibly starting, as The Art of Action did, with an appreciation of the lessons to be learned from an analysis of the battle of Jena-Auerstedt. ### Choosing Kubernetes: Managing Risk in Cloud Infrastructure by Ben Butler-Cole Twitter feedback on this session included: @ctford: "The value you build as a business is exactly the same as the value to your customers of delegating that risk." Ben Butler-Cole #QConLondon @timanderson: Ben Butler-Cole at Neo4J speaking at #qconlondon on managing risk and Kubernetes. Main takeaway: only take responsibility for risk that is within your core expertise. Secondary point: Google Kubernetes Engine has worked out OK. ### Risk of Climate Change and What Tech Can Do by Jason Box, Paul Johnston Twitter feedback on this session included: @lizthegrey: @climate_ice Greenland has 7 meters of ice [ed: think I heard that right... but captions would help], but due to CO2 levels and warming, we're already going to lose all of it and have sea levels rise. #QConLondon @lizthegrey: He works in Copenhagen creating automated ice measurement stations that measure movements of mass, snowfall/rainfall, and energy flows from sunlight/temperature. They check weather forecasts and climate models using this data. #QConLondon @lizthegrey: We can watch glaciers push ice out to sea as warming intensifies and see it break off over the years. #QConLondon @lizthegrey: Fractures propagate at the speed of sound, but takes years to build up and replace the ice that broke off. #QConLondon @lizthegrey: When lakes form, sunlight absorption over that area increases, exacerbating the effects of warming. Water draining away through the ice carves through the ice, creating more heat as well as propagating the heat from the warmer water. #QConLondon @lizthegrey: We face the thermal collapse of our ice sheets. Softer ice flows faster. The melt season has increased both in terms of the proportion of the year it's viable, and the amount of heat applied. #QConLondon @lizthegrey: Ice also erodes at the grounding line as water infiltrates, causing the ice to move faster into the sea. Fractures filled with water expand over time (hydrofracture) #QConLondon @lizthegrey: Global dimming temporarily slowed ice loss, but ice erosion came back with a vengeance after we stopped putting particulates into the atmosphere. #QConLondon @lizthegrey: so why is the heating happening? Because of the CO2 and other greenhouse gases we're putting into the atmosphere. Besides burning fuels... Wildfires increase with climate change, exacerbating the rise. But also crop clearing, cement production, etc, #QConLondon @lizthegrey: 90% of the warming is due to human activity rather than normal cyclical behavior. Like having a 1.5-watt lightbulb over every square meter of earth operating 24/7. #QConLondon @lizthegrey: We've overcorrected from avoiding another ice age pre-Industrial Revolution to cooking things, even if the Paris accord is passed. At least 2.7 deg C rise even with Paris Accord, and 5.8 deg C if business as usual. #QConLondon @lizthegrey: We can buy ourselves time if we pull that curve back down and save lives. But we have to treat it as a true crisis. "If your house is on fire, you get out buckets and put out the fire." but we're in denial. #QConLondon @lizthegrey: Our environment of stable sea levels and climate that facilitated civilization existing is about to change. It doesn't make sense to build docks if the sea moves much further inland or away from current shore. #QConLondon @lizthegrey: The sea levels have already begun rising - coastal areas are seeing infrastructure destruction even if Big Ben isn't yet half underwater. So how can we fix this? #QConLondon @lizthegrey: We need to halt carbon emissions *and* remove 200-500 giga-tons of carbon from the air. #QConLondon @lizthegrey: One thing we need to do is switch to gravity/hydroelectric powered datacenters (e.g. in Greenland) that are renewable and sustainable with low production cost. Problem is only 2 existing cables to Iceland and North America; would need to expand. #QConLondon @lizthegrey: Buying offsets which fund planting trees can take carbon out of the air. #QConLondon @lizthegrey: You can switch to purchasing renewable energy for your home. Community societies can create small energy companies. #QConLondon @lizthegrey: IoT device production and obsolescence will contribute to 3.5% of carbon emissions. Datacenters are rising from 2% of carbon emissions (equal to aviation) Efficiency won't save us because demand is still increasing. #QConLondon @lizthegrey: Each Bitcoin transaction is 193kg of CO2, or a car trip from London to Edinburgh and back. "That should horrify you." #QConLondon @lizthegrey: And we don't know the carbon footprint of each AWS instance, every Google search query, etc. #QConLondon @danielbryantuk: "Climate change is causing an increase in extreme weather events. It is also a business risk. You can make a difference" @PaulDJohnston #QConLondon https://t.co/vxgkLxixwE @lizthegrey: The business risks of climate change: what if there's a tax on carbon emissions? Best to start using carbon-neutral datacenters now rather than scrambling or paying the fine. [ed: I was pretty proud of what I saw from @gcpcloud. https://t.co/JQQdSn7LdX] #QConLondon @lizthegrey: Be aware of the physical risks to your suppliers of climate change. Move your business to renewable data centers. Encourage suppliers to do the same as well. Remember that computing is as big of an emitter as aviation. #QConLondon @lizthegrey: AWS is confusing -- because they have only 5 carbon-neutral regions, and the rest are not. they're only "planning to" with no timescale. They're marketing mostly but not providing public data. Consider moving to sustainable regions. write to them about the rest #QConLondon @danielbryantuk: "Technology matters in regard to climate change. Everyone at #QConLondon can make a difference" @PaulDJohnston https://t.co/KREOxE8HkZ @lizthegrey: .@azure offsets their carbon with certificates, and they offset their flights. Well done if you're already on Azure, but they need to improve direct emissions rather than only offsetting. "We love @GCPcloud. They are brilliant. They are the largest buyer of renewable." #QConLondon @lizthegrey: And GCP actually buys renewables *locally* rather than offsetting elsewhere. Alibaba doesn't seem to care about the issue, and uses coal grid electricity for their Chinese datacenters. #QConLondon @lizthegrey: Oracle claims to be very green -- some regions are 100% renewable like UK, but only 29-33% over next year in total. #QConLondon @danielbryantuk: The current gold standard for climate impact within cloud vendors is @googlecloud, according to @PaulDJohnston at #QConLondon https://t.co/m4Irw8LHzP @lizthegrey: IBM's commitment is very data driven and public; they procure 50% from local sources for datacenters. You can't buy a 100% renewable compute unit from them though. Renewable definitions differ. How do you categorize nuclear? [ed: I say it's carbon-neutral!] #QConLondon @lizthegrey: We need 10x the current amount of renewable capacity to power just our datacenters by 2025. [ed: build nuclear! it's baseline power load perfect for datacenters!] #QConLondon @lizthegrey: Offsetting just kicks the can down the road without managing demand and production. Stop buying things that are polluting with CO2. Tell your providers you want 100% renewable. And look at applications of ML (e.g. Deepmind) to optimize production & consumption. #QConLondon @noelwelsh: Sustainable servers by 2024 petition: https://t.co/LXtuYAcri6 #QConLondon @lizthegrey: "Climate change is our problem because we are the first generation to see the effects and the last generation who can fix it. You're now empowered to do something, so go fix it." [fin for real] #QConLondon ## Tech Ethics: The Intersection of Human Welfare & STEM ### Creating a Trusted Narrative for Data Driven Technology by Indra Joshi Twitter feedback on this session included: @lizthegrey: There's loads of software being written to address applying data using AI. "We can definitely get rid of doctors in 10 years..." is not a good VC pitch to make. The government wants to set up skills development, but also rules around AI and ethics. #qconlondon @lizthegrey: When they started writing the code of conduct, they already had the Health and Research Advisory body covering medical research. But does iterative product development play nicely with the process? What if your experiment changes along the way? #qconlondon @lizthegrey: https://t.co/AU6y0j0rdk -- patient data helps improve outcomes. Are people aware of whether their data is used for their own personal benefit vs. broader sharing for research? And professionals have to safeguard the system since users are data illiterate. #qconlondon @lizthegrey: We need both rules of the game and an ecosystem. Think of all of the players -- the government and non-profits committed to free at point of care healthcare, but also businesses and startups providing the technology. #qconlondon @lizthegrey: We need a regulatory feedback loop. We need to balance innovation and regulation. Make sure people are responsible with their development. For instance, mail order prescribing, following the GMC rules for remote prescribing. But following the rules is the *minimum* #qconlondon @lizthegrey: They decided that they'd prefer to make phonecalls to patients to have an individual human conversation rather than just taking them at their word on a computer screen that they needed pills. Even though it wasn't legally *required*. #qconlondon @lizthegrey: 10 principles of the NHS, including both beneficence and non-maleficence categories, a partial list: Are we solving a problem, or are we just using the data we have without considering the consequences? Here's an example: #qconlondon @lizthegrey: If you can predict someone will get severe disease, will it make their life better or worse to tell them, if they can't actually receive treatment? (e.g. if they're in rural India)? If people can't take action based on the results, then it just terrorizes them. #qconlondon @lizthegrey: There are real people underlying the data. Is it ethically okay to make use of the data? What are the consequences? Make sure that people using the system directly or indirectly can understand the system. One day it may be you in the A&E department. #qconlondon @lizthegrey: Regulation in individual jurisdictions gets murky at the borders; better to think about more universal principles. and we need to think about solving basic ethical problems like communicating genetic information/diagnoses intergenerationally in families. #qconlondon @jkriggins: If you are developing something for clinicians or the workforce, always consider ethics because the only thing people really ask in the healthcare industry is why. They don't really ask how, just why. @IndraJoshi10 at #QConLondon https://t.co/7iGshY0f38 @lizthegrey: Is someone entitled to the results of their parents' genetic tests if it may impact them or their children? What are the privacy consequences? #qconlondon @lizthegrey: Case studies and stories are how we learn and remember. Q: But how does this interact with technologists who try to make universal, inflexible rules? A: Doctors avoid exacerbating high-risk situations. Hard-and-fast "bring child in if they have a fever". #qconlondon @lizthegrey: or "don't discharge an elderly patient at 3am". But the in-betweens are the difficult cases. Empathy is what's critical to decision making when it's a grey area. #qconlondon @lizthegrey: Q: DeepMind, and patients' data being given to it without advance notice. Should there be an opt-out? A: That was a first case that resulted in large publicity due to companies involved. It's the first you've heard of but there are many other companies doing this. #qconlondon @lizthegrey: This isn't "evil Google"; instead, a systematic problem of understanding. There is an obligation to clarify the lines and publish info on governance on data exchange, uses of data for research vs. individual care. There is an opt out system maintained by NHS Digital. #qconlondon @lizthegrey: Q: how does regulation around privacy affect medicine? A: It's primarily a communications problem. But it's been helpful to have regulation, especially the right to explanation. #qconlondon @lizthegrey: A lot of the talks from both the tech and ethics tracks at #qconlondon boil down to the _Jurassic Park_ quote: "Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should." https://t.co/fY6aWTqy37 @lizthegrey: Expert advice on efficiencies doesn't necessarily align with the day to day realities. Example: expert said "put the printer closer to the blood collection", but didn't realize that the paper storage needed to be moved too. So they reverted the consultant's change. #qconlondon @lizthegrey: "Bring people with you and understand their needs, don't just force a solution upon someone and expect people to accept it." #qconlondon @lizthegrey: Expertise doesn't necessarily come from the ivory tower. Patients are the best experts on their own experiences, even if they're not medically trained. [fin] #QConLondon ### Effective Ethics for Busy People by Kingsley Davies Twitter feedback on this session included: @PaulDJohnston: Tech sector and the "third sector" (charity sector) are both massive worlds, but don't necessarily converge very often < @kings13y This is so so true #QConLondon @PaulDJohnston: People are responsible for their reaction to events - a learning from Man's Search for Meaning < @kings13y #QConLondon @PaulDJohnston: People can often find their significance from work, and so if they do, when they retire, can find that they need "pet projects" to find further significance @kings13y #QConLondon @PaulDJohnston: Other ideas: Adoration e.g. seeing family again. Bravery e.g. finding the courage to get through. @kings13y #QConLondon @PaulDJohnston: Making the point that if tech is non-diverse and is responsible for building AI/ML tools going forwards, then those biases will be carried forwards @kings13y #QConLondon @PaulDJohnston: analyze, Act and Iterate @kings13y #QConLondon https://t.co/BZqIszpMjF ### Ethos(s): Enabling Community and Culture by Robyn Bergeron Twitter feedback on this session included: @jkriggins: Why is @robynbergeron at #QConLondon? Part of it is bringing her #workingmoms and #opensource community lessons together. Openness, transparency, collaboration lessons. https://t.co/40TXwdNUEi @charleshumble: I have a sense of imposter syndrome that is probably not as big as yours because, you know, imposter syndrome. @robynbergeron #QConLondon @charleshumble: Users are potential contributors. If you have no users, you will have no contributors. @robynbergeron on open source at #qconlondon @charleshumble: "Some universal rules for OSS: Be honest. Be transparent. Keep the bar as low as possible (kindness, listening etc.) Community process or rules should apply equally to everyone. (or should be clear about the circumstances where it won't)." @robynbergeron #qconlondon @PaulDJohnston: Monolithic open source projects are less likely to get contributors because everyone has to know everything... ... modular-plugin style projects allow people to be experts in small elements -> greater participation without having to know everything @robynbergeron #QConLondon @charleshumble: If you are a participant in an Open Source project: Be honest about your intent Speak up when you have concerns about the health or operation of the community @robynbergeron #qconlondon @garethr: "Honesty and transparency of intent is critical for healthy open source communities" #QCONLondon @PaulDJohnston: "Sometimes there are business decisions..." @robynbergeron Yup! Even for open source companies #QConLondon ## The Right Language for the Job ### WebAssembly and the Future of the Web Platform by Ashley Williams Twitter feedback on this session included: @justincormack: Programming languages, time vs metal. @ag_dubs on Wasm the first new low level languagein 30 years #qConLondon standing room only https://t.co/PfavpILhKE @avideitcher: Wasm is like jquery for cpu architectures, says @ag_dubs at #qConLondon @timanderson: My dream is that "every platform becomes the web platform" says Ashley Williams #qconlondon @timanderson: "I love the predictability of not having a GC" says Ashley Williams; yes it does have good points :-) #qconlondon ## Solutions Track I ### Traces Are the Fuel: Making Distributed Tracing Valuable by Ben Sigelman Twitter feedback on this session included: @lizthegrey: He starts by critiquing observability dogma [ed: and this critique I agree with!] If Google and Facebook use Metrics, Logs and Distributed Traces, then we should also use those "three pillars", right? No, because each of them is flawed in some way. #QConLondon @lizthegrey: Each of them is missing a different one of {TCO scales gracefully, accounts for all data without sampling, and is immune to cardinality} Searching through all your logs just doesn't work. It's proportional to number of servers but value doesn't rise in proportion. #QConLondon @lizthegrey: Grouping by tags for metrics only works if your metric store was already set up for the tag you wanted. He needs to explain the cardinality problem because only 5% of audience has heard of it. You can only have ~100 tag values per tag, which is sad face. #QConLondon @lizthegrey: And traces are sampled, so you can't necessarily catch low-probability events. They don't make for good solutions on their own. What would an ideal solution look like? High-throughput, high cardinality, unsampled, and weeks of retention. But we have to pick 3. #QConLondon @lizthegrey: They're ultimately just data, rather than complete solutions. There are three pipes of data coming at us, rather than structural pillars. #QConLondon @lizthegrey: .@el_bhs thinks APM is focusing on the wrong thing for us (we care about services, not top-level applications). we need service-centric observability. We have SLIs for services but they don't align 1:1 necessarily with the application's SLIs. No more monoliths. #QConLondon @lizthegrey: Each service needs to only immediately care about its upstream users and downstream dependencies. We need to only know what data gets sent up and down. Narrows required scope of understanding. [ed: can confirm, most Google SREs know up/down 1-2 layers from their svc] #QConLondon @lizthegrey: But if a downstream dependency multiple layers deep is having a bad day, the decoupling makes debugging it harder. Observability is needed because systems have become more complex. He wishes there was a shorter word for it though. #QConLondon @lizthegrey: Control theory definition of o11y: understand system from its outputs. You need it to generate enough valuable outputs to understand it from the outside. And when we decouple, we no longer can find the right stack trace. So we need different tools. Demoing lightstep #QConLondon @lizthegrey: making analogy to chrome request inspector, etc. for people who've never seen a trace display UI. Have to label the critical paths rather than parallel requests that don't matter. Teams used to waste time optimizing performance outside the critical paths. #QConLondon @lizthegrey: Distributed traces alone can find egregious performance problems downstream. You can see the downstream span get super long, even if you don't know why under the hood. #QConLondon @lizthegrey: He's also demonstrating what I know from Dapper as annotations -- micro-logs stored within the span relative to the start of the span for intra-process events. #QConLondon @lizthegrey: Each transaction can generate one distributed trace; it crosses the boundaries between microservices so we don't have to step through one layer at a time like we used to. #QConLondon @lizthegrey: But there's too many traces to centralize in a single way, you can't easily search over them, and hard for humans to interpret them. [ed: this was the largest complaint I had about Dapper before George and I productionized exemplars -- too hard to find right trace] #QConLondon @lizthegrey: Distributed tracing is a *technique* that makes distributed traces (the raw data) valuable. "We're scratching the surface of what's possible." They're "the most important signal for understanding a microservice architecture." [ed: and traces are aggregated events] #QConLondon @lizthegrey: Yup, and he acknowledges that if you can't relate events together and find causality, it's hard to do the analysis you need. [ed: we're 100% agreed there! events in a vacuum are less useful than when you see how they relate] Now a refresher of SLIs. #QConLondon @lizthegrey: It's not about the inner workings, it's about the user-visible performance. Two workflows: we need to either gradually improve an SLI over months, or we need to rapidly restore a SLI to health. [ed: applause from here! yay consistent messaging :)] #QConLondon @lizthegrey: "Performance work or incidents that aren't about an SLI are a waste of time. Don't have alerting rules for internal implementation details." [ed: It's almost like @el_bhs has seen my talk already!] #QConLondon @lizthegrey: You need to measure your SLIs precisely to be able to detect problems, and you have to be able to iteratively form hypotheses to explain variance from the expected performance. [ed: omg yes yes yes] #QConLondon @lizthegrey: Service-centric observability is about starting with measuring the SLI, finding variance, and being able to explain it. The tooling approaches are similar both for slow and fast SLI debugging. Now showing how this would work, using tools at hand [e.g. lightstep] #QConLondon @lizthegrey: He notes that he would do this in a vendor-neutral way, but that there's no current vendor-neutral tooling available for demoing. [ed: there is opentracing and opencensus for collecting, but the rendering and data analytics is all everyone's secret sauce...] #QConLondon @lizthegrey: "Performance is a shape, not a percentile. and we don't actually care about p99s, we care about user experience." [ed: omg yes.] #QConLondon @lizthegrey: It's important to be able to view individual traces. [ed: or events!] from a time window and latency range to narrow down on the anomalies you're wanting to study. #QConLondon @lizthegrey: "An average latency of 5ms vs 100ms on different services doesn't matter, what matters is how it compares to each service's SLI." --@el_bhs #QConLondon @edith_h: Don't let your architecture diagram be so complex that no one can follow it - there's no winner in the micro service complexity contest @el_bhs #QConLondon https://t.co/OtsKLuL2Wm @lizthegrey: And you can also try to find what's in your critical paths downstream to determine what to speed up, and help you ignore the non-critical spans. #QConLondon @lizthegrey: "The view of a single trace doesn't hold a candle to looking at the aggregate view and nuanced perspective on the whole system from several hundred datapoints." #QConLondon @lizthegrey: This isn't feasible without a lot of distributed traces. Finally, explaining variance. Dimensions/tags can explain variance in traces and metrics; but traces [ed: and events!] allow more exploration of high cardinality data. #QConLondon @lizthegrey: "Exiting your workflow and tool is disruptive. Instead, try to find commonalities between tail latency, high-cardinality tag values, and your SLIs" Make sure you can overlay tags that are overrepresented in the tail. [ed: @FisherDanyel's Bubbleup does this too!] #QConLondon @lizthegrey: So, if you use microservices, use distributed trac*ing* to make sense of your distributed traces. "You don't need to buy Lightstep but you do need something distributed tracing shaped. There are other solutions out there, but don't stop at just visualizing one trace." #QConLondon ## Solutions Track III ### How to Feature Flag (Poorly) & Lessons Learned by Edith Harbaugh Twitter feedback on this session included: @lizthegrey: @edith_h 3/4 of this audience is already using feature flagging! Yay! The number goes up year on year! and about 10% of folks are in process of learning. #qconlondon @lizthegrey: It sucks when nobody likes and cares about the "clever feature" you developed. How could we prevent this catastrophe from happening? And how can we cope with being "that bad vendor" that broke you? #qconlondon @lizthegrey: "Product managers can fix everything, you just have to build the right thing, right?" It turns out there's no one right answer, but you can be more agile and iterative.' And she learned the importance of marketing from that too. [ed: all prep for becoming CEO!] #qconlondon @wiredferret: #QConLondon @edith_h: If you're running a production system and a vendor ships you something that breaks your system, they're mad and it's justified. After my time in development, I decided that the problem might be solved by doing product and building the right thing. @lizthegrey: All of these problems of breakage and testing the product could have been addressed through feature flags. We don't need yearly waterfall any more, we have weekly or more frequent releases. #qconlondon @wiredferret: #QConLondon @edith_h: In the last 10 years, the average time to release an application has gone from years to weeks. Feature flags are part of the solution for Microsoft and many others. @wiredferret: #QConLondon @edith_h: Small changes are faster and less risky for release and keeps the spirit of agile. @lizthegrey: By separating deployment from release and getting quick feedback, you can move faster and be more innovative. @LaunchDarkly's goal is to make it easier to manage feature flags. 200 billion features evaluated per day. [up 10x since last printing of brochures!] #qconlondon @lizthegrey: People start out using feature flags as a kill switch to roll back unexpected behavior from new features. It's the worst possible time to try to do a release when something is already on fire. It takes time to roll out. Instead, just press 'stop' on a control panel. #qconlondon @wiredferret: #QConLondon @edith_h: Atlassian takes from minutes to 3 hours to deploy. LaunchDarkly gives them the ability to stop a misbehaving feature instead of doing a redeploy. @lizthegrey: Take your time diagnosing once you've mitigated! [ed: yes! this is in my talk too!] Panic releases worst releases. You make the most mistakes then, especially if you're trying to fix forward. #qconlondon @wiredferret: #QConLondon @edith_h: Anti-pattern: Using feature/long-lived branches to do fixes for particular clients. Then the merge party involves a lot of yelling about trying to unify code bases. @lizthegrey: "Friday night merge parties are not actually a party." Merge early and often but control behavior in prod with feature flags. #qconlondon @lizthegrey: Controlled rollouts let you have more flexibility in terms of when things can go wrong, letting you get your lunch instead of having your Nandos visit interrupted. #qconlondon @lizthegrey: Early access betas for your earliest customers don't work if customers have to use the beta service URL instead of the real server, and risk losing anything they do on that copy... #qconlondon @lizthegrey: Feature flagging doesn't have to be per server, it can be per user or customer instead. Your users are more forgiving if they know they can just ask to have a broken thing turned right back off. And people like getting the latest and greatest. #qconlondon @lizthegrey: Your most enthusiastic customers are likely to also be the most forgiving of bugs. You can also block individual users' access to features to prevent competitors or press from seeing experiments. #qconlondon @lizthegrey: "Test in production" provokes mixed feelings in people. Are we being careless cowboys? No. It's the opposite of that. We're finding all the actual issues in a responsible way. "The lab is very different from people running through an airport trying to look up gates." #qconlondon @wiredferret: #QConLondon @edith_h: Kill your staging server! What is a staging server for, and what is the value and purpose? @wiredferret: #QConLondon @edith_h: We have a customer who saved 10s of thousands of dollars by killing off staging and going to production more quickly. @lizthegrey: You can also treat subscriptions as feature bundles, lowering the cost of managing your customers and controlling their feature delivery. #qconlondon @lizthegrey: and when you need to rip something out, flag it down to 0% then remove it knowing nothing is using it. #qconlondon @lizthegrey: But what are the pitfalls of feature flags? Two words: Knight Capital. It lost500M in 20 minutes and went out of business. They used confusing flags for multiple purposes, causing the wrong flag to be turned on in a HFT system. #qconlondon

@lizthegrey: Don't use ambiguously named flags. Don't name every flag after fast food. What does "chicken nuggets" mean to you at 4am? #qconlondon

@lizthegrey: You may be called upon to explain them to a customer. Don't make your salespeople and customers unhappy. Define what off and on mean for a flag, and don't overuse/overload them. #qconlondon

@wiredferret: #QConLondon @edith_h: Be really clear if a flag is on or off. Decide on a convention and stick to it, because a name can be very confusing, such as User setting state enabled.

@lizthegrey: Lack of communication and unclear usage may cause teams to get out of sync about why a flag is on, causing them to set it to the wrong value! :( #qconlondon

@lizthegrey: Don't set up conflicting flags. There's no one generic answer, it depends upon your architecture and people. Make sure people are communicating enough so they don't create conflicting behavior. #qconlondon

@lizthegrey: Remove flags you're not actually using so someone doesn't accidentally turn them on. #qconlondon

@wiredferret: #QConLondon @edith_h: Antipattern: Conflicting flags. It's easy to have flags at different levels of control that interact and conflict with each other without being able to see what the team had intended.

@lizthegrey: Make sure you have appropriate controls and permissions to prevent flags from being changed by people who don't need access. And keep on top of your tech debt. Clean up leftover flags. #qconlondon

@wiredferret: #qconlondon @edith_h How do you do flags well? * Flag carefully and mindfully * Lock down access to flag changes * Remove flags when they have passed their useful period https://t.co/E5WGie0eD6

@lizthegrey: Know what the lifecycle of your flags is. Are you using it for ops, for feature management, etc. -- and when will you remove it? Managing many feature flags across many teams gets exponentially hard without tooling. Have ACLs on your flags. #qconlondon

@wiredferret: #QConLondon @edith_h: Make releases easier and less frightening by making sure you're doing progressive delivery and allowing instant feature-off.

@thetruedmg: Push and Pray - old style deploys to Live. With flags, can deploy and only allow QA to see the changes, who can then make sure this works on Live before switching the flag on for everyone #QConLondon

@lizthegrey: Remember that while developers put in the flags, they're not the only users. QA can test in prod, safely, by seeing the feature flag turned on for them first before it rolls out to the rest of the prod. Marketing/customer success can handle betas on the same platform. #qconlondon

@lizthegrey: Audience question: what is it not good for? A: things with giant persistent state e.g. database migrations/schema changes that aren't forward/backward compatible. #qconlondon

Avi Deitcher attended the conference:

Arguably, QCon is one of the best run conferences I have attended. It isn’t just the smooth administration, although that played an important part. Rooms were ready and prepared; audio-visual worked; sessions started on time and ended on time. They had people at the back of the room - where attendees couldn’t easily see them but presenters could - who had very large signs. The first was a yellow yield sign saying, “5 minutes”, while the second was a big red stop sign. This wasn’t just for sessions, but keynotes as well.

Scheduling was important. There were 25 minutes between each session. This not only provides sufficient time to have a few post-presentation follow-up questions and use the facilities, get a hot drink and make it to the next session, even for the next presenter, but enabled - nay, encouraged networking throughout the day. Most conferences have “mixer” times, usually at the end of the day. Not everyone can stay the whole day; many are tired; people have to fill booths.

Small touches made a big difference. The name badges were much shorter, hanging comfortably on the mid-chest rather than the stomach. I far prefer not having people try to decide how in-shape I am while getting my name (nor do I enjoy doing it to others). However, because you sometimes need to extend the badge, it sits at the end of two extenders, one on each side, to make it easy to pull when needed.

Even better, first names were printed in LARGE CAPITAL LETTERS. This made it very easy to see someone’s name without really looking down at all. One of the surprising effects was that speakers were able to see the names of the first 5-6 rows of the audience. I never have had that experience as a presenter before, Even when I didn’t use the name of someone in the audience, as part of my own involving the audience or responding to a question, just seeing their name created a better connection and thus a warmer and more comfortable presentation. This was quite the pleasant surprise for me, both as a speaker and as an audience member.

At the end of each session, staff stood at the exit to the room with their own three-part badges: green, yellow, red. Each attendee put their RFID-enabled badge over the staff’s, voting if the presentation was right-on; needed some work; or just failed. They further encouraged feedback on that vote on the Website.

However, in many ways, the most important part was not all of the above, as good as it was. The best part was the quality of the presentations. It came down to three key things they do:

• Each day is divided into six tracks, each track has a chair, and each speaker is selected, or curated, by the track chair.
• There are no sponsored presentations or keynotes. There was one sponsored track, but they were very explicit about it.
• Finally, QCon invest in speakers.

They really invest. There are optional Webinars, an interview, and even a rehearsal session with an experienced QCon speaker or the chair (Wes) himself. They believe great speakers are made, not necessarily born, and they probably are right.

I got the opportunity this year to attend the QCON London conference, ran by InfoQ, and would absolutely recommend it to anyone interested or working in the software engineering industry, that’s proficient with technical language and subjects. Most talks were aimed at a general wider audience, from engineers to CTOs, with abstract high-level material, rather than getting too down and dirty with the technicalities of specific languages or frameworks.

Guy Podjarny attended the conference:

QCon is one of my favourite conferences, and was a great home for such a track (DevSecOps - ed.n.), since its audience, for the most part, is senior enough to have deep conversations, and pragmatic enough to understand the world isn’t black and white and no solution is perfect. On top of that, the conference organisers run an amazing production, including the best collection of audience feedback I’ve seen.

@SonOfGarr: One of the best views from a conference I've ever had. Well done #QConLondon https://t.co/tDw22gpA5F

@MelanieCebula: Love this location! #qconlondon https://t.co/PnzTtQFCaM

@aboutchrisw: First day of #qconlondon has been great. Some interesting talks and of course, lots of goodies! Looking forward to more of the same tomorrow #LifeAtCapgemini https://t.co/GBdSnIRZyg

@Shaleenaa: So far I'm pleased with the quality of talks at #QConLondon, speakers really dive into the history of their story and provide insights on what it means to move forward intelligently

@jessfraz: #qconlondon is awesome :) I keep getting to have great technical conversations and ask smart people lots of questions like @perbu this morning

@jessfraz: The awesome thing about #qconlondon is somehow they got all the speakers you'd want for the right topics, RISC-V, Web Assembly and Rust... really a good event

@kriswager: So many great book recommendations at #QConLondon My to-read pile is going to grow a lot when I get home https://t.co/AS2kCdfrXm

@jessfraz: This has been an awesome conference, after speaking with others I think it's due to the super technical content, quality speakers, and lack of vendor talks #qconlondon so glad I could be here :)

@robjordan: I thought choosing a track to attend at #qconlondon was hard, but this choice blows my mind! https://t.co/2lOAKFV5JH

@timreid: #qconlondon continues to live up to expectations. highlight of the conference so far were contributions by @el_bhs

@TonyPrintezis: "Arguably, QCon is one of the best run conferences I have attended." : I haven't attended #QConLondon, but I TOTALLY agree with this for #QConSF and #QConNYC. They always do a great job.

## Takeaways

Mike Salsbury’s takeaways were:

We’ve attended a host of interesting and inspiring talks at QCon London 2019 and we’ve pencilled in many more talks to catch up on when the recordings become available. There have been some great ideas over the past three days that we can bring back to Caplin, and we’re already looking forward to next year’s QCon with a new range of topics to learn about and share with the team.

Takeaways from QCon London included:

@FraNobilia: #QConLondon has exceeded my expectations with #cultural and #career tracks. Well done @InfoQ

@cfhirschorn: There's nothing better than hanging out with folks you have massively respected *for years* to find out they are all super genuinely nice and approachable human beings. And very intellectually engaging› Thank you #QConLondon

@dunknicoll: Alllllmost home after travelling back from London. #qconlondon was a great experience. Fantastic talks, fantastic food, fantastic venue. If you ever get the chance, definitely check it out!

@MortenStromme: Time to fly home from #London and #QConLondon. Definitely a good conference.

@xjamundx: Finally leaving after an incredible 6 days. Thank you #qconlondon for having me! https://t.co/jF45sA1aPe

@GeertvanHorrik: Back from a great #QConLondon, gained great new insights. It was my first time but definitely worth going back next year. Great to be out of the #Microsoft echo chamber once in a while

@BlancaRojoM: Truly inspired after 2 days at #QConLondon . Thanks for all the amazing talks and the dozens of new things to think about https://t.co/8DfiHaRC1P

## Conclusion

InfoQ produces QCons in 6 cities around the globe. Our focus on practitioner-driven content is reflected in the fact that the program committee that selects the talks and speakers is itself comprised of technical practitioners from the software development community. In early May we’re in São Paulo, then New York in June, Shanghai in October and San Francisco in November. We'll be back in London on March 2nd-6th 2020.

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.