InfoQ Homepage Articles Key Takeaway Points and Lessons Learned from QCon New York 2018

Key Takeaway Points and Lessons Learned from QCon New York 2018

Aug 01, 2018 76 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Conferences often have a theme that emerges during their course that none of us had predicted. One year at QCon London it was, bizarrely, cat pictures; every presentation you went to had a cat picture in it. Another year, as the microservices movement was just getting going, it seemed that we had mandated a Conway’s law slide in every presentation - we hadn't of course, but there certainly were plenty of them.

This year, at the seventh annual QCon New York, our second year in Time Square, it felt like it was diversity and inclusion. The event had a particularly positive atmosphere that made for something truly special, and we got a huge amount of positive feedback from attendees and speakers about it, both during the event and afterwards. This is something the QCon team has worked on for several years, and it felt wonderful to see that work starting to pay dividends.

From a content perspective, attendees at the event got to see keynotes from Guy Podjarny co-founder of Snyk, talking about "Developers as a Malware Vehicle", Joshua Bloch giving a "Brief, Opinionated History of the API", and Tanya Reilly, Principal Engineer of Squarespace, giving a thoroughly interesting and unusual talk about the history of fire escapes in New York City and what we as software engineers can learn from them.

In total we had 143 speakers across the 117 sessions, workshops, AMAs, Open Spaces and mini-workshops. Topics included containers and orchestration, machine learning, ethics, modern user interfaces, microservices, blockchain, empowered teams, modern Java, DevEX, Serverless, chaos and resilience, Go, Rust, Elixir, and security.

Videos of most presentations were available to attendees within 24 hours of them being filmed, and we have already begun to publish them on the InfoQ site. You can view the publishing schedule on the QCon New York website.

InfoQ also reported from the event, and recorded podcasts with a number of speakers. This article, however, presents a summary of QCon New York as blogged and tweeted by attendees.

Keynotes
Tracks and Talks
Architectures You've Always Wondered About
Chaos, Complexity, and Resilience
- UNBREAKABLE: Learning to Bend but Not Break at Netflix
- Using Chaos To Build Resilient Systems
Container and Orchestration Platforms in Action
- Control Planes: Designing Infrastructure for Rapid Iteration
- Forced Evolution: Shopify's Journey to Kubernetes
Developer Experience: Level up Your Engineering Effectiveness
- Help! I Accidentally Distributed My System!
- Succession: A Refactoring Story
Empowered Teams
- Empowering Agile Self-organized Teams with Design Thinking
Ethics in Computing
Finding the Serverless Sweetspot
- Observability to Better Serverless Apps
- Serverless Patterns and Anti-Patterns
Microservices: Patterns & Practices
Modern Java Reloaded
- Effective Java, Third Edition - Keepin' it Effective
- Invest in Your Java Katalogue
Modern User Interfaces: Screens and Beyond
- Rethinking HCI With Neural Interfaces @CTRLlabsco
- Smart Speakers: Designing for the Human
Real World Security
- Making Security Usable: Product Engineer Perspective
Ask Me Anything and Open Space
- AMA w/ Joshua Bloch
- Mob Programming Mini Workshop
Opinions about QCon
Takeaways
Conclusion

Keynotes

A Brief, Opinionated History of the API

by Joshua Bloch

Twitter feedback on this keynote included:

@lizthegrey: Subroutine libraries first appear in Goldstine and von Neumann's 1948 paper on programming methodology *before general-purpose computers physically existed*. #QConNYC

@lizthegrey: Key idea: programs require common operations. Library subroutines reduce duplicated code and number of errors. #QConNYC

@lizthegrey: Maurice v Wilkes: Second ever Turing Award given to Wilkes for subroutine libraries. Why didn't Goldstein and von Neumann get the award? It was vaporware at the time. #QConNYC

@lizthegrey: EDSAC was the real deal -- world's first stored-program computer; was immediately useful. 650 instructions per second. #QConNYC

@lizthegrey: 4 million times slower than a modern PC, and 4 million times less memory; 100 times the power and 1000 times the size. But it changed the world. #QConNYC

@charleshumble: EDSAC was 17 bit. Don’t ask why. I do know but you don’t want to. @joshbloch #QConNYC https://t.co/xGKj1RXS02

@lizthegrey: Why was he successful? Kept it simple and based it on conventional architecture and conservative electronic design. #QConNYC

@lizthegrey: The first two EDSAC programs ran on in May 1949, printing the first 100 squares and the first 170 primes. #QConNYC

@lizthegrey: Simple software architecture sufficed: 30 words stored in electromechanical switches of "initial orders" (boot loader). Loaded program from tape into memory. #QConNYC

@danielbryantuk: "Why did Wilkes complete the EDSAC before other early computers were finished? Because he kept it simple" @joshbloch #qconnyc https://t.co/REcLrd8jPP

@jeanneboyarsky: In first program on first computer ”... realization came over me that a good part of the remainder of my life was going to be spent in finding the errors in my own programs” —— Maurice Wilkes. Subroutine library is a partial fix. #qconnyc

@lizthegrey: If you don't have to debug every little thing because you have trusted libraries, you might have an easier time...So Wilkes gave the job to Wheeler. #QConNYC

@lizthegrey: Wheeler devised "coordinating orders" to direct the compiler to insert subroutines into the code. [ed: kind of like macros, almost]. Required no manual intervention, unlike von Neumann's idea #QConNYC

@lizthegrey: Everything fit all on a single tape, and the bootloader stayed at 42 instructions (up from 30), constrained by the phone switches "a tour de force of ingenuity." #QConNYC

@lizthegrey: Arbitrary recursion was permitted from subroutines into other subroutines, and passing functions in as arguments to other functions. Self-modifying code. The linkage technique was called "The Wheeler Jump". Amazing for its time. #QConNYC

@jeanneboyarsky: ”The wheeler jump” —— call function by jumping. Requires self modifying ocde which would be a security nightmare now. #QConNYC

@lizthegrey: Large amount of mathematical coverage of the subroutines; written all in one year by one research team. #QConNYC

@danielbryantuk: "The EDSAC subroutine library was in fact a library of code tape" @joshbloch #qconnyc https://t.co/uLvgSXLUXg

@glamcoder: If you ever wondered why a software library is called "library", that's why #qconnyc #keynote #day2 https://t.co/eGZataIwCQ

@lizthegrey: Entire API was contained in _The Preparation of Programs for an Electronic Digital Computer_, the first text on computer programming (WWG abbreviation for the authors' last names) #QConNYC

@lizthegrey: Key ideas presented in Wheeler's 1952 paper: "The Use of Sub-routines in Programmes". Described subroutines, libraries, performance/generality tradeoffs, first-class functions, etc. #QConNYC

@lizthegrey: Quote from the paper: "After coding/testing, there remains the task of writing a description so that people not acquainted with the interior coding can use it easily; *this task may be the most difficult*." #QConNYC

@lizthegrey: 42 years later, Parnas still had to remind people that "Reuse is far easier to say than do, requiring both good design and documentation". #QConNYC

@jeanneboyarsky: 1951 - “The Preparation of programs for digital computers”. Called WWG for last names of authors. World’s first text on computer programming and remained primary text until higher level languages arose. Introduced subroutines to the word. #QConNYC

@lizthegrey: Wheeler's 1952 paper remains accurate: "simplicity of use, correctness of code, accuracy of description, and burying complexity out of sight." #QConNYC

@lizthegrey: Why didn't WWG discuss APIs separate from the library? Because the two were isomorphic. Only one machine architecture and one machine. No portability needed. #QConNYC

@charleshumble: “Wheeler’s 1952 was only 2 pages long. And by the way I found a typo in it.” @joshbloch #QConNYC

@lizthegrey: No legacy programs, because there were no earlier programs. No need for backward compatibility. They understood API design principles but didn't see a difference between library implementation and API. #QCOnNYC

@lizthegrey: But the field progressed, and libraries had to be reimplemented on new hardware. Keeping the same API let you preserve code and knowledge. #QConNYC

@lizthegrey: New algorithm implementations made existing APIs faster. First use of API: 1968 paper "Data structures and techniques for remote computer graphics" #QConNYC

@lizthegrey: "System designed to be hardware independent... implementation may be recoded for different hardware, but maintain the same interface with each other and the application program." #QConNYC

@lizthegrey: "Eventual replacement is almost a certainty given the rapid rate of developments in computer technology... flexible, hardware independent systems will ensure that systems don't become prematurely obsolete." #QConNYC

@lizthegrey: Authors separated implementation from the API, to allow implementations to be replaced without harm to clients. Libraries naturally give rise to APIs. Not so much invented as discovered. #QConNYC

@lizthegrey: It took us 20 years from the invention of libraries to the latent discovery of APIs. #QConNYC

@lizthegrey: Two-part test for whether something's an API: does it provide operations defined by inputs and outputs, and can it be reimplemented without compromising its users? #QConNYC

@lizthegrey: C standard library from 1975. K&R's commentary from 1978: "routines are meant to be portable, and programs that only use the standard library can be moved from one system to another without change." #QConNYC

@jeanneboyarsky: If it provides operations defined by inputs/outputs and allows reimplementation without compromising user, it is an API #QConNYC

@lizthegrey: Core libraries become joined to the language. Unix VI system calls from 1975 -- OS kernels have APIs! #QConNYC

@lizthegrey: IBM PC BIOS (1981) -- firmware provided API to underlying hardware.
MS-DOS command-line interface -- is that an API? Well, a script requires access to the commands... #QConNYC

@lizthegrey: Win32 API (1993) still used today. Java class libraries (version 2, 1998), with many implementations including Harmony, Android, etc. #QConNYC [ed: obligatory that these are Bloch's opinions and not necessarily mine or Google's]

@lizthegrey: And then we have web APIs. the first web API was delicious. Lessons: APIs come in all shapes and sizes, and can live long after the platforms they were created for. They can create entire industries above and below. #QConNYC

@lizthegrey: "APIs are the glue that connects the digital universe." --@joshbloch #QConNYC

@lizthegrey: But API reimplementation is under serious attack, says @joshbloch. in August 2010, Oracle sued Google in Federal Court for reimplementing Java APIs in Android. #QConNYC

@lizthegrey: May 2012: The jury ruled no patent infringement, judge ruled APIs not copyrightable. Oracle appealed, and in May 2014 US Court of Appeals overturned Alsup's ruling. Google petitioned SCOTUS, and in June 2015 SCOTUS declined to re-hear it. #QConNYC

@lizthegrey: "Unfortunately, the case was remanded to the court in California to decide if it was fair use." –@joshbloch May 2016: jury ruled fair use. Appealed by Oracle, but despite academic amicus briefs, May 2018 appeals court reversed jury verdict. #QConNYC

@lizthegrey: "Currently, unless something changes, it it is today the law of the land that API reimplementation isn't allowed without the permission of the API creator." But there are efforts to have a new en banc hearing, supported by academics and industry professionals. #QConNYC

@lizthegrey: [ed: all of the above are, I must stress again, @joshbloch's opinions and not mine or Google's.] What happens if the ruling stands? Payment of licensing fees or field of use restrictions (or outright denial) are possible. #QConNYC

@lizthegrey: Author would get a near-perpetual monopoly on the API. If you think software patents cause problems (20 years), a life plus 70-year or 95-year monopoly on implementations of an API would strangle the industry. #QConNYC

@lizthegrey: None of GNU, non-IBM PCs, Samba, Wine, Android would be possible without this right to reimplement, which had been the case for most of the 70-year history of computers. #QConNYC

@lizthegrey: We'll wind up spending less time coding, more time arguing with lawyers, and wind up reimplementing incompatible things, says @joshbloch. #QConNYC

@lizthegrey: APIs date back to the dawn of the computer age. Don't develop APIs unless they're free to reimplement. Don't work for companies that assert copyright on APIs. And let executives at the companies you work for, and the courts and Congress know your opinions. #QConNYC

Developers as a Malware Distribution Vehicle

by Guy Podjarny

Jeanne Boyarsky attended this keynote:

Developers have more power than ever – can get more done and faster. Can also do more harm….

Must trust the people who write the software.

We ship code faster. Hard to find if developer introduces code maliciously or accidentally….

As we get more power, we need to get more responsible

Causes of insecure decisions:

Different motivations – focus On functonality. Security is a constraint. Need to be cognizant of it
Cognitive limitations – we move fast and break things
Lack of expertise – don’t always understand security implications
Developers are overconfident. Harder to train where think know it.
”It doesn’t happen to me” . Security breaches happen to everyone.

Mitigations

Learn from past incidents
Automate security controls
Make it easy to be secure
Developer education
Manage access like the tech giants
Challenge access requests. When need. For how long. What happens if don’t have access. What can go wrong with access? How would you find out about access being compromised?

Developers have access to user data. Be careful.

Google BeyondCorp

All access route through corporate proxy
Proxy grants access per device – limits what can do from Starbucks
Monitoring access

Microsoft Privileged Access Workstations (PAW)

Access to production can only be from a secure machine
No internet from the secure machine
Your machine is VM on secure machine

Twitter feedback on this keynote included:

@lizthegrey: "Not being a malware distribution vehicle is generally useful." -- @guypod #QConNYC

@lizthegrey: Devs in China chose to mirror Xcode to local mirrors e.g. on Baidu filesharing. But some of the mirrors had malware :( #QConNYC

@lizthegrey: XcodeGhost included a malicious CoreServices component that spies on users. Evaded Apple's malware detection. :( #QConNYC

@lizthegrey: Undetected for 4 months from May to September, infecting 300+ apps in China -- WeChat, DiDi, Railway apps... #QConNYC

@lizthegrey: Some apps were compromised through third party libraries. Total of 1.4M active victims per day. Imagine a startup with 1.4M daily users within 4 months. #QConNYC

@lizthegrey: Victims weren't just in China; many in US, Japan, Canada, Australia, and Ireland. #QConNYC

@lizthegrey: Even with a closed App Store environment, _months_ to get users to choose to update to non-infected apps. #QConNYC

@lizthegrey: Eventually, Apple wound up solving the underlying motivation: providing local mirrors so developers in China wouldn't have to use untrusted sources. #QConNYC

@lizthegrey: Developers were a distribution vehicle for the malicious library. Without them, it would have gone nowhere. #QConNYC

@lizthegrey: Second example: 2009. Virus inside of Delphi called Induc. Compromised sysconst.dcu statically linked into every program compiled on machine. #QConNYC

@lizthegrey: Even worse than XcodeGhost -- took 10 months to find, propagated millions of times, more viral and harder to centrally remove; replicated within compiler rather than executables. #QConNYC

@thenewstack: With Delphi’s #Induc and Apple’s #XCodeGhost, #malware can be widely spread through developers, by way of compilers and hidden libraries. @snyksec’s @guypod #QconNYC https://t.co/4y7IHplWq2

@lizthegrey: The moral: You can't trust code that you didn't create yourself. But nobody does this. #QConNYC

@lizthegrey: It's more important to trust the people who write software, according to Thompson.

@lizthegrey: There's no tie between the code on GitHub and the compiled binaries served on NPM. Malicious PyPi packages last year, RubyGems, NPM, etc. -- and malicious docker images too. #QConNYC

@lizthegrey: These are just the ones that we know about. Attackers are smart and sophisticated, and evolving faster than defenders. -- @guypod #QConNYC

@lizthegrey: Our users trust the code that we ship as developers. We need to pay attention. But we have another power: access to code, systems, and data. #QConNYC

@lizthegrey: Salesforce (@modMasha) ran an internal phishing test, and developers were the second most likely group to click the malicious link. #QConNYC

@thenewstack: With #DevOps, developers have access to production systems, and user data—which is not always a good idea, given that developers are just as likely to click on a phishing email as any other employee — @snyksec’s @guypod #QconNYC

@lizthegrey: Story 2: Uber Hack of 2016: 600k Uber drivers had PII leaked, and some personal info of 57M Uber users. Uber paid a $100k ransom disguised as a bug bounty. #QConNYC

@lizthegrey: Uber didn't report the breach for an entire year until November 2017. How did it happen? S3 tokens pushed to private github repo w/o 2fa; attackers gained access to repo. #QConNYC

@lizthegrey: Tokens used to steal info from S3. Where did this go wrong? Uber said they told people to use 2fa, and to stop using GitHub. But repeat of 2014 incident where public gist contained sensitive URL. #QConNYC

@lizthegrey: Spiderman quote: With great power comes great responsibility. Why do developers keep falling for these? #QConNYC

@lizthegrey: People make insecure decisions because they're motivated by something other than security (e.g. baby photos). Cognitive limitations of number of distinct passwords remembered. Lack of expertise. #QConNYC

@lizthegrey: Developers are, in addition to failing on those three previous things, overconfident and arrogant. Training developers is harder than training regular employees... 'This couldn't happen here.' --@modMasha #QConNYC

@lizthegrey: We are trustworthy but not infallible. -- @guypod We need to protect ourselves when we inevitably make mistakes. #QConNYC

@lizthegrey: Three lessons: (1) learn from past incidents [eg automate security controls, make security the default, educate developers], (2) manage access like a tech giant [e.g. beyondcorp/Cloud IAP, u2f, access controls] #QConNYC

@lizthegrey: PAWs from Microsoft -- access to production requires a dedicated secure isolated machine; your personal work is a VM inside that secure host machine. #QConNYC

@lizthegrey: Netflix's BLESS: no long-lived access; central SSH certificate authority and bastion servers to mediate access. #QConNYC

@lizthegrey: and the most important question: what's the worst case if it were compromised? How would we detect it? #QConNYC

@lizthegrey: At the end of the day, users trust us. Care about user safety, even if it's hard and slows us down. Don't be a malware distribution vehicle. [fin] #QConNYC

@thenewstack: Tech companies mitigate against security breaches by limiting of employee permissions — see the Google #BeyondCorp central proxy, Microsoft’s Privileged Access Workstations and Netflix’s ssh-based Bless— @snyksec’s @guypod #QconNYC @qconnewyork https://t.co/mu8jpOvbNR

The History of Fire Escapes

by Tanya Reilly

Twitter feedback on this keynote included:

@danielbryantuk: "Fireproof buildings are more effective than fire escapes. Much like building resilient software is more effective than tacking on an ops process and debugging in prod" @whereistanya #qconnyc https://t.co/9qzvPKWJwh

@charleshumble: "An optimistic disaster plan is a useless disaster plan." @whereistanya #qconnyc

@charleshumble: "I read this patent 3 times and I'm pretty sure this person invented a rope for a fire escape. It's the most silicon valley invention of 1846" @whereistanya #qconnyc

@danielbryantuk: Fascinating insight into the failure of fire mitigation in New York city, via @whereistanya at #QConNYC
"Human error is never the root cause of failing to escape from a building fire" https://t.co/HLQJ8RHP8g

@charleshumble: "fire escapes collapse during times of intense use - such as during actual fires." #qconnyc @whereistanya

@John03000413: Reliability is everyone’s job #qconnyc https://t.co/d9YuveWSU1

@lizthegrey: "You don't want the NYFD rushing into your kitchen every time you burn toast." --@whereistanya #QConNYC

@danielbryantuk: "If you missed my subtle metaphor for issues with handling fires, it's the same with software failure" @whereistanya #QConNYC https://t.co/RTqHDgJ86m

@charleshumble: "The New York fire department recommends you don't cook when drunk or sleepy. I'd like to respectfully suggest that the same applies to a root prompt." #QConNYC @whereistanya

@charleshumble: "Fatigue is a form of encumbrance. Push back on this" @whereistanya #qconnyc

@mpredli: “Software without built-in reliability is a tenement.” @whereistanya at @qconnewyork #QConNYC

Tracks and Talks

Architectures You've Always Wondered About

Canopy: Scalable Distributed Tracing & Analysis @Facebook

by Joe O’Neill & Haozhe Gao

Twitter feedback on this session included:

@thenewstack: To monitor its thousands of services, Facebook captures about a billion traces a day (about ~100TB collected), a dynamic sampling of the total number of interactions per day — @Facebook’s Haozhe Gao and Joe O’Neill #QConNYC https://t.co/iHXCirnp3L

Lyft's Envoy: Embracing a Service Mesh

by Matt Klein

Twitter feedback on this session included:

@micheletitolo: “Lyft’s architecture wasn’t scaling because the developers didn’t trust the infrastructure” - @mattklein123 #qconnyc

@micheletitolo: "People have partial or no implementations of distributed systems best practices" - @mattklein123 #qconnyc

@micheletitolo: "If I want consistent [microservices best practices] I need to implement them in each language" #qconnyc

@micheletitolo: "If developers don't trust it, they won't use it. They would go add their features to the monolith" @mattklein123

@danielbryantuk: Breaking down @EnvoyProxy with @mattklein123 at #qconnyc "We use Envoy as a middle proxy and an edge proxy too -- this way there is only one tool to learn" https://t.co/1jINCs0BOS

@nWaHmAeT: If your devs are spending 60% on infrastructure instead of business logic, you're doing it wrong. @mattklein123 on removing impediments to a microservices architecture

@danielbryantuk: "Observability is vital for modern microservices-based networking" @mattklein123 #QConNYC https://t.co/B85vN8n23x

@danielbryantuk: "There is no traffic at Lyft that does not go through @EnvoyProxy -- we have 100% coverage, including observability" @mattklein123 #qconnyc https://t.co/e8sBr115tq

@philip_pfo: “Consistency reduces cognitive load and improves operability of a service.” @mattklein123 #QConNYC

@danielbryantuk: "Distributed tracing is not so useful for fire fighting, but it is fantastic for debugging and performance issues. You need to have 100% coverage of communication though, and also make it easy for devs to access data via tooling" @mattklein123 #qconnyc https://t.co/J0KHQ5uF8v

@danielbryantuk: Three key observability dashboards from @mattklein123 at #qconnyc
- service to service comms (for *all* services)
- edge proxy (including all ingress and failures)
- global health https://t.co/fCyyBJs1In

@danielbryantuk: "@EnvoyProxy is a universal data plane. There is a thin client for ID propagation and best practices etc, but the sidecar proxy is where the magic happens" @mattklein123 #QConNYC https://t.co/g7VI1CTAhH

@danielbryantuk: "Critical mass with @EnvoyProxy has nearly been achieved within the service mesh space -- is it becoming too costly *not* to use?" @mattklein123 acknowledging his bias, but making a good point at #QConNYC https://t.co/tmnASAhOYl

@danielbryantuk: "In the long term I'm not sure people will even be aware that @EnvoyProxy is running. It will most likely be embedded into container and serverless platforms" @mattklein123 #QConNYC https://t.co/BwNVHQhSky

@thenewstack: The next big thing will be connecting asynchronous rest-based systems, such as the ones #Envoy supports, and event-based “real-time” synchronous systems such as #Kafka. Most companies already use both. Envoy will look into supporting Kafka soon — @mattklein123 #qconnyc

Scaling Push Messaging for Millions of Devices @Netflix

by Susheel Aroskar

Twitter feedback on this session included:

@danielbryantuk: Great start to @susheelaroskar's #QConNYC talk -- the architecture of Zuul Push at Netflix, and why this is important (paraphrasing) "this reduced traffic by 12%" in comparison with pull-based comms https://t.co/H8ZYfYPIdY

@kcasella: @susheelaroskar auto-scaling a push notification system requires looking at open connections, not RPS or CPU #qconnyc @WeAreNetflix https://t.co/GzwSECk1CD

@danielbryantuk: An overview of managing a push cluster like Netflix's Zuul Push, @susheelaroskar at #QConNYC https://t.co/Y3EALG5npp

@whereistanya: That was an engaging and informative talk by Susheel Aroskar about how Netflix does push notifications. Key takeaways:
- recycle connections often
- randomize connection lifetime
- use small servers
- autoscale on open connection count
- use websocket-aware or TCP LB

Chaos, Complexity, and Resilience

UNBREAKABLE: Learning to Bend but Not Break at Netflix

by Haley Tucker

Twitter feedback on this session included:

@danielbryantuk: The @netflix team learned from an experiment of shutting off the non-critical service shard that although everything worked, the critical services saw 25% more traffic... via @hwilson1204 at #qconnyc https://t.co/x4L32VJGcJ

@danielbryantuk: An overview of the principles of chaos from @hwilson1204 at #qconnyc (with a hat tip to @caseyrosenthal et al for https://t.co/LTW9u8p4hU) https://t.co/ph2JSQ5kdv

@danielbryantuk: "Limit the impact, and pick your times, when running chaos experiments" @hwilson1204 #qconnyc https://t.co/RtThU0YIgE

@danielbryantuk: Interesting to hear from @hwilson1204 about the value automated canary analysis can provide at Netflix (and some of the tooling is now open sourced in @spinnakerio) #qconnyc https://t.co/0X8hhLPeSV

@danielbryantuk: Great key takeaways from @hwilson1204's #QConNYC talk about chaos and resilience testing at @netflix https://t.co/wlVEzMBtXO

Using Chaos To Build Resilient Systems

by Tammy Butow

Twitter feedback on this session included:

@lizthegrey: @tammybutow How you apply chaos engineering depends upon the scale of your infrastructure. #QConNYC

@lizthegrey: It's like riding a bicycle; you can't just hop on and ride at full speed. "The hello world of chaos engineering is a CPU attack." --@tammybutow #QConNYC

@lizthegrey: And once you can ride a bicycle, then you can drive a car, and perhaps drive an F1 car as you get more sophisticated at operating wheeled vehicles. It is a journey that could take multiple years. #QConNYC

@lizthegrey: ... and Gremlin's CEO asserts that chaos engineering is about testing that those graceful degradations work. #QConNYC

@charleshumble: "Google is an expert in outages that you don't notice - graceful degradations - and testing this is where chaos engineering comes in." @tammybutow #qconnyc

@lizthegrey: We have to be able to gracefully omit parts of our site that we aren't able to serve instead of leaving holes in our UI. It's a cross-functional effort involving product managers and UX, not just infrastructure engineering. #QConNYC

@lizthegrey: What are the implications of our services not working correctly? Sometimes it's small, but if you work in finance you could cost someone a mortgage and cost them their dream home! (and get you fined by regulators). #QConNYC

@lizthegrey: "If you never get paged, you won't know what to do when a real failure happens or be able to train engineers." so @tammybutow used Chaos Engineering at dropbox to inject faults for people to train on. #QConNYC

@lizthegrey: Always be careful about affecting real customers if you can while doing chaos engineering. #QConNYC

@lizthegrey: Gremlin provides chaos engineering as a service, allowing simulations of packet loss, host shutdown, etc. with a local agent #QConNYC

@lizthegrey: Laying foundations: defining resiliency. Resilient systems are highly available and durable. They can maintain acceptable service and weather the storm even with failures. #QConNYC

@lizthegrey: We need to know what results we want to achieve. Do thoughtful planned experiments to reveal weaknesses in our system. More like vaccines -- controlled chaos. #QConNYC

@charleshumble: "Failure Fridays are dedicated time for teams to collaboratively focus on using chaos engineering practices to reveal weaknesses in your services" @tammybutow #qconnyc

@lizthegrey: Why do we need chaos for distributed systems? Unusual failures are common and hard to debug; systems and orgs scale and chaos engineering helps us learn. #QConNYC

@lizthegrey: We can inject chaos at any layer -- API (e.g. ratelimiting, throttling, handling error codes...), app, ui, cache (e.g. empty cache -> hammered database), database, OS, host, network, power etc. #QConNYC

@lizthegrey: So why run these experiments? Are we confident that our metrics and alerting are as good as they should be? "Alert and dashboard auditing aren't that common but should be practiced more." [ed: yes.] #QConNYC

@lizthegrey: Do we know that customers are getting good experiences? Can we see customer pain? How is our collaboration with support teams? #QConNYC

@lizthegrey: Are we losing money due to downtime, broken features, and churn? #QConNYC

@lizthegrey: How do we run experiments? Need to form a hypothesis, consider the blast radius, run the experiment, measure results, then find/fix issues and repeat at larger scale. #QConNYC

@lizthegrey: Don't forget to have baseline metrics before you start experimenting. Don't run before you can walk, it's okay to start slow. Three key prerequisites: (1) monitoring & observability (e.g. 4 different systems :( :( ) #QConNYC

@lizthegrey: (2) Oncall and incident management. If you don't have any type of alerting and are manually watching dashboard, that's bad. You need a triage and incident management protocol to avoid treating all outages with the same severity. #QConNYC

@lizthegrey: (3) Know the cost of downtime per hour. [ed: or have clear Service Level Objectives so the acceptable budget is defined by/for you!] #QConNYC

@lizthegrey: Tools that @tammybutow recommends: @datadoghq, @getsentry, and old fashioned Wireshark. #QConNYC

@lizthegrey: The most critical thing is having an IMOC rotation, says @tammybutow [ed: although a good end goal is empowering *every* engineer to become an incident commander]. #QConNYC

@lizthegrey: How do we choose what experiments to run? Identify your top 5 critical systems and pick one! Draw the system diagram out. Choose something to attack and determine the scope. #QConNYC

@lizthegrey: Things to measure in advance: availability/errors, KPIs like latency or throughput, system metrics, and customer complaints. We need to verify we can capture failures. Does our monitoring actually work? #QConNYC

@lizthegrey: https://t.co/lM3QZBNWV8 is a toolkit for running your own gameday. example: a chart for how many hosts we can affect and how much latency we're going to add to each. #QConNYC

@lizthegrey: Make sure you have a switch for turning off all chaos experiments in case of emergency. #QConNYC

@lizthegrey: Think about what attacks you can run -- both on individual nodes, as well as on the edges between the nodes, says @tammybutow. #QConNYC

@lizthegrey: Verify that your k8s clusters are as self-healing as you think they are -- will they spin back up correctly if restarted? #QConNYC

@lizthegrey: Resource chaos is also important. Increase consumption of CPU, disk, I/O, and memory to ensure monitoring can catch problems. Make sure that you find limitations before you have to turn away customers. #QConNYC

@lizthegrey: https://t.co/C9QtRdufQY is a known-known experiment that tests situations we can anticipate and is a bicycle for learning. #QConNYC

@lizthegrey: Disk chaos -- issues like logs backing up. we can fill up the log partition on a replica or primary and make sure the system can recover. #QConNYC

@lizthegrey: "Use your experience of past outages to prevent future engineers from being burned in the same way." --@tammybutow #QConNYC

@lizthegrey: Memory chaos: what if we run out of memory? What if it's across all the fleet? Process chaos: kill or crashloop a process, forkbomb... #QConNYC

@lizthegrey: Shutdown chaos: turn off servers, or turn them off after a set lifetime. #QConNYC

@lizthegrey: k8s pods are a natural target for shutdowns and restarts. or simulate a container that's a noisy neighbor that kills the containers on its own host. #QConNYC

@lizthegrey: The average lifetime of a container in prod is 2.5 days, and they die in many different ways. #QConNYC

@lizthegrey: Time chaos and clock skew: simulate time drift and different times. (and @tammybutow points out this could have been used for y2k tests) Network chaos: blackhole services, take down DNS. #QConNYC

@lizthegrey: Reproducing outages on demand lets us be confident we can handle them in the future. #QConNYC

@lizthegrey: What were the motivations for chaos engineering? For one, Dropbox and Uber's worst outages ever (both involving databases). Resources: the gremlin community and https://t.co/czw9Oef1L9. [fin] #QConNYC

Container and Orchestration Platforms in Action

Control Planes: Designing Infrastructure for Rapid Iteration

by Mohit Gupta

Twitter feedback on this session included:

@aspyker: #qconnyc "We used Mesos as Twitter used it. And @clever was right across the street. Worked but the Twitter team was 3x our company size." #qconnyc (foreshadowing of how offloading orchestration is important to small to medium companies)

@aspyker: Key message: design a great control plane (api) that lets you change infrastructure via thin wrappers keeping what your engineers work with stable. #qconnyc @mohitgupta https://t.co/sU0bXfnU2S

@aspyker: Set your own defaults in using open source. If Netflix or Google wrote it, the defaults are likely wrong for your company. @mohitgupta #qconnyc

Forced Evolution: Shopify's Journey to Kubernetes

by Niko Kurtti

Twitter feedback on this session included:

@danielbryantuk: "If you bought a @kubernetesio hoodie then you've used the @Shopify platform... which was running on Kubernetes" @n1kooo #qconnyc https://t.co/XDU1fsclON

@danielbryantuk: "Even though we tiered our services (and SLOs) we soon realized that manual processes for deployment and issue remediation don't scale" @n1kooo #qconnyc https://t.co/1bvI9BVbtM

@danielbryantuk: "We wanted to build a 'paved road' for developers to follow when deploying and operating apps. We also wanted it to be self service, as this is efficient and scalable (and engineers don't want to talk to people too ;-))" @n1kooo #qconnyc https://t.co/nckRCgxGvc

@aspyker: "We wanted kubernetes as the foundation, but didn't want to show kubernetes to developers" @n1kooo (foreshadowing of a different contract/api?) #qconnyc

@aspyker: Engineers on Shopify platform uses web form for what they need, platform submits back PR to their repo with templates for their "entire" set of dependencies. #qconnyc

@danielbryantuk: Interesting to hear about the use of "cloudbuddies" at @ShopifyEng -- effectively extending @kubernetesio with custom controllers that provide operator-style functionality for making a developer's life easier @n1kooo at #qconnyc https://t.co/EqrKD0pFfS

@danielbryantuk: Very nice developer experience at @ShopifyEng when bootstrapping a new service. Everything is UI-driven and hooks into platform infra. A few clicks and you have all templates generated and the shell app deployed via @n1kooo at #qconnyc https://t.co/wAzFbIvhyX

@danielbryantuk: "Documentation was vitally important for our paved road platform rollout. We focused on how to 'drive a car' rather than 'how to build a car' -- developers typically just want to deploy apps" @n1kooo #QConNYC https://t.co/4e9yZWDcHd

@danielbryantuk: "If you want to build your own PaaS then focus on hitting 80% of use cases, hide complexity, and educate" @n1kooo #qconnyc https://t.co/KKMDtgb3cn

Developer Experience: Level up Your Engineering Effectiveness

Help! I Accidentally Distributed My System!

by Emily Nakashima & Rachel Myers

Twitter feedback on this session included:

@micheletitolo: Lots of companies intentionally create distributed systems, whether they are good or bad. Splitting out by Nouns can cause lots of problems since they weren't the right boundaries. #QConNYC https://t.co/5l4YU2tQAG

@micheletitolo: SaaS products! Whenever you start using a lot of Saas Products, you are creating a distributed system. At a certain point, you need something custom. #QConNYC https://t.co/a4XVxIy0rQ

@micheletitolo: Buying can help you create a reliable system. Don't be afraid to buy, but make sure to do due diligence. #QConNYC https://t.co/LMMOq08bt6

@micheletitolo: IaaS, PaaS, BaaS, FaaS: more specialized to the left. Fewer use cases, and you'll outgrow their usefulness faster. #QConNYC https://t.co/a0mUIEynav

@micheletitolo: Figure out how to use specialized tools. They increase cognitive load. So does putting a simple service in a complex system #QConNYC

@micheletitolo: Browsers are a distributed system! You can't SSH into it, and little opportunity to instrument. Front-end complexity is ever increasing and therefore bugs are getting more complex. #QConNYC https://t.co/6xxrDQNiGI

@micheletitolo: Vendors are another secret distributed system. Adds significant complexity, especially when things fail. How do you know a tool works well? You need to instrument it. #QConNYC

@micheletitolo: Distributed load means a high Cognitive Load. Let go of the expectation that you will be able to hold a full picture of the system in your head. Differing levels of abstraction mean different levels of familiarity #QConNYC

@micheletitolo: Add enough instrumentation that you don't have to worry about it. Make sure you have tools for breadth and depth. #QConNYC https://t.co/TRiNkoM7hc

@micheletitolo: Final conclusions: 1. We are all distributed systems operators and 2. You need to be able to trust your tools #QConNYC https://t.co/uJ09IudtgA

Succession: A Refactoring Story

by Katrina Owen

Twitter feedback on this session included:

@micheletitolo: Takeaways from A Refactoring Story by @kytrinyx, probably the best refactoring talk I've seen #QConNYC

@micheletitolo: The first thing to ask: should I refactor at all? Is there a reason to change the code? The change defines which axis needs to change. We don’t need infinite flexibility. #QconNYC

@micheletitolo: Rearrange the code to get flexibility we need, and only then add the new feature #QconNYC

@micheletitolo: When extracting, naming is important. If you get the names wrong, the code is harder to change, because names stick around. Use domain concepts, which are less likely to change #QconNYC https://t.co/CL8cq3EO0T

@micheletitolo: Next, dissect your code in place. This gives you information before you commit to a new design #QconNYC

@micheletitolo: Isolate the algorithmic code into methods. Trade off: something that looks incredibly simple, now much more complex. But now you see the bones/structure of the algorithm. #QconNYC

@micheletitolo: First do a parallel implementation, so you can compare new + old. If your tests fail, there’s something else you missed, usually a conditional. Exceptions are the keys to a new insight and unblock abstraction #QconNYC

@micheletitolo: There's the "primitive obsession" where we want to use basic types, like hashes and strings, but objects help us encapsulate better. #QConNYC

@micheletitolo: Once you've solved one problem, move on to the next one favoring fixing duplication. Chip it away to reveal underlying complexity. Rinse. Repeat. #QconNYC

@micheletitolo: Use “The flocking rules” to guide you: find the things that are the most alike, select the smallest difference between them, make the smallest change that removes that difference. #QconNYC

@micheletitolo: Last step is tacking assumptions, which are usually hardcoded. These need to be made flexible #QconNYC

@micheletitolo: Take small steps. Each of your small steps need to be safe. The higher the uncertainty the smaller the steps. #QconNYC https://t.co/yCdWOmVx3a

@micheletitolo: Lastly, with legacy codebases this might mean going backwards. That's okay! By going backwards you surface complexity. Don't keep all the details in your head, and make understanding things easier #QconNYC https://t.co/MNzcw4HvmV

Empowered Teams

Empowering Agile Self-organized Teams with Design Thinking

by William Evans

Twitter feedback on this session included:

@micheletitolo: "People don’t listen to what leaders say, they look at what leaders do" - @semanticwill #qconnyc

@micheletitolo: "Most teams aren’t ready for a transformation, mostly because people are overburdened, doing unplanned work etc. Fix the system before introducing new things." #qconnyc

@micheletitolo: "Overburdening disempowers people and prevents them from doing good work" #qconnyc

Ethics in Computing

Data, GDPR & Privacy: Doing It "Right" Without Losing It All

by Amie Durr

Jeanne Boyarsky attended this session:

Goals: send right message to right person at right time using right channel (ex: email, text, etc)…

Build trust without stifling innovation

accountability – what do with data, who responsible, continuing to focus on data perception, audit/clean data, make easy to see what data have and how opt out/delete
privacy by design – innovate without doing harm, don’t want to get hacked, be user centric, move data to individual so no storing, what is actually PII vs what feels like PII. Anonymize both…

What they did

dropped log storage to 30 days. Have 30 days to comply with requests to delete data. So handled by design for log files
hash email recipients
Remove unused tracking data
Communicated with customers
Kept anonymized PII data, support inquiries, etc
some customers feel 30 days is too long so looking at going beyond law

Twitter feedback on this session included:

@lizthegrey: Senders need to know what recipients have done with the messages they sent. Four key topics: consumer trust, privacy regulations, recent key issues/lessons, and doing it right. #QConNYC

@lizthegrey: Goal of marketing industry: sending the right message to the right person at the right time (and via right channel). #QConNYC

@lizthegrey: Only recently has it become possible to gather enough data to accomplish this. But this requires data handling. Three projects: GDPR compliance, and two feature enhancements. #QConNYC

@lizthegrey: 2/3 of consumers don't trust brands with PII. And employees don't trust their company to be GDPR compliant (63% not confident). *after* GDPR 90% don't believe consent is accurately described yet 31% don't think they're personally responsible. #QConNYC

@lizthegrey: ^^ that's 90% of employees of companies that don't think their employer's GDPR disclosures are accurate. 90%. #QConNYC

@lizthegrey: Do we deserve that trust? Well... ~500M identities known to have been stolen to date this year (e.g. email addresses, hashed passwords). #QConNYC

@lizthegrey: Example: Ticketfly had to shut down after losing 27M peoples' data. Panera: 37M identities stolen, including partial credit card numbers #QConNYC

@lizthegrey: 92M identities stolen off MyHeritage. Fortunately no DNA data stolen, just passwords and emails.
150M (and counting) identities stolen off myfitnesspal. #QConNYC

@lizthegrey: The minimum threshold isn't just not selling your data, it's safeguarding it against a breach. If someone phishes your employees' accounts or gets an S3 token, you don't want to be a Panera. #QConNYC

@lizthegrey: Landscape of regulations: CASL, CAN-SPAM, EU-US Privacy Shield, & GDPR. And Germany and France are crafting separate regulation. #QConNYC

@lizthegrey: It's not about the explicit laws, but instead the idea that our customers' trust matters and that data has an impact upon our brand. Customers will leave us if we break their trust. #QConNYC

@lizthegrey: Demand is on the rise for data scientists/engineers (50% higher than supply). 650% growth in roles over 6 years. #QConNYC

@lizthegrey: Key issues in trust to cover: accountability, privacy by design, and continued innovation. #QConNYC

@lizthegrey: (1) accountability -- {get,stay,show} clean. Audit data inventory, have processes for new data, and providing transparency/opt-out. #QConNYC

@lizthegrey: "We were annoyed at the mess of data marketing was retaining... until we looked at our own logs, in which we kept all production data with no fixed retention." -- @AmieDurr #QConNYC

@lizthegrey: Dropping storage to 30 days makes it the default behavior in compliance with the law, rather than requiring manually cleaning data after an opt-out. #QConNYC

@lizthegrey: Hashing data to make it pseudonymous. Removed unnecessary tracking tags/cookies. Educated customers not to do stupid things like put PII in subject lines/content. #QConNYC

@lizthegrey: Separate your abuse logging/heuristics from your other data, and communicate clearly about it in your privacy policy. #QConNYC

@lizthegrey: You don't have to throw everything out, but you need to know what you have and what you're using it for to make appropriate decisions. #QConNYC

@lizthegrey: Data protection is a shared responsibility that needs to be continuously done. #QConNYC

@lizthegrey: 7 principles of GDPR; be user centric, and continuously stay clean. #QConNYC

@lizthegrey: Case study: engagement message events from message opens. Used to store the encoded link containing the crypto unhashed & other customer data. Instead migrated that data to no longer live in the links. #QConNYC

@lizthegrey: Best practice is to have the messages forward links for a year, but needed to change behavior: if <30 days, do the join, if >30 days, just pass the link along. #QConNYC

@lizthegrey: Can now no longer see who the messages were to retrospectively unless the user engages. #QConNYC

@lizthegrey: Distinctions between strict PII and possible PII (e.g. try to be non-creepy about geo_ip data even if city level). Don't just follow the law, look after your consumers and what feels right to them. #QConNYC

@lizthegrey: Stop storing non-anonymized PII, and encrypt/encode data you do need to keep. Aggregate. Explicitly include data management in design docs. #QConNYC

@lizthegrey: "If you do not [include privacy in your design docs], you will forget about it. And make the DPO your best friend." -- @AmieDurr #QConNYC

@lizthegrey: If you delete by default there's little you have to do around GDPR's deletion policy. #QConNYC

@lizthegrey: Smart send case study: don't send mails to disengaged users. So we started backfilling 6 months of data for our beta feature... except to discover it wasn't hashed. Switched to an appropriate source. But the team missed privacy by design. #QConNYC

@lizthegrey: Someone proposed a feature to compare subject lines within the same industry. Need to communicate early/often with the DPO. Do we need to update our privacy policy? But don't be afraid to innovate. #QConNYC

@lizthegrey: What's next? Regulation and innovation aren't in opposition to each other. We do both -- learn from your peers, ask questions, innovate, and build a system of trust. #QConNYC

@lizthegrey: If you {get,stay,show} clean, consumers will trust you. [fin] #QConNYC

Ethics in Computing, From Academia to Industry

by Kathy Pham

Twitter feedback on this session included:

@lizthegrey: Why was the US government spending billions of dollars on software that didn't work? In part, because it didn't understand the needs of the community. #QConNYC

@lizthegrey: Call to action: honor all expertise across academia and industry to build better software. Better outcomes from interdisciplinary collaboration (e.g. at academic institutions) #QConNYC

@lizthegrey: The hierarchy of engineering/tech roles over other disciplines being less valued is unhealthy and produces software that doesn't serve people. #QConNYC

@lizthegrey: Amazon reconstructed redlining with same-day delivery by not being critical about the data they were using. #QConNYC

@lizthegrey: The incentives aren't aligned for serving users' needs and we wind up with tools that don't work. #QConNYC

@lizthegrey: In recent news: Google, Microsoft, and Amazon in the headlines for employee protests/mobilization against problematic government contracts #QConNYC

@lizthegrey: How do we train people? If we look at the CS curriculum now, people have many choices of focuses, but no dedicated focus on ethics [ed: may have mis-transcribed].
There are at least 197 courses across 188 universities. #QConNYC

@lizthegrey: They exist, but they may not work -- otherwise we wouldn't be in the state we're in. What can we do to fix this? Embed ethics into the data science curriculum instead of making it a separate class. #QConNYC

@lizthegrey: Make people question what they're building *as* they're building it. It'll take time to tell how effective it'll be. #QConNYC

@lizthegrey: We need to empower and connect everyone. There's a lot of power in individual contributors, especially among engineering individual contributors. We are perceived by management as some of the most valuable employees and can utilize that power. #QConNYC

@lizthegrey: Surprisingly, people with social science and computer science backgrounds have a harder time finding jobs than people from pure CS backgrounds, because they don't look like the "standard template". #QConNYC

@lizthegrey: Use your voice. "With great power comes great responsibility" by Spiderman. #QConNYC

Organizing for Your Ethical Principles

by Liz Fong-Jones

Twitter feedback on this session included:

@lizthegrey: "Data is a vital tool that helps us describe what is happening in our cities, predict what can happen for people in situations, and evaluate what's happening around us." --@QuietStormnat #QConNYC

@lizthegrey: Data determines things for us as individuals -- how fast we get home, how much we pay for healthcare, and so forth. #QConNYC

@lizthegrey: How do we develop a better trust with our users and build tools responsibly? "If we don't do this, the data goes away, and if the data goes away, the innovation goes away." #QConNYC

@lizthegrey: The power of the individual person matters. We want a vision and understanding of how the world should be. We're not living up to the promise of what America should be, but with technology, we can in the future. #QConNYC

@lizthegrey: "Not everyone should follow the same code of ethics. Ethics are shaped by how we grew up. We should be respected for our differences." #QConNYC

@lizthegrey: You have to understand how you define responsible behavior to call other people on it.
It's hard to do data science and do innovative things for social good. #QConNYC

@lizthegrey: Breaches are a large reason for social outrage -- "people aren't screaming because companies aren't complying with legal regulations, they're screaming when companies are _violating their trust_. And that's having an impact." @QuietStormnat #QConNYC

@lizthegrey: We need to balance individual rights with company profits. A growing tension, and outrage when that tension isn't respected and addressed. #QConNYC

@lizthegrey: GDPR says that people have not just the right to know what's being done with their data, but have a say in how their data is used. But it's expensive and hard to be more nuanced about consent. #QConNYC

@lizthegrey: Ethics is driven by culture, measured by compliance, and determined by society. Building trust requires empowering people to speak about data use, have processes to standardize practices, and leverage technology to hold us accountable. --@QuietStormnat #QConNYC

@lizthegrey: "Data is not technology -- data is _people_." --@QuietStormnat #QConNYC

@lizthegrey: First post-USDS project: Community-driven Principles for Ethical Data Sharing (CPEDS). What does responsible use for data look like? How should we share data? #QConNYC

@lizthegrey: Goal was to define a foundation for the principles of other peoples' work. e.g. a Hippocratic Oath for Data Scientists. It's okay for it to fork -- the same has happened in medicine. #QConNYC

@lizthegrey: How are we analyzing data and developing software/algorithms? Is it fair? Fostering diversity? Considering unintended consequences?
"Ethics is not about being perfect, it's about being intentional and responsible." --@QuietStormnat #QConNYC

@lizthegrey: Audience participation exercise: describing how users interact with data, key data lifecycle points of contact, and key questions to check ourselves. #QConNYC

@lizthegrey: e.g. data as "opportunity" or "risk". requires "trust". #QConNYC

@lizthegrey: "When we communicate about the APIs we produce, we need to not only say what it _can_ do, we need to say what it _cannot_ do." --@QuietStormnat #QConNYC

Privacy Ethics – A Big Data Problem

Twitter feedback on this session included:

@lizthegrey: General Data Protection Regulation passed in Europe. Why? "People are more aware of what kind of information companies are storing about them, and are worried about how that data might be used." -- @rags_den #QConNYC

@lizthegrey: Examples: leaks of information from credit bureaus and threat of identity theft. "The cost of storage and compute have decreased so much they're practically free. Companies are recording more and more data for the benefit of their bottom line revenue." #QConNYC

@lizthegrey: Surprising discoveries upon acquiring a company: credit card numbers stored in the clear in images that could be OCRed, accessible to any engineer. #QConNYC

@lizthegrey: Large IoT data lakes. We need to understand how to segregate and anonymize our data sets. [ex: Strava heatmaps exposing military bases] #QConNYC

@lizthegrey: Your client list may or may not be sensitive information. Think law firms, doctors... #QConNYC

@lizthegrey: Application logging as another privacy pain point: verbose logging can let developers quickly resolve problems, but the challenge is that our constraints change. Think recording webform answers. What if a sensitive field is added? #QConNYC

@lizthegrey: Usernames can be correlated to other sites and reveal a huge amount of PII about the user. The mapping from transaction to username is a problem. #QConNYC

@lizthegrey: Aggregation services like Splunk and Sumologic makes it easier to access logs, bypassing controls on the production environment. Can you audit access? #QConNYC

@lizthegrey: Additional challenge: biased algorithms. e.g. Target outing minors who were pregnant by making inferences from their purchases. #QConNYC

@lizthegrey: Recap: privacy vs. security. Privacy relates to your fundamental rights and how your data is used; very contextual. Security relates to how we protect data and implement controls. #QConNYC

@lizthegrey: Solutions: cultural change is needed. Employees need to understand how to handle data and address customer concerns about how data will be used. #QConNYC

@lizthegrey: For the managers in the room: do you treat privacy/ethics/data-handling as part of your performance review process? #QConNYC

@lizthegrey: Is it cheaper to spend 4% of your global revenue on GDPR fines, or invest more in privacy (current industry average: 0.0004% of revenue spent on privacy engineering) #QConNYC

@lizthegrey: Second solution approach: security. Encrypt data in transit, rest, and your backups. [ed: and encrypting backups lets you delete data faster by deleting the key!] #QConNYC

@lizthegrey: Make sure you have robust, unified authentication mechanisms. Have anonymization/masking and pseudonymization processes. #QConNYC

@lizthegrey: Use secure credential storage such as HashiCorp Vault. Retain detailed audit logs on who is accessing things. #QConNYC

@lizthegrey: Solution 3: design. classify all data you store in terms of sensitivity. Salesforce is a common dumping ground for data #QConNYC

@lizthegrey: Ensure that you are implementing multifactor/multiparty authorization to access data. #QConNYC

@lizthegrey: Obscure data by design; don't show everything at once such that people can get access to data they don't immediately need. #QConNYC

@lizthegrey: Don't make aggressive assumptions about the consent that you're getting from users. Permission to use for one use is different from permission to use for marketing etc. #QConNYC

@lizthegrey: Provide visibility & transparency to users. Solution 4: process e.g. privacy impact assessments. #QConNYC

@lizthegrey: Ensure that minimal use of data is made; ask questions about what legitimate business purposes we have for asking for information. Give users self-service access to manage data. #QConNYC

@lizthegrey: Honor user consent when processing data. "Challenge your product manager and ask for the consent database." --@rags_den #QConNYC

@lizthegrey: Store user data only as long as is necessary. What you can delete does depend upon business function/use eg fraud [ed: but you could mask fields, even if not delete...]. #QConNYC

@lizthegrey: Solution 5: Automate. do discovery of data at rest and in motion - label/tag data sources. (this is what Integris does) #QConNYC

@lizthegrey: You can false positive on noise [ed: e.g. that things can look like a phone number that aren't] so have confidence scores. #QConNYC

@lizthegrey: Make sure you have the ability to access all of your data across all your environments and across all of your data formats. #QConNYC

@lizthegrey: Be aware both of the data in fields as well as the metadata (e.g. is "Jordan" a country, name, or a shoe brand?) #QConNYC

Finding the Serverless Sweetspot

Observability to Better Serverless Apps

by Erica Windisch

Twitter feedback on this session included:

@lizthegrey: @ewindisch @IOpipes .@ewindisch has seen the challenges of running infrastructure at scale, and wants to help people go beyond infrastructure -- making sure that our applications are working for our users. #QConNYC

@lizthegrey: "Are you sure?", she says. It's a trick question, because we need to define working. "What is a working application? Up is not online." #QConNYC

@lizthegrey: There are lots of tools that will do tests like pingdom. [ed: omg preach]. "Can you send an HTTP request and does it return a response is the bare minimum and doesn't actually tell you if your app is returning API responses." #QConNYC

@lizthegrey: "Up means that your application needs to be useful for your users. It goes beyond uptime." #QConNYC

@lizthegrey: If we trust your cloud provider, your application should always be up [ed: for some SLA based value of always]. Or if you trust your k8s install to be flawless, that your containers are always running. #QConNYC

@lizthegrey: But again, that's infrastructure-level uptime, not application availability. With serverless, you can assume that uptime of your infrastructure is not your problem -- there's nothing you can do about it other than waiting for your provider to fix it. [ed: or having CREs] #QConNYC

@lizthegrey: "Your code always does what you write it to do, but does it do what you *want* it to do?" --@ewindisch #QConNYC

@lizthegrey: Does it help to know assembly and about file descriptors? Yes. "But you shouldn't *have* to. Making it easier lifts up our developers and teams so that they don't have to worry about things that they shouldn't have to worry about." --@ewindisch #QConNYC

@lizthegrey: Most of the tooling around serverless is completely different from the tooling around containers. We can't use the same tools we use for k8s in the serverless world. Aggregating data into minutes/seconds and smashing it together doesn't work for serverless. #QConNYC

@lizthegrey: We need to be able to store more data, get more insights, and get data on singular requests, rather than being limited by number of custom metrics and number of processes. #QConNYC

@danielbryantuk: "Traditional monitoring focused on deployment on uptime. Now we should be asking what value do you provide to your business?" @ewindisch #qconnyc https://t.co/g91Dh2D1ZW

@lizthegrey: The traditional story centers around deployment and uptime. But what value do we provide to our business, and what do we cost? Can we describe what we save the business? It's hard to show what you mitigated/prevented. #QConNYC

@lizthegrey: Are we *pleasing* our users? This is the metric for if our application is working. We're supposed to test-learn-repeat. Make sure that the changes you're making are resonating with users. #QConNYC

@lizthegrey: One way we can get this data is empowering your data scientists. Even if you're a one-person company, congratulations, YOU are the data scientist. #QConNYC

@lizthegrey: Involve your data scientists the same way that you'd involve operations in software engineering for devops. Make sure data scientists are understanding how users use the application. #QConNYC

@lizthegrey: Key Performance Metrics/Indicators re critical. What is actionable? How many people are using a feature and fall off? #QConNYC

@lizthegrey: The overhead of debugging is irrelevant and will save you in the long term. If you are horizontally scalable with serverless, it's really cheap to run your debugging all the time. You won't run out of CPU #QConNYC

@lizthegrey: Data isn't useful in a vacuum; we can't just throw it into kafka. Be able to correlate the data. If someone compromised your application, what telemetry do you have? You *can't* look at the containers, they're gone already. #QConNYC

@lizthegrey: Can you figure out what's making your database slow by looking at what queries are correlated with the slowness? #QConNYC

@lizthegrey: How many users of our Alexa skills thanked us? How many cursed us? User happiness matters. #QConNYC

@lizthegrey: Application metrics > infrastructure metrics. They both matter, but application metrics are a superset. Knowing if your application is working can't be determined just from the infrastructure metrics. #QConNYC

Serverless Patterns and Anti-Patterns

by Joe Emison

Twitter feedback on this session included:

@danielbryantuk: "Write software that the average developer can support -- because sooner or later..." @JoeEmison #QConNYC https://t.co/xVDeLUZ3KY

@danielbryantuk: Great takeaways on @JoeEmison's #qconnyc serverless patterns and antipatterns https://t.co/c3btCwAuQn

Microservices: Patterns & Practices

Complex Event Flows in Distributed Systems

by Bernd Rücker

Twitter feedback on this session included:

@lizthegrey: @berndruecker 3 hypotheses to discuss today: (1) event-driven architectures decrease coupling, (2) orchestration can be avoided for event systems, and (3) workflow engines are painful and aren't needed with microservices. #QConNYC

@lizthegrey: To simplify, we need to pay for the item, fetch it, and ship it. How do we implement it? We might have bounded contexts of checkout, payment, inventory, and shipment services, implemented as microservices. #QConNYC

@lizthegrey: But we might have a separate set of application processes, infrastructure underlying it, and separate development teams for each microservice. #QConNYC

@danielbryantuk: "Autonomy is a vital theme for microservices" @berndruecker #qconnyc https://t.co/F2Jrvl8jZ7

@lizthegrey: This lets us decouple the request/response dependency. So we have a hammer, everything looks like a nail... can we do the whole chain with events? Checkout broadcasts that an order was placed. #QConNYC

@lizthegrey: Martin Fowler: "it's easy to make decoupled systems with event notification, without realizing that you're losing sight of the larger-scale flow." #QConNYC

@danielbryantuk: "You definitely don't want to turn microservices development into a three-legged race. Teams should be decoupled" @berndruecker #qconnyc https://t.co/ZWckQqs73I

@lizthegrey: We need to have orchestration, it's not evil. Implement Martin Fowler's idea of smart endpoints and dumb pipes. Make the event bus as dumb as possible. #QConNYC

@danielbryantuk: With a hat tip to @samnewman, at #qconnyc @berndruecker states that within a microservice-based system a god service is only created by bad API design, but... "Clients of dumb endpoints easily become a god service" https://t.co/cSbjqiR0OM

@lizthegrey: Smart endpoints potentially must keep transactions running for a long time, rather than immediately handing back success/failure. #QConNYC

@lizthegrey: So we need some kind of state. We can either persist individual things, or we can use state machines. People typically think workflow engines are painful, because they're using the wrong tools. "Death by the property panel." #QConNYC

@lizthegrey: People have built more modern tools (Cadence by Uber, Conductor by Netflix...) #QConNYC

@lizthegrey: There are lightweight open source options (zeebe, jBPM, Activitr), and some horizontally scale. #QConNYC

@danielbryantuk: "The cloud vendors and silicon valley companies are recognizing the power of orchestration via a workflow engine within a microservices architecture. There are now open source offerings too" @berndruecker #qconnyc https://t.co/MEdUj2teHT

@lizthegrey: So onto our distributed systems, we need to think about failures -- what happens if we never hear back from the request to try the credit card? We need to clean up state. #QConNYC

@lizthegrey: we don't necessarily have ACID transactions, so we need to instead define 'undo'/'compensation' actions for each action that could potentially fail. #QConNYC

@danielbryantuk: "Workflows live inside service boundaries -- there is no central orchestrator. The beauty of using workflow engines is that you get observability into the business processes" @berndruecker #qconnyc https://t.co/z7WBYBDKxr

@lizthegrey: As a summary, the answer to all three questions is "sometimes". [ed: hah!] [fin] #QConNYC

@danielbryantuk: Great tour de force of using events and (modern) workflow engines by @berndruecker at #qconnyc https://t.co/JaYKKHCPYr

Debugging Microservices: How Google SREs Resolve Outages

by Liz Fong-Jones & Adam Mckaig

Twitter feedback on this session included:

@thenewstack: When you find a new type of performance issue, the temptation is to add a new set of metrics to a dashboard. Most of the time this is not a good idea. Overly busy dashboards can quickly lead to cognitive overload — Google’s @lizthegrey on #microservices debugging #qconnyc https://t.co/V5KJjBhdTg

Designing Events-first Microservices

by Jonas Bonér

Twitter feedback on this session included:

@lizthegrey: @jboner "Before you drink the Kool-Aid, take a step back and think about whether you really need to do microservices." Reason *to* might include scaling your organization. Reasons to not *necessarily* do it are to scale up your system. #QConNYC

@lizthegrey: Many people building microservices wind up building microliths with synchronous, blocking RPCs between microservices directly replacing API calls within the original monolith. And using synchronous datastores. #qconnyc

@lizthegrey: You may have solved the organizational scaling problem, but you haven't gained the additional benefits of microservices. We can do better than that by thinking in events with domain-driven design. #QConNYC

@randyshoup: “The right reason to do Microservices is to scale the organization” @jboner at #QConNYC

@lizthegrey: Modeling events forces you to think about the behavior rather than the structure of the system -- Greg Young. #QConNYC

@danielbryantuk: "When designing microservice systems don't focus on the things; focus on what happens" @jboner #QConNYC https://t.co/meva8EZck0

@lizthegrey: Starting by defining events [ed: important to note: this is different from the definition of event used in the observability space]. An event is a fact of information. They're immutable, and we can disregard/ignore them but not retract/delete them (but can replace old) #QConNYC

@lizthegrey: We can mine the facts to understand causality, etc. -- and "event storming" to crowdsource from experts on the system how the system actually works. #QConNYC

@lizthegrey: We need to catalog intents - when do we share information or transfer control? Do we have state/history/causality on facts? #QConNYC

@lizthegrey: Intents -> Commands, Facts -> Events. Commands are an object form of a method or request. They're an imperative verb e.g. CreateOrder, ShipProduct #QConNYC

@lizthegrey: Reactions = side-effects; events represent something that *has happened* e.g. OrderCreated, ProductShipped. #QConNYC

@lizthegrey: Commands are all about intent, whereas events are intentless. Commands have a single (remote) target, events are targetless and for many observers; forgotten once sent. #QConNYC

@lizthegrey: Our events should define the bounded context and what protocols we use between boundaries. Event-driven services receive and react to facts, and asynchronously publish new facts. #QConNYC

@lizthegrey: Mutable state is okay, but should be contained and non-observable by other parts of the system. #QConNYC

@lizthegrey: We publish facts to the outside world [ed: e.g. the value at time X was value Y] rather than trying to pin down something mutable. #QConNYC

@lizthegrey: The Aggregate is our state persisted to disk; maintains integrity and consistency and becomes our unit of failure/determinism. #QConNYC

@lizthegrey: So what does this look like in practice? A user creates a command which generates an event; the event stream/bus/pubsub relays the events to other event driven services, potentially triggering other actions. #QConNYC

@lizthegrey: This model requires eventual consistency as a basis. Commands are fully asynchronous. [ed: no discussion yet of reliability of this model, and the idea that events might drop...] #QConNYC

@lizthegrey: But what's wrong with CRUD services? They're fine for isolated data, says @jboner, but as soon as you need some kind of cross-service consistency then you have consistency problems and can't join data. #QConNYC

@danielbryantuk: "Strong consistency is the wrong default in distributed systems. We need to embrace reality, and this is often eventually consistent" @jboner #QConNYC https://t.co/gZWGar5iUx

@lizthegrey: Information travels at the speed of light and we'll always have non-zero latency. There is no now, and information is always from the past. #QConNYC

@lizthegrey: Distributed systems are non-deterministic. We live in a scary world where messages get lost, and where systems fail in new ways. [ed: yes! finally :)] #QConNYC

@lizthegrey: We need to model uncertainty and account for it in our business logic. #QConNYC

@lizthegrey: Autonomous components can only promise its own behavior; making everything local improves stability. [ed: this is outsourcing *all* your risk onto your event bus. Make note that GCP's PubSub is 99.95% available, for instance] #QConNYC

@lizthegrey: Think of modeling things as State, Commands, and Events. You'll never fully converge. #QConNYC

@lizthegrey: There is no now, and resilience is by design. [ed: go on, tell me more...] We need to manage failure rather than avoid it. #QConNYC

@danielbryantuk: "A system of microservices is a never ending stream towards convergence" @jboner #QConNYC https://t.co/PM0o83ymnX

@lizthegrey: Good failures are contained to avoid cascading failures, reified into events, signaled async, observed by many, and managed outside the failed context. #QConNYC

@lizthegrey: Events need to be persisted. How do we transition from a CRUD system to events? Atomically double-write both to a table and to the event bus [ed: no details provided on how to do this...]. #QConNYC

@lizthegrey: We don't get full consistency, but we get eventual consistency by subscribing to the event bus. #QConNYC

@lizthegrey: The truth is the log, and the database is the cache of the subset of the log. "It's cheap to store data, why not store the entire log?" [ed: the Privacy/Ethics track members would... disagree with this premise, as well as the latency of going through the entire log] #QConNYC

@danielbryantuk: "Update-in-place strikes systems designers as a cardinal sin..." @jboner #QConNYC https://t.co/ISPDGfsbIm

@lizthegrey: Event sourcing can act as a cure for destructive updates: log all state-changing events to maintain strong consistency and durability. #QConNYC

@lizthegrey: To recover from failures, rehydrate events from the event log and re-run the internal state, and don't run the side effects. [ed: what happens if the side effects didn't run at least once?] #QConNYC

@lizthegrey: We get one source of truth with all history, and smaller durable in-memory state. Avoids in-memory object to stored relational mismatch. #QConNYC

@lizthegrey: Another pattern to deploy is CQRS (https://t.co/n28mfZwoRK) for separating reads and writes. #QConNYC

@wesreisz: Genius... @jboner comparing the event log to an accounting ledger. Would you ever destroy data by overwriting it in a financial ledger? Why do keep doing it with CRUD? #QConNYC

@lizthegrey: Time travel lets us do historical debugging, auditing, failure recovery, and replication all for free. #QConNYC

@lizthegrey: Key takeaways: use event-first design to modernize and reduce risk. Event logging avoids CRUD/ORM and lets us retain history, balancing strong/eventual consistency. you can start with https://t.co/3RpnUSLJyM or read https://t.co/Q8BdgPjWVw [fin] #QConNYC

Design Microservice Architectures the Right Way

by Michael Bryzek

Twitter feedback on this session included:

@danielbryantuk: "Sometimes the quality of software architecture is only revealed after several years of an application being worked on" @mbryzek #QConNYC https://t.co/BMiUneKnqJ

@danielbryantuk: Common microservices misconceptions, courtesy of @mbryzek at #QConNYC https://t.co/LI2J1xEHcR

@philip_pfo: “Automation tooling enables everyone to benefit from specialist expertise without needing to be a specialist” @mbryzek #QconNYC

@danielbryantuk: "At Flow we have a single developer CLI tool that contains all of our automation and workflow utils" @mbryzek #QConNYC https://t.co/T6BrRMZwMD

@danielbryantuk: "Continuous delivery is a prerequisite to managing microservices architecture. This should be 100% automated, and 100% reliable" @mbryzek #QConNYC https://t.co/roHANR6ZH6

@danielbryantuk: Critical decisions with a microservice architecture, via @mbryzek at #QConNYC https://t.co/3PNrqIev1I

No Microservice Is an Island

by Michele Titolo

Twitter feedback on this session included:

@lizthegrey: "What I noticed is that nobody really defined what a microservice is, yet we've been hearing about distributed systems... microservices are a distributed system." -- @micheletitolo #QConNYC

@lizthegrey: We do it for speed, safety, and to cut costs, even though there are sometimes costs associated with getting started. #QConNYC

@lizthegrey: We all start off with all the pretty straight-looking lines, but it winds up getting messy over time. The more pieces and connections you add, something will probably go wrong. #QConNYC

@lizthegrey: How do we figure out when something goes wrong? And once you've figured it out, how do you fix the problems. We have new challenges we didn't have even in a monolithic world. #QConNYC

@lizthegrey: They're about more than the size of the application. You need an ecosystem. We need to adapt our applications and or infrastructure. #QConNYC

@lizthegrey: If your infra and tooling and deploys aren't there, you'll always be playing catchup. Like having your deploys take hours. #QConNYC

@lizthegrey: Three key areas to create that foundation: Deployment, Scaling, and Debugging. #QConNYC

@lizthegrey: What are the deployment best practices? Small changes, frequent releases, and consistent releases that are standardized and have supported tooling (no manual releases). #QConNYC

@lizthegrey: Limit your number of special snowflakes. Two quickly leads to three or four. People see what things you allow to happen inside your system. #QConNYC

@lizthegrey: Invest in your deployment tooling. Automate as much as possible, don't have humans touching your deploys. #QConNYC

@lizthegrey: Do staged deployments, so that you can feel confident that things are going right. Canarying requires having a routing service upstream, and deploying to an instance serving a small percentage of traffic first. #QConNYC

@lizthegrey: And then you roll out to more and more servers, and then the old version is gone. You know that it can handle your production traffic/ecosystem and everything is good. #QConNYC

@lizthegrey: The other alternative is blue/green or red/black, which has two parallel deployments of the application and duplicates the entire environment. It costs twice as much to do. #QConNYC

@lizthegrey: We do a cutover between the old and new systems. Both of these techniques allow doing automatic rollbacks. How do we know whether a deplo ment is successful? #QConNYC

@lizthegrey: We need to have features and tools to validate the success of our system -- not just the one service. Robust unit/integration tests are needed. "You need to test the space between." -- @micheletitolo #QConNYC

@lizthegrey: Frequent deployments mean frequent testing means good tooling. You need standardized healthchecks. Everything should use the same technique (port, url, contents) #QConNYC

@lizthegrey: Consume your dependencies and secrets without recompiling, so that you're able to push the same binaries without modification. #QConNYC

@lizthegrey: Onto the system: we need to be able to aggregate healthchecks. You shouldn't need to log into a bunch of servers to figure out if they're working. #QConNYC

@lizthegrey: You need your system to be proactive about alerting you when things are broken. You still can get a coffee, but you might be paged to let you know an automatic rollback happened and you need to investigate. #QConNYC

@lizthegrey: Everyone needs to be able to see the status of deployments of their own team and of other teams. #QConNYC

@lizthegrey: Report progress to the scaling automation to tell it that your individual application is ready to be terminated. 0% CPU is not a good indication. #QConNYC

@lizthegrey: What do we mean by routing traffic? Loadbalancer or service discovery. All public cloud providers have loadbalancing available. Make sure you know how to use it; it's easier than running your own. #QConNYC

@lizthegrey: You can also attach scaling groups with the public cloud's products if you want. #QConNYC

@lizthegrey: For service discovery, we route via convention; we can get from an unknown state of errors that we can't interpret (network?) to a known state of *why* the error happened (e.g. the target service was unavailable) #QConNYC

@lizthegrey: Scaling only solves so many problems. Onto troubleshooting. First we need to know there's a problem. #QConNYC

@lizthegrey: We need to know what qualifies as a problem. Not every exception or timeout matters. It may be unactionable. #QConNYC

@lizthegrey: Nuisance pages suck. It varies per application. RAM, CPU, Latency [ed: :( to RAM/CPU], but you may want less obvious things e.g. if your service has scaled 7 times in the past hour, you might have a memory leak. #QConNYC

@lizthegrey: You can also alert on your Key Performance Indicators or SLAs [ed: yup, this is what I advocated]. Alerts are for known issues that we can think of in advance. #QConNYC

@lizthegrey: You also need dashboards to be able to see in aggregate what is going on in your system. Humans make better connections when they visualize data. #QConNYC

@lizthegrey: Standard debugging 101 -- look at the application causing the page, and pretend SSH doesn't exist. Use your logs (you did set up collection/aggregation, right?). and potentially increase log levels on the fly. #QConNYC

@lizthegrey: If it's going to take time to fix, escalate. Avoid cascading failures. The failure of one application shouldn't bring everything down. #QConNYC

@lizthegrey: Identification: figure out what parts of your system depend upon each other. You need request tracing, so that we can follow requests through the system. #QConNYC

@lizthegrey: Works best as an overlay. Envoy or OpenTracing. Even adding the header manually is better than nothing. #QConNYC

@lizthegrey: Isolation: we need circuit breaking to drop queries if latencies or errors increase. It isolates services to return them to a known state. #QConNYC

@lizthegrey: Circuit breakers let our application recover while people are debugging, rather than hammering people with peak traffic. #QConNYC

@lizthegrey: So now we think we know what's gone wrong, and can deploy a fix. Slowly ramp your traffic. #QConNYC

@lizthegrey: Scaling up can ideally be done automatically with automation. Built-in loadbalancers will do it for you, or you can do it manually if necessary. #QConNYC

@lizthegrey: What happens if the problem isn't one of our apps and instead is an external dependency? External: can't see logs, or can't see source, or we can't deploy a fix on our own. #QConNYC

@lizthegrey: If you can see everyone's source code and can make a PR, but can't commit or deploy, even within the same company, it's the external case. #QConNYC

@lizthegrey: Distributed systems are constantly changing. Scaling happens when our application's load varies. We need health tracking to see our load. #QConNYC

@lizthegrey: Look at things like RAM, CPU, and Latency. But you also need custom metrics such as the queue length. How do we scale? Automation. Please don't hand-scale your systems. #QConNYC

@lizthegrey: Smart systems enable us to scale up when under pressure, and down when the resources are no longer needed. #QConNYC

@lizthegrey: Your metrics can trigger automatic scaling. Scaling up and down are different cases -- for up, we're just deploying more instances, waiting for them to be healthy, and sending traffic to them. #QConNYC

@lizthegrey: For scaling down, detect when instances aren't being fully used to save money. Stop routing traffic to the server and gracefully shut down. #QConNYC

@lizthegrey: Check the status page or monitor it; but you many need to raise an issue if nothing is posted yet.
But we can also mitigate by figuring out who to talk to with request tracing to find the failure, or circuit breaking/degrading gracefully. #QConNYC

@lizthegrey: "Do as much as you can to keep as much as you can working." -- @micheletitolo #QConNYC

@lizthegrey: But sometimes everything breaks under your infrastructure (e.g. AWS S3 Outage in 2017). Hopefully you can mitigate, but you should have at least some degree of error handling. "S3 always works"... until it doesn't. #QConNYC

@lizthegrey: To recap, debugging internal apps requires logging, tracing, and circuit breaking. For external apps, trace and circuitbreak. #QConNYC

@lizthegrey: All of these things share in common the issue of visibility. It's harder to see what's going on in large distributed systems and you can't observe what you don't see. #QConNYC

@lizthegrey: "Observability is not free." -- @micheletitolo #QConNYC

@lizthegrey: You need health checks, circuit breakers, logging, alerting, and the ability to shut down gracefully. Your infra should consume healthchecks. do circuit breaking, loadbalancing, log aggregation [and specific log reading], and automated deploys/rollbacks. #QConNYC

@lizthegrey: And have lots of dashboards [ed: or an interactive querying system instead of too many dashboards] #QConNYC

@lizthegrey: Running microservices successfully requires smarter infrastructure. Ending with a @krisnova quote: Infrastructure won't evolve on its own. [fin] #QConNYC

@lizthegrey: Answering an audience question, @micheletitolo says that change can be incremental -- identify gaps and figure out what to automate or measure first based on your pain points. #QConNYC

@lizthegrey: Another audience question: how to monitor your circuit breakers. @micheletitolo says that open source tools like Hysterix can show you a control plane and dashboards for all of your circuits. #QConNYC

@lizthegrey: On the subject of memory leaks: if you can roll back, roll back, otherwise you can either spend a ton of money or go down until you can get a fix into place [or do rolling restarts every few hours if it's a slow leak]. #QConNYC

Modern Java Reloaded

Effective Java, Third Edition - Keepin' it Effective

by Joshua Bloch

Twitter feedback on this session included:

@charleshumble: "The best way to think about type inference in Java is that it is magic. It takes a whole chapter in TLS, is extraordinarily complex and basically, no one knows how it works." @joshbloch #QConNYC

@charleshumble: "The rule for when to use raw types is don't" @joshbloch #QConNYC

@charleshumble: #Java Lambdas lack names and documentation - they should be self-explanatory, and they should not exceed a few lines; one is best. If a lambda would be long or complex extract to it a method and use method reference. @joshbloch #QConNYC

Invest in Your Java Katalogue

by Don Raab & Aditi Mantri

Jeanne Boyarsky attended this session:

Links to Katas

Original Katas – http://codekata.com
On GitHub: https://github.com/BNYMellon/CodeKatas

Kata

Hands on programming exercises to hone your skills through practice
Styles
- Refactor code and tests should keep passing
- Fix the code – write code to make tests pass
- Fix the tests – write tests using API
- Sandbox – free form…

How to build a kata

Identify what want to learn (ex library feature)
Design a problem to solve. Write unit tests demonstrating how feature works
Implement code
Add helpful comments and hints so becomes standalone
Delete parts of code that want someone to learn

Modern User Interfaces: Screens and Beyond

Rethinking HCI With Neural Interfaces @CTRLlabsco

by Adam Berenzweig

Jeanne Boyarsky attended this session:

Intro to Neural Interfaces

Interfaces devices to translate muscle movement into actions
Human input/output has high bandwidth compared to typing or the like. We think faster than we can relay information. Output constrained.
Myo – For amputee, arm where have electrode on arm that controls arm.
Neural interfaces have information would have sent to muscle or physical controller
Lots of stuff happens in the brain, but you don’t want all of it. You want the intentional part without having to filter out everything else. The motor cortex controls muscles so represents voluntarily control. Also don’t have to plan electrodes on brain.

Examples

Touch type without keyboard presence [not very practical as it is hard to touch type without seeing keys]
Mirrors intention of moving muscles even if physical attempt is blocked
VR/AR – more immersive experience

Designing for Neural Interfaces

Want to maximize control/minimize effort
Cognitive limits – what can people learn/retain
Mouse is two degrees of freedom, laser pointer is three. There is also six where control in space. Human body has ore than six degrees of freedom. Are humans capable of controlling an octopus
How efficient is the input. Compared to existing control devices
It is possible to control three cursors at once, but it is exhausting. Not a good design
Different people find different things intuitive. Which way is up?
Don’t translate existing UIs. Can evolve over time.

Smart Speakers: Designing for the Human

by Charles Berg

Jeanne Boyarsky attended this session:

Smart speakers

Unlike smart phones, they are fixed in space. Direct voice to it.
Place it near where plan to use it.
That usage leads to context…..

Calling use case

Call dentist – should be seemless
Call Walgreens – which one?
Hands free calling for friends is frequent

Process

Understand context.
In a medium to large company, there is a lot of research already going on.
Find archived research. Don’t need to do from scratch.
Interview – start internal to team, then friends/family
Quickly need to expand interview to be more broad.
Pitch teammates on ideas based on research
Identify lead designer. Then identify themes (summary of research), brainstorm and create user journey map.
Physical user testing. Made two rooms with a mattress instead of just talking about it.

Real World Security

Making Security Usable: Product Engineer Perspective

by Anastasiia Voitova

Twitter feedback on this session included:

@charleshumble: It is 2018 and here I am on stage talking about naming, but naming things unambiguously is hard. For example proxy is an easy name for us but, it turned out, not for our customers. We wrote a lot of docs, and we thought it was obvious, but it wasn’t. @vixentael at #qconnyc

@charleshumble: Listening to @vixentael at #qconnyc talking about UX for security products reminds me a bit of some of the conversations I've had with @derekpearcy over the years - making security products easy to use and understand is a really hard problem.

@charleshumble: Short feedback cycles were incredibly helpful in terms of making our product useable for non-security experts. It allows our customers to adopt faster, make fewer mistakes, become less frustrated. For us, the key is making user-facing decisions. @vixentael at #qconnyc

@charleshumble: Making security usable is not the same as making it artificially over-simplified. @vixentael at #qconnyc

@micheletitolo: Making security tools usable increases security #QConNYC https://t.co/QIHhYwNlQE

@micheletitolo: It's hard to gather metrics about security tools... because they are secure. Sitting with customers and helping with integrations was the best way to get data. #QConNYC

@micheletitolo: First big issue was deployment. Multiple servers needed to be deployed and configured. Biggest win: docker-compose support and secure-by-default configurations. #QConNYC

@micheletitolo: Next, integration. Developers wanted to have libraries in their application languages. Even though the libraries were tiny, creating them made integration easier. #QConNYC

@micheletitolo: With secure-by-default it decreased the number of decisions that developers needed to make. The ones implementing their tool often weren't security experts #QConNYC

@micheletitolo: Along with integration was having just the right amount of docs. They split them to focus in two areas: 1. integration, 2. security. #QConNYC https://t.co/MkpGcufE0c

@micheletitolo: They also changed the way some things were named. By doing this, people understood the purpose of different parts of the system without needing to ask follow up questions #QConNYC

@micheletitolo: And finally, one of my favorite points: "Making things usable does not mean over-simplifying them. Customers aren't stupid, they are in fact very smart." This completely changed how they developed tools #QConNYC https://t.co/nkuP6eeBOf

Ask Me Anything and Open Space

AMA w/ Joshua Bloch

Jeanne Boyarsky attended this session:

Law

Idea expression dichotomy – Words copyrightable but not ideas. Methods of operations not copyrightable like QWERTY
Ideas may be protected by patents so can’t reimplement for 20 years. Ex: +++ on Hayes Modem
Now legal strategy to include a patent in copyright so goes to desirable court
Technically implementing your own list is a problem. But selective enforcement implies won’t enforce.

Microsoft vs Sun

had signed agreement and fought
not broader implications

APIs

How document scalability, performance and other non-logical constraints? Better to specify even if implementations get to choose. That way it is specified that can’t rely on.
Some requirements/behavior is in spec when buy a product. Not just in API. Ex: copy machine can make X copies a second while API is press button to make copy.
Balancing act – don’t want implementation details in API that might change as callers will rely on them. But if not enough, developers assume/infer

Java

Eventually languages keel over from own weight. Good thing. Just like people, not meant to live forever.
Need to decide what to learn
Need to decide which API to use.
Makes room for new entrants when people can’t deal with current languages
Not just language. Also VM, libraries, etc. Much harder to move away from a platform. Harder than moving off JVM than changing from C to Java.
Josh’s favorite editors – IntelliJ and Emacs
Type erasure is present so could be migration compatible (vs a whole new collections library)
As languages age, some decisions make sense only due to history and past decisions.
Josh wishes Java supported unsigned integers and especially bytes. Gosling designed by gut originally and was usually right. This was a bad one. He felt it would have been unneeded complexity. Solution is a library since not in the language
Josh would have added methods to return an arbitrary element and leave (or remove) from collection.
Josh would have returned Collection itself instead of a boolean as to state. That would have allowed fluent calls
Josh expects modules to get used mainly within the JDK

Java and six months release cycle

Platforms require stability. Can’t have major changes each 6 months.
Lots of stuff that in there and not used
Books released less frequently than release cycle even when more reasonable.
Changes miniscule so not worth updating for each

Josh’s future

Josh teaching OO and APIs courses next semester
APIs course may turn into book on API design

Mob Programming Mini Workshop

by Harold Shinsato

Jeanne Boyarsky attended this session:

Concepts

“All the brilliant people working on the same thing, the the same time, in the same space, on the same computer” – Woody Zuill
Turn up the good. A team would rotate who types to solve the problem.
Far less bugs/tech debt
More productive than working individually [that’s not the right comparison]
Harold uses at Montano code school.
Requires kindness, consideration and respect. Takes time to learn. Psychological safety and empathy are important.
The team decides what to do and how to do it. Can’t force it. Just like forced pair programming doesn’t work.
Important to go in with plan to experiment and see what works.
Mobs of three good for coaching. Hunter uses mobs of 5-8.
Rotation time varies. Can switch every 15 minutes or just have one driver.
Mob worked with 50 people with language puzzles to learn/work through koans.
Introverts need to check out regularly. And that is ok (for mobs larger than 5) because flow continues. A teammate catches up on what you miss.
Gets people up to speed faster.

Roles

Coach
Driver – typist. Can’t have ideas. Just types. Smart input device.
Navigator – everyone else. Helpful to start with one navigator so can practice not talking when know answer. Need others to practice leadership and develop expertise.

Problems with pairing

Like a date. Awkward if doesn’t work out.
Mobbing is safer because in a group. No one person is a single point of failure.
Mobbing makes it easier to pair later. Harold uses mobbing before students pair

Strong pairing – where person at keyboard can’t make decisions…

In practice

Supposed to improve team in long run
If someone knows most, they are primary navigator and turn into a leader – helping others.
Get better at using tools
Mobs of 8 good for teaching
Hunter is most experienced of mobs. Switch every 15 minutes.
Can speak at higher level if driver understands.

Opinions about QCon

@CGuntur: @jeanneboyarsky @qconnewyork #qconnewyork has consistently added subtle efforts that show they care a lot about both speakers and attendees. One of the best conferences I have been to. #QConNYC

@pasku1: Thank you @QCon for the amazing conference. Such an awesome learning opportunity. #QConNYC

@jeanneboyarsky: glad #qconnyc does videos. i really wanted today’s keynote, but had to be at work. watching it now and learning a lot about NYC history!

@hwilson1204: @susheelaroskar you made the 1 minute recap! https://t.co/3aWryHtqgY

Takeaways

Atomist’s takeaways were:

Two big takeaways for us: 1) everyone is struggling to figure out delivery at scale (hint: more pipelines isn’t the answer), 2) people really grok the idea that your delivery needs to be event-driven (clearly we agree!).

A lot of people have hand-crafted their own delivery by cobbling together tools they already have around the house. Everything is held together with string and duct-tape, but when you grow up with it, it just seems normal. But there’s a better way!

Takeaways from QCon New York included:

@lizthegrey: This is my number one takeaway from today I think. https://t.co/6WbwcuGuiD

@corradoi: #QConNYC Take Away from great talk from Netflix operation Seth Katz. on real time root cause prediction. In a nutshell use statistic, maths and basic algorithms before going for fancy ai frameworks and Machine Learning. Do fundamentals before trying tricks you cannot manage!

@danielbryantuk: "APIs are the glue that connects the digital universe. We need to be able to reimplement these." @joshbloch #QconNYC https://t.co/lt1xRJavQX

@thoweCH: "When you start modeling events, it forces you to think about the behavior of the system. As opposed to thinking about the structure of the system.” - Greg Young https://t.co/rUBBcbRV90

@jxson: "An optimistic disaster plan is a useless disaster plan" https://t.co/9o6CoeziYV

Conclusion

QCons are produced by InfoQ.com. Our focus on practitioner-driven content is reflected in the fact that the program committee that selects the talks and speakers is itself comprised of technical practitioners from the software development community.

Our next English QCon is in San Francisco November 11-19th.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Key Takeaway Points and Lessons Learned from QCon New York 2018

InfoQ Article Contest

Related Sponsored Content

Rate this Article

This content is in the QCon Software Development Conference topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter