BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News QCon San Francisco 2024 Day 1: Architectures, Rust, AI/ML for Engineers, Sociotech Resilience

QCon San Francisco 2024 Day 1: Architectures, Rust, AI/ML for Engineers, Sociotech Resilience

Day One of the 18th annual QCon San Francisco conference was held on November 18th, 2024, at the Hyatt Regency in San Francisco, California. Key takeaways included: the power of calculated risks; debunking the limitations of using Rust for web applications; challenges in re-architecting the Slack platform; and architecting real-time systems around a mainframe.

What follows is a summary of the keynote address and highlighted presentations.

Keynote Address: Dare Mighty Things: What NASA's Bold Endeavors Teach Us about the Power of Calculated RISCs

Khawaja Shams, co-founder and CEO at Momento, presented his keynote address entitled, Dare Mighty Things: What NASA's Bold Endeavors Teach Us about the Power of Calculated RISCs. After a short video on the landing of the Mars Curiosity Rover in August 2012, Shams kicked off his keynote stating that his presentation is all about the power of calculated "RISCs." He maintained that we all take risks with something seemingly trivial such as developers attending a multi-day conference. Upon returning to work, there is usually a backlog of work that would require a significant amount of time and decisions made in their absence. "Fear is the primary reason we avoid taking risks," Shams maintained, as we won't be able to champion our personal projects at work. The challenge is how to overcome the fear of taking risks.

Using NASA's Jet Propulsion Lab (JPL) as an example, Shams discussed the risks that the JPL had to make the Mars Curiosity Rover landing a success.

Shams stated that we should ask ourselves if the risk is worth it. Projects can take a significant amount of time to come to fruition. For example, it took over nine months for the Mars Curiosity Rover to reach Mars and the maximum latency for sending a message from Earth to Mars is over 22 minutes.

We need to think bigger, as it's easy to fixate on what can go wrong. Instead, we should balance that with what can go right. During a 2015 shareholder meeting, Jeff Bezos, founder and CEO at Amazon, said:

I believe we are the best place in the world to fail (we have plenty of practice!), and failure and invention are inseparable twins.

Bezos believed that failure is a necessary part of the innovation process and that Amazon's culture encourages experimentation and risk-taking.

The JPL took such a risk by using a RAD750, a radiation-hardened 200 MHz processor, that was prevalent in Apple computers built in the 1990s.

On Mars, the Jezero Crater, located in the Syrtis Major quadrangle, is about 45.0 km in diameter and was believed to have once been flooded with water. NASA sent the Mars 2020: Perseverance Rover equipped with Terrain Relative Navigation, a new landing technique designed to handle Jezero's rough terrain.

Shams stated that we need to ask ourselves about the problem that we are trying to solve. Shams discussed the 2-Way Doors Decision introduced by Bezos to Amazon employees in 1997, writing:

If you've made a suboptimal Type 2 decision, you don't have to live with the consequences for that long. You can reopen the door and go back through. Type 2 decisions can, and should, be made quickly by high judgment individuals or small groups.

This concept refers to a business decision that is easily reversible if the initial decision doesn't work out. Shams demonstrated how the Hexacopter and Dragonfly drones were good examples of this.

We need to be wrong...a lot, Shams stated, as we need to embrace failure. Quoting Bezos once more:

If you're good at course correcting, being wrong may be less costly than you think, whereas being slow is going to be expensive for sure.

Shams concluded with Galileo and reminded everyone that while he didn't invent the telescope, he actually improved it.

Highlighted Presentations: Rust for Databases | Re-Architecting Slack | Architecting Real-Time Systems around a Mainframe

Rust: a Productive Language for Writing Database Applications was presented by Carl Lerche, principal engineer at AWS and author of the "Tokio Rust Library." Leche kicked off his presentation by enumerating Rust's attributes. These include: the same performance as C++; the fastest growing programming language (according to SlashData); used (mostly) at the infrastructure level; good for high-quality code; and not as productive. He also maintained that developers should choose a programming language that is best suited for the job. However, this isn't necessarily an easy thing to do.

Lerche responded to a question asked in a HackerNews post, "Why waste your time with Rust's type system if you are building a web app?" by asking if this was true. He did acknowledge that Rust is perceived to be harder to learn, but maintained that developers should adopt Rust for performance and keep it for productivity. With this in mind, Lerche then introduced Are we web yet?, a website that describes how Rust is a viable framework for web application development, and Toasty, a new ORM for Rust that supports both SQL and some NoSQL databases. So armed with these new resources, Lerche maintained that Rust can be easier to learn.

He then discussed what he referred to as the hard parts of Rust, that is, Traits and Lifetimes, and recommended avoiding using them along with their corresponding first- and second-order bounds. Instead, he prefers to use enums over Traits when possible. Lerche provided code examples to demonstrate the pitfalls of using Traits. Lerche maintained that "the more complicated the trait bound, the harder it is to see the value for using them."

Lerche concluded by saying Rust can be a productive general purpose language, but there is still some work to actually get there. He also recommended watching this closing keynote from RustConf 2018.

More details from this presentation may be found in this InfoQ news story.

Changing the Model: Why and How We Re-Architected Slack was presented by Ian Hoffman, staff software engineer at Slack. Hoffman kicked off his presentation by comparing the Geocentric Model of the Solar System with the Heliocentric Model of the Solar System. Using the Geocentric Model, astronomers were able to predict, with 90% accuracy, the positions of other planets. However, a side effect where some planets appear to have been moving backwards was discovered. Adding epicycles explained this backward motion but made the model more complex. This tells us that subpar architectures may work for a long time, but they will become increasingly complex. An improved architecture will simplify complex problems, and Hoffman maintained that we should always question our own models.

Hoffman introduced Slack's V1 Architecture, which was introduced in 2013. Otherwise known as the Workspace Model, one customer was allowed one workspace where those workspaces contained data such as users, channels, messages and apps. Each workspace was a closed system. Hoffman then discussed the architectural implications and consequences of this architecture.

The V2 Architecture, codenamed the Enterprise Grid, delivered new features such as: allowing enterprise customers to have many workspaces; enterprise users were allowed to belong to many workspaces; and the data may be shared with multiple workspaces. These features were, however, limited to the enterprise. Hoffman discussed the architectural implications and consequences for this architecture as well.

The V3 Architecture, codenamed the Unified Grid, allowed customers to see everything they may access (within the enterprise) in one view. The workspaces still determined the access control. Architectural implications included: the API token no longer determines the workspace shard but now determines the current org.; the server checks the org and workspace shards, but limits check on workspaces to which the user belongs.

Slack built a prototype of the Unified grid to better understand the feasibility of using it. The prototype was used for day-to-day activities and it was designed to easily turn it on and off via a button. Slack employees were encouraged to use the prototype and provide feedback.

Hoffman concluded with these key takeaways:

  • The org diagram is more complex than the Unified Model, but they discovered benefits.
  • "Explicit is better than implicit."
  • The fastest API call is the one that doesn't happen.
  • We should prototype to get buy-in.
  • Johannes Keppler's model wasn't perfect, but it allowed Galileo to make improvements.
  • Ask the big questions: is our architecture serving us?

The rollout of the Unified Grid started in September 2023 and completed in March 2024. Upgrades of pre-Unified Grid mobile clients were completed in October 2024.

Legacy Modernization: Architecting Real-Time Systems around a Mainframe presented by Jason Roberts, lead software consultant at Thoughtworks and Sonia Mathew, director, Product Engineering at National Grid. Roberts and Mathew kicked off their presentation with a Venn diagram containing these intersecting concepts: Domain-Driven Design; Team Topologies; Event-Driven Architecture; and Change Data Capture. All of these concepts were introduced and discussed throughout their presentation.

The Unified Web Portal (UWP), a self-service site for National Grid customers, features billing, the ability to make payments, schedule service and display energy usage. It wasn't always unified, however, since a core problem was separate portals for gas and electricity running on multiple mainframes. A proposed solution to unify the portal was to migrate the data to a database, but this created challenges that included: an ETL error rate that led to data quality issues; only running the ETL a few times per day that led to data freshness issues; and there was a single point of failure.

A more emergent problem, however, was the realization the Extract, Transform, Load (ETL) pipeline was slow, expensive and batch-oriented. A proposed solution was to supplement the cache with synchronous reads/writes to the mainframes, but drawbacks to this solution included a system that was inelastic and contained an inherently complex infrastructure.

The outcome of these solutions included: failed interactions on the portal that drove customers to phone the call centre; and a high lead time due to silos in the organization bogging down the recovery efforts. Customer experience and satisfaction suffered as a result.

UWP 2.0 was designed with technical goals to decouple systems; reduce dependency on external teams and third-party solutions; and to create an empowered engineering organization. The intent was to reduce call center and software licensing costs and improve customer satisfaction and platform stability.

Domain-Driven Design (DDD), designed by Eric Evans can be defined by two key definitions: a bounded context is the "delimited applicability of a particular model that gives team members a clear and shared understanding of what has to be consistent and what can develop independently." And an entity is "an object fundamentally defined not by its attributes, but by a thread of continuity and identity."

Team Topologies is a functional organizational structure focused on creating an enabled, stream-aligned team. Mathew maintained that an organizational design is architecture.

Event-Driven Architecture is a "software design pattern that uses events to trigger actions and communicate among decoupled services." Roberts and Mathew used the Saga Pattern, a pattern where each service has its own database, for orchestration in UWP 2.0.

Change Data Capture is a "process that tracks, and records change to data in a database or data warehouse in real-time, and then delivers those changes to a downstream system or process." Roberts and Mathew defined a System-of-Reference for Change Data Capture as: a near-real-time replication of the System-of-Record; synchronization of Change Data Capture; and a model using Domain-Driven Design that is decoupled from the upstream structure and semantics.

Roberts and Mathew described their delivery and organizational structure, modeled workflows with finite state machines, and were alerted to changes using data events in the System-of-Reference. They overcame a number of challenges, such as batch bottlenecks, cross-cutting concerns, and the so-called "big bang" release.

Roberts and Mathews concluded with these key takeaways:

  • Change Data Capture can be a foundational strategy
  • Event-driven architecture is a natural fit with Change Data Capture
  • Team Topologies map an organizational design to a system design
  • Domain-Driven Design brings it all together

As a result, they were able to release every two weeks.

Conclusion

QCon San Francisco, a five-day event, consisting of three days of presentations and two days of workshops, is organized by C4Media, a software media company focused on unbiased content and information in the enterprise development community and creators of InfoQ and QCon. For details on some of the conference tracks, check out these Software Architecture and Artificial Intelligence and Machine Learning news items.

About the Author

Rate this Article

Adoption
Style

BT