InfoQ Homepage Reliability Content on InfoQ
-
Real-Time Exactly-Once Event Processing at Uber with Apache Flink, Kafka, and Pinot
Uber faced some challenges after introducing ads on UberEats. The events they generated had to be processed quickly, reliably and accurately. These requirements were fulfilled by a system based on Apache Flink, Kafka, and Pinot that can process streams of ad events in real-time with exactly-once semantics. An article describing its architecture was published recently in the Uber Engineering blog.
-
How GitHub Partitioned Its Relational Database to Improve Reliability at Scale
GitHub has been working for the last couple of years on partitioning their relational database and moving the data to multiple independent clusters. This effort led to a 50% load reduction and a significant reduction of database-related incidents, explains GitHub engineer Thomas Maurer.
-
Reviewing the Eight Fallacies of Distributed Computing
In a recent article on Ably Blog, Alex Diaconu reviewed the eight fallacies of distributed computing and provided a number of hints at how to handle them. InfoQ has taken the chance to talk with Diaconu to learn more about how Ably engineers deal with the fallacies.
-
Artificial Intelligence for IT Operations: an Overview
Artificial intelligence for IT operations (AIOps) combines sophisticated methods from deep learning, data streaming processing, and domain knowledge to analyse infrastructure data from internal and external sources to automate operations and detect anomalies (unusual system behavior) before they impact the quality of service.
-
Auth0's Move to a Single-Cloud Architecture on AWS
Auth0, a provider of authentication, authorization and single sign on services, moved their infrastructure from multiple cloud providers (AWS, Azure and Google Cloud) to just AWS. An increasing dependency on AWS services necessitated this, and today their systems are spread across four AWS regions, with services replicated across zones.
-
How DevOps Principles Are Being Applied to Networking
Practices from the DevOps world are being adopted into managing networking services. Vendor hardware, configuration tools and deployment modes have eased programmable configuration and automation of network devices and functions.
-
Using Models in Developing Software for Self-Driving Cars
Models play an important role in developing software for autonomous systems like self-driving cars; they are used to simulate and verify behavior, document the system, and generate code. Jonathan Sprinkle explains how to model software used in autonomous systems, the benefits of modeling, using test data to validate the software that drives a car and techniques for writing reliable code.
-
GitHub’s DGit Improves Reliability, Performance, and Availability
GitHub has been quietly rolling out DGit, short for “distributed Git”, a new distributed storage system built on top of Git with the aim of improving reliability, availability, and performance of using GitHub.
-
Big Data Architecture: Push, Pull, or Search in Place?
A surprisingly common theme at the Splunk Conference is the architectural question, “Should I push, pull, or search in place?”
-
Too Big To Fail: Lessons Learnt from Google and HealthCare.gov
At QCon New York 2015, Nori Heikkinen shared stories of failure and lessons learnt during her time working as a site reliability engineer (SRE) at Google and HealthCare.gov. The discussion of managing large-scale outages included recommendations for preparation, response, analysis and prevention.
-
Testing Impact of Model Driven Development
By using Model Driven Development component tests could be skipped and integration and system testing went a lot smoother, said Bryan Bakker in the presentation Model Driven Development (MDD) and its impact on testing. Main results from the MDD approach are a reduction of the amount of testing and increased reliability of the code that was generated from a mathematical model.
-
Mixing Agile with Waterfall for Code Quality
The 2014 CAST Research on Application Software Health (CRASH) report states that enterprise software built using a mixture of agile and waterfall methods will result in more robust and secure applications than those built using either agile or waterfall methods alone. InfoQ interviewed Bill Curtis about structural quality factors, and mixing agile and waterfall methods.
-
Martin Thompson Discusses the Reactive Manifesto 2.0
The second version of the Reactive Manifesto was announced at September's GOTO conference in Aarhus. Martin Thompson discusses the need for a revised version of the Manifesto and what its changes mean for the burgeoning reactive community.
-
Applying Security by Design with the CMMI for Development
To enable development of secure products, processes covering the software development life cycle have to include security activities. Winfried Russwurm from Siemens and Peter Panholzer from Limes Security facilitated a workshop at the SEPG Europe 2013 conference where they explored security activities and presented the Application Guide for Improving Processes for Secure Products.
-
Application Reliability in Windows Store Apps
Testing is critical, but not enough. This is the theme of Harry Pierson’s session on application reliability in Windows Store apps.