Amazon recently announced EMRFS, an implementation of HDFS that allows EMR clusters to use S3 with a stronger consistency model. When enabled, this new feature keeps track of operations performed on S3 and provides list consistency, delete consistency and read-after-write-consistency, for any cluster created with Amazon Machine Image (AMI) version 3.2.1 or greater.
Reasons for building microservices are often about using isolation as a means to handle change. Sharing code between services couples your services to each other reducing the effectiveness of the isolation and the ability to handle change, David Dawson writes in a series of blog posts questioning the Don’t Repeat Yourself (DRY) principle in connection with microservices.
Five years ago many NoSQL databases were pre version 1.0 and when, it came to the CAP tradeoff, choosing availability over consistency was in vogue. Fast forward to today and distributed, fault tolerant transactions are moving into the fore as a new round of NoSQL databases seek to redefine our NoSQL expectations.
Apache Spark 1.2.0 was released with Netty-based implementation, High Availability and Machine Learning APIs. It represents the work of 172 contributors from over 60 institutions and comprises more than 1000 patches. InfoQ talks with Patrick Wendell, a Spark committer and PMC member.
Splice Machine version 1.0 supports analytic window functions and integration with Hadoop ecosystem. Splice Machine team recently released their Hadoop based RDBMS data management solution that can be used for transactional workloads on Hadoop.
There is a strong trend for microservice based architectures and frequent discussions comparing them to monoliths, Robert Annett explains and defines a monolith as an architectural style or a pattern using three basic viewtypes for characterization.
LinkedIn recently open sourced Cubert, its High Performance Computation Engine for Complex Big Data Analytics. Cubert is a framework written for analysts and data scientists in mind.Developed completely in Java and expressed as a scripting language, Cubert is designed for complex joins and aggregations that frequently arise in the reporting world.
At the 2014 QCon San Francisco conference, LinkedIn's Lin Qiao gave a talk on their Gobblin project (also summarized in a blog post) that is a unified data ingestion system for their internal and external data sources.
Stripe, the internet payments infrastructure company recently announced open sourcing a set of internally developed tools based on Apache Hadoop.Timberlake, Brushfire, Sequins and Herringbone all contribute to enriching the available tools for building an Apache Hadoop stack.
Microservices are not new ideas and we will over the course of 3-5 years end up rebuilding WS-* the same way Web Services did rebuild all from CORBA unless we learn from our mistakes and improve to prevent them from being made again, Greg Young stated in a presentation at the Microservices Conference in London.
Microservices are valuable, but to break things up properly creating the right boundaries we need to understand our business and its processes Jeppe Cramon stated in a presentation at the Microservices Conference in London.
Udi Dahan describes how looking for highly cohesive, loosely coupled microservices, not within a system but over the enterprise, we can end up with a focus on organising services around business capabilities spanning the whole organisation since this is what the business care about.
When working with Microservices pushing them to the cloud, people often find it difficult to understand the new architecture, it’s a paradigm shift, Daniel Bryant explains in a presentation at the Microservices Conference in London. As a help when designing and implementing cloud microservices Daniel has created the DHARMA principles, the idea being to use them as a checklist.
Databricks has recently announced a new record in the Daytona GraySort contest using the Spark processing engine. The Daytona GraySort contest is a 3rd party benchmark measuring how fast a system can sort 100 Terabytes of data. Databricks posted a throughput of 4.27 TB/min over a cluster of 206 machines for their official run.
When using Domain-Driven Design (DDD) separating the concerns of a large system into bounded contexts with each context using its own data store there is often a need to share some common data. One way of doing that is to let each context publish events about changes, events that others can listen to, Julie Lerman recently explained in MSDN Magazine.