When working with Hadoop, with or without Hunk, there are a number of ways you can accidentally kill performance. While some of the fixes require more hardware, sometimes the problems can be solved simply by changing the way you name your files.
Splunk can now store archived indexes on Hadoop. At the cost of performance, this offers a 75% reduction in storage costs without losing the ability to search the data. And with the new adapters, Hadoop tools such as Hive and Pig can process the Splunk-formatted data.
Splunk opened their big data conference with an emphasis on “making machine data accessible, usable, and valuable to everyone”. This is a shift from their original focus: indexing arbitrary big data sources. Reasonably happy with their ability to process data, they want to ensure that developers, IT staff, and normal people have a way to actually use all of the data their company is collecting.
Preparing for problems like partial failure is the best thing you can do when working with distributed systems, Vaughn Vernon explains in a conversation with InfoQ and refers to a blog post by Jeff Hodges noting its down-to-earth approach and practical advices e.g. designing for partial availability, and using capped exponential back off to restore full operation when dependencies are unavailable.
GameAnalytics, maker of a free analytics platform, has recently open sourced gascheduler an Erlang library that provides a generic scheduler for parallel execution of distributed tasks. InfoQ has spoken to Chris de Vries, one of gascheduler’s creators.
Looking at Command Query Responsibility Segregation (CQRS) in a larger architectural context there are other architectural styles available. There are database technologies solving the same problems but in a simpler way, Udi Dahan states looking into ways of approaching CQRS. There is also a way that fulfils a lot of the CQRS goals but with fewer moving parts when CQRS is really needed.
ELIoT (Extensible Language for the Internet of Things) is a simple and small programming language aiming to make distributed programming easier. A program in ELIoT may appear as a sigle program, but it actually runs on different computers, so, e.g., a variable or function declared on one computer is transparently used on another.
To make microservices awesome Domain-Driven Design (DDD) is needed, the same mistakes made 5-10 years ago and solved by DDD are made again in the context of microservices, David Dawson claimed in his presentation at this year’s DDD Exchange conference in London.
Twitter has replaced Storm with Heron which provides up to 14 times more throughput and up to 10 times less latency on a word count topology, and helped them reduce the needed hardware to a third.
Apache Parquet, the open-source columnar storage format for Hadoop, recently graduated from the Apache Software Foundation Incubator and became a top-level project. Initially created by Cloudera and Twitter in 2012 to speed up analytical processing, Parquet is now openly available for Apache Spark, Apache Hive, Apache Pig, Impala, native MapReduce, and other key components of the Hadoop ecosystem.
During the last months Martin Fowler among others have claimed that a microservices architecture should always start with a monolith, but Stefan Tilkov is convinced this is wrong, building a well-structured monolith with cleanly separated modules that later may be pulled apart into microservices is extremely hard, if not impossible in most cases.
Latest version of MemSQL, in-memory database with support for transactions and analytics, includes a new Community Edition for free use by organizations. MemSQL 4, released last week, also supports integration with Apache Spark, Hadoop Distributed File System (HDFS), and Amazon S3.
NASA Center for Climate Simulation (NCCS) is using Apache Hadoop for high-performance data analytics. Glenn Tamkin from NASA team, recently spoke at ApacheCon Conference and shared the details of the platform they built for climate data analysis with Hadoop.
Big data vendors Hortonworks, IBM, and Pivotal recently announced that their Hadoop based platform products will use the common Open Data Platform (ODP). They made the announcement at the recent HadoopSummit Europe Conference of the open platform which includes Apache Hadoop 2.6 (HDFS, YARN, and MapReduce) and Apache Ambari software.
After three developer previews, six release candidates and over 1500 closed tickets the Apache foundation has announced version 1.0 of Apache HBase, a NoSQL database in the Hadoop ecosystem. After more than 7 years of active development, the team behind HBase felt that the project had matured and stabilized enough to warrant a 1.0 version.