Netflix recently introduced Hollow, a Java library and toolset for processing in-memory datasets that aren’t characterized as “big data.” A single producer provides datasets from which many consumers have read-only access. The communication mechanism between producer and consumer includes real-time dataset changes.
Using Neural Networks for sequence prediction is a well-known Computer Science problem with a vast array of applications in speech recognition, machine translation, language modeling and other fields. FB AI Research scientists designed adaptive softmax, an approximation algorithm tailored for GPUs which can be used to efficiently train neural networks over vocabularies of a billion words & beyond.
Cloudera announced their partnership with MIT & Harvard's Broad Institute and detailed some of their experience with the Genome Analytics Toolkit pipeline.
Yahoo! has benchmarked three of the main stream processing frameworks: Apache Flink, Spark and Storm.
Chris Atherton did the closing keynote of the GOTO Berlin 2015 conference in which she talked about designing software. She suggests that, in stead of relying on professional opinions on how software should look or work, it can be better to go out and get data from real users. InfoQ interviewed her about designing and testing user interfaces.
Samsung SAMI is a Data-driven Development (D3) platform for receiving, storing and sending data to/from IoT devices. Any device can send data in various formats which is then normalized into a JSON format and stored in the cloud. Data can then be requested by other devices.
For an organization to be data-driven, it's not enough to just dump mountains of data. That data needs to be accurate and meaningful. Julianna Göbölös-Szabó, data engineer at Prezi shared how they improved the quality of its log data. Their solution involved moving from unstructured to structured data with a lightweight, contract-based approach to nudge all teams in the right direction.
An agile view of Big Data, wherein data is viewed as a real time stream, offers a new look at how data is managed. Using an agile data infrastructure, organizations can conquer Big Data challenges with a level of ease, flexibility and performance. White paper by codeFutures describes the Agile view of Big Data.
Prismatic have added data coercion in the 0.2 release of their Clojure data description library, Schema. The addition of coercion means that the library doesn’t just reject data that has the wrong types, but it can be configured to modify instances to fit the schema. InfoQ talked to Prismatic's Jason Wolfe about Schema.