InfoQ Homepage Big Data Content on InfoQ

Articles

RSS Feed

Newer Older

Cloud

Designing Continuous Authorization for Sensitive Cloud Systems

Most cloud systems make one authorization decision at login. Everything after runs on trust established at authentication time. For systems handling regulated data, that gap is where breaches happen. This article presents a continuous authorization architecture covering risk-tiered evaluation, behavioral baselines, privacy-preserving audit trails, and a phased and incremental rollout.

Venkata Nedunoori
on Jun 19, 2026
Cloud

Two Misconfigurations That Caused Spark OOM Failures on Kubernetes

After migrating Spark pipelines to Azure Kubernetes Service, two infrastructure settings interacted destructively: spark.kubernetes.local.dirs.tmpfs=true backed shuffle spill with RAM instead of disk, and a hard podAffinity rule forced all executors onto one node. Together, they caused repeated OOM kills invisible to standard diagnostics.

Pranav Bhasker
on Jun 03, 2026
AI, ML & Data Engineering

Time-Series Storage: Design Choices That Shape Cost and Performance

Every time-series database makes a set of storage design decisions: how to lay out rows, when to compress, what to partition on. These decisions determine cost and query performance more than the choice of database itself. This article works through those fundamentals from first principles, using widely available tools like PostgreSQL and Apache Parquet to make each trade-off measurable.

Nirmesh Khandelwal
on May 12, 2026
AI, ML & Data Engineering

From Batch to Micro-Batch Streaming: Lessons Learned the Hard Way in a Delta Index Pipeline

This article describes how a production delta-index pipeline migrated from scheduled batch to micro-batch Spark Structured Streaming. It covers why record-level streaming was rejected, how partition-based watermarks replaced fragile S3 completion markers, overlap-window correctness, and restart-as-design strategies for better predictability in object-store–based ingestion systems.

Parveen Saini
on May 04, 2026
AI, ML & Data Engineering

Autonomous Big Data Optimization: Multi-Agent Reinforcement Learning to Achieve Self-Tuning Apache Spark

This article introduces a reinforcement learning (RL) approach grounded in Apache Spark that enables distributed computing systems to learn optimal configurations autonomously, much like an apprentice engineer who learns by doing. The author also implements a lightweight agent as a driver-side component that uses RL to choose configuration settings before a job runs.

Hina Gandhi
on Jan 30, 2026
Development

How to Compute without Looking: a Sneak Peek into Secure Multi-Party Computation

This article shows how you can compute a function across multiple parties that do not trust each other without forcing them to share their individual inputs. This technique can be used to split secrets among parties, perform logical operations, or count votes in a way that ensures data privacy is preserved.

Debasish Ray Chawdhuri
on Mar 31, 2025
Development

Zero-Knowledge Proofs for the Layman

This article will introduce you to zero-knowledge proofs, a kind of cryptography you can use to provide the proof you know a secret, such as a private key or the solution to a problem, without ever sharing it to an interested party. While many articles exist on the topic, this will not require any high math knowledge.

Debasish Ray Chawdhuri
on Mar 18, 2024
Culture & Methods

Minimising the Impact of Machine Learning on our Climate

This article introduces the field of green software engineering, showing the Green Software Foundation’s Software Carbon Intensity Specification, which is used to estimate the carbon footprint of software, and discusses ideas on how to make machine learning greener. It aims to give you the tools to take an active part in the climate solution.

Sara Bergman
on May 30, 2023
DevOps

Data Protection Methods for Federal Organizations and beyond

The Federal Data Strategy describes a plan to “accelerate the use of data to deliver on mission, serve the public, and steward resources while protecting security, privacy, and confidentiality." This article covers what it is and how it can be applied to any organization.

Alex Tray
on Jan 18, 2023
Development

Who Moved My Code? An Anatomy of Code Obfuscation

In this article, we introduce the topic of code obfuscation, with emphasis on string obfuscation. Obfuscation is an important practice to protect source code by making it unintelligible. Obfuscation is often mistaken with encryption, but they are different concepts. In the article we will present a number of techniques and approaches used to obfuscate data in a program.

Michael Haephrati Ruth Haephrati
on Nov 09, 2022
Development

Virtual Panel: the New US-EU Data Privacy Framework

Recent rulings by several European courts have set important precedents for restricting personal data transmission from the EU to the US. As a consequence, the US and EU have started working on a new agreement. In this virtual panel, three knowledgeable experts discuss where the existing agreements fall short, and whether a new privacy agreement could improve the current situation.

Chris McLellan Jeff Jockisch Stephen Bailey Sergio De Simone
on Oct 13, 2022
DevOps

Embracing Cloud-Native for Apache DolphinScheduler with Kubernetes: a Case Study

This article shares how Apache DolphinScheduler was updated to use a more modern, cloud-native architecture. This includes moving to Kubernetes and integrating with Argo CD and Prometheus. This improves substantially the user experience of deploying, operating, and monitoring DolphinScheduler.

Yang Dian
on Jun 24, 2022

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles