BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Multi-Cloud Observability Using Fluent Bit

Multi-Cloud Observability Using Fluent Bit

Key Takeaways

  • Fluent Bit allows us to address cost management and compliance considerations that are key to multi-cloud observability.
  • Organizations rarely run cloud-native solutions alone, so any tool that seeks to unify observability, such as Fluent Bit, needs to work well with a wide range of technologies and implementation strategies.
  • Fluent Bit offers cloud neutrality and the ability to work with cloud vendor services, which are important features for enabling multi-cloud observability.
  • Fluent Bit provides the means to support different teams with different specialized observability tools to maximize the value of an organization' multi-cloud operations.
  • Monitoring and observability technologies and techniques have significantly improved in the last decade. Fluent Bit provides the capabilities to embrace these changes in an undisruptive manner.

Multi-cloud and hybrid IT operations are no surprise—while the hyper scalers would rather you keep your workloads on their cloud, it isn’t practical. After all, different clouds have different pros and cons. Sometimes, it isn’t the different features that can drive the selection of clouds. Fluent Bit is part of the Fluent CNCF project that provides the capabilities to gather observability data (representing the classic three pillars of logs, traces, and metrics) and then transform, filter, and route these events to appropriate tools. So before we consider Fluent Bit’s role in a multi-cloud scenario, let’s step back and look at the drivers and challenges involved in a multi-cloud.

Divergence and Autonomy

For larger organizations, a multi-cloud approach doesn’t always come from strategic planning and trying to align everything to individual vendors to maximize purchasing power. Mergers and acquisitions have traditionally been cited (for example, in this article by Thoughtworks) as one of the drivers for multi-cloud. However, probably more common situations come from local or gray IT making localized choices and when departments or business units have a level of autonomy. While autonomy lends itself well to agility, it can present challenges, particularly when a local solution gains traction and gets broader adoption within the organization or is identified as not fully aligned with the wider organization’s compliance obligations.

Figure 1:  How a localized solution can evolve to need enterprise-level support

Bringing localized operational processes into the organization’s wider operations can be a source of tension. Operationally, the local team will be happy working with their tools, but specialist teams like security will use their tools to manage wider visibility and understand cause and effect. Not only that, but these other teams may also well be working on different clouds. Herein lies one of the many benefits of technologies that allow observability data to be routed to various tools (an aspect of Fluent Bit we’ll touch upon shortly).

Distributing the observability data can help resolve challenges by enabling different teams to use their preferred tooling. This tooling can be deployed in different clouds, with the corporate-wide/centralized capabilities typically hosted on strategic cloud platforms.

Cost Efficiencies

Regardless of how an organization has arrived at a multi-cloud situation, it needs to have an end-to-end understanding of its IT to recognize when there is an issue and what is cause and effect.

Central IT teams are often seen as an overhead, so there is constant pressure to do more with less. This can be from operational processes to security. Consequently, any improvements that make life quicker and easier are positive—from consolidating common observability tools and dashboards to enabling faster recognition of issues or identifying when to apply preventive actions. Fluent Bit, as we’ll see, can help simplify these challenges. But simply consolidating all our observability data from multiple clouds can be expensive; logs and traces can be voluminous, and metrics are often generated at high frequency. All of this creates network costs, particularly data egress, and even if we’re looking at cents per hour per service, it builds up quickly. So we need to find an effective strategy—another area Fluent Bit can help us with.

Fluent Bit

It is worth looking at the background of FLuent Bit to understand how it can help us, as this gives us insight into some of its important characteristics. Fluent Bit comes from the same CNCF project as Fluentd, and in the early years of Fluentd and as part of the CNCF landscape, it was very much the "little brother." Fluent Bit was ideal for the Internet of Things use cases with a very small footprint. It could gather logs and pass them on to be processed—often via its sibling. But Fluent Bit has stepped out from its sibling’s shadow in the last few years. This has been driven by the cloud-native focus on having smaller, more efficient executables. Fluent Bit is built with C and extensible using C,  Go, and WASM, both creating native binaries and being customizable using LuaJIT.

In addition to Fluent Bit’s extensibility, an approach of supporting commonly used standards has been influential. As a result, in addition to being conversant with Open Telemetry Protocol and being able to function as an Open Telemetry Collector, it can seamlessly interoperate with existing Fluentd deployments and has both compiled in "plugins" that allow data to be sourced and sent to many different types of endpoints from Log4J, Apache, and Nginx log files to outputs to Kafka, TensorFlow, and social notification channels.

These technologies are part of the reason why Fluent Bit’s early adoption was around the Internet of Things (IoT) use cases, as the code could be compiled to specialist hardware with a small, efficient footprint. As cloud-native and Kubernetes have pressed forward smaller, more efficient binaries (22MB on Windows compared to Fluentd’s 100MB) at scale, these performance benefits have yielded cost savings (particularly at hyperscale). However, the cloud has also been a great enabler for AI/ML as we can "time-share" large fleets of densely packed GPUs that can use Fluent Bit for monitoring.

It can act as a Prometheus agent and an OpenTelemetry collector for all signal types. It easily fits into a Kubernetes ecosystem, supports traditional environments, and seamlessly plugs in where we might have used Fluentd.

Fluent Bit provides a very adaptable single-agent substitute that can support the tools we need to interrogate and visualize what is happening. This means we can greatly simplify our agent deployment needs: deploying Fluent Bit removes the need to deploy Prometheus Agent, agents for gathering OS-level logs, OpenTelemetry collector, etc.

Fluent Bit’s capabilities allow us to transform events inflight to derive metrics from logs (e.g., how many errors per minute in a log stream) and convert log representation to the industry-standardized metrics or traces. The ability to perform such transformation allows teams to introduce new observability tools and techniques without creating significant disruption. For example, we can convert traces and metrics to logs if development teams are adopting the auto instrumentation of code at build time or using service meshes like Istio to generate trace data, but the core operational tooling hasn’t yet caught up to handle trace data. Conversely, legacy apps that no one wants to have changed can potentially have traces and metrics derived from their logs, giving more insights from an observability stack that can support such data.

Most importantly, as Fluent Bit is driven through configuration files, it becomes easy to roll out the changes in a simple, iterative manner as we evolve our practices and technology without needing disruptive change.

Fluent Bit in a Multi-cloud Environment

As we’ve already seen, Fluent Bit offers many capabilities to blend and support multiple tools and overlapping technical or operational tooling. We can support the latest and more mature approaches to monitoring and observability and potentially replace multiple data-collecting agents. We operate with a small, efficient footprint.

These are all good arguments for Fluent Bit, but what makes it unique for multi-cloud use cases? The technologies used within Fluent Bit (and the fact that it is an open-source solution) mean it can be run almost anywhere, including across multiple clouds.

Data manipulation and filtration are important for multi-cloud; observability is verbose (particularly traces and logs) and consumes a lot of bandwidth and, therefore, cost. The exact costs for egress can be complicated as they can vary based on several factors:

  • Only Oracle of the big four cloud vendors doesn’t vary costs on cloud regions (when this happens, US and European regions tend to be cheaper, and the Asia Pacific and Latin America are more expensive).
  • Tiering—vendors will tier their pricing based on volume, how much consumption is committed to, etc.
  • Most vendors will charge for data flowing between their cloud regions.
  • Use cases—in some use cases, such as service migrations or offline data transfers, data egress costs may be reduced or free.

We’ve observed rates as high as 0.16 USD per gigabyte. At this price, one server generating a GB per hour of logs, traces, and metrics around the clock would cost 115 USD per month. But who runs a single server solution? So, if we can establish a strategy that could reduce this by even 10% without compromising the operational efficiency of a support team, we will make considerable savings.

Fluent Bit’s ability to filter and interact with many different tools means it would be easy to configure it to filter out noisy logs and store them temporarily in lower performance (and cost) local storage to the servers. The net result is that the logs are retrievable if deemed necessary, but hopefully, 99.99% of the time, they are superfluous to needs. For example, an app may regularly report that they’re alive and well. Rather than sending every one of these events across the network—it would be a simple thing to configure Fluent Bit to exclude such log events, or better still, just a share from one cloud to the central location the fact x number of these logs have been observed every 5 minutes—so the absence of information doesn’t become an issue). So we get an architecture more like the following diagram:

Figure 2: Simple top-level architecture to help with data costs

This idea is hardly new; we often see it as a case of "cloud edge" operations. It can become even cleverer if centralized operations teams use multiple tools. There will be overlapping observability data needed by all the apps, so we address that routing locally to the operations infrastructure, limiting any duplicated traffic.

Observing Geo Distributed App Instances

There is also a variation on this neutrality we’ve encountered. As mentioned, legislation and compliance requirements may mean the same solution needs to be deployed to multiple clouds. However, we still want a core to manage and observe the overall performance so they don’t need locally sensitive data. Instead, they need to consolidate the views that show which deployments are fine or need intervention. So, the intelligence to filter and pull indicator metrics and alerts can be deployed locally with the application. But it feeds the global monitoring—so the support team can easily see which deployment(s) need attention.

This approach isn’t new; managed service solutions often adopt it—their central cloud with all the operational observability dashboards and knowledge guides in a single cloud—but signing into the remote clouds to execute remediation.

Fluent Bit can enable us to achieve this by filtering out or deriving the relevant events in the geographically distributed deployments and forwarding them onto a central service, having masked or removed any sensitive values that can’t be distributed. Not only that, as we forward data, the central services will need to know the origins of the events, such as which Kubernetes cluster was the event origin; this can be injected into the observability events using plugins, such as the Kubernetes plugin for example.

Standardized data collection—local visualization

The last use case where Fluent Bit can excel is where the solution built needs to be vendor agnostic, often by adopting a CNCF-based technology stack. The collected observability data is then directed to the cloud domain via a centralized node. This means one Fluent Bit node needs to support the local observability product’s specific integration. With major vendors, this may be exposed using industry-standard data definitions such as OpenTelemetry’s OTLP, with which Fluent Bit is conversant.

The team then tailors the monitoring visualization to use the cloud local tooling. They may even blend the Fluent Bit data feeds with local cloud-specific monitoring points such as CloudWatch (often provided freely, so no compute costs are incurred for Fluent Bit). This is not a multi-cloud in the sense of using multiple clouds used by a common vendor or time. This makes a lot of sense for software vendors, as they’re extracting their application monitoring in a manner that means there is no need for the customer to intrude or understand the configurations. When operating in a Kubernetes-style environment, they just tap into the Fluent Bit Feed and use the data from Fluent Bit, which is the same across all deployments, to control things via a control tool or even just Kubernetes Operators.

Conclusion

When handling multi-cloud, whether centralization across multiple clouds for a single enterprise view or the same solution distributed across multiple clouds, we need to control the data flow to help control costs without compromising team effectiveness (or even addressing organizational responsibilities, accountabilities, and cultural differences).

Data flowing across clouds or cloud regions can often mean crossing legislative boundaries. If our data contains sensitive data such as PII, we are confronted not only with cost but also with legal constraints and obligations (e.g., data sovereignty).

Fluent Bit’s small, efficient footprint, which can be deployed just about anywhere, enables that. We can consolidate just enough data—building aggregated summaries, if necessary—but also leave the observability data in locations that mean we can easily pull on it if needed.

We can get the correct data to different tools for different related observability tasks, from security to specialist app operational views, supporting teams in a manner that makes them as effective as possible.

As Fluent Bit can collect events in near real-time, we can move from detecting an issue once it has started to occur to actually identifying the warning signs as the events are collected and creating the opportunities to be proactive.

About the Author

Rate this Article

Adoption
Style

BT