Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles A Reference Architecture for Fine-Grained Access Management on the Cloud

A Reference Architecture for Fine-Grained Access Management on the Cloud

This item in japanese

Key Takeaways

  • For highly dynamic cloud environments, VPNs and bastion hosts are inadequate as effective access management mechanisms
  • A non-repudiable user identity established by an identity provider (as opposed to relying on network layer concepts such as IP addresses) is a core underpinning of access management in the cloud
  • Short-lived ephemeral tokens or certificates eliminate the need for users to remember or store static credentials, which may be vulnerable to leaks and attacks
  • 3rd party SaaS tool accesses to data, which cannot be controlled using VPNs, can be monitored just as effectively as those by internal users and tools
  • User productivity is improved by providing a consistent access experience across all cloud resources -- SSH hosts, databases, S3 buckets, etc. -- even as administrators can exercise fine-grained control over them


What is access management?

Access management is the process of identifying whether a user, or a group of users, should be able to access a given resource, such as a host, a service, or a database. For example, is it okay for a developer to be able to log in to a production application server using SSH, and if so then for how long? If an SRE is attempting to access a database during off-call hours, should they be allowed to do so? If a data engineer has moved to a different team, should they continue having access to the ETL pipelines’ S3 buckets?

How access management is done today?

Before the proliferation of various infrastructure and data services on the cloud, access management was a relatively simple problem for DevOps and Security teams to solve. VPNs and bastion hosts were (and still are) the preferred mechanisms to cordon off all critical resources at the network level. Users first authenticate with the VPN server, or log on to the bastion host, before they can access any resource on the private network.

This works well when the resources are static and their number relatively small. However, as more and more resources dynamically spring up in different parts of the private network, the VPN / bastion host solutions become untenable.

Specifically, there are three areas where VPNs and bastion hosts fall short as an effective mechanism for access management.

  • They operate at the network layer: Once a user authenticates with a VPN and gains access to the private network, they effectively have access to all services running on it. It’s not possible to manage access at the granularity of each infrastructure or data service based on the user’s identity.
  • Credentials are a vector of attack: Both VPNs and bastion hosts require users to remember and store credentials. Expiring and rotating credentials as a security policy is difficult, especially when a large number of users are involved, thus the credentials become potential vectors of attack.
  • 3rd party SaaS tools cannot be governed: SaaS tools such as Looker, Tableau, and Periscope Data require direct access to data endpoints. Consequently, anyone accessing data using these tools cannot be authenticated using the same mechanisms and credentials as the rest of the cloud infrastructure.

A new architecture for access management on the cloud

In this article, we will define a new reference architecture for cloud-native companies that are looking for a simplified access management solution for their cloud resources, from SSH hosts, databases, data warehouses, to message pipelines and cloud storage endpoints.

It solves the following specific challenges VPNs and bastion hosts aren’t able to overcome:

  • Enforcing access authorization at a fine-grained service level
  • Eliminating shared credentials and individual account management
  • Governing access by 3rd party SaaS tools

Additionally, it enables the following business benefits for organizations with sensitive data:

  • Auditability for meeting compliance standards such as FedRamp and SOC2 through session recording and activity monitoring across all services
  • Privacy and data governance through fine-grained authorization policies to restrict or scrub sensitive data based on identity of the accessors

The architecture is built upon the following three core principles, whose implementation allows DevOps and Security teams to exercise full control over all of their environment while improving user productivity with a simple and consistent experience.

  • Establishing a non-repudiable identity for users accessing resources
  • Using short-lived ephemeral tokens and certificates in place of static credentials and keys
  • Centralizing fine-grained access policies across all resources types in a single place

The following figure shows the reference architecture and its components.

The VPN / bastion host from the previous figure has been replaced with an Access Gateway. The Access Gateway is actually a collection of micro-services and is responsible for authenticating individual users, authorizing their requests based on certain attributes, and ultimately granting them access to the infrastructure and data services in the private network.

Next, let’s look at the individual components to see how the core principles outlined before are accomplished.

Access Controller

The key insight underpinning this architecture is the delegation of user authentication to a single service (the Access Controller) rather than placing that responsibility with each service to which the user may need access. This kind of federation is commonplace in the world of SaaS applications. Having a single service be responsible for authentication simplifies user provisioning and de-provisioning for application owners and accelerates application development.

The Access Controller itself will typically integrate with an identity provider, such as Auth0 or Okta, for the actual authentication sequence, thus providing a useful abstraction across a wide array of providers and protocols. Ultimately, the identity provider guarantees non-repudiation of the user’s identity in the form of a signed SAML assertion, a JWT token, or an ephemeral certificate. This obviates the need to rely on a trusted subnet as a proxy for the user’s identity. It also allows configuring access policies down to the granularity of a service unlike VPNs which permissively grant users access to all services on the network.

An additional advantage of delegating authentication to identity providers is that users can be authenticated using zero trust principles. Specifically, identity provider policies can be created to enforce the following:

  • Disallow access from disreputable geo-locations and IP addresses
  • Disallow access from devices with known vulnerabilities (unpatched OSes, older browsers, etc.)
  • Trigger an MFA immediately after a successful SAML exchange

How the authentication sequence works:

  1. A user first authenticates with the Access Controller which in turn delegates the authentication to an identity provider.
  2. Upon a successful login to the identity provider, the Access Controller generates a short-lived ephemeral certificate, signs and returns it to the user. Alternatively, it may generate a token in place of the certificate. As long as the certificate or the token is valid it may be used to connect to any of the authorized infrastructure or data services managed by the Access Gateway. Upon expiration, a new certificate or token must be obtained.
  3. The user passes the certificate obtained in Step (2) to a tool of their choice and connects to the Access Gateway. Depending on which service the user is requesting access for, either the Infrastructure Gateway or the Data Gateway will first validate the user’s certificate with the Access Controller before allowing them access to the service. The Access Controller thus acts as a CA between the users and the services they’re accessing, hence providing a non-repudiable identity for each user.

Policy Engine

While the Access Controller enforces authentication for users, the Policy Engine enforces fine-grained authorization on their requests. It accepts authorization rules in a human-friendly YAML syntax (check out examples at the end) and evaluates them on user requests and responses.

The Open Policy Agent (OPA), an open-source CNCF project, is a great example of a policy engine. It can be run as a microservice on its own or used as a library in the process space of other microservices. Policies in OPA are written in a language called Rego. Alternatively, it’s easy to build a simple YAML interface on top of Rego to simplify policy specifications.

Having an independent policy engine separate from the security models of the infrastructure and data services themselves is advantageous for the following reasons:

  • Security policies can be specified in a service and location agnostic manner
    • E.g. Disallow privileged commands on all SSH servers
    • E.g. Enforce MFA checks for all services (both infrastructure and data)
  • Policies can be maintained in a single place and versioned
    • Policies can be checked into a GitHub repository as code
    • Every change goes through a collaborative review process before being committed
    • A version history exists to make it easy to revert policy changes

Both the Infrastructure Gateway and Data Gateway depend on the Policy Engine for evaluating infrastructure and data activity, respectively, by users.

Infrastructure Gateway

The Infrastructure Gateway manages and monitors accesses to infrastructure services such SSH servers and Kubernetes clusters. It interfaces with the Policy Engine to determine granular authorization rules and enforces them on all infrastructure activity during a user session. For load balancing purposes, the gateway may comprise a set of worker nodes, be deployed as an auto-scaling group on AWS, or run as a replica set on a Kubernetes cluster.

Hashicorp Boundary is an example of an Infrastructure Gateway. It’s an open source project that enables developers, DevOps, and SREs to securely access infrastructure services (SSH servers, Kubernetes clusters) with fine-grained authorization without requiring direct network access while precluding the use of VPNs or bastion hosts.

The Infrastructure Gateway understands the various wire protocols used by SSH servers and Kubernetes clients, and provides the following key capabilities:  

Session recording

This involves making a copy of every command executed by the user during a session. The captured commands will typically be annotated with additional information, such as the identity of the user, the various identity provider groups they belong to, the time of the day, the duration of the command, along with a characterization of the response (whether it was successful, whether there was an error, whether data was read or written to, etc.).

Activity monitoring

Monitoring takes the notion of session recording to the next level. In addition to capturing all commands and responses, the Infrastructure Gateway applies security policies on the user’s activity. In the case of a violation, it may choose to trigger an alert, block the offending command and its response, or terminate the user’s session altogether.

Data Gateway

The Data Gateway manages and monitors accesses to data services such hosted databases such as MySQL, PostgreSQL and MongoDB, DBaaS endpoints such as AWS RDS, data warehouses such as Snowflake and Bigquery, cloud storage such as AWS S3, and message pipelines such as Kafka and Kinesis. It interfaces with the Policy Engine to determine granular authorization rules and enforces them on all data activity during a user session.

Similar to the Infrastructure Gateway, the Data Gateway may comprise a set of worker nodes, be deployed as an auto-scaling group on AWS, or run as a replica set on a Kubernetes cluster.

Due to the wider variety of data services compared to infrastructure services, a Data Gateway will typically have support for a large number of wire protocols and grammars.

An example of such a Data Gateway is Cyral, a lightweight interception service and is deployed as a sidecar for monitoring and governing access to modern data endpoints such as AWS RDS, Snowflake, Bigquery, AWS S3, Apache Kafka, etc. It’s capabilities include:

Session recording

This is similar to recording infrastructure activity and involves making a copy of every command executed by the user during a session and annotating with rich audit information.

Activity monitoring

Again, this is similar to monitoring infrastructure activity. For example, the policy below blocks data analysts from reading sensitive PII of customers.

Privacy enforcement

Unlike infrastructure services, data services grants users read and write access to sensitive data related to customers, partners, and competitors that often resides in databases, data warehouses, cloud storage, and message pipelines. For privacy reasons, a very common requirement for a Data Gateway is the ability to scrub (also known as tokenization or masking) PII such as emails, names, social security numbers, credit card numbers, and addresses.

So how does this architecture simplify access management?

Let’s look at some common access management scenarios to understand how the Access Gateway architecture provides fine-grained control compared to using VPNs and bastion hosts.

Privileged Activity Monitoring (PAM)

Here’s a simple policy to monitor privileged activity across all infrastructure and data services in a single place:

  • Only individuals belonging to the Admins and SRE groups are allowed to run privileged commands on SSH servers, Kubernetes clusters, and databases.
  • While it’s okay to run privileged commands, there are a few restrictions in the form of exceptions. Specifically, the following commands are disallowed:
    • “sudo” and “yum” commands may not be run on any SSH server
    • “kubectl delete” and “kubectl taint” commands may not be run on any Kubernetes cluster
    • “drop table” and “create user” commands may not be run on any database

Zero Standing Privileges (ZSP) Enforcement

The next policy shows an example of enforcing zero standing privileges -- a paradigm where no one has access to an infrastructure or data service by default. Access may be obtained only upon satisfying one or more qualifying criteria:

  • Only individuals belonging to the Support group are allowed access
  • An individual must be on-call to gain access. On call status may be determined by checking their schedule in an incident response service such as PagerDuty
  • A multi-factor authentication (MFA) check is triggered upon successful authentication
  • They must use TLS to connect to the infrastructure or data service
  • Lastly, if a data service is being accessed, full table scans (e.g. SQL requests lacking a WHERE or a LIMIT clause that end up reading an entire dataset) are disallowed.

Privacy and Data Governance

The last policy shows an example of data governance involving data scrubbing:

  • If anyone from Marketing is accessing PII (social security number (SSN), credit card number (CCN), age), scrub the data before returning
  • If anyone is accessing PII using the Looker or Tableau services, also scrub the data
  • Scrubbing rules are defined by the specific type of the PII
    • For SSNs, scrub the first 5 digits
    • For CCNs, scrub the last  4 digits
    • For ages, scrub the last digit i.e., the requestor will know the age brackets but never the actual ages


We saw that for highly dynamic cloud environments, VPNs and bastion hosts are inadequate as effective access management mechanisms in agile cloud environments. A new access management architecture with a focus on a non-repudiable user identity, short-lived certificates or tokens, and a centralized fine-grained authorization engine effectively solves the challenges that VPNs and bastion hosts fail to solve. In addition to providing a comprehensive security for users accessing critical infrastructure and data services, the architecture helps organizations achieve their audit, compliance, privacy and governance objectives.

We also discussed a reference implementation of the architecture using well-known developer focussed open-source solutions such as Hashicorp Boundary and OPA in conjunction with Cyral, a fast and stateless sidecar for modern data services. Together they can provide a fine-grained and  easy to use access management solution on the cloud.

About the Author

Manav Mital is the co-founder and CEO of Cyral, the first cloud-native security service that delivers visibility, access control and protection for the Data Cloud. Founded in 2018, Cyral works with organizations of all kinds—from cloud-native startups to Fortune 500 enterprises— as they embrace DevOps culture and cloud technologies for managing and analyzing their data. Manav has a MS in Computer Science from UCLA and a BS in Computer Science from the Indian Institute of Technology, Kanpur.

Rate this Article