InfoQ Homepage Presentations Beyond Entitlements for Cloud-native

Beyond Entitlements for Cloud-native

View Presentation

Speed:

Download

46:09

Summary

Chandra Guntur and Hong Liu show how they use Open Policy Agent with Spring Boot and HOCON to produce a responsibility management solution that scales to volume and performance needs. They also show some hiccups that they faced while deriving the most optimal solution for their needs. A short explanation of some tooling they built for validating the policy files in the IDE is also discussed.

Bio

Chandra Guntur is a Director and Java Advocate in Resilient Systems Engineering, BNY Mellon. Hong Liu is a principal developer in Resilient Systems Engineering, BNY Mellon.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Guntur: We're going to talk about entitlements and beyond entitlements in the cloud data. We are going to talk about scalable responsibility management. We have a few toolsets that we use, including SpringBoot and Open Policy Agent. Without further ado, a small disclosure from our workplace that we have to show.

Moving on, I'm going to introduce Hong [Liu] as well. Hong is the principal developer at BNY Mellon and she has about 18 years of experience. She recently started working more focused towards microservices and artificial intelligence. Hong is amazing at creating plugins as well. With that, I'm going to move onto myself. I'm a senior principal architect at BNY Mellon. I have been in this industry doing Java for about a few years, give or take. I'm one of the representatives for BNY Mellon at the JCP executive committee, and I'm also JUG leader for New York JavaSIG. Please do attend New York JavaSIG if you get a chance. They're an amazing group.

The agenda for today is we are going to talk about responsibility management. We will talk about some of the technology choices we opted for. We will also cover the architecture that we used in order to come up with our responsibility management, which will also include a field pattern and something we used to improve it. We will include code samples on what we are talking about, and in the end we'll also talk about an intelligent plugin that we wrote to improve productivity. The session is going to be broken into two parts. I will talk about the first few sections and then Hong is going to pick up and talk more about the architecture and the code samples.

Responsibility Management

Before we begin, I want to take a few minutes of your time, four or five minutes, to just explain the rationale for why do we need responsibility management? The role-based access control system is great. It allows you to go define some sort of data matrix, where you can determine who can get access to what based upon certain conditions. Responsibility management takes it to the next level because it's more functional. Let's take a quick look at some rationales.

Let's assume you work in a corporate place and entitlements are to be provided or access has to be granted or denied based upon your or a user's availability in an LDAP group. We have LDAP groups and someone is setting up users in an LDAP group or taking them out of an LDAP group. Depending on that, you're either whitelisted of blacklisted from getting access to something. A service wants to check this out to use it for their entitlements. They go access the sender group, they find out if there is access and they action it accordingly.

What if there was another service that needed the same thing? What will they do? They do use these LDAP groups. They connect to an LDAP directory. They use whatever LDAP groups they need and they find out if entitlements are needed or not. What if there are multiple services that need this? What happens if employees are moving out of the organization, moving out of the company or joining, what do you do? These are questions that are normally asked and, usually, there are systems that cater to movers, leavers, and joiners, that's what they're called. It is quite autonomous in how you interact with these systems. Everybody is trying to connect to the directory; therefore denial-of-service is possible too.

Another scenario: you got LDAP groups, which is awesome. You can also do email groups. There are companies where whether you are a member of an email group determines whether you have access to something or not. Maybe you are in an approval email group, therefore you can approve some tasks, maybe you are not, therefore you cannot, etc. What if multiple services need the same thing? How are you going to handle this? How do you handle movers and leavers? Emails can actually have some sort of a mechanism where someone has left the organization or moved on to a different group. Their email can still stick around in that group for a longer period of time and therefore, they may be able to do things that they should not be allowed to do. How do you control this?

Another scenario which could be more complex is when as you grow in the corporate and you get more and more policies, things start evolving and then you're putting up more connections. What if this user was in this LDAP group but not in this LDAP group? Let's give them access. What if this user also has to be in the approval email group? What if this user is allowed to only approve if the order is less than a certain dollar amount? What if there is an HR function associated with this user?

These are the different constraints that you could come up with. I just made up these constraints; there could be so many other constraints that you could come up with. How do you handle this today? What would most people do? They would probably code it out in the system. They would write some sort of rules and we'll talk about what the different options currently being used are shortly. There are different kinds of amounts, different kinds of variables that you can stack into this. Where are these governed? Who's actually maintaining these policies? Who's actually following the SDLC process for these policies to be set up? We don't know. They're a gray area. They're probably in an application, they're probably elsewhere. Who knows? And finally, who's managing the movers, leavers and joiners, and how are you maintaining all those brackets or HR functions? You're accessing different systems.

And finally, you have something called a role-based access control mechanism for a given domain in a given course code or organization for a given environment, production, Dev tests, etc., for a given action, let's say, edit, delete, create, etc. How are you authorizing somebody to go ahead and action something? What is your policy? What do you do? You basically use some sort of a role-based access control system, which is not tied to any of the functions that we did earlier. These are the use cases that we are talking about.

This is what led us to come up with what is called a responsibility management system. We already asked these questions, but more importantly, the third question is what I'd like to focus on. Who manages these roles and responsibilities? Who manages a user to role mappings? Are they done in a central team? Are they a part of your team? Do you have to connect to them? Does every application connect to them? Could it lead to some sort of throttling on services? How are you maintaining high availability? These are questions that you will get asked.

Common Solutions

We were talking about the common solutions that we had for this. I'm going to break this common solution into two parts. There is a data part and there's a logic part. The data part is similar to our back solutions, where you have some data stored somewhere in some persistence layer or in some sort of a service that can access a persistence layer, and you are querying this for information and someone is updating on a constant basis, hopefully, these data points. Then there are systems that are usually called integrated services such as a LDAP directory or active directory, and applications and services directly connect to these systems. User approvers, user managers, etc., are governed by proprietary corporate data structures, which are again queried by every system independently. Role-responsibilities and user-role mappings are either locally persisted or there are some kind of proprietary systems that either are bought through a vendor or developed in-house.

This is data. Now, data was easy. Data is just storage, right? I'm not trying to demean data, but data is storage. What if you had functions? What if you had logic? Like, this person is only allowed to approve until a certain value point. This person should have X number of people under him or her during the time period when the approval is going, etc. How do you handle functions? If there are complex functions and most of them, I would term legacy apps. I'm not, again, trying to demean the term legacy, but in apps that were developed earlier in the day, these were coded into the application or service. Newer applications and modern services try to separate it out by creating what is called a service, an independent microservice that can handle this for you.

Some applications use rules engines such as rules which are limited in scope in terms of what technologies they can work with. Then there are some proprietary systems that can also be used for evaluations. Again, there is a plethora of choices available for people to go do things. When you have a plethora of choices to do things, there are a plethora of ways you could go wrong or have errors. This is what we're trying to go about and solve at the workplace.

Responsibility Management Cycle

This led us to come up with what we call a responsibility management system. We will talk a little bit about that. Responsibility management was supposed to be a service that we offer, which allows us to cater to both data and logic that people are querying for, make it highly available and be resistant to some sort of back pressure and volume loads. Let's take a look at what we did, but before we go there, the first thing we need to talk about, as we mentioned this term very briefly, is the passing called policies. Responsibility management is made up of policies and these policies have a lifecycle. Let's take a quick look at what the lifecycle looks like.

You first have what is called the policy administration. This is where you will actually create the policies. You will author them and you will store them in some location. Once you’ve got the policy administration, you go into what is called a policy distribution mechanism. You create the policies, they're stored somewhere. You pick up these policies and distribute to whichever application or service needs them. Then you have what is called a policy decision making and a policy enforcement, which is where your policies are being picked up by those services or applications, and they are querying them to get decisions. These decisions are then used to enforce whether or not an action should be taken, prevented, entitled or disentitled. Then eventually in the course of the cycle, you will also need to curate these policies, because things have changed, users have moved on, or the company policies have changed. This is called the policy reconciliation.

This cycle continues on and on, and this is how you maintain policies. This is called a responsibility management cycle. Now, obviously, this is a very short picture. There is a much more complex picture associated with it in the appendix. If we have time, we will go through that as well. Now, this was responsibility management, so let us take a quick look at how content looked like before we introduced a responsibility management service. Before we do that, let's also look at what responsibility management's right solution looks like.

The Right Solution

You have a responsibility management service that can federate all the calls to integrated services, such as LDAP and active directory. It should provide for roles and responsibilities for you. It should provide for user to role mapping, and it should provide for a proper SDLC mechanism for you to audit, and provide to auditors on demand. In addition, our responsibility management system should provide for some sort of a policy engine that can evaluate any kind of complex calculation, and should be able to get data from multiple sources. One of the data sources and ad hoc data source, which is the service querying this policy engine, should be able to provide some data. Some data should be able to be picked up from integrated services such as LDAP and other sources that we have. Then you could also have policies, that are generated and stored elsewhere, being accessed. It should also cater to mover/leaver logic. It should horizontally scale and be highly available to any system that needs it.

Let's move on to how it looked like before we introduced a responsibility management service. We had a hodgepodge of systems that were connecting individually, trying to access responsibilities on their own. This was very decentralized. Auditing was per application and user management was very bespoke. So anytime an auditor came in and tried to figure out how you handled users, how you handled the system at this time, what your bitemporal milestone associated with this data or this policy was, there were no answers and everything had to be done on a per application basis. It was highly inefficient and cost ineffective.

Let's take a look at what we did with RMS. Obviously, you look at a picture way more simplified. Everything tries to now reach out to a single responsibility management service, which in turn accesses some sort of a role service which is centrally governed, and all users are directly connecting to this responsibility management system, which then enforces all the policies and can reach out to other systems that lie underneath it. This was much more clean solution. This was more centralized. Auditing became centralized and user management became centralized. When audit comes in, they had a very easy time going through a single report produced by, let's say, an RMS kind of a service.

Technologies Used

With that, the next thing I wanted to focus on are what are the technologies we use in order to come up with this solution? We had a few choices that we needed to make and I'm going to talk about two technologies, and then I’ll leave it to Hong, who is going to cover the architecture and the code samples. The first technology we went into was the payload that we use in order to come up with the responsibility management system. It was designed to be a RESTful service. We wanted a RESTful interface with POST and GET operations. We used what is called HOCON, a Human Optimized Configuration Object Notation. I'm going to talk a little bit about how HOCON looks like.

Before that, we had the choice of either doing a POST or a GET operation. Policies are queried; you should not have mutating policies, so POST operations were out of the door, but POST operations are what have requests bodies. GET operations don't. Obviously, when you try to create a GET operation with a very large parameter body, there may be some constraints placed on that. Also, it becomes very verbose and not bookmark-able or editable again. JSON and individual query parameters led to a lot of problems for us because it became a very verbose query string.

HOCON

HOCON, Human Optimized Configuration Object Notation, solves that for us. In the next two slides, we're going to see what HOCON looks like. First, why HOCON? HOCON has a very simple syntax. It's very easy to understand. It is a super set of JSON, which means that any JSON that you pass in is automatically treated as HOCON. It allows for the use of comments. Probably anyone who's used the YAML and JSON knows that that's one of the biggest drawbacks. You cannot create comments in your code. It allows for multi-line strings. You could have more verbose data being passed around. It allows for something that we'll look at in the next slide called includes and substitutions. These are amazing features and we will see in the next slide why. It has built-in durations. No more confusion about was it a millisecond, a nanosecond or a second or an hour; you actually get durations built into HOCON.

Now for a little bit of a sample on what we are talking about, inclusions and substitutions. The first arrow that you will notice, substitution, is where I am passing in X and I am declaring Y to be dollar X, which means the value that it will eventually show would be X equals 10 and Y equals 10. Then you have what is called inclusions, wherein I am able to include a content or data that lies elsewhere, either in my request payload or on the RMS server. So I may be passing a very trimmed- down version hoping and expecting that there is some other content that I'm going to merge with at the service endpoint. This is where I am including the content of the first file into the second file, and then getting all the values out. The way it reads in the my.conf is a.X equals 10, a.y equals 10, and a.z equals 5 seconds. If you notice, I am not mentioning whether this was a time or whatever; I can very easily understand this to be a duration, because it does say 5s, and this is something that HOCON picks up for you and that actually passes information converted into a duration.

Here’s a small sample of what we're talking about, how HOCON differentiates itself from JSON. This is a JSON Payload. You have foo, bar, baz and baz has a my value. Its equivalent in HOCON would look like a dot notation, foo.bar.baz equals myvalue or something similar to JSON, where I just have foo.bar.baz myvalue where I don't need colon. Again, I'm trimming down on the payload content or my parameter content by passing in HOCON.

In a larger example, if you notice, the difference between an employee payload and an employee payload on the HOCON site differentiates in the colons. You can do substitutions as in the case of full name. If you notice I have employee.firstname and employee.lastname, instead of passing in a hard coded value. Obviously, I'm using time as a duration rather than something I have to verbally explicitly declare. There are some benefits and you actually trim down on what you're doing and your HOCON becomes much more human optimized and hence the name, HOCON.

So this was one of the first things that we introduced. We decided that our payloads were going to be based on HOCON, rather than JSON or YAML. They were much more readable and obviously with inclusions, we got a lot more benefit by deferring a lot of data on the server side, than carrying all the payload over to the server at every service call. It was amazing. If you want to look it up, HOCON is at GitHub and the link is listed at the bottom of the screen.

Eclipse Collections

The next choice we had to make was we heavily built our services in Java and we were finding performance issues with collections that we used in Java. Therefore, we switched to using Eclipse Collections as our collections library. Eclipse Collections is obviously a rich and a concise API. It has clear, mutable and immutable hierarchies that you can use. It produces memory efficient containers. It is optimized for eager APIs which Java Collection Frameworks don't have. It improved code readability, and of course we had a very low learning curve because of the amount of Katas that we had in order to learn Eclipse Collections. With that, I'm going to pass over it to Hong [Liu] who's going to now talk about the architecture and the code samples for open policy.

Open Policy Agent (OPA)

Liu: We'll continue on with the choices for RMS. The next topic is open policy agent, OPA. First, what is OPA? OPA is an open-source. It's a general-purpose policy engine that enables policy enforcement. Why did we choose OPA as policy engine for the RMS? One important reason is OPA decouples policy decisions from application services. Application services offload policy decisions to OPA, and the policy reading in OPA can be automatically enforced at any time. There's no need for advocating services to recompile or redeploy. Application service can integrate with OPA in one of three ways: either as a library, as an independent daemon, or as a sidecar. The RMS uses OPA as a sidecar, the service called OPA, so a RESTful API call.

Another feature of open policy agent is language Rego. Policies in OPA are written in Rego, and the Rego is a declarative query language. It is declarative because it supports hard reloading and the performance optimization. It's also a query language; actually, it’s very powerful in referencing nested document. This is the high level of open policy agent.

Next, we'll take a look at one simple example of how OPA can be used. We have a service call, service one. An open policy agent is deployed as a sidecar alongside with service one. Service one queries open policy agent by calling RESTful API call with the query payload. An open policy agent receives the query, evaluates with OPA data and OPA policy to generate the query result, to produce a result, and sends it as a decision back to service one.

You can see open policy agent consists of two things. One is data. Data is in the JSON Document and the policy is written in Rego. For this example, service one queries open policy agent by sending this payload. It checks if client one had read access to bucket2. So what does OPA do? OPA traverses the OPA data in JSON looking for the match. This is a match. If the match exists, the policy will return true, otherwise it will return false by default. This is a very simple example of how OPA can be used. For more information, you can refer to open policies website, openpolicyagent.org, for more details.

Responsibility Management System Architecture

Next, let's move onto the architecture of responsibility management systems. The first version of an RMS architecture is a federated responsibility management service. Let's take a look at the details. The policy setup process: the policy needs to be set up in the RMS first, and the policies belong to this corresponding domain. The domain is a very important concept in an RMS. A domain can be an organization, a department, or even an application. Different domains have different policies, and the policies from all domains are stored in the rule repository. Policy information points. Policy information points federate the calls through LDAP, active directory, user service, and role service as an integrated service. With policy information points, we can check if a user belongs to a LDAP group, if a user belongs to an email group, or if a user belongs to a user service group.

Responsibility management- that's the main component of an RMS. It pulls the policies from the rule repository and loads them into open policy agent. The RMS service consumers are the clients from multiple domains who are sending requests to the RMS service, and they get the decision back as a response from the service. So this is just a high level of RMS architecture, version one. There are some key issues with this federated architecture solution. One is multiple services from multiple domains share one instance of RMS, and then one instance of open policy agent. Another issue is the RMS pulls the policies from the rule repository. Let's go into the details of these issues.

The first issue occurs whenever we have policy change in one domain; because multiple domains share one instance of RMS, the service will become intermediate. This RMS downtime will be for all domains, instead of one particular domain. The same thing happens with policy bugs. If we have a bug in a policy script, this is going to lead to a loss of service for all domains. The RMS becomes the gatekeeper for testing and the coverage.

Another issue with this architecture is when a new policy is added. Because the RMS pulls the policies from the repository, the service needs to figure out a way to pull up those new policies. This is for new policies; how about for new domains? Our observation for onboarding new domains was that policy changes were more frequent, and new bugs would be introduced. Again, the RMS became the gatekeeper for testing and the coverage. For segregation and information barriers, you need more work. So these are the issues that come up with the architecture version one, for RMS. To solve these issues, I'd like to talk about the design of a RMS architecture version two: a distributed responsibility management service.

In version two, the policy setup process is the same as in version one. The policy information point is the same. The policy management, the main component is split into two. The first one is policy administration service. This service provides APIs for the client to add and update role, responsibility, and the membership information. This information will be saved into the role service and then later on, policy administration services will gather data from the role service and will generate the JSON document as OPA data.

The policy administrative service also provides APIs for the client who published the policy, not pulling anymore, publishing the policy to the service. Once the service has both data and the policy, it will create policy bundles, and the policy bundles will be stored in a policy bundles repository. We'll get into policy bundles later in this presentation. Another component introduced in version two is the policy distribution service. This service gets a policy bundled from the bundle repository and distributes to the open policy agent.

Then you can see in version two, open policy agent is no longer one instance. It is deployed as a sidecar alongside an application service instance. Also, open policy agent gets the policy reference data from the policy information points. So this is the RMS architecture version two. What are the benefits of this distributed architect solution, compared to version one?

Version one is in the case of a policy change or policy bug; in version two, the RMS downtime or outage is only for that particular domain, not for all domains anymore. Also, in version two, the RMS is no longer the gatekeeper for testing and the coverage. For new policies and the new domains, because the domains can push policies on demand, so the issue with the pulling mechanism in version two is no longer here. There is no additional work for segregation and the information barrier either. You can see, version two solves all the issues in version one. Besides this, another new feature called RBAC support, the rule-based access control policy library, is introduced in version two as well. We'll get into that later.

Policy Bundles Repository

But first, let's take a look at policy bundle repositories. Policy bundles consist of two things: data and the policies. Data is specific to that particular domain, and the policies include the policies specific to that domain, as well as the common policies across all domains. In this example on the right side of the slide, so you can see data.json. This is the policy data specific to domain one. The policy, the Rego, this is the policy specific to domain one. All the Rego policies under RMS, including RBAC, LDAP, active directory, those policies are the common policies across all domains. RMS will bundle all this into the tar.gz file. This bundle, tar.gz, will be stored in the policy bundle repository and there is domain, policy name, and the version number. In other words, if you want to get a particular bundle from the bundle repository, three parameters need to be specified: domain, policy name, and the version number.

This is version two of an RMS architecture: open policy agent running as a sidecar along with application service instance. Open policy agent pulls the policy bundle from the distribution service. It is the responsibility of open policy agent to specify those three parameters: domain, policy name, and the version number. Let's take a look at how open policy agent is set up. The Docker image of open policy agent consists of three things. One is open policy, the executable, the runtime, of course. The second is open policy agent configuration file. The third is open policy agent startup script. In the agent configuration file, the service URL points to the policy distribution service, and the bundle name specifies policy domain, policy name, and policy version. These two parameters and the polling specify how often the distribution service will distribute the policies to the open policy agent automatically.

All these variables in red will be replaced with a real value from the environment variables. The environment variables are specified as part of deployment. Once the configuration file has updated with the real value, the startups, we will start OPA executable, the runtime, by pointing to this updated configuration file. In this way, the distribution service will distribute that particular policy and that domain name and that policy with that version number to the agent.

RBAC Policy Library

Next, let's talk about the RBAC policy library. The RBAC policy library is inspired by two facts. One is role-based access control is very used by application systems. Then second is the Rego language by OPA is very powerful, but it does need some time for the developers to get familiar with the language. The RBAC policy library is an out-of-box library providing some common functions, like the user has a responsibility function. This function will check if the user can take an action on one resource. This function is running as a common policy across all domains. Application policy, those policies specific to one domain, will be very simple; this user has a responsibility function by passing the input parameter, so the user ID, the action, the service as a resource.

Now we have common policies. We have domain-specific policies. How about the data? RMS version two: the role responsibility and the membership information are installed in the role service. RMS will act on the data and generate a JSON document as the OPA data. The data, the JSON, actually include all the roles for one particular domain. Each role has its responsibility and the members under it. Now we have OPA data, common policy, and domain-specific policy. RMS will bundle all of this into a policy bundle, and the distribution service will distribute into that open policy agent and open policy agent running as a sidecar along with the application service instance. That instance is for one particular domain. So this is the architecture of RMS version two.

OPA IntelliJ Plugin

Next, let's talk about OPA IntelliJ Plugin. OPA IntelliJ plugin is a functional work-in-progress policy editor. The editor is coded based on the language reference posted on open policy agent website. The editor now can parse and validate OPA policies in Rego. We’ll be adding new features in the future, for example supporting automatic indentation and running configuration, so the client can run the policy inside IntelliJ. This is a comparison before and after the plugin is installed in IntelliJ. On the left side, you can see the Rego file is just a plain text file, black and white. It's hard to read. There is no validation either. If you're using a double code to close the string, "no error" is reported. On the gray side, you can see the whole Rego file is parsed. The package import default; those are the keywords. You can also see in your comments inline comments, also array operators, strings. The same error and the missing double codes for the string are as well reported as an error in the IDE now.

You can also get details about this error. After you fix the error, it's well past the validation. This will be marked as success. Another feature provided by this plugin is you can customize the color scheme for the editor. You can pick your own color for the comments - keywords, strings, for your preference.

In Summary

Guntur: That comes to the conclusion of the need for building a responsibility management, and also what our architecture and our choices in the architecture were. We went through a little bit of a discovery phase while we were doing this, and in summary, these is what we found out. Responsibility management as a service can resolve a lot of issues on several fronts that corporate enterprises face. Choosing a proper payload format matters. For us, HOCON worked. For you, maybe something else will work. Choose an appropriate payload format. A choice of an architecture matters. In some cases, a federated architecture is more vital. In others, a distributed may work better. So you have to choose what kind of a service best meets your resilience needs and your latency requirements.

What we did notice was that distributing our policy engines across alleviated a lot of back pressure and volume demands. We are now able to scale and we have very few outages with this policy engine at this point. It's being heavily used. Maintenance-related downtimes went down significantly. Also, creating productivity tools such as an IntelliJ plugin, for instance, that we just talked about, have improved productivity, and reduced the rogue nature of certain policies which could not be tested until they were published to the responsibility management service.

In version one, I don't know if you recollect from the picture, we were pulling in the policies and therefore we were guardians of testing and coverage. With this mechanism, with the productivity tool through a plugin ID editor and through segregation of policies to individual domains, RMS could provide the core features rather than just be a guardian for testing policies. So that took away a lot of owners of what RMS had to do as its functions, and also provided for us a winning strategy with a distributed policy evaluation process.

Audit has become quite simple afterwards. We now produce reports. We produce a lot of reports that include reconciliation mechanisms and recertification mechanism based upon the data that we can evaluate. We also have entitlements that are much more controlled in terms of who can access what policies, because policies are distributed through a service which gets validated. We eat our own dog food. Our own policies are evaluated through responsibility management ourselves. We are the first consumers of RMS. With that, I think I'm done with the summary.

Questions & Answers

Participant 1: How do the policies ensure that me, as a user, who wants to have access to a service and I'm only finding policies for a certain service, that it doesn't affect the entire service? If I screw something up, how do I ensure that there is a barrier between my configurations and BIOS configuration?

Guntur: That's a great question. I'll just reiterate the question to make sure that I'm answering the right thing. Let's assume a service set up somebody to have access or have entitlements. How do we prevent it from causing some sort of rogue downstream impacts? Is that the question?

Participant 1: Yes.

Guntur: That question can be answered through a much more complicated slide that I can share real quick through my appendix, which I wanted to hide. We do provide for what is called in the gray box an access reconciliation review and certification process. What this does is you authored your policies, you distributed your policies, you got those policies evaluated and enforced, which is fine and dandy. This is what responsibility management system is all about. But periodically, there is an access review and certification going on in terms of both discovery of entitlements, recertification and reconciliations.

So there is testing going on in the background in order to evaluate who has access to what. Is it causing any damage? How do we do damage control? And this information from here is fed to our provisioning, our managed provisioning network, which then uses the enterprise roles and responsibilities and the business functions to update the data. Therefore, the data that you receive will be cleaned up. So even if there is a cycle where someone has got some sort of a rogue entitlement that they should not have, this periodic access control will take care of it. So any person leaving or moving from the organization or leaving the company as well is catered to by using the same mechanism.

Participant 2: Can you speak a little bit about the volume, in terms of the number of policies you have, the number of users, the number of requests coming from the agents?

Guntur: Yes. I cannot give you exact numbers, but we have about close to 10 domains at this point. I think we are at 10 domains, and these are for infrastructure and core components. Therefore, every part of the bank who's going through some sort of a dynamic compute cycle has to use these. We're talking a few million hits on a daily basis and these are hits that are not impacting as much because we have internal caching mechanisms thanks to our hydrating policies and automatic push mechanisms. We are free to actually investigate and find out, and we have other systems that suck in information on a daily basis about metrics, finding out what is going on with the system and therefore we can react very quickly.

Now, to answer your question in a different way, since we are now a service, or what I would like to call a glorified microservice with many services, we can scale; we can be more elastic, so we can spread out and add more instances as we see volume grow. We are only limited by our host capacity, not by anything else at this point. Since all these systems are being hydrated on a pushed basis, we don't go into an inconsistent cycle at all. Everything is bitemporary milestone. We know exactly what effective date each operation can use as its data.

Participant 3: From a resource management perspective and governance, do you do anything to revalidate the resources as well, other than the policies?

Guntur: The question is, rather than just the user data, do we also evaluate resources that need to be accessed? Sometimes yes, sometimes no. The answer being yes in cases of if it is an infrastructure that we control, absolutely, yes. If it is a database, for instance, we have access, we work in the infrastructure area, so they periodically send us whatever has been retired, decommissioned, etc. Now, if it is an application, it is the onus of the application itself to figure out whether that's a certain resource. For instance, you have a button on a UI that may be visible to some users, not visible to the others. We cannot control that. That has left us with the responsibility of the application itself to figure it out. If it is infrastructure, yes we do. For the majority of our domains that we currently use this with, we use infrastructure, and yes, the answer there is yes.

Participant 4: Do you ever consider flipping them all and attaching the policy to the user and pre-compile it on authentication time, and then supply to the services as a default yes or no, and just leave the attribute-based authorization at runtime? What was the caveat that led you to move away from that?

Guntur: That is what the difference between version one and version two. The version one architecture where there was a centralized policy management system where we were running the policies internally, evaluating it, caching it, giving them an on-demand decision, worked great until there was a change to be made in either the data or the policy. That's when the nightmare started happening. We are a service and we are spread across multiple instances of the service and things become inconsistent very quickly in that model. We did face problems. Also, scalability becomes an issue because now the policy engine is churning all this data on its own, and there is a key resource dependency on the policy management system to evaluate and return these policies on demand.

Now, if you were to distribute this out where your compute and your resources and your data are distributed, your only problem is about consistency for the time period. So what we also talked about was that there is bitemporal milestoning of the data and the policy that we have to use. Audit becomes easy. You may not have the latest data, therefore you will not get a response rather than get a wrong response for it. That is what we got a benefit out of by distributing this policy engine and letting application teams for themselves decide how to evaluate this policy and when. The decision was very easily moved out of our hands and into the hands of the domain that was requiring this policy. That was the benefit we had.

See more presentations with transcripts

Recorded at:

Oct 02, 2019

InfoQ Software Architects' Newsletter