Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles API Friction Complicates Hunting for Cloud Vulnerabilities. SQL Makes it Simple

API Friction Complicates Hunting for Cloud Vulnerabilities. SQL Makes it Simple

Key Takeaways

  • Developers spend too much time and effort wrangling APIs. When APIs resolve automatically to database tables, it frees devs to focus on working with the data.
  • SQL is the universal language of data, and a great environment in which to model and reason over data that's frictionlessly acquired from diverse APIs.
  • Postgres is ascendant, more than just a relational database it's become a platform for managing all kinds of data.
  • SQL has evolved! With features like common table expressions (CTEs)  and JSON columns, it's more capable than you might think if you haven't touched it in a while.
  • The ability to join across diverse APIs, in SQL, is a superpower that enables you to easily combine information from different sources.

Pen testers, compliance auditors, and other DevSecOps pros spend a lot of time writing scripts to query cloud infrastructure. Boto3, the AWS SDK for Python, is a popular way to query AWS APIs and reason over the data they return.

It gets the job done, but things get complicated when you need to query across many AWS accounts and  regions. And that doesn't begin to cover API access to other major clouds (Azure, GCP, Oracle Cloud), never mind services such as GitHub, Salesforce, Shodan, Slack, and Zendesk. Practitioners spend far too much time and effort acquiring data from such APIs, then normalizing it so the real work of analysis can begin.

What if you could query all the APIs, and reason over the data they return, in a common way? That's what Steampipe is for. It's an open-source Postgres-based engine that enables you to write SQL queries that indirectly call APIs within, across, and beyond the major clouds. This isn’t a data warehouse. The tables made from those API calls are transient; they reflect the live state of your infrastructure; you use SQL to ask and answer questions in real time.

The case study we’ll explore in this article shows how to use Steampipe to answer this question: Do any of our public EC2 instances have vulnerabilities detected by Shodan? The answer requires use of an AWS API to enumerate EC2 public IP addresses, and a Shodan API to check each of them.

In the conventional approach you’d find a programming-language wrapper for each API, learn the differing access patterns for each, then use that language to combine the results. With Steampipe it’s all just SQL. These two APIs, like all APIs supported by Steampipe’s suite of API plugins, resolve to tables in a Postgres database. You query within them, and join across them, using the same basic SQL constructs.

Figure 1 illustrates the cross-API join at the heart of our case study.The aws_ec2_instance table is one of the hundreds of tables that Steampipe builds by calling AWS APIs. The shodan_host table is, similarly, one of a dozen tables that Steampipe constructs from Shodan APIs. The SQL query joins the public_ip_address column of aws_ec2_instance to the ip column of shodan_host.

Before we dive into the case study, let’s look more closely at how Steampipe works. Here’s a high-level view of the architecture.

Figure 2: Steampipe architecture

To query APIs and reason over the results, a Steampipe user writes SQL queries and submits them to Postgres, using Steampipe’s own query console (Steampipe CLI) or any standard tool that connects to Postgres (psql, Metabase, etc). The key enhancements layered on top of Postgres are:

  • Postgres foreign data wrappers 
  • Per-API plugins
  • Connection aggregators

Postgres foreign data wrappers

Postgres has evolved far beyond its roots. Nowadays, thanks partly to a growing ecosystem of extensions that deeply customize the core, Postgres does more than you think. Powerful extensions include PostGIS for geospatial data, pglogical to replicate over Kafka or RabbitMQ, or Citus for distributed operation and columnar storage. 

One class of Postgres extension, the foreign data wrapper (FDW), creates tables from external data. Postgres bundles postgres_fdw to enable queries that span local and remote databases. When Steampipe runs it launches an instance of Postgres that loads another kind of FDW, steampipe-postgres-fdw, an extension that creates foreign tables from APIs with the help of a suite of plugins

These foreign tables typically map JSON results to simple column types: date, text, number. Sometimes, when an API response includes a complex JSON structure such as an AWS policy document, the result shows up in a JSONB column.

Per-API plugins

The plugins are written in Go, with the help of a plugin SDK that handles backoff/retry logic, data-type transformation, caching, and credentials. The SDK enables plugin authors to focus on an essential core task: mapping API results to database tables. 

These mappings may be one-to-one. The aws_ec2_instance table, for example, closely matches the underlying REST API

In other cases it's helpful to build tables that consolidate several APIs. A complete view of an S3 bucket, for example, joins the core S3 API with sub-APIs for ACLs, policies, replication, tags, versioning, and more. Plugin authors write hydrate functions to call these sub-APIs and merge their results into tables.

A basic Steampipe query

Here’s how you’d use Steampipe to list EC2 instances.

  1. Install Steampipe
  2. Install the AWS plugin: steampipe plugin install aws
  3. Configure the AWS plugin

The configuration relies on standard authentication methods: profiles, access keys and secrets, SSO. So authenticating Steampipe as a client of the AWS API is the same as for any other kind of client. With that done, here’s a query for EC2 instances.

Example 1: Listing EC2 instances

from aws_ec2_instance;

| account_id   | instance_id         | instance_state | region    |
| 899206412154 | i-0518f0bd09a77d5d2 | stopped        | us-east-2 |
| 899206412154 | i-0e97f373db22dfa3f | stopped        | us-east-1 |
| 899206412154 | i-0a9ad4df00ffe0b75 | stopped        | us-east-1 |
| 605491513981 | i-06d8571f170181287 | running        | us-west-1 |
| 605491513981 | i-082b93e29569873bd | running        | us-west-1 |
| 605491513981 | i-02a4257fe2f08496f | stopped        | us-west-1 |

The documentation for the referenced foreign table, aws_ec2_instance, provides a schema definition and example queries.

Connection aggregators

The above query finds instances across AWS accounts and regions without explicitly mentioning them, as a typical API client would need to do. That’s possible because the AWS plugin can be configured with an aggregator that combines accounts, along with wildcards for regions. In this example, two different AWS accounts – one using SSO authentication, the other using the access-key-and-secret method – combine as a unified target for queries like select * from aws_ec2_instance

Example 2: Aggregating AWS connections

connection "aws_all" {
  plugin = "aws"
  type = "aggregator"
  connections = [ "aws_1", aws_2" ]

connection "aws_1" {
  plugin    = "aws"
  profile = "SSO…981"
  regions = [ "*" ]

connection "aws_2" {
  plugin    = "aws"
  access_key  = "AKI…RNM"
  secret_key  = "0a…yEi"
  regions = [ "*" ]

This approach, which works for all Steampipe plugins, abstracts connection details and simplifies queries that span multiple connections. As we’ll see, it also creates opportunities for concurrent API access.

Case Study A: Use Shodan to find AWS vulnerabilities

Suppose you run public AWS endpoints and you want to use Shodan to check those endpoints for vulnerabilities. Here’s pseudocode for what needs to happen.

A conventional solution in Python, or another language, requires you to learn and use two different APIs. There are libraries that wrap the raw APIs, but each has its own way of calling APIs and packaging results. 

Here’s how you might solve the problem with boto3.

Example 3: Find AWS vulnerabilities via Shodan, using boto3

import boto3
import datetime
from shodan import Shodan

aws_1 = boto3.Session(profile_name='SSO…981')
aws_2 = boto3.Session(aws_access_key_id='AKI…RNM', aws_secret_access_key='0a2…yEi')
aws_all = [ aws_1, aws_2 ]
regions = [ 'us-east-2','us-west-1','us-east-1' ]

shodan = Shodan('h38…Cyv')

instances = {}

for aws_connection in aws_all:
  for region in regions:
    ec2 = aws_connection.resource('ec2', region_name=region)
    for i in ec2.instances.all():
      if i.public_ip_address is not None:
        instances[] = i.public_ip_address
for k in instances.keys():
     data =[k])
     print(k, data['ports'], data['vulns'])
   except Exception as e:

When APIs are abstracted as SQL tables, though, you can ignore those details and distill the solution to its logical essence. Here's how you use Steampipe to ask and answer the question: "Does Shodan find vulnerable public endpoints in any of my EC2 instances?"

Example 4: Find AWS vulnerabilities using Steampipe

  aws_ec2_instance a
left join
  shodan_host s 
  a.public_ip_address = s.ip
  a.public_ip_address is not null;

| instance_id         | ports    | vulns              |
| i-06d8571f170181287 |          |                    |
| i-0e97f373db42dfa3f | [22,111] | ["CVE-2018-15919"] |

There's no reference to either flavor of API, you just write SQL against Postgres tables that transiently store the results of implicit API calls. This isn’t just simpler, it’s also faster. The boto3 version takes 3-4 seconds to run for all regions of the two AWS accounts I’ve configured as per example 2. The Steampipe version takes about a second. When you’re working with dozens or hundreds of AWS accounts, that difference adds up quickly. What explains it? Steampipe is a highly concurrent API client.

Concurrency and caching

If you've defined an AWS connection that aggregates multiple accounts (per example 2), Steampipe queries all of them concurrently. And within each account it queries all specified regions concurrently. So while my initial use of the query in example 3 takes about a second, subsequent queries within the cache TTL (default: 5 minutes) only take milliseconds. 

It’s often possible, as in this case, to repeat the query with more or different columns and still satisfy the query in milliseconds from cache. That’s because the aws_ec2_instance table is made from the results of a single AWS API call.

In other cases, like the aws_s3_bucket table, Steampipe synthesizes many S3 sub-API calls including GetBucketVersioning, GetBucketTagging, and GetBucketReplication. And it makes those calls concurrently too. Like any other API client, Steampipe is subject to rate limits. But it’s aggressively concurrent so you can quickly assess large swaths of cloud infrastructure. 

Note that when using a table like aws_s3_bucket, it’s helpful to request only the columns you need. If you really want everything, you can select * from aws_s3_bucket. But if you only care about account_id, instance_id, instance_state, and region, then asking explicitly for those columns (as per example 1) avoids unnecessary sub-API calls.

Case Study B: Find GCP vulnerabilities

If your endpoints only live in AWS, example 3 solves the problem neatly. Now let's add GCP to the mix.  A conventional solution requires that you install another API client, such as the Google Cloud Python Client, and learn how to use it. 

With Steampipe you just install another plugin: steampipe plugin install gcp. It works just like the  AWS: calls APIs, puts results into foreign tables that abstract API details so you can focus on the logic of your solution. 

In this case that logic differs slightly. In AWS, public_ip_address is a core column of the aws_ec2_instance table. In GCP you need to combine results from one API that queries compute instances, and another that queries network addresses. Steampipe abstracts these as two tables: gcp_compute_instance and gcp_compute_address. The solution joins them, then joins that result to Shodan as in example 4. 

Example 5: Find GCP vulnerabilities using Steampipe

with gcp_info as (
    gcp_compute_address a
    gcp_compute_instance i
    a.users->>0 = i.self_link
    a.address_type = 'EXTERNAL'
  order by
select as instance_id,
  gcp_info g
left join
  shodan_host s on g.address = s.ip;

This query makes use of two language features that can surprise people who haven't looked at SQL in a long while. The WITH clause is a Common Table Expression (CTE) that creates a transient table-like object. Queries written as a pipeline of CTEs are easier to read and debug than monolithic queries. 

a.users is a JSONB column. The ->> operator addresses its zeroth element. Now that JSON is a first-class citizen of the database, relational and object styles mix comfortably. That's especially helpful when mapping JSON-returning APIs to database tables. Plugin authors can move some pieces of API data into legacy columns and others into JSONB columns. How to decide what goes where? That requires an artful balance of concerns, but the key point is that modern SQL enables flexible data modeling. 

Case Study C: Find vulnerabilities across clouds

If you've got public endpoints in both AWS and GCP, you'll want to combine the queries we've seen so far. And now you know everything you need to know to do that.

Example 6: Find AWS and GCP vulnerabilities

with aws_vulns as (
  —- insert example 4
gcp_vulns as (
  —- insert example 5

select * from aws_vulns
select * from gcp_vulns;

| cloud | instance_id         | ports    | vulns              |
| aws   | i-06d8571f170181287 |          |                    |
| aws   | i-0e97f373db42dfa3f | [22,111] | ["CVE-2018-15919"] |
| gcp   | 8787684467241372276 |          |                    |

We've arranged example 4 and example 5 as a CTE pipeline. To combine them requires nothing more than a good old-fashioned SQL UNION. 

You also now know everything you need to know to expand the pipeline with CTEs for the Oracle or IBM clouds. While you're at it, you might want to bring more than just Shodan's knowledge to bear on your public IP addresses. There are plugins that do reverse DNS lookup, map IP addresses to geographic locations, and check addresses for reported malicious activity. Each of these maps another API that you don't need to learn how to use, models it as a collection of database tables, and enables you to work with it using the same basic SQL constructs you've seen here.

It's just Postgres

We've said that Steampipe isn’t a data warehouse, and that API-sourced tables remain cached for only a short while. The system is optimized for rapid assessment of cloud infrastructure in real time. But Steampipe is just Postgres, and you can use it in all the same ways. So if you need to persist that realtime data, you can.

Example 7: Persist a query as a table

create table aws_and_gcp_vulns as 
  -- insert example 6 

Example 8: Persist a query as a materialized view

create materialized view aws_and_gcp_vulns as 

  -- insert example 6
  -- then, periodically: refresh materialized view aws_and_gcp_vulns

Example 9: Pull query results into Python

import psycopg2, psycopg2.extras
conn = psycopg2.connect('dbname=steampipe user=steampipe host=localhost, port=9193')
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cursor.execute('select * from aws_and_gcp_vulns')
for row in cursor.fetchall():
  print(row['cloud'], row['instance-id'], row['vulns'])

Example 10: Connect with psql

psql -h localhost -p 9193 -d steampipe -U steampipe

You can use the same connection details to connect from Metabase, or Tableau, or any other Postgres-compatible tool. 

Bottom line: Steampipe’s API wrangling augments the entire Postgres ecosystem. 

Skip the API grunt work, just do your job

For a DevSecOps practitioner the job might be to inventory cloud resources, check for security vulnerabilities, or audit for compliance. It all requires data from cloud APIs, and acquiring that data in a tractable form typically costs far too much time and effort. With fast and frictionless access to APIs, and a common environment in which to reason over the data they return, you can focus on the real work of doing inventory, security checks, and audits. The requisite API wrangling is a distraction you and your organization can ill afford. Don’t let it get in the way of doing your real jobs, which are plenty hard enough even when you have the data you need.

About the Author

Rate this Article