InfoQ Homepage Articles Data Manipulation with Functional Programming and Queries in Ballerina

Data Manipulation with Functional Programming and Queries in Ballerina

Lire ce contenu en franÃ§ais

Aug 11, 2022 14 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Ballerina has been designed as a Data-Oriented programming language and supports a functional programming coding style
Expressing data manipulation logic with functions can be powerful
Using arrow functions with type inference can make code compact and clear
The Ballerina query language is similar to SQL in the sense that a query expression is made up of clauses. Data manipulation expressed with Ballerina query language is easier to read in comparison with other functional programming expressions
The Ballerina “Table” data structure can be more effective than maps in representing indexed data collections

As an adept at Functional Programming (FP), I feel at ease with expressing my data manipulation logic by chaining high order functions like map, filter, and sort operating on arrays and maps. As we saw in our previous article, Ballerina, being designed as a Data-Oriented programming language, supports this FP style of coding.

In this article, I’d like to get deeper into FP capabilities in Ballerina and also explore an innovative way to express data manipulation logic that the language provides, via Ballerina query language and a data structure called “table”.

Data manipulation with functional programming

Suppose we want to retrieve enriched search results from in-memory collections of books and authors that we have fetched from a service like OpenLibrary through its JSON API.

A book has three fields:

title
isbn
author_id

An author has three fields:

id
firstName
lastName

A book matches the query if its title contains the query as a substring.

Each search result must contain the following fields:

title (book field)
authorName (calculated field)

The book results need to be sorted by book author names.

In order to fulfill the requirements, I create:

a Book record,
an Author record,
a BookResult record,
an array of Book records,
and a map of Author records, where the map keys are the author ids.

To implement the search logic, I will write a function with the following signature:

function searchBooks(Book[] books, map<Author> authorMap, string query) returns BookResult[] {}

The function’s code involves three steps:

Finding the matching Book records in the array of Book records
Converting each Book record into a BookResult record (data enrichment)
Sorting the BookResult records

The most interesting part is the data enrichment, where we need to “join” the books and the authors. For that, I need to write two functions:

fullName: takes an Author record and returns the author’s full name
enrichBook: takes an Author map and a Book array and returns a BookResult:

function fullName(Author? author) returns string {
    if (author is null) {
        return "N/A";
    }
    return author.firstName + " " + author.lastName;
}

function enrichBook(map<Author> authorMap, Book book) returns BookResult {
    return {
        title: book.title,
        authorName: fullName(authorMap[book.author_id])
    };
}

Note the type of argument passed to fullName is Author? with a question mark suffix to express the fact that its value might be nil. The reason is that we have to deal with the possibility that the author_id is not found in the Author map.

Now, I am going to chain filter, map and sort with anonymous functions to implement my business logic:

function searchBooks(Book[] books, map<Author> authorMap, string query) returns BookResult[] {
    return books
    .filter(book => book.title.includes(query))
    .map(book => enrichBook(authorMap, book))
    .sort(array:DESCENDING, b => b.authorName);
}

Note that Ballerina’s type system is smart enough to infer the types of arguments of the anonymous functions passed to filter, map and sort.

If, for instance, I am trying to access the isbn field inside the anonymous function passed to sort, the type system will complain that this field does not exist in the expected anonymous record type:

undeclared field 'isbn' in record 'record {| string title; string authorName; |}'

Ballerina’s type inference capabilities make it easy to write code à la FP

For people with experience in FP, code like this is probably easy to read and also easy to write. But for people coming from an OOP background, it might be challenging. Moreover, even for an experienced FP developer, writing the code of a function like searchBooks requires attention to low-level details:

We need to create anonymous functions
We need to call filter before we call map, and sort in order to avoid unnecessary calculations
We have to manually join authors and books by accessing the book.author_id field inside the authorMap map
The author collection needs to be a map, while the book collection is an array

In short, we have to write code in order to express something that could be expressed declaratively.

Ballerina supports an innovative way to express data manipulation, via a query language. Let’s see it in action.

Ballerina query language

The Ballerina query language is similar to SQL in the sense that a query expression is made up of clauses, like select, from, where, order by, join, etc... But, unlike in SQL, we are not limited to SQL operators to express our custom business logic: we are allowed to use any function inside our queries. Moreover, the Ballerina query language syntax makes it very convenient to manipulate data as it deals with records.

Let’s start our exploration of the Ballerina query language, by writing a query that finds book titles that contain a query string:

function searchBooksSimple(Book[] books, string query) returns string[] {
    var res = from var book in books // operate on books
        where book.title.includes(query) // filter books whose title include the query
        select book.title; // return the book title
    return res;
}

The query is made up of 3 clauses:

The from clause defines the data the query operates on
The where clause defines the condition a record should match in order to be returned by the query
The select clause decides what record fields should be returned (the projection)

A full description of all the clauses is available in the official query language documentation.

Note that inside the query, we can use any piece of the Ballerina language, for instance, calling the includes string method inside the where clause.

But query capabilities go further than calling methods: for instance, we can use Ballerina destructuring syntax to destructure the title field in a book and make the code more compact by rewriting our query like this:

function searchBooksSimple(Book[] books, string query) returns string[] {
    return from var {title} in books // operate on books
        where title.includes(query) // filter books whose title include the query
        select title; // return the book title
}

In addition to this convenience, Ballerina query language brings a killer feature, namely joining data sources, like in SQL.

In the FP implementation of searchBooks, from the previous section, we had to manually “join” the Book and the Author records by looking for the corresponding Author record in the authorMap map.

With Ballerina, we don’t need to have an Author map. We can leave the Author records inside an array and leverage the power of the join clause, like this:

function searchBooks(Book[] books, Author[] authors, string query) returns map<anydata>[] {
    return from var {author_id, title} in books // destructuring two fields
        join var author in authors  // joining with authors
        on author_id equals author.id // the joining condition
        where title.includes(query) // filter books whose title include the query
        select {  // select some fields 
            authorFirstName: author.firstName,
            authorLastName: author.lastName,
            title
        }; 
}

Remark: The return type of the function is an array of maps, because for now, the result is not a BookResult record. We will see in a moment how to plug in the author's full name in the results.

The join clause syntax is similar to SQL, with a slight difference: we create a local variable with var author to hold the corresponding record and reference it in the rest of the query.

We now have all the pieces in place to implement the full search logic using Ballerina query language:

function searchBooks(Book[] books, Author[] authors, string query) returns BookResult[] {
    return from var {author_id, title} in books // destructuring two fields
        join var author in authors  // joining with authors
        on author_id equals author.id // the join condition
        let string authorName = fullName(author) // creating a variable to calculate the author full name
        where title.includes(query) // filter books whose title include the query
        order by authorName descending // sorting according to authorName field
        select {authorName, title}; // select some fields 
}

Note how natural it is to create a calculated field like authorName and use it later in the query in the order by clause and in the select clause.

A word about the performance of the query. The Ballerina query engine is smart enough to create a temporary index to make the join efficient. In the next section, we will see a more idiomatic way to represent data to be manipulated by queries that alleviates the need for this performance optimization.

Tables as first class components

The two major families of data collections that I use on a day-to-day basis in my programs are sequential collections and indexed collections. For instance, in JavaScript I use arrays for sequential collections and objects for indexed collections (what JavaScript calls objects are in fact hash maps with string keys).

Ballerina has typed arrays and typed string maps. But in addition to them, it has an interesting collection type called table: it’s a hybrid data collection that combines the characteristics of sequential and indexed collections.

Let’s look at how to represent a collection of Book records with arrays, maps and tables. As before, a Book record has 3 fields: title, isbn and author_id.

type Book record {
    string isbn;
    string title;
    string author_id;
};

Now, suppose the OpenLibrary API returns a JSON string, like this one:

[
  {
    "isbn": "978-0736056106",
    "title": "The Volleyball Handbook",
    "author_id": "bob-miller"
  },
  {
    "isbn": "978-0345525345",
    "title": "Friendship Bread",
    "author_id": "darien-gee"
  },
  ...
]

Depending on our application needs, we might decide to represent this collection of books inside our program as either an array or a map. If we need to iterate over the books we will use an array, while if we need to randomly access a book, we will use a map.

Ballerina provides an easy way to convert a JSON string into an array:

Book[] bookArray = check apiResponse.fromJsonStringWithType();

If we want to turn this array into a map, we need to write custom code, using either forEach or reduce.

function mapifyBooks(Book[] books) returns map<Book> {
    map<Book> res = {};
    foreach Book book in books {
        res[book.isbn] = book;
    }
    return res;
}

function mapifyBooks(Book[] books) returns map<Book> {
    return books.reduce(function(map<Book> res, Book book) returns map<Book> {
        res[book.isbn] = book;
        return res;
    },
    {});
}

As an adept at Functional Programming, I tend to prefer the implementation with reduce, but both implementations transform a book array into a book map as expected:

mapifyBooks(bookArray)

{
  "978-0736056106": {
    "isbn": "978-0736056106",
    "title": "The Volleyball Handbook",
    "author_id": "bob-miller"
  },
  "978-0345525345": {
    "isbn": "978-0345525345",
    "title": "Friendship Bread",
    "author_id": "darien-gee"
  },
  ...
}

In a dynamically typed language like JavaScript, one could write a generic mapify function that works on any record types (see for instance Lodash’s keyBy), but in Ballerina, we have to write a specific function for each record type: mapifyBooks for Book records, mapifyAuthors for Author records, etc.

As an example, here is the implementation of mapifyAuthors, using the id key from Author:

{
  "bob-miller": {
    "firstName": "Bob",
    "lastName": "Miller"
  },
  "darien-gee": {
    "firstName": "Darien",
    "lastName": "Gee"
  },
  ...
}

Remark: If one day Ballerina introduces support for generic types, we might be able to avoid this kind of code duplication. It’s challenging because the name of the field (isbn for Book, id for Author) also needs to be dynamic.

The first drawback of maps is that they require custom code to create them from arrays. Another drawback of maps is that the key by which the map is indexed is not necessarily part of the record. Quite often, when a map is returned from a JSON API request, the key is not inside the data itself. For example, an author map JSON might look like this:

{
  "bob-miller": {
    "firstName": "Bob",
    "lastName": "Miller"
  },
  "darien-gee": {
    "firstName": "Darien",
    "lastName": "Gee"
  },
  ...
}

Once again, converting a JSON string like this into a map of Author (with an id field) requires custom code.

Ballerina supports a data collection that is like a database table with a primary key index, where instead of rows, we have records where the field used as the primary key in the index must be a read-only field. Let’s adjust our Book and Author record types accordingly:

type Book record {
    readonly string isbn;
    string title;
    string author_id;
};

type Author record {
    readonly string id;
    string firstName;
    string lastName;
};

Here are the type definitions for:

BookTable: a table of Book records with isbn as primary key
AuthorTable: a table of Author records with id as primary key

type BookTable table<Book> key(isbn);
type AuthorTable table<Author> key(id);

Ballerina provides an easy way to create a table from a JSON string, leveraging type inference, similar to how we created an array from a JSON string:

BookTable bookTable = check apiResponse.fromJsonStringWithType();

Another way to create a table is from an array of records, using a simple query, like this:

AuthorTable authorTable = check table key(id) from var author in authorArray select author;

If some records have the same value for the indexed field, the table creation fails at run time. That’s why we need to use the check syntax to capture run-time errors. Once the table is created, we are not allowed to modify the value of the indexed field. That’s why the indexed field must be marked as read-only.

Once we have a table in hand, we can efficiently retrieve a record by its primary key, using the square bracket notation, like in a map:

authorTable["bob-miller"]

Tables bring another advantage over maps: the indexed field does not have to be a string, like in maps. Moreover, we can use the combination of several fields for the index.

	Map	Table
Key location	external	internal
Key type	string	any
Mutable key	yes	readonly
Fields in key	1	many
Order	no	yes

We operate on tables with queries, exactly like we operated on arrays. We can copy and paste our implementation of searchBooks and replace the argument types:

BookTable instead of Book[]
AuthorTable instead of Author[]

function searchBooks(BookTable books, AuthorTable authors, string query) returns map<anydata>[] {
    return from var {author_id, title} in books // destructuring two fields
        join var author in authors  // joining with authors
        on author_id equals author.id // we must respect left and right!
        let string authorName = fullName(author) // creating a variable calculate the author full name
        where title.includes(query) // filter books whose title include the query
        order by authorName descending
        select {authorName, title}; // select some fields 
}

It works exactly the same as with arrays, except that the join optimization that we mentioned before is not required anymore, as tables are already indexed.

In short, Ballerina tables are the way to go when manipulating data with queries.

Let me conclude this article by mentioning some limitations of the Ballerina query language.

Limitations of Ballerina query language

The Ballerina query language is not yet completely implemented, and some important features will come in the (near) future:

Support for grouping and aggregation (GitHub issue #441)
Allowing non-constant order direction in order by clause (GitHub issue #1118)

Moreover, I think that there is a fundamental limitation when expressing data manipulation with a query, and it has to do with composability. When we use FP, the data manipulation steps being function calls are composable, while clauses inside a query expression are not composable.

Let me give you an example: Suppose we want to add an argument to our searchBooks, to control whether or not we want to sort the results. With FP, it’s just a matter of adding a condition inside the code:

function searchBooksWithCondSort(Book[] books, string query, boolean shouldSort) returns Book[] {
    var filteredBooks = books.filter(book => book.title.includes(query));
    var res = shouldSort? filteredBooks.sort(array:DESCENDING, b => b.title) : filteredBooks;
    return res;
}

But inside a query, we have to skip a clause depending on the runtime value of a boolean argument. The only way to do that is to have two different queries:

function searchBooksWithCondSort(BookTable books, string query, boolean shouldSort) returns Book[] {
    if (shouldSort) {
        return from var book in books
            where book.title.includes(query)
            order by book.title descending
            select book;
    } else {
        return from var book in books
            where book.title.includes(query)
            select book;
    }
}

It might be acceptable when we have a single boolean argument, but what if we want to add another argument, for instance limiting or not the number of results?
Now, we’d have to write four different queries that deal with the four combinations of the arguments:

with sort and with limit
with sort and without limit
without sort and with limit
without sort and without limit

function searchBooks(BookTable books, string query, boolean shouldSort, boolean shouldLimit) returns Book[] {
    if (shouldSort) {
        if (shouldLimit) {
            return from var book in books
                where book.title.includes(query)
                order by book.title descending
                limit 100
                select book;
        } else {
            return from var book in books
                where book.title.includes(query)
                order by book.title descending
                select book;
        }
    } else {
        if (shouldLimit) {
            return from var book in books
                where book.title.includes(query)
                limit 100
                select book;
        } else {
            return from var book in books
                where book.title.includes(query)
                select book;
        }
    }
}

I must admit that in most real-life use cases, I’d find this lack of composability acceptable, in fact. I think that overall, the rigidity of the query language is an advantage.

	Functional Programming	Query language
Logic units	Functions	Clauses
Required knowledge	High-order functions	SQL
Structure	Flexible	Rigid
Composability	High	Limited

Wrapping up

Ballerina’s flexible type system makes it natural to write data manipulation logic in an FP style, changing high-order functions like filter, map and sort. Furthermore, with Ballerina's powerful query language we are able to express data manipulation logic in a way that is easy to read, even for developers with no experience with Functional Programming.

Queries are easy to write, as we have the ability to use SQL-like clauses, like select, where, order by and join, and combine them with regular functions and advanced Ballerina syntax (e.g destructuring). The natural way to represent data collections in Ballerina is via a table, a data structure that combines the benefits of arrays and maps.

About the Author

Yehonathan Sharvit

Show moreShow less

InfoQ Software Architects' Newsletter

Data Manipulation with Functional Programming and Queries in Ballerina

Write for InfoQ

Key Takeaways

Data manipulation with functional programming

Related Sponsors

Ballerina query language

Tables as first class components

Limitations of Ballerina query language

Wrapping up

About the Author

Yehonathan Sharvit

Rate this Article

This content is in the Ballerina topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter