InfoQ Homepage Presentations PRQL: a Simple, Powerful, Pipelined SQL Replacement

PRQL: a Simple, Powerful, Pipelined SQL Replacement

Bookmarks

View Presentation

Speed:

Download

49:13

Summary

Aljaž Mur Eržen discusses PRQL, a language that can be compiled to most SQL dialects, which makes it portable and reusable, important factors of OLAP.

Bio

Aljaž Mur Eržen is a compiler developer at EdgeDB and a core contributor of the PRQL project, with a background of mathematics and data science. He is working on making data languages more predictable, easier to reason about and improve the developer experience.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Mur Eržen: My name is Aljaž Mur Eržen. I'm a compiler developer at EdgeDB. This is PRQL. It is a new language. If you look at it, it feels like SQL. It has the from clause, the join clause, the select clause. It's actually a new language that was started, I think, a year and a half ago, when we published the proposal, and we got a lot of positive feedback. We decided that we should actually make the compiler that compiles this query that you see here to SQL that you can execute on any of your relational databases. We're not the first language that claims that it could replace SQL. There's a lot of criticisms of older languages that came before us. The main question being, why? Why would you want to replace a language that is nearly 50 years old, and is adopted by basically all relational databases? That's what I'll be talking about.

Overview

I have to talk about why SQL is not the ideal relational language from the language design standpoint. I will talk about how we designed a new language that aims to improve all those points that I'll be making about SQL. Then, this is quite unprecedented to compile to SQL, so I have to talk about how the language is being executed. Then a little bit about our project.

Flaws of SQL

Let's first take a deep dive into flaws of SQL. I'm talking from the place of contempt toward data tools, so this may come as a bit of rambling. It all started in 1970s, when a paper was published, talking about the relational model, and how it introduced abstraction over data storage, so you wouldn't have to look at offsets in your files, to determine that you're looking at some IDE, or price, or something, but it instead looks at relations as your data. This was really revolutionary, because up until now, no one has actually made a rigid mathematical model to look at your data. It also introduced tuple relational calculus, which is basically this that I have on the slide. It says that relation is a set of tuples. Then you have a few operations on those relations. For example, Greek letter pi denotes projection, sigma denotes filtering, and the star symbol denotes Cartesian product. This is a rigid mathematical algebra that you can use in your applications. Soon after people realized that it's hard to type symbols that you don't have on your keyboard. Instead of Greek letters, in 1974, there was a paper called Structured English Query Language, or SQL for short. This was later adopted by IBM and shortened to SQL. The main goal of the language was that it was understandable for normal folks, for managers that wanted insights into their data banks. This was accomplished really well, because it has human friendly syntax. If you read the query that I have here, select distinct name from invoices, it doesn't get better than this. Everyone can understand what this query is doing. I have a problem with this, because I think as from coming before all the other things. I think like I'm pulling data from invoices, then I'm selecting a column, and then I'm applying the distinct operation. This disconnect between the syntax and what's actually happening in your database is not a good property for a language to have. This goes further. Again, I have a query, select sum total from invoices, which is perfectly good in English grammar. If I change the syntax just a little bit and add over parenthesis, this is now not an aggregation function anymore, but it's a window function. This will produce vastly different results. The first query that I have, the sum total will produce a single row because this is aggregation. The second one will actually produce a whole relation with normalized totals. This is also not a good property of a language to have because small differences in syntax will produce vastly different results.

Going deeper, I want to talk about name resolution. Here I'm pulling data from albums, and I'm renaming column title into title_alias. The question is, when can I use title and when I can use title_alias, in which clauses, in where I group by, order by? The answer is, it really depends. It depends on the database engine that you're using. For Postgres, I know that you can use title in where, but you have to use title_alias in group by and order by. If I ask you, what about having? You'd have to memorize it for each clause and for each database system that you're using. This is a lot of rules to memorize. It's even more rules when you factor in order by positional references to your columns. You have to factor in correlated subqueries and lateral subqueries. These rules also interact with each other, and there's a lot of them. This becomes quite complex. I have actually implemented a resolver for SQL for our EdgeDB SQL adapter. It's a lot of work. You can generalize these rules a little bit, but still, there's just too much of it. This could be vastly simplified. I also want to claim that SQL is not really composable. For this, I have an example here that is pulling data from tracks and then doing some aggregation on album ID, with grouping on album ID. Now I want to take this relation and join it with another table, say album. If I want to do this, I have to encase this query that I had as a subquery in another query. I could do this with common table expressions, but still, I'd have to do width and then name my query and I have to list all the columns again. This is not really composable. When you just try to do data exploration, you try to just pipe your data into another transforms. This is not the ideal case for doing data analysis.

The last thing that I want to say is that in SQL, there are basically two major types. One is relations, and one is scalars. I have two queries here. The first one is doing select all from emp, which is probably employees. The second one does aggregation minimum employee ID from employees. The first one returns a relation, a lot of columns and a lot of rows. The second one returns only one column and one row. This means that the second query can also be interpreted as a scalar. This comes with the possibility to encase it in this query saying where employee ID is the sub-link. This will work, and it makes sense. It will always work because it has only one column and one row. In some cases, you can have queries that can be both. For example, here, I'm doing where a role is manager. If employees have only one manager, this will work. It can be interpreted as a scalar. As soon as Jim becomes the manager too, this will stop working. If you have it encased in a where clause like this, this will break and produce error saying there are multiple values returned by this query. You could think about this problem as SQL having dynamic typing, and not have types at all. I think that this is a great opportunity for a new language to introduce static typing, so you could in advance say this query that you have may produce multiple results. You should fix this in advance and not in production when you get the errors. I'll stop talking about SQL now. I'll just leave you with these articles here and a few other projects that are trying to compile or replace SQL. They have good justifications on why SQL is not the ideal relational language.

Design of a New Language for Relations

Let's start with the main thing that I want to talk about. This is how to design a new language for relations. I'll do this by just explaining a query that we saw earlier on the slide. Let's start with just, from animals. This is the most basic query that you can have in PRQL. What it does well, it pulls data from table, animals. As an example, the result of that would be a relation that has 101 tuples, each of the tuples containing columns, animal ID, species, height in centimeters, age and keeper ID. The first result is Mus musculus, which is a common house mouse. In PRQL, you can always just attach new transforms to your pipeline, before I had, from animals, and now I'm adding another transform that says filter only the rows that have height in centimeters more than 10. The result from that would be another relation now containing only 15 results. The first result would be Suricata suricatta, which is a meerkat, in English. You can add on more transforms. You can do, take 10, which will take only first 10 rows. You can do join. Here I'm providing two arguments. I'm providing first reference to the table that I want to join with, and then the condition on which rows to keep. Here I'm saying that the keepers ID must match between animals and keepers. This is probably a database from some zoo or something. This is quite verbose. Let's first introduce an alias for this new table that we're joining in, and say, k is keepers. Then we can just refer to k.keeper_id. This is still verbose. Because we are comparing two columns with the same name, you have special syntactic sugar here. We can just say, ==keeper_id. This is still semantically equivalent to what we had before. This is a good example of syntactic sugars we can add into the language, because we are writing a compiler and compilers can do this.

Let's move on, let's add another transform filter. We had filter before, but we can do it again. Why not? Here we are comparing first name to Thomas. Then we're sorting by animals.age. Then we do a select, selecting three columns. First column is animals.age. Then it's species, which is a named column. Then is a keeper, which is also a named column composed of an f string from Python. This is basically doing concatenation of first name, a space, and the last name. Just because we can, we can do select again, and now compose another f string and keep the previous f string keeper. The result of all this computation would be, for example, a relation with two tuples. First one is saying, I'm Suricata suricatta, 3 years old, my keeper is Thomas Edison. The second one is saying, I'm Capra ibex, 12 years old, keeper is Thomas Moor. We already know what Suricata suricatta is, and Capra ibex is an Alpine goat. They're very aggressive when they clash their horns, and quite stubborn to mountain hikers.

The query is now quite large. It's a lot to fit on one slide. It has a few nice properties that I want to explore. First of all, it's always top to bottom. Data flows from the input, and then through all your transforms that you're building. This is very convenient for exploration, because you can pull data from tables and then join in and transform, and transform until you get the result that you're looking for. It's also very easy to extract variables, because you can split your pipeline at any point. The first part is going to be a valid relation that you can put into a variable and reuse it later in different queries. You can also extract few transforms from a pipeline and reuse those 3 transforms, or however much transforms that you want later on. Let's explore this a little bit more. First, to extract a variable, let's say that I'm doing, from animals, and then a bit of filtering, and take 10. I want these first two transforms to be reused in multiple queries. I'm going to pull them out and say that let_big animals, this is a variable, is this first two transforms. Then in the main query, I can just say, from big_animals, take 10. You can see that this is very readable. Similarly, for a function, to extract a function, let's say I want to extract these last two transforms, the sorting and take operation. I can similarly define a new variable, and let's call it take_biggest. This is now a function that takes two arguments. The first one is number n, and then a relation. We pipe this relation into sort and then into take n. In the main query, I would just say, from animals, take_biggest 5. Now this 5 would get substituted or parsed into take. It would resolve to what we had before. I think that at this point you can see that this really feels like a programming language. Because you can refactor your code, you can see that this is a common pattern that we're using. You can pull it out. You can work with your data queries just as you would work with any other language that you're using.

I want to now show you the actual data model, what's happening in the core of PRQL, and how we defined the data that you're operating on. First, we have basic data types, Boolean, integer, float, string. This we're going to change, it's not really important. The important part is that we have two container types. First one is tuples, which can have named fields. They have different types. The compiler must know of all the columns in advance. You have here an example of tuple with integers, floats, and Booleans. The second container type is arrays, which has unnamed items, and all the items must have the same type. The nice thing about arrays is that they can contain a dynamic number of items, so the compiler does not need to know how many items is going to be in the array. Let's combine those two and say that a relation is just an array of tuples. Because arrays need to contain all the items of the same type, this is going to enforce that all the tuples also have all the same types of the columns and names. For example, here, I have three columns. My int, then an unnamed float column, and my Boolean. If you compare this definition to the 1970s definition of the conventional relation definition, it has two major differences. The first one is that arrays here have a defined order. Sets don't have defined order, so there is a disconnect here. Also, arrays can have duplicate items. We made this decision based on the fact that a lot of databases already actually treat relations in this way. Also, it is much more convenient and performant to do this in this way.

Let's go into a few syntactic definitions of how PRQL is defined. You will see that it's very compact. I'll tell you exactly all the syntactic rules that we have. For example, here we have variable definitions saying, let a is 5, and let b is a + 1. This is easy to read. Everyone can read this. Functions are defined with this arrow operator. We have function add_one that maps x into x + 1, and takes two arguments x and y and maps them into x + y. To call a function, you encase it into parenthesis, and the first thing in the parenthesis must be function name that you want to call, and then just list the arguments without commas. It's a bit different than the conventional languages, but it looks like the functional style programming languages. Because PRQL stands for Pipeline Relational Query Language, of course, we have a pipeline operator. Here, I'm having two different forms, because pipeline operator in PRQL can be denoted either by the vertical line, or by the new line. These two examples are semantically equivalent. I'm piping number 5 into function add_one, and then again into add_one, and the result would obviously be 7. I'm saving that into variable named 7.

This is the functions, and now we can move on to transforms. I've talked about transforms a little bit already. In PRQL, we just say that a function that operates on relations is a transform. This is just a shorthand. To show you that actually transforms are just plain functions, I'm going to use this example where I have two variables, one is animals and one is main. Let's say that animals is just some relation that you define either as a literal or something else. Let's focus on the main variable. It is a function call. The first part is the function name that you were referring to. This is a filter which is a function in our standard library. Then you have argument saying that height in centimeters is more than 20, and then the reference to the local variable, animals. Now let's transform this into the form that we had before. First, we're going to pull the animals as the last argument into the front and use the pipe operator. Filter is missing its last argument. Because of that, it's going to evaluate to a function that still awaits this last argument. This means that we can push the animals into this function that was evaluated here. In functional languages, this is called querying. Because vertical line is equivalent to a new line, let's split this up so it's a bit more tidier. Now I want to tell you about the rule in PRQL that says that if your variable is named main, you can just omit the variable definition, and you can just have the expression plainly on the line. This is still semantically equivalent. I don't want to use the local variable, animals, but I want to pull data from the database, so we're going to remove the animals declaration, and say just, from animals. Lastly, we designed PRQL exactly for queries like this. We actually have a rule that you can remove the parentheses here and have just height in centimeters is more than 10. Now you can see that with a few basic syntactical rules, we constructed a language that looks very like SQL, and is very nice to use when dealing with long pipelines of data.

I've talked about transforms a little bit, and this is all the transforms that there are in PRQL. It's 12 of them. We've dealt with from, select, derive, filter. This actually covers the whole of SQL clauses. It covers partitions in window functions. It covers with recursive, and it covers union. This is quite a significant achievement to back all the syntax of SQL into these 12 transforms. This is possible because we designed transforms to be orthogonal. This means that every transform is independent of the transforms that came before it. A good example of this is filter. In SQL, if you do something like from expenses, then filtering, then aggregation, then filter again, you know that the first filter will translate to where, and the second filter will translate to having. If you want this to be all one select statement, having has to be used for the second filtering operation. This really means that where and having cannot be independent of the previous transforms. While in PRQL, we just say, you can use a filter in both cases, and the compiler will figure out if this is happening before or after replication. To move this further, to achieve orthogonality of transforms even more, we have invariants for each of the transforms. These are properties of relations that will not change when you apply these transformations. For example, filter will never change columns. Filter will only remove rows, but will never add new columns or remove columns. Derive and select will never change the number of rows. This is also very obvious. In SQL, this is not the case. In SQL, you can write select, and then an aggregation function, and now this will trigger aggregation and the result will be only a single row. Just changing the select clause will change the number of rows. In PRQL we don't want to do that, so we have separate aggregate transform, and this has the invariant that it will produce exactly one row.

If you ask yourself, in SQL, when you do aggregation, you often want to use aggregation by some other, we want to do group by. How do you do then group by if aggregate will only produce one row? This is actually one of my favorite features of PRQL. I'm very proud of how we designed this, because it really shows you the composability of the language. For basic aggregation, you can see that you just do from expenses and then aggregate the total as the sum of all costs. Aggregate will produce a single row. The result would be something like total is 431. When you want to do grouping, what you essentially want to do is to take your expenses, and split it into different partitions, one partition for each department. Then you want to apply this aggregate function that you had before to each of the partitions, and that will produce a single row for each partition. At the end, you want to compose or merge these rows back together into a single relation. This is exactly what's happening here in the syntax that I'm showing. Group function is taking two arguments, first one is the column that you want to group by, and the second is the function that you want to apply to each of the partitions. The result is what I have on the slide below, is department sales has total of something, and department accounting has a total of something else. The nice part about this design is that you can actually decouple group and aggregation. These are not connected principles anymore. You can use any function that you would want here as the function that is applied to every partition. For example, you can use function take one. What would this do? It would split all expenses into separate partitions and then take just one row of each partition, then merge all the partitions back together. The resulting relation would have exactly the same columns as it had at the beginning. It would be just one random expense from each department. Because this function that we're applying could be any function, it could also be multiple transforms stacked one on top of each other. Here I'm doing sorting by cost decreasing, and then taking just one row. What this would do is, basically, for every department, find their largest cost.

If you want to know how this is implemented in SQL, basically, in Postgres and DuckDB, you can use DISTINCT ON, special keyword just in those two database engines. In other engines, we have to use windowing functions, creating row numbers, and then filtering where the row number is 1. This is very inconvenient, and basically exactly what this query compiles to with our compiler. Now a question for you, what would this query compile to in SQL? I can give you a hint, expenses.* means all the columns of expenses relation. This is distinct, because we are grouping by all the columns, each partition will have exactly the same values in all the columns. When you do take 1, this is basically doing deduplication. This is the beautiful part, I think, that we have a few core transforms that can eliminate the need for other keywords in SQL language. This is a very common operation, so you could define your own function that would do group and take 1. I am showing you that the core of PRQL is very simple, because relations are really not that complicated.

Moving on to lighter topics, we have a lot of ergonomic improvements. The first one being nulls. In PRQL, null is a value. If you want to know if a column is null, you just say, my_col == null. In SQL, we have a special keyword for that, too. More ergonomics come from syntactic space, we have date literals. We have f strings, as I've shown. We have the coalesce operator for looking if a variable is null. We have underscores in numbers. We have trailing commas, so you can comment out lines easily. These are just a few improvements that were made over the last 50 years of language design. It doesn't make sense to use a language that is that old.

Challenges of Compiling Queries

I have to talk about how PRQL will actually get executed. PRQL compiles to SQL. It is meant to replace SQL. It is designed to replace SQL, but not in the form of the database interface, but just the query language. What's the difference? Try this exercise with me. Imagine a database that would not have any query language. The only interface that you have is specifying which table you want to query. The result would be database sending the whole table back to you, to your client code, where you have super-fast Rust code that is going to do all the crunching of the data. That's obviously going to be super slow. A more extreme example of this would be implementing a query like this, where in SQL it would say, count all from albums where title is something. This would produce a single row. In my imaginary example, an operation like this would be implemented by sending all the data from the database to your client, and then aggregating it in your client down to a single number. Here, you can see that what you're basically doing using a query language is moving the processing from the client, from your Rust client, into the database, basically specifying a program to be run within the database. You would want this processing to be as close to the data as possible. This will obviously minimize data transfers. Depending on the database interface that you have, databases can figure out that your query could be executed on multiple computers at once, so you could parallelize your query. For columnar databases, this will also enable vectorized operations to be used, speeding up your query significantly.

When you think about databases in this way, they're basically execution platforms for your data queries. When you're using a new language in the front of the database, this is basically a compilation target. When we think about PRQL and SQL, actually we see that this is analogous to compiling C or Rust code to a binary to be executed on an amd64 processor, or compiling Java down to bytecode to be executed on a JVM. We're compiling PRQL to SQL to be executed on your database. I've said your database a lot. I've said SQL a lot. Actually, there is no one database and no one SQL dialect, there's a lot of dialects. In our case, this means that there are multiple instruction sets that we need to take into account. They differ from each other a lot. There is a standard, but there are differences in syntax, there are differences in available functions, and there are differences in available data types. This complicates our work a lot. This is fine. Databases have different priorities, they have different backwards compatibility guarantees, and there are different implementation limitations. You cannot expect all the databases to conform to the standard exactly to the point. Actually, the differences in databases are fine. The problem is that they are not specified. It's not easy to have a query work on your database. Then when someone asks you, would this query also work on some other database engine? The answer is, "Maybe it would, or maybe it would work but produce a different result." What I'm saying is that if we had a query language before SQL, the compiler could adapt this query to some different SQL result, that would work on every database that you're compiling to. Also, your compiler could say, the function that you're using does not exist on your target architecture. This means that you cannot use it or use something else, or change your database engine. This will greatly speed up your development process, because you will catch these errors before they hit production.

Another problem that we have when compiling to SQL are leaky abstractions. We are basically introducing an abstraction over SQL. Developers that would be writing PRQL should not know SQL at all. That's the goal. Currently, this is not the case. That's because sometimes we don't have the complete knowledge of the database, and the compiler is going to produce invalid SQL. Also, compiler has bugs. Sometimes, it's going to produce invalid SQL. Also, in some cases, you could write better SQL than the compiler would. There are also cases in the other side where you don't know the optimal SQL to write and the compiler can assist you there. One more thing is that when you say divide by zero, you would get a runtime error. The database would report that error. In SQL at offset 33, there is a runtime error, and we need to translate this error into original PRQL source. This is non-trivial, but it's something that all the compilers have to face. TypeScript did this and all the compilers before it did this. You need source maps to do it. It's just something that we still have to accomplish.

PRQL, the Project (Open Source Effort)

I've talked about how we're going to execute. Now, this is the last part, a little bit about our project. We are an open source effort. All the language specifications, the compiler documentation, and all other tooling is under Apache License 2. We're an open community that work on GitHub. We have all the language design work done there. If you have strong opinions on how to improve this language even more, come and join us and comment out your strong opinions on GitHub. One thing that we take very seriously is that we'll never monetize because, first of all, it's very hard to monetize a language. Second of all, it would actually decrease the adoption of the project, especially when integrating with other open source projects. The main product of the project is prqlc, which is the compiler from PRQL to SQL. It currently is able to target Postgres, SQLite, DuckDB, MySQL, and ClickHouse. It has bindings for C, Python, JavaScript, Java, .NET, PHP, and I think also Elixir. The main principle that the compiler follows is, don't connect, infer. This means that the compiler will not connect to the actual database, but it will look at your query that you're writing. It will see that you are referring to table, animals. It will infer that there must be a table, animals. When you say animals.species, it must infer that the animals table contains column species. This is a hard thing to do. I can say that it works really well. If you want to provide more information to the compiler, we are soon adding definitions of the tables, so you could define the type of tables that you have in your database. This would greatly improve the feedback that you get from the compiler. It also is able to fail early. When you're typing in your IDE, you get squiggly lines for syntactic errors, or even semantic errors. This really tightens the loop of your thinking, writing, and waiting for your database to tell you if what you wrote is correct or not.

A little bit about the architecture. The compiler is basically taking PRQL source code and parsing this into the first intermediate representation, which is called PL, which stands for pipeline language. This is basically an ASD similar to other compiler architectures. At this stage, we do all the name resolution. We resolve this function call is coming from standard library, so this is join, transform. Here we determine types of each of your values in the program, and determine if there are any type errors. Then we lower this representation into RQ, which stands for relational query. It's the query of relations. We make sure that you're actually querying relations. It has the format, where it's easy to see all the transforms and see that these three transforms form an atomic select statement, so you can split your long pipeline into atomic SQL statements. Here, we also correct all the things that stem from window function and determine, we need a separate relation here, a separate select query to compute some window function, and so on. This last presentation can be converted directly to SQL query. The result is a compiler written in Rust, which means that it can be easily compiled to WebAssembly, which means that we have a playground that compiles PRQL to SQL easily in the browser. Similarly, we have VS Code extensions which does the same, plus it's able to query DuckDB and SQLite databases locally. At this point, I think that you should be able to see that SQL is not the ideal programming language. It was not designed to be.

It has a lot of things that could be improved. PRQL, I hope that you can see that it improves a lot of these points. Even though it's hard to compile to SQL, there are a lot of challenges and we have a way forward. If you're convinced, please join us at prql-lang.org. We are an open community, especially looking for contributors that know Rust or know how to program compilers.

Questions and Answers

Participant 1: You mentioned the issues you have with, you have all these databases, they support different functions. As the PRQL project, do you try and cater to the least common denominator? Are you happy to say, this query, you cannot run it against this database. Do you error out? Can I fall back on to native SQL syntax?

Mur Eržen: We started with the common denominator of supported features, because we started compiling easy queries at first, but we already ran into problems with how databases deal with arithmetic operations. For some SQL dialects, we already report that this is not supported. You can always fall back to the default behavior of the database, because we have an escape hatch, which is S-strings that you type exactly SQL within your PRQL query that is basically compiled exactly to the SQL. Similarly as in a C program where you can just specify an assembly chunk of code.

Participant 2: Do you support other data processing frameworks like Spark or Flink, where you compile that off SQL?

Mur Eržen: Specifically, Spark or Flink, no, not yet. Actually, we are thinking of targeting some other dialects or some other representations of relational queries. For example, there is Substrate IO, which is the standardization for relational algebra queries. We are thinking about compiling PRQL to that, because there are compilers from that representation into other things. I think that PySpark or Spark is one of them.

Participant 3: You mentioned the SQL support, are you aware of any other IDEs that support [inaudible 00:41:25], or is there grammar that an IDE could use?

Mur Eržen: We have a few different lexers, but I don't think that there are integrations for other editors available right now. Yes, there are other lexers, but I think that we have a three-seater operation and [inaudible 00:41:51], and those could be integrated probably with a lot of editors.

Participant 4: Are there any plans to support Microsoft SQL Server?

Mur Eržen: Yes, there are. We have basically three tiers of support for dialect. The first tier is the dialects that we can test. Actually, Microsoft SQL Server is in the second tier, which should work in most cases, but is not tested.

Participant 5: Do you have any [inaudible 00:42:58] or interfaces to support new dialects?

Mur Eržen: Currently, this would have to be merged into the main compiler. It could easily be forked, just the SQL translation backend to add support for new dialects. Currently, we're at the stage where any work on that would easily be merged into the main compiler, so it would be easy to do that.

Participant 6: Just to use an example of my company, we use Django on our backend. I understand that you talk about how PRQL, for example, has better error mapping. It's in theory, language agnostic. Why would I want to use PRQL as opposed to like a built-in ORM in my day-to-day?

Mur Eržen: ORMs are used mostly for mapping objects to relations. PRQL is most useful when you're dealing with large data queries, which happens with analytical queries. I would actually not recommend PRQL for when you're doing app development, and you just want to map your objects to relations. I would suggest ORM or EdgeDB, which is the company I work at, which does exactly this. PRQL really shines when you have large queries that you want to refactor and you want to treat your database as an execution platform.

Anand: I used with LinkedIn, and for large data processing they used Pig, which is another dataflow language. They used that instead of Hive, which is what that case would use, because Pig gave developers the ability to basically specify almost like their logical or physical plan, because it's specifically in order and you have control over the logic or the physical plan of the query, one that you wouldn't with Hive. That was for analytic queries. I think it totally makes sense.

Participant 7: I was wondering, there is a project you may be familiar with, it's called dbt. It's one of the main frameworks used in handling large datasets particularly if it involved warehouses. It has a similar way of generating new SQL. In this case more about, behind software engineering practices to the way you generate SQL. This adds an additional layer to it. I'm wondering whether there's any synergies there that you can see of combining the compile level of PRQL and the software engineering pipeline or generic data stuff with dbt?

Mur Eržen: We actually are aware of dbt. We have people working with dbt a lot and PRQL. We have a plugin that plugs into dbt. The problem is that dbt needs more insight into what's happening with your data. It's basically you're writing SQL queries, and then you have Jinja templating on top of that. When you refer to some other model, you use Jinja references to that model. Actually, there is a bit of overlap over semantics that PRQL and dbt are doing. The integration between the two would be quite complex. We've spoken actually with the folks at dbt, and they think that we could benefit a lot from the integration, especially on the column level lineage. This is still yet to be done.

Anand: As a dbt user, do you find it hard to manage very complex SQLs, very large SQLs in certain parts.

Participant 7: It's supposed to actually help you manage. I think one of the key points of PRQL, from my perspective, besides standards, is the composability aspect that you mentioned. It's really actually a point that should not be lost in all this conversation for me, particularly if you're adopting some of the functional programming paradigms. I've seen some inserts on this dialect construct here, which is great. For me, it's the composability, and therefore with dbt, although it's not done at the language level, it's done more at the engineering side of it, helps you address some of the composability at scale, of course.

Mur Eržen: For the composability, the big aspect of dbt is to define multiple models that the dbt then knows about. You can say, just build this model again, or the error originates from this model and I'm testing this model. You would have to build this into PRQL. When you define a data query, when you define the pipeline, you would say, this is now a data model that needs to be exposed to dbt. Actually, when we were thinking about it, this really adds a lot of power to the dbt, especially because the compiler can even optimize between different models that you're building. There is some aspect here.

See more presentations with transcripts

Recorded at:

Feb 08, 2024

Aljaž Mur Eržen

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

PRQL: a Simple, Powerful, Pipelined SQL Replacement

Summary

Bio

About the conference

Transcript

Overview

Flaws of SQL

Design of a New Language for Relations

Challenges of Compiling Queries

PRQL, the Project (Open Source Effort)

Questions and Answers

Related Sponsored Content

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ