InfoQ Homepage Podcasts Jaxon Repp on HarperDB Distributed Database Platform

Jaxon Repp on HarperDB Distributed Database Platform

Mar 23, 2022

In this podcast, Srini Penchikala spoke with Jaxon Repp, head of product at HarperDB, on their distributed database platform, edge persistence, and custom functions.

Key Takeaways

HarperDB's distributed database architecture and use cases it's a good candidate for.
HarperDB support for edge computing and data persistence on devices.
Leverage operations API and custom functions to interface with the database
Custom functions project based on GraphQL interface
Database support for authentication, authorization and data encryption.

Subscribe on:

Transcript

Introduction [00:38]

Srini Penchikala: Hi everyone, my name is Srini Penchikala. I am the lead editor for the AI/ML and the Data Engineering Community at InfoQ website. Thank you for tuning into this webcast. In today's podcast, I will be speaking with Jaxon Repp, head of product for HarperDB Database. We will discuss the HarperDB Database from technical architecture and the data management capabilities, and also how it supports cloud-based data management use cases.

Let me introduce our guest first. Jaxon has 25 years of experience architecting, designing and developing enterprise software. He's a founder of three technology startups and has consulted on multiple IoT and digital transformation initiatives. Jaxon, thank you for joining me today in this podcast. Before we get started, do you have any additional comments about your work at Harper DB that you would like to share with our audience?

Jaxon Repp: No, just I'm honored to be here and thank you very much for having me. I'm excited to tell you a little bit more about what we're doing.

HarperDB Architecture [01:30]

Srini Penchikala: Sounds good. Thank you for your time. For our listeners who are new to HarperDB, can you give us a brief description of the database and what type of data management use cases it's a good solution to consider?

Jaxon Repp: HarperDB is a distributed database. At its core storage, we are SQL and NoSQL, so we can store traditional rows and columns. We can also store a Document format, but we combine that with the power of ANSI compliant SQL. So you can join tables and even join on nested values within JSON between multiple tables or a MongoDB corollary would be Collections, which is something that most Document Stores have trouble with. So, we give you the power of the SQL you already know with the flexibility and the speed of a Document Store, like MongoDB.

In addition, we have a unique algorithm for replicating data between nodes of HarperDB that we call clustering, but it is a table level, bidirectional configuration, so you don't need to move all of the data to all of the nodes. You can have certain pieces of data, subsets or tables residing on, say an edge server where the cloud might have everything, and another edge server might have a different subset of data. So, it's extremely efficient and works with just about any data topology you can imagine.

Srini Penchikala: Excellent. Regarding the clustering and also how the data is managed, can you discuss the data storage and architecture of HarperDB? How your team was able to build the product from the ground up and any interesting architecture innovations used in developing the database product?

Jaxon Repp: One of the things we as developers were running into constantly, was issues with performance over time. When you're in a greenfield development, you build the best data structure you can and then inevitably something arises that throws all of your logic and all of your best laid plans to waste. Now you've got a billion rows in a database and you want to add a new index and you have to shut down for the weekend and hope that thing builds in time for the Monday crush.

We wanted to build a system that didn't duplicate data, would be smaller on disk, still highly efficient and performant, but ultimately lower maintenance as well. So, by default, we index all top level attributes in a table so you don't have to worry about managing your indexes, because everything's indexed by default. Obviously that contributes to slightly more storage on disk, but we don't actually replicate the entire data set in an index, only the hash attributes, the attributes that we're tracking.

Initially, we did this using symlinks on the Linux file system. Ultimately, we ran into an inode issue and we rebuilt the entire product a few years ago on top of LMDB, which is a memory mapped key value store, extremely performant, and has the additional benefit of allowing for non-blocking rights, which allows us to be performant for reads, i.e. analytics, or other workloads where you're doing lots of queries, while at the same time pushing data into the database and not suffering from a performance hit for that.

Srini Penchikala: You mentioned about LMDB being the key value database that's the foundation for HarperDB. But from a developer standpoint, the database can be used for different types of data structures, like a document type, or key value, or like Cassandra supports the other data type and also graph data. So can you talk a little bit about that? What type of data is best for HarperDB usage and if there are any particular data structures or cases that are not good to use HarperDB?

Jaxon Repp: No. I would say we are highly performant, on top for every sort of data structure. That was the flexibility ultimately that we wanted to achieve with this product, because it was one of those things where all of a sudden you have a new value, that's going to go into the database so you need to change the size of that field. That's a massive pain, we didn't want to have to deal with that. So, we love the flexibility of document stores like a MongoDB, but ultimately we wanted something that was more performant and allowed us to integrate traditional SQL as well. So, when I mentioned LMDB is our underpinning key value store, that allowed us to have extremely fast and non-blocking ability to persist the data and then our data model, which lies on top of that, is what allows you all of the flexibility to then break those things down into ultimately what LMDB stores as a key value, a set of keys and values.

Data Consistency Model [05:53]

Srini Penchikala: I know you mentioned about the performance as well as the SQL, traditional database kind of support. So usually, that is the dynamics between these different options, like a SQL versus NoSQL, where SQL supports the transactionality and immediate consistency, with some performance hit, whereas NoSQL databases they're meant for performance, but they do come at a cost in terms of the consistency and transaction management. So, how did you guys balance these two somewhat, perceivingly, conflicting challenges and how are you able to achieve both?

Jaxon Repp: We are ACID compliant at the node level. So, we adhere to consistency on disk, similar to more established SQL databases, and we are eventually consistent across the cluster. So, we accept that two people might try to write something on opposite sides of the globe and then it's our job to reconcile that. We use that with using a universal clock, we're investigating CRDTs. We have lots of algorithms to prevent duplication of data. And to be honest, right now, the last writer wins based on a timestamp. But as we replicate data around the system, we have lots of compares to make sure that the right data ends up at the right place around that cluster.

There are competitive products that sacrifice speed, even in the distributed computing world, where you basically do a global row lock. CockroachDB is a great example, extremely performant on a global level and ACID compliant. They do that by basically locking a row before they write it anywhere. It's great for financial services, so nobody can ever take the same dollar out of the account at the same time, from opposite sides of the globe, but for the other 99% of transactions, customers and applications, eventual consistency is sufficient. That's the portion of the market that we decided to go after. That's not to say we won't add a toggle somewhere down the line that does global row locking, but right now we have plenty on our plate just selling distributed computing and the advantages that come with that,

Srini Penchikala: Most of the use cases are typically needing the performance more than the consistency, right? I know that there are a few use cases, like you said, financial institutions, transaction is more important, right?

Jaxon Repp: Exactly. Our biggest success in proof of concept have been things like AI/ML, like classifications at the edge, where anomalies might be replicated up the chain, rolled into a new model, the new model, then distributed back down to the edge. Or social media/gaming or entertainment, where you might want a friends list. There's no penalty for adding a friend and getting it a half second later, or dropping a friend and having them hang out there for a half second on the other side of the planet, or 110 milliseconds is usually what our global data replication takes. So, when it's not mission critical that we lock everything out, the speed that is delivered by having the data close to the edge far outstrips every other architecture when you're looking at a single monolithic data store in a single region and the massive round trips from your distributed API endpoints back to that data source,

Srini Penchikala: Also, you mentioned earlier how HarperDB manages the data consistency at the node level and the cluster level. Can you discuss a little bit more about how the distributed, peer-to-peer, read and write consistency works? And how is the API distribution designed in HarperDB?

Jaxon Repp: Our clustering is peer-to-peer, it's pub sub. So, one node can initiate a connection to another node, say an edge node that has access up to the cloud, where the cloud might not have access down through a firewall to an edge node. That edge node can subscribe to tables where it's managed perhaps in the cloud and it could say, thresholds, it's doing analysis of sensor data. It is recording those values on a first in, first out, maybe 30 minutes worth of sensor of data. It subscribes to the thresholds from the cloud. It's recording the data locally, but not sending that anywhere until a threshold is violated. At which point, it creates an event using perhaps the 30 minutes of leading data. The event data itself wraps that up in a package and inserts that package into the events table, which would then be replicated back up to the cloud. So, you don't move all of the data. You only move the parts that are necessary.

Now, when you've got 1,000 edge nodes and they are writing events to the events table, obviously we would have a key in those attributes that say, "I came from machine one or machine two, or machine three." A lot of times, if you have multiple applications writing to the same instance of HarperDB and then distributing those out, or the same table in HarperDB, we effectively take the transaction, which is just the change set to that data, and we replicate that around the network to all the nodes that are supposed to possess that data. So, the result is that the transaction itself doesn't go to every node, only the change in data, which is all lot less. So, we'd like to lower network traffic and ultimately lower the load on everybody but ourselves from that algorithm that says, "This is supposed to be written before this, this overwrites that. That one comes before, but here's how that would reconcile." We do all of that without anybody having to worry about it.

Edge Computing and Data Persistence [10:55]

Srini Penchikala: Also, you mentioned about the edge competing a little bit, most of the modern data use cases include some kind of an edge device, whether it's a smartphone, or a device in a factory, or even the vehicles i.e. cars in terms of autonomous vehicle technologies. Can you talk more about that? Can HarperDB be installed at all these different nodes or devices? I know HarperDB can be used for smaller, as well as larger applications. Can you talk more from the edge persistence standpoint, how it works?

Jaxon Repp: We've run on machines as small as little Jetson boards or Raspberry Pi. We have a 100 meg footprint and we use, as I mentioned, LMDB as an embedded database. So, there's no ongoing process sucking up CPU until such time as you hit the operation's API and begin executing a workload. So, we're not resource intensive and we don't have a large footprint. Obviously, your data may have a large footprint, so you're going to be constrained by the size of what you choose to put on disk. But ultimately, we found that as much as we would refine our query engine, or the speed of our API. We had Express when we started and now we use Fastify for example, and Fastify turns out to be twice as fast.

There was no substitute for moving the transaction closer to the user. We can save from five second total query time, to 10 millisecond total query time when we move the application layer, i.e. our custom functions and the data out to say, a Verizon wavelength server and you're on the Verizon network over 5G. You connect to that application and you see data in the single digit millisecond response times. There's almost no optimization I can do to our internal product that will ever save that much time. So, the edge is a huge play for databases because we just look faster because of where we are. It's a very, very easy decision to make as long as your data retention policies and your comfort with eventual consistency meets your application's needs.

Custom Functions [12:52]

Srini Penchikala: Right along with cloud computing and edge computing another area that's been getting a lot of attention in terms of the databases, is the serverless architecture. HarperDB supports custom functions, so can you talk about how these custom functions work and what does this mean to database developers as well as application or API developers? How can they leverage this interface to work with the database?

Jaxon Repp: HarperDB comes with what we call our operations API. It's an HTTP API where you can execute all of your queries, operations, inserts, start jobs, exports, stuff like that. That is, as I mentioned, run on a fast device server that then has access to core HTTP methods and executes, whatever operation you send in. We love that format, because almost every application's already making HTTP calls and figuring out your connectors or your drivers is always such a pain and HTTP is super easy. So, it's also very atomic and it's very fast. There's not a lot of overhead attached to it. We thought it would be great if you could define your own operations. So, we added to HarperDB, a standalone Fastify server that you can restart without affecting the core HarperDB ingestion and operations.

Then you can use traditional Fastify routes and all of their hooks and all of their handlers to effectively set up your own API endpoint. Within those routes and those handlers, you have access to core HTTP methods. So where you might normally with an application set up a serverless, a Lambda function on AWS, you would reach through say API gateway, you would hit the Lambda. The Lambda would then make another HTTP call to HarperDB and make that hop. Whereas in this architecture, you hit the Lambda, basically sitting in a Fastify handler directly on the machine next to the data and then instead of making that second hop, it just directly queries the embedded LMDB data store through our data model. That's how you end up with two to four millisecond response times when you're on the edge, querying massive data sets, and in the social media and gaming space, that's why we're having success, is that things that used to take maybe 11 seconds, if you were in South America, that round trip to US-West-1 takes a long time, up to 11 seconds to get your friends list back and we were able to move our HarperDB server with custom functions to a data center in Bueno Aires and boom, you were seeing 20 millisecond response times at peak.

GraphQL Support [15:14]

Srini Penchikala: Speaking of the API to access database, can you talk about the DB support GraphQL or any other type of application API interfaces?

Jaxon Repp: It's interesting, that's our number one requested feature is GraphQL interface. When we launch custom functions, they are project based. So you create a folder, there's routes in there, there's helpers. You can include any MPM package you'd like and build that thing as a project, and then distribute that project to all of your nodes. The first project we actually deployed and it's available on our GIT repository is a RESTful interface. So, truly RESTful compliant, I'm going to go to this schema, this table, this ID, and it's going to return me that result. I can post to it and update it. I can patch, it follows the spec perfectly. And then we thought, "This is a much better solution." Rather than building in the functionality for GraphQL, we will work on a custom functions project that is a GraphQL based interface.

Jaxon Repp: So we're working on that now, we're planning it, but what we're really excited about is the fact that since it's the number one requested feature, we hope that as soon as we release version one, the community is going to lovingly adopt it and make it even better. So, we have a very passionate developer community and a developer Slack, where we are constantly getting questions about, "This is how I used the architect solutions, this is the table structure, is this right?" It's refreshing and rewarding to be able to say, "It doesn't have to be that hard anymore. It can be a lot easier than that, just do it this way." And it'll solve all your problems and to watch people who maybe are familiar with say a MongoDB and that flexibility, learn how simple it is to join things or to insert data or what the performance levels are.

I think almost everybody's probably started an application using MongoDB and eventually run into the bottlenecks that you'll eventually run into. But likewise, people who are traditionally using RDBMS are still able to use their SQL queries. They can move them right over. They can transfer data directly out of Postgres, and they can effectively just move the queries in their app, into custom functions and it still just works. So, it's a very flexible platform and all of the additional interfaces, including a WebSocket subscription, where you could simply subscribe to changes in a table, all of those will be built as example, custom function project that then people can just get cloned directly into their custom functions folder, fire it up, and it should just work.

Srini Penchikala: Yeah, that's a good point. You mentioned that the reactive architectures are getting a lot of attention. So, by leveraging WebSockets, developers can get that side of the development as well. You mentioned performance and scalability a couple of times. Do you have any resources or benchmark studies that InfoQ listeners can check out, especially to see how it compares with other databases in the SQL and the NoSQL spaces?

Jaxon Repp: Absolutely. If you go to our website at harperdb.io, go to product, and then under that there are benchmarks and we have them for most major platforms.

Database Security [18:13]

Srini Penchikala: Another important aspect to know of any database or any application in general is security. Can you briefly talk about what kind of security features HarperDB already supports out of the box in terms of authentication, authorization and data encryption, and so on?

Jaxon Repp: We support basic auth through our operations API, as well as GWT auth, so you can generate a token and then use that for future transactions. Within custom functions, you can leverage the same basic auth or token auth, but you could also write your own auth handler. I wrote one for Azure AD for a customer, so you can implement effectively anything you want on those custom functions. From a security perspective, if you'd prefer to completely control the experience, you can actually shut off the operations API and only use custom functions.

So, you can close down access points to HarperDB and restrict them to only the ones that you want to have. Once you're in and authenticated, we have a role-based security, where users can be assigned to roles. Within roles, we allow for basically attribute/operation level security. So, I can grant one role, read access to a table where they return all of the columns, except the ones with personally identifiable information.

So, when they do “select *”, they'll get back some account details and technical details, but they won't get name or address or phone number. Meanwhile, the account team, they won't get any of the technical details if they don't need them and they'll get back the name and the address and the phone number. So, you can control data incredibly granularly, including not just reads, but writes, updates and deletes. So, an edge node may only have permission to write three attributes into a table, whereas another application that is following up on an alert might have the ability to write, say three more columns, because now there's more data further down the line. That's incredibly powerful.

We don't currently encrypt data at rest. We have customers who encrypt full disk, and there's a small performance penalty for that, obviously, but we thought there's lots of people going after that. We are looking at encryption strategies like homomorphic encryption, where it is not only encrypted at rest, but it is encrypted while you query it. So, it would be fully encrypted end-to-end, including when you send it in over SSL and then immediately wrapped up and pushed into persistence. It is never not encrypted until you get the result set back and then that endpoint will simply decrypt it on the way out, transmitted over SSL.

That is a forward looking project. As we push further into the cloud and further into simplifying deployment of a data fabric, a global data fabric, in future iterations of our HarperDB cloud offering, I think that abstracting away all of the complexity and the worries of security and speed and networking and clustering logic, the simpler we can make it, the more our customers seem to like it. I think the old paradigm of, "I want to install this on bare metal and I've got a DBA and he's the only person with the keys. And God forbid he gets hit by a bus, but man, he's really in charge of our indexes." I think we like to turn a lot of those assumptions and traditions on their head and our customers seem to be nothing but grateful for our efforts.

Srini Penchikala: And also, it helps with the faster time to market.

Jaxon Repp: It does. The sales cycle shortens considerably when you can answer, "We already do that," to every single question they ask.

Srini Penchikala: That's how we bring the database tier into the agile development spectrum, right? Since HarperDB is a relatively new database in the industry, where can our listeners find more about it? Obviously your website, are there any other resources you can share, that they can learn more about it?

Jaxon Repp: So to start, HarperDB is an NPM package. So, we are built in Node.js, but we are ultimately an NPM package that you can install just by typing "NPM I HarperDB." Super easy assuming you have Node.js and NPM installed, obviously. More details can be found in our docs at docs.harperdb.io. We have, as I mentioned, HarperDB cloud database is a service offering with a free tier where you can log online to our studio, our management studio at studio.harperdb.io, spin up a free instance and get started right away. There are code samples that you can drop in just about every language into your application to demonstrate how to set up schema, how to set up a table, how to push data in, how to import a CSV, all of which you can manage through the studio. And then you can always install instances locally on your own machine. We also have a Docker container. All of those install instructions are again at docs.harperdb.io.

Open Source Strategy [22:41]

Srini Penchikala: In terms of open source and community participation, the HarperDB is not fully open sourced yet. Can you talk about what are the plans? Will there be an open source solution in the future? What is the roadmap?

Jaxon Repp: We considered open-sourcing this product a great deal. But at the end of the day, we've seen a lot of really great databases with incredible features. Some of the names fail me, but they fail me because they are now no longer a thing because it's hard to get people to work on something for free. I mean, even MySQL is fostered somewhat by a multi-billion dollar corporation. We don't have billions of dollars at HarperDB yet. We hope to someday, and we would love to open source the whole product. I think what ultimately we are going to be able to do is retain that core HarperDB data model and our data store, and then have everything on top of that, including operations APIs, custom function API, our clustering mechanism. Those can all be open-source modules where there's simply drop them in a folder and now you have an operations API. Drop it in a folder and if other people work on that for us, that would be amazing. We are under no presumption that our upcoming solution for clustering is going to be the right solution forever and we would love somebody with the aptitude and time to suggest something better, faster, and stronger.

We are focused on meeting the needs of our enterprise customers so we can pay our salaries and develop the best possible experience for our community and customers, and that takes money. So, open source with a service revenue layer, I think it's something we aspire to. You just got to have money in the bank to afford that.

Srini Penchikala: Right, it's not how much a technology costs or doesn't cost, it's what it's worth. So, that's what makes it a valuable contribution to the community.

Jaxon Repp: We contribute back. Many of the open source libraries that we have incorporated in our product, oftentimes we are using it in unique and interesting ways that ultimately we raise issues. We find packages that have been around, like LMDB. We find new ways to challenge them and we work very closely with a lot of the maintainers of packages that we incorporate. When we find something, we can push a fix, we can do a PR for them. Ultimately, we don't want to put a load on anybody else in the same way that people hate filing tickets for us. We want to be good stewards of the goodwill that we hopefully engender.

So, we also do lots of interactions with library maintainers, and we do podcasts and webcasts all the time, where we show off their technology. Because obviously, we wouldn't be where we are if we hadn't managed to basically stand on the shoulders of some pretty amazing technological ingredients and systems.

Srini Penchikala: Do you have any additional comments before we wrap up today's discussion?

Jaxon Repp: No, I'm just thankful you guys reached out and asked me to talk about the product and the landscape that we are battling against every day. It is truly transformative when you move data closer to people. Like I said, distributing data closer to the edge is the best thing you can do to optimize performance. Nothing you can do internally will ever beat being a thousand miles closer to somebody, within reason.

Srini Penchikala: Yes. Thank you very much for joining this podcast, Jaxon, it's been a great opportunity to discuss one of the new and innovative databases in the industry, like HarperDB and how it helps with cloud native database development efforts that most of our audience are running into right now. To our listeners, thank you for listening to this podcast. If you would like to learn more about cloud databases or data engineering topics in general, please check out the AI/ML and data engineering community page on infoq.com website. I encourage you to listen to the recent podcasts and also check out the articles and the news items my team has posted on the website. Thank you.

About the Author

Jaxon Repp

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.