InfoQ Homepage Podcasts Development Using Data Lakes and Large Language Models

Development Using Data Lakes and Large Language Models

Oct 20, 2023

In this podcast Shane Hastie, Lead Editor for Culture & Methods spoke to Davit Buniatyan, the CEO and founder of Activeloop about developing with large language models and AI.

Key Takeaways

Developers need to understand how large language models work, as they are becoming increasingly important in the field of AI.
It is important to aim for efficiency in machine learning operations, includes optimizing the use of GPUs, reducing the time taken to train models, and minimizing the cost of computation on the cloud.
There is a need to understand how to manage data for AI applications which includes storing unstructured data like images, video, and audio in a way that allows for efficient data transfer over the network and using vector databases.
Prompt engineering and is a key skill for developers working with AI, they need to learn how to condition large language models to control the data they generate.
AI models are commoditizing software development, which that developers need to think more systematically about how to collect training data to teach models to operate effectively and how to handle edge cases, which are situations that are not covered by the model's training data.

Subscribe on:

Transcript

Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture podcast. Today, I'm sitting down with Davit Buniatyan. Did I get that right, Davit?

Davit Buniatyan: Yes, almost.

Introductions [00:20]

Shane Hastie: Welcome. Davit is the CEO and founder of Activeloop. He has a PhD in neuroscience and computer science, which I'm going to ask him to tell us a little bit more about. And we're going to be talking about large language models, we're going to be talking about AI developers, and we'll see where the conversation goes.

Davit, welcome. Thanks for taking the time to talk to us today.

Davit Buniatyan: Shane, thank you very much for having me here. I'm super excited to be here. Unfortunately, I didn't finish my PhD at Princeton, though I was in the computer science department I'd actually dropped to start the company, what I was very, very excited to work on. But yeah, happy to give you shared experience I went through both during the PhD and post-PhD.

Shane Hastie: Cool. So yeah, tell us the story. Who's Davit? What's brought you to where you are today?

Davit Buniatyan: I'm originally from Armenia, did my undergrad in U.K. And then came here to start my PhD at Princeton University, at a neuroscience lab, working on this field called Connectomics, which is a new branch in neuroscience that tries to understand how the brain works. Actually, most of our algorithm that we use today for if you heard about large language models or generative AI, et cetera, and under the hood deep learning are all based on our understanding of how the brain works from 1970s. But the neuroscience evolved so much that actually the understanding that we have in computer science, how the brain works, is way different than the current actually biologically-inspired algorithms should operate.

In our research, the goal is to figure out or reverse engineer how the learning is happening inside the brain while I was a computer scientist at the research lab. So, the problem that we were trying to solve is that we're trying to reconstruct the connectivity of neurons inside a mouse brain. We're taking one millimeter volume of a mouse brain, cutting into very thin slices, imaging each slice. Each slice was a hundred thousand behind a thousand pixels, and then we had 20,000 of those slices. So the dataset size was getting into petabyte scale sitting on top of AWS.

And our goal was to use deep learning and computer vision to be able to separate those images or pixels from each other to separate the neurons from each other, which is called the segmentation process, find the connections or the synapses, and then reconstruct the graph so later neuroscientists can do their own research. And during that process, that whole operation of computation on the cloud was costing us a lot of money, up to seven figure. And we had to reduce the cost by 5X by rethinking how the data should be stored, how it should be streamed from the storage to the computing machines, should we use CPUs or GPUs and what kind of models to use. And those inspired us to actually take a leave of absence from the university, and then later drop, apply to Y Combinator, get into it, move to Bay Area, and start working with early companies to help them to be more efficient in terms of their machine learning operations.

Shane Hastie: What's it mean to be more efficient in machine learning operations?

Efficiency in machine learning operations [03:18]

Davit Buniatyan: Well, I don't know if you've heard, but training the GPTs and all these large language models cost orders of tens of millions of dollars of computing costs, mostly GPU hours time, and they reach to a order of a hundred million. And if you have also seen a lot of companies now raise significant amount of money to be able to actually go and buy the GPUs from Nvidia so that they can have the infrastructure to train these large language models that go into up to hundred billions of parameters of model scale.

And that whole process is actually inefficient. So apparently when we train those models, in average case we use 40 to 60% of the GPU utilization. The rest of the time is actually being underutilized. Currently, to be fair, the bottleneck for GPU training is actually the most of it is because of the networking. So you need to synchronize the model across thousands of GPUs in a data center. However, there are also cases when you train those models, the bottleneck is actually bringing of the data from the storage to the model onto the GPU. And that's where actually we have been focused on solving this problem.

We had actually a customer about four years ago. That was way before GPT was there. There was transformers, which is attentional, you need the paper, was only one year old. And they asked us like, "Hey, can you guys help us to train a large language model on 18 million text documents?" The current approach that they were taking was taking them two months and we said, "Sure, we can do it in a week." And we actually took the data, started in very efficient manner on the cloud and then did a bunch of optimizations for doing this distributed training at scale on the cloud. That was four years ago. So whatever technology was there available and we reduced the training time from two months to a week project.

We cut compute costs and storage costs, and also on top of that we run an inference on 18 million text documents at 300 GPUs. So fairly, that number was large at that time, now it's pretty small because a lot of operations are handled now in orders of thousand GPUs as you see with GPT-4, et cetera. But at that time we did a lot of optimizations, how we can do all this processing time and efficiency. And what we realized is that a lot of time the bottleneck is actually how you move the data over the data center infrastructure.

And while I worked with another company as well who was processing petabyte scale air image data, they had airplanes flying over the fields collecting a lot of air image data to provide insights to the farmers where there's a disease on the field field or dry down area, and we help them to build the data pipelines. And what we realized is that you have all these awesome databases, data warehouses, data lakes, lake houses specialized for analytical workloads, but you don't have one for deep learning and AI.

Deep learning frameworks [06:11]

And we looked into how PyTorch and TensorFlow, those are the deep learning frameworks that under the hood are used to train and deploy those large language models and deep learning models in general and understand how can you design the storage behind the scenes. So those models that usually take a tensor in and output a tensor out and usually tensor, just an N-dimensional array, think of it. And while you have all these formats, storages, data lakes, et cetera, table formats, they all operate on top of columns which are just one dimensional arrays and nobody actually stores unstructured data, especially like images, video, audio inside the tables, all of them put it as a blob storage on top of let's say a cloud.

So what we did is that we said actually you can treat them as another column, but instead of being one dimensional, it can be n-dimensional or in case of let's say images, you have width, height and also the RGB, the color channel. So then you can have a column of, let's say if you have million images, million by 512 by 512, by three, and store this in a way on top of the cloud storage like AWS S3, or GCS, or Azure blob storage so that it can efficiently move the data over the network. Think of it as Netflix for datasets. So this really helps to reduce the time that the GPU is wasted while the data being transferred over the network to the GPU as if the data was local to the machine.

So you no longer need any distributed file system or network file system. You can directly stream the data from very cost-efficient storage, which is S3, as if it was like an in-memory database. And to be fair, people try to store images in traditional databases, even '90s, Oracle tried to have images. But if you talk to any salesperson at Oracle, nobody even they were being best at sales will recommend you to store the images inside their database. And the reason is because storing data in memory is very expensive. And for unstructured data, for blob data, that's why you prefer to have colder storages like object storage solutions like S3. That's a bit on a technological deep dive, but we did Open Source it. Initially the name was Hub, but then we renamed it to Deep Lake, which is a data lake for deep learning applications, and started running webinars and building the community. And it started trending across all GitHub repositories, number two, number one in Python languages.

We built a community of thousands, hundreds of data scientist and data engineers and we really started focusing on these use cases that the data scientists today, they operate with files the same way as you operate with your laptop where you have all these different folders, different files messing around and there's no version control, there's not any structure; even if you operate on top of unstructured data or complex data, et cetera. And Deep Lake actually helps to take all the great learnings and understanding of how databases and data lakes should operate for data scientists mostly while focusing on deep learning applications.

Shane Hastie: What are some practical use cases? So if I think of our audience, the technologists, the enterprise software development community being asked to either implement some of these large language models, bringing in AI, or at least exploring and evaluating. What are some of the practical use cases that are there?

Practical use cases for developers [09:31]

Davit Buniatyan: So that was like two years ago where deep learning specifically was very, very for very niche audience. People didn't know anything about GPT-4 and all the discharge GPT style applications, et cetera. So the use cases were fairly very small or niche. For computer vision, you need to do object recognition, object detection, you want to do classification, et cetera. For language modeling, you want to do semantic understanding of the text or doing search capabilities as well. But it was fairly early days.

And what happened last seven, eight months maybe a year already because of the ChatGPT and GBT-4, people started to realize that we need a database for AI. And the reason is the following, is that first of all, the large language model input is fairly limited. And to be able to overcome that problem is you need to extend the context size and for that you need a vector database. So this whole notion of vector databases has been born earlier this year. I mean it was there for last two or three years, but it became widely adopted in any enterprise software company that's building gen AI solutions sooner or later, going to need vector database in place.

That's number one. Number two is that because of this generative AI hype, a lot of companies need to change their strategy. Because all these large network models are doing is they're actually commoditizing software development. A lot of companies, products that they built over the last seven years can be doable using “AI”. What it means is that while your models and the solutions become commoditized, what keeps the company moat is actually the data. So then the amount of the data that you collect becomes the castle that you built to be able to offer differentiated solutions or products to your customers or end use cases.

And that's where the realization of, oh, now we have to actually protect our data. You have seen Reddit, Twitter, Stack Overflow, all initially being super open about their data use now they convert it into selling their data, or not letting anyone to scrape the data sets themselves. And the main reason is that because economically the data is becoming their key source of differentiation across organizations or across products, et cetera.

I didn't mention yet totally fairly new use cases, but there are many new use cases has been born because of the generative AI. The first one is the generation of images with the models like Stable Diffusion, companies like Midjourney or as with Photoshop with their Firefly that lets you to generate a lot of content. More specifically for on the image side. You have companies like RunwayML that let you to generate videos from the images. That's something that hasn't been possible before.

And from the ChatGPT interface side, what was unique and different is was the first time that any person could interact with a chatbot or a language model that could have a similar level of understanding or discussion that a human would have. And that drastically changed the way perception of anyone interacting with these chatbots or large language models, which enables the use cases such as enterprise search, basically building conversation with the data that an enterprise has, or being able to do code search, or code understanding, or code generation. There are more use cases. We'll see which ones will stick and bring the value, but I do believe that these two ones that I mentioned have the highest value impact so far in the industry.

Shane Hastie: Those are some clear use cases. What are the skills? If we think about our enterprise software developer today, what do they need to learn?

Skills that developers need to learn [13:12]

Davit Buniatyan: As it happened with web development era, we are also seeing the same, then cloud era, then et cetera. Then now we're seeing with gen AI era, while AI, let's say, putting in quotes “deep learning”, was for very niche audience, mostly PhD level folks, being able to train these models. Actually building generative applications became generally available. I had a one user conversation that was using our Deep Lake database and he was like, "Hey guys, you have been building this product for the last two, three years but it was only specialized for enterprises, but now average Joe (pointing to himself) has access to these AI models and access to the vector databases, and we can build solutions now”. We don't need this advanced level of understanding, et cetera, to work operate on top of those models. So I think it changes the field in a way that any developer, or even non-developer, can interact with these large language models because they're mostly trained on human level text, and either build solutions or use them for automating their tasks.

So there's one big problem. First of all, this large language models, they actually hallucinate a lot of data. So when you interact with them, they can come up with the data that didn't exist before, which is great for creative tasks, which is very bad if you're doing any specific operation. So you as a developer who wants to build a solution now need to have some control on top of how this both the models will start to operate. So that's where the so-called prompt engineering comes into the place, where you start to condition these large language models on very specific constraints so that the generation of the data that it outputs is fairly under control. I think prompt generation itself has been used widely because you as either a developer or non-developer, you don't have to worry too much. You just condition, add the text, or system instructions into this large language models and you have some control.

That's the first thing that you can do. Second thing, you can start collecting data, which is your training dataset. They can start to fine-tuning those models. I think OpenAI recently released their fine-tuning guide for GPT 3.5 that lets anyone to fine tune the model. You don't have to go and use PyTorch or TensorFlow and load the data model. Or if you want that you can also go and take Open Source models which are also super widely available and start training your own proprietary models. And you have to be careful there with the licenses. Some of the models are commercially free to use, some of them are very restrictive, but the whole of this Open Source movement is also trying to catch up with private organizations like OpenAI to be able to get similar accuracy what GPT-4 is providing on a closed source manner.

The way we build software changes with these models [15:50]

And then you get into understanding, okay, how the vector databases will work. In fact we realized that there's a gap between this, think of it as a new way of thinking versus the old way of thinking, where the understanding of how you need now need to start building software is also going to change. And there are now pretty famous frameworks like Langchain and Llama Index, they help you to connect a lot of tools together and connect those tools into large language models so you can solve your problems. And we actually in fact released free certification course in collaboration with Intel and Towards AI, which has a big community of data scientists and users that helps you to go from zero to hero. And the main difference of being this dummy for generative AI course, it's actually established on building useful use cases that you can take it in your enterprise.

Shane Hastie: So let's dig a little bit into that course. As a professional developer, I'm probably proficient in Java. In my case I was very proficient in C and C++, I will admit not anymore. But if I wanted to get into this space, I've got that programming background. I understand relational databases, I understand front ends, middleware, backend, so forth. What are you going to teach me?

Learn to think about systems differently [17:10]

Davit Buniatyan: Yeah, let's say the full stack engineering development background. Well first of all, I think in terms of the languages, apparently most of the AI development have been happened on Python, and now there's also a lot of popularity on JavaScript as well. But the main thing is that I think that you're going to learn already the insight is that if you want to get from A to B instead of building all these heuristics, you're going to think about systems differently, about those models that you need to technically teach them how to operate in life. And it's very similar to what happened with self-driving cars like five, seven years ago, it was so easy to build a prototype self-driving car, put it on a street, took your video camera, how this is driving over the street, but you would never trust this to put on a highway, and sit inside. And the main reason was actually these edge cases.

What it took both companies like Cruise, Tesla, and Waymo over the last few years is to collect a lot of edge case data to be able to fully scope out what does it mean to actually drive on a street in any place. I don't know if you've watched recently on Elon Musk taking a live recording of driving him Tesla in Palo Alto, but there was a very key interesting insight. So this is the first time they built an end-to-end neural network that drives the car. And there's not any human engineer wrote If... Else Statement how basically the car should drive. And the Tesla was not stopping at the stop sign. Now you have to wait until one to three seconds. So what they find out from the data they collected from Tesla cars is that in average a human stops at a stop sign using Tesla just 0.5 seconds.

How do you solve this problem? Okay, you can't just drive illegally. So one fair solution was like hey, you can go and add an If statement, say "Hey, if you see a stop sign, you're going to stop here." Wait, I don't know, one or three seconds before you drive. But the way they solve this problem actually was totally different. They said, "Okay, let us go and collect more data where good drivers are stopping long time on stop sign and feed this data an ops sample manner to the model when we do the next training." So the way you think of building systems and software is also changes. You're not thinking, "Hey, let me go in myself condition this edge case and solve this problem." But you think more a systematic approach, "Okay, how can I collect more training data to teach these models to do what I want them to do in a good manner?"

And why self-driving is not yet a solved problem is because nailing down this last mile of edge cases is so tough problem. It's like humans are incapable of writing all the conditions that a car potentially can drive through. You can throw hundreds of engineers on this problem, but maintain this software and put all these conditions on different scenarios and all their combinations becomes super, super difficult. And that's where developers now becoming more like teachers how to tune this junior engineers to do useful work. It's fairly early times you should not expect wonders to happen from these models, but the incremental updates that we are seeing especially over the last year, but it has started actually five years ago, or 10 years ago is we can predict that the change is going to be dramatic next few years, but it will take some time to nail down all these rough edge cases with large language models.

I can give you an example as well here. If you heard about this retrieval augmented generation, which is a fancy word for search use case where let's say you have million documents and you want to be able to text and ask questions and find these million documents. A more specific example is, let's say, you have all the reportings, quarterly reportings of Amazon for the last two years, and you want to be able to ask questions like, "Hey, what was the revenue for the last quarter?" Which is simply a straightforward question that can be answered by a ChatGPT, or GPT-4 having all the context inside. And for you to be able to implement this solution, you have to build, use a vector database, get all the data into vector database and then run a query.

However, let's say you ask this question, give me the quarter reportings of last three quarters, and not any vector search can solve this problem. What you have to do is you have to go to each quarter, get the revenue, take the last three quarters and then sum them together and give this to the GPT-4 so that they can synthesize the final answer to you.

Hallucinations leading to mistrust [21:34]

So this is just a one very basic example where no any search or basic search or large language can answer this question. Apparently if you take all these edge cases, if you get to 70% of accuracy answering questions across the data, that's already remarkable. And the biggest problem is that when the ChatGPT, or GPT-4, or any large language model foundation models don't have the right enough context, they start hallucinating the data, and they come up with new information. So for us, will you trust the system to be in a hospital working with a physician, like having all your records data and maybe some other records and give you 70% accurate answers for your diagnosis? Fairly no, right? That means that we still have the time to really make those systems really production grade, enterprise grade deployable into highly sensitive critical situations.

Shane Hastie: From an ethics perspective, I want to jump on that and say, but people are already doing it.

Davit Buniatyan: Let me ask you, doing what?

Shane Hastie: So the hype, what we hear is, and maybe this is the hype cycle playing out, but 70% accuracy, but we trust the results.

Davit Buniatyan: Do you all be wondered how many times we have junior team members, we ask them, "Hey, go and figure out this problem." And they come up with a solution, they come to me is like, "Hey, we figured out the solution." But the solution looks fairly wrong and it's like, "How do you get to this answer?" And then they say, "Oh, this is the reasoning." It's like, "No, no, no, this is not way possible." And they start arguing back why this is possible. And ask at the end, can they bring a proof of GPT-4 said this, and they put a screenshot in front of me. And I'm like, you can't do that. You can't put this in the context of building. You know how people started trusting Google and internet and Wikipedia for a lot of questions. Obviously most of them are right, most of them, but there are these edge cases are very specific cases that the domain expert or knowledge can actually outweigh this misinformation that exists in the common knowledge.

So in our case it was not critical application, but yes, you're right. If you ask, you'll be wondered how many people now already take the ChatGPT answers as facts. As factual information, you can use it for actions or any items which is knowing how things work behind the scenes and how they have been trained. You'll never ever do that. Though it's not saying that they're not useful. Actually we already passed six months of this hype cycle, and fairly the hype is still going up. So we haven't came to that point of realization, oh, we have these over expectations across this technology, and fairly we're going to go down. I feel like we're going to go down in terms of the expectations at some point and then the most useful use case is going to evolve and become mainstream. But certainly everyone says this as well. This is a totally different wave compared to the blockchains, and more similar to what happened with mobile revolution, with dotcom era, et cetera. So how the internet started in the early days.

Shane Hastie: Where are we going?

Where are we going? [24:38]

Davit Buniatyan: Well, definitely at least to what I believe is not the Terminator-style AI revolution, I think that's the one side of false future that's not going to happen. And then on the other hand, you have nothing going to change. We still developers going to continue building the software ourselves like using Java, Python, whatever current language do we use, et cetera. I think there will be certainly changes and those changes going to have huge positive impact with obviously all the negative consequences taken consideration into.

For example, at least on myself, I see that me having a lot of experience writing a lot of software using the large language models really helped me with code development. Saved so much time writing all this nonsense code that I have to write to get there that automates for me. That's just me using it for a year. I tried, by the way, using this two years ago and the state was fairly not there. I tried this about three years ago with fairly small models and the auto-completion was nowhere near to the usefulness that we have the systems now deployed at scale, so at least we have got this usefulness.

The second thing is I think a lot of interface is going to change. Now you go and give everyone access to generate code and do that in fairly conversational way they're used to, and you'll see a lot of non-technical people will become technical or at least leverage these technologies to uplevel their skills. This could be just even learning from the foundational models so that you can take it from educational perspective. I do believe the large language models will have a huge impact in education space, especially this will let you to get personalized education for each student depending on their speed and performance, how they're moving forward.

For one student you can teach basic math. The other student's super excited into quantum computing and explain how the quantum theory works. So then you can have personalization there and this will definitely have impact in automation. We have seen companies like UiPath, et cetera, on RPA side that help you to automate certain very manual tasks on the computer. And this is the next level where you can actually write a text and ask to generate a plan and start executing this plan, which is this notion of agents, what that operate on self.

Actually, there was a very interesting story recently that Stanford published this simulation life where you have 16 different agents, all of them connected to a large language model living into a very simulated small space. And you give a task to one of these agents, like organize a birthday party for your next day, and then what you see this agent, which all the agents just use behind the scenes large language models, you see this agent goes and asks nearby agents like, "Hey, can you help me with organizing the birthday party?" And the other agent says, "Yes. So I will do that. Let me start creating these invitation letters and inviting all other 14 agents to the birthday party." And some of the agents say, "Hey, I'm busy, I can't come tomorrow," et cetera. And they organize this birthday, which is very exciting. You have this kind of a simulation style life where you have these agents working.

On the other hand you have projects like AutoGPT and BabyAGI, which had a lot of excitement in the space, but they went down because of the error rate they operate. A very simple prompt can actually have a butterfly effect on their reasoning and how they think. So they're nowhere close to be fault tolerant. And I had a friend of mine called me and said, "Hey Davit, I'm really thinking to automate ..." He's in the trucking industry. Trying organization of these phone calls with the truck drivers, et cetera, using AutoGPT. And it's like, "Wait a second, that's nowhere ... You're calling me one week after the AutoGPT was released, it's like we are nowhere close. I tried this tool. It's like we are very, very far from this putting into production usage." So I think if you ask me how the future looks like, we are going to have a lot of now developers thinking and building generative AI applications which are ready to deploy in real life in production, but the development itself of those applications going to be different than what we used to do before.

Shane Hastie: As a developer, my job is not at risk, but I need to learn something new. That's been the case in development since I started in 1982, and it was happening a long time before then too.

AI as assistant rather then replacement [29:03]

Davit Buniatyan: We have senior engineers in our team. They're like, "I'll never use ChatGPT to advise me on how to write the code. I have been doing this last 30 years," or 20 years. Who also touches our ego as well is like, "Oh, there could be an AI that writes better software than us, how this is possible?" And I do think that the most efficient look into this, that the current generation of AI at least is going to be more assistance, and your pair programmers and going to continue where the humans lack the, not just the experience, but also ... I myself, I can write code six hours per day, but then I get super tired, and I can do so stupid mistakes. I believe every developer does these stupid mistakes. At least it saves the time on searching across Google and Stack Overflow so you can be more efficient.

But basically what I also found is that it can really boost you in very highly precision certain tasks that you know roughly that these things should work like this, but you don't have an experience doing this. So it boosts senior engineers to be extra senior. It enables the junior engineers to have all this knowledge that they lack or missed before, and then it helps the non-developers to become developers. So you get this boost across all the fronts and you don't have anyone get replaced here, just get enabled at scale.

Shane Hastie: Thank you so much. Some really, really interesting thoughts and ideas, and thoughts about the future. If people want to continue the conversation, where do they find you?

Davit Buniatyan: Well, they can find me on Twitter, LinkedIn, all the social networks. Fairly, also do a lot of work at Google AI. We have a blog that we publish, a lot of use cases and case studies with folks that we are working on. And I think today the best thing you can do is contribute to the Open Source and learn from the Open Source. And that also will uplevel everyone in terms of what's happening in the AI space especially.

There's a big difference to what's happening now versus what happened 10 or 20 years ago is that AI and Open Source are intertwined together, and this really boosts the innovation. I know this is a longer answer to what you asked for, but it's one more exciting thing I want to share is that I was talking to one of my friends at a large company, I won't share which one. But they basically spend maybe last four years on working on these generative AI applications generating images, and they definitely have state-of-the-art models that no any competition had. But they left that aside and said, "Hey, let us just switch to Open Source." Even though the Open Source models, they lack some performance and accuracy, the change that's happening, they're definitely going to outweigh what we are doing internally. So now our common wisdom of building the systems and models together is way more than any private organization can do. So I think we're living one of the exciting times in the whole humanity life horizon that we had. And yes, super excited to be part of this.

Shane Hastie: Love it. Thank you so much.

Davit Buniatyan: Likewise, Shane. Great chatting with you and thanks for the questions.

Mentioned

About the Author

Davit Buniatyan

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.