Dataiku recently released version 4.2 of its Data Science Studio (DSS), a collaborative data-analysis and predictive-analytics platform, which ships with pre-trained deep-learning models for image processing. Models can further be adapted to a proprietary datasets through transfer learning.
The DSS platform covers all the steps of a data-science end-to-end project, from connectivity, data wrangling and visualization to machine learning and production deployment. Its machine-learning module supports standard libraries such as Scikit-learn, XGBoost, MLLIb or H20. Developers can also connect to a Hadoop cluster and integrate multiple Spark engines.
DSS is tailored for common use cases in predictive analytics such as demand forecast, lifetime value optimization, churn analytics or fraud detection. Dataiku customers include companies like General Electric, L'Oreal or Unilever. The company has been named a "visionary" in the Gartner 2018 Magic Quadrant for Data Science Platforms for the second consecutive year.
InfoQ sat down with Florian Douetteau, CEO of Dataiku, to learn more about the company and its flagship product.
InfoQ: Please give us some background on Dataiku's DSS, your data science platform. What's the technology behind it?
Florian Douetteau: Dataiku is a software that users download and install on their infrastructure. So for many customers, that means in the cloud, but for others, it's still their data center (it's about 50/50).
We are generally guided by the deployment constraints and challenges of our clients, so therefore we have to keep our product as simple as possible. Our architecture is multi-process but also monolithic, in the sense that it's self-contained. Basically, the solution embeds everything it needs, including the databases that are SQLite and H2. We code primarily in Java, which is one of the principal languages of big data and that is considered a good compromise between performance and productivity.
On the back end, you'll find a web server that does job scheduling, storage and management of metadata, and search indexing. We also have some Python and R processes as well as, obviously, Spark processes. And on the front end, we use a single-page application (SPA) in AngularJS.
InfoQ: Who is the typical user of the platform? Can the marketing or sales department use it or does it require some level of data science expertise?
Douetteau: The great thing about Dataiku is that it is for everyone within an enterprise that uses or interacts with data. Of course, there are lots of features specifically for those with coding and data science expertise - they can use their favorite big data programming languages for more advanced and custom work. But we also have many data scientists using the platform that combine those coding features with the point-and-click visual interface, because at times it can just be more efficient.
The visual interface in Dataiku allows for analysts or other non-technical profiles to go from connecting data sources to data wrangling to applying machine learning models to visualization and more without writing a single line of code. For larger teams with lots of analysts, this is great because it brings a huge scalability component. For small teams that maybe don't have a data scientist, this allows for a lot of flexibility as well.
InfoQ: With the release of DSS 4.2, you include deep-learning image based recognition. What is your deep learning product strategy?
Douetteau: Throughout 2018 and into 2019, we'll have our sights set on helping businesses remove the roadblocks standing in the way of productionalized data projects while also providing the structure and stability necessary for long-term success. This means an accelerated focus on deep learning, AI, and deployment to production in our product development roadmap.
InfoQ: How do you handle machine learning at scale with big data and deep learning in terms of computing power and storage?
Douetteau: In terms of performance, because Dataiku is a software that users download and install on their infrastructure, it's up to the client to deploy multiple instances to ensure good performance.
And this is exactly what we support - scaling out and adding new nodes. Convincing our customers to trust us with hosting their data would be complicated, especially since we're focused on large, international enterprises. But on top of that, there are also underlying technical issues; for example, when it comes to processing as close to the data as possible, SaaS is not a good solution. On the other hand, this works well in the cloud - we integrate with AWS, Microsoft Azure, and GCP via their managed Hadoop solutions.
Dataiku will be present at several Big Data and AI events in the US in the coming months including the next Spark Summit in San Francisco on June 4. A free version of the studio is available on Dataiku's website.