Data Preparation Pipelines: Strategy, Options and Tools
Data preparation is an important aspect of data processing and analytics use cases. Business analysts and data scientists spend about 80% of their time gathering and preparing the data rather than analyzing it or developing machine learning models. Kelly Stirman spoke last week at Enterprise Data World 2017 Conference about the data preparation best practices.
Stirman talked about the differences between data preparation and data integration efforts. Data preparation is primarily done by Business Analysts using tools like Alteryx, Trifacta, and Paxata. Data integration has been an essential tool for IT and is performed by IT teams with the help of tools like Informatica, IBM, SAS, and SQL tools.
Data integration is mature and robust and has integrations to enterprise standards, security and governance controls. It’s server based, so it’s more centralized and more scalable. But it also has some limitations like its for IT users only, and assumes minimal data quality. It’s mature for enterprise sources but is less mature for cloud, third party apps, Hadoop, and NoSQL databases.
On the other hand, data preparation prioritizes speed, ease of use and offers faster time to value. It’s based on data-centric model (vs. metadata-centric model) and works for both IT and Business users. It supports different data processing environments like Hadoop, NoSQL databases, Cloud, and machine learning. Its limitations include being a less mature tech stack, limited ecosystem of integrations and skills, security integrations being less comprehensive, and still needing IT on-boarding and coordinating process.
Stirman discussed a variety of open source and commercial tools for use by different types of users like business users, data scientists and software developers, and the pros and cons of each tool. Open source tools like Apache Spark, Pandas (Python) and dplyr (R) help data scientists and developers in the preparation of data.
Some of the factors to consider when looking for data preparation solutions and tools are usability, collaboration, license model, governance, complexity, vendor viability, and ecosystem.