Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles From Raw Data to Data Science: Adding Structure to Unstructured Data to Support Product Development

From Raw Data to Data Science: Adding Structure to Unstructured Data to Support Product Development

Data science is fast becoming a critical skill for developers and managers across industries, and it looks like a lot of fun as well. But it’s pretty complicated - there are a lot of engineering and analytical options to navigate, and it’s hard to know if you’re doing it right or where the bear traps lie. In this series we explore ways of making sense of data science - understanding where it’s needed and where it’s not, and how to make it an asset for you, from people who’ve been there and done it.

This InfoQ article is part of the series "Getting A Handle on Data Science" . You can subscribe to receive notifications via RSS.


Key takeaways

  • Levels of data “structure” can exist on a scale from unstructured raw machine logs to analysis specific data tables, which are highly structured and designed to inform specific and currently relevant information needs.
  • As we move away from raw unstructured data, the volume and variety of it will decrease and as such so does the skill set needed to manipulate it, the time needed to prepare it before analysing it, and the attention from business users to assess the value of it.
  • Before doing any data modelling or developing data pipelines, it’s useful to analyse the unstructured data and provide insight into a real business problem or two.
  • To add structure to unstructured data, we can look in three areas: the real-life human behaviour behind the data, any meta data we have about the user whose behaviour it is, and the inherent structure of the technology that is recording the behaviour.
  • To implement this structure in data, first define signals, which are quantitative properties of the concept that help us to identify the presence and nature of it, and then we use the signals to distinguish between the values we’ve defined for our concept.


With unstructured database technologies like Cassandra, MongoDB and even JSON storage in Postgres, unstructured data has become remarkably easy to store and to process. Software and data engineers alike can succeed in a world (mostly) free from data modelling, which is no longer a prerequisite to collecting data or extracting value from it. We can start collecting data from various sources and at large scale without knowing why we’re collecting it or what value it might have towards helping us to achieve our overall product development and business goals.

For some readers, alarm bells will be ringing. Why collect data if you don’t know what its value is? After all, for the information manager spending months adding structure by gathering business requirements, modelling the data and simply surfacing it to business users has worked for decades. Other readers may have read the title and thought, “Why add structure at all?”. After all, for a data engineer the NoSQL revolution means that adding structure to data by modelling it for a relational data store is no longer a necessary requirement for most use cases.

While there is a place for both of these concerns, the reality is that neither the information manager nor the leading edge data engineer would be entirely right either. The reality is that data will exist at varying levels of structure throughout our data ecosystems and in order to make the right decisions to define each level, we have to understand the costs and benefits of adding structure.

The people costs of completely unstructured data ecosystems

While the infrastructure costs of storing data is relatively immaterial, the people costs can increase exponentially with both data volumes and variety.

With unstructured data, you will need more data engineers or new highly skilled analysts, versus traditional relational data stores, to make use of the data at all. For every new business need, analysts and business/product managers will have to spend time understanding the supply of data and how they might change their decisions based on it. After all, structured data is sort of a menu of data and unstructured data is more like a bulk size grocery store.

Even if the analysts know their way around the data, for each (new) business question data engineering skills will have to be deployed in a potentially complex data preparation process, which every analyst knows requires a specific skill sets and is immensely time consuming, leaving less time for analysis and supporting decisions.

In short, the people cost implications for unstructured data are that:

  • A complete absence of structure leaves little room for company wide efficiency gains in the data preparation process, more time will be spent to answer each business question.
  • Less room for regular support to the business from analysts without more advanced data engineering skills, meaning more expensive analysts will need to be attracted and retained.

So a data ecosystem that relies completely on unstructured data would be a heavy burden on human resource indeed! This is both true in terms of hours spent in response to each business need, and in terms of skills you will need on staff in your business analyst teams.

But what would happen if the data ecosystem relied on structured data only?

The time costs of completely structured data ecosystems

In the traditional information management funnel, we would start by collecting requirements, defining a target schema (i.e. what should structure of the data look like) and developing pipelines from source to schema. Whether or not to introduce new data into decisions is hardly an isolated process. Time is spent preparing test data, evaluating it and forming consensus about whether it should find its way into routine decision making. All of this is usually done away from the immediate data needs of the business (i.e. not a live business case, but a historical or hypothetical one).

In a fully structured data ecosystem, all data sets must be understood and modelled before they are introduced into business decision making. If that work is required before the data can be used, then no data department will be able to keep up with the growing supply of and demand for data by its business users. Data that is absolutely essential to informing decisions about the present and future course of our business could be stuck in an information management pipeline before it gets in the hands of a data analyst providing decision support, never mind before it finds its way into the decision-making and actions of a business manager!

So while a traditional information manager may prefer to bring data through the IM funnel which has worked so well for 20+ years, they will never succeed in making data a source of competitive advantage. The development time for structured data stores will be too long and will require too much distraction from data and business resources.

So we want to design our data ecosystems to balance the agility and low cost per kb storage associated with NoSQL solutions and the efficiency gains and lower cost analysts associated with adding structure via the IM funnel.

The business benefits of structure

People have a natural tendency to organise around information and as such, a business needs a vast amount of structure to its information in order to achieve that structure in its operations. For example, providing a common view of business performance helps everyone to organise around common objectives (i.e. less debating about what good performance is or how to measure it and more debates about how to improve performance in specific areas).

Structure such as key performance indicators, standardised user/business/product metrics and even generic consumer behaviour data stores have an essential role to play in a helping a business communicate and stay on target. Not only that, but structured data can also support communities of practice amongst your analysts. Since analysts have to operationalise each business concept to be scientific, establishing a common language and common operational definitions in your data is often essential for organisations with distributed data analysis teams. Analysts can then share insights and techniques, rather than debating how to translate common business concepts into data objects.

More generally, structure is a systematic way of collecting information about a specific real life phenomena or area. It’s a precise set of need requirements for that phenomena and to enforce a structure is to say to any system trying to insert data about this phenomena, you need to have the following information (i.e. data objects), which in turn has to have the following properties (i.e. data types).

Striking balance

So while modern day data engineers and scientists may prefer a world without structure, overall if an organisation wants to embrace data into its decision making it must also embrace structure onto its data.

And so we must find balance: balance between the agility and storage (cost per kb) benefits associated with NoSQL solutions and the efficiency gains in data preparation, the lower cost of business analysts and most importantly the unavoidable need for structure that makes data a common source of value in an organisation.

To do this we must define a data management funnel, whose purpose is neither to proliferate unstructured data nor to lock data up in a length evaluation and structuring exercise. So what would that look like?

Data management funnel

The root problem underlying the tension between structure and unstructured data is a common problem between data management and software management; the need to build tangible products quickly, to explore their value and gain learnings about the strengths before significant investment. Data products could be any of the following:

  • Unstructured data set for data scientists
  • Structure data sets for data analysts
  • A data set of model outputs (e.g. price elasticity coefficients)
  • A self-service dashboard for business users or non-specialist analysts
  • A data-enabled product or platform capability (e.g. trained neural network underlying recommendation engines)

In the development of all of these products, more and more we see software management principles make their way into data science with great success. Using unstructured data and a minimum viable product style project, data teams can evaluate both the value of the data and the extent to which structure must be added in order to the value to be realised by non-specialist analysts. This is often done using live data to solve a real business challenge, unlike the IM funnel where historical or hypothetical needs are used to define the idealised schema.

So what are the key questions we need to consider to adapt a product management style funnel to be a data management style funnel?

Key questions to answer to define your DM funnel stages

  1. When does a data set get applied to a live business problem?
  2. When and how does the data itself get evaluated for its structural needs and value?
  3. When does structured data set get added to the production supply of data in your organization?

While you may want to define more stages, effectively these following three stages follow the question areas that must addressed.  

Three stages of data maturity

1. Raw data: Massive volumes, entirely unstructured. Source: Usually comes directly from a machine (e.g. server logs, api outputs). Access: There is limited access to this data because of the skills to manipulate and analyse it or for privacy reasons. Typical audience: data engineers or data scientists using NoSQL queries or advanced processing using python.

2. Analysis Friendly data: Massive to medium volumes, partially structured or structured data. Source: data objects are calculated in an extract load transform style process from raw data to the tables or NoSQL collections, ideally organised by analysable units (e.g. a user based collection with everything we know about users). Access: Highly skilled analysts and data scientists only as the size and variety of formats in this data make it difficult to query.

3. Management Reporting data: Entirely structured. Source: calculated from either the raw data or the analysis friendly data sets and includes both the numbers that business/product managers use to run the business (e.g. KPIs) as well as any business drivers at an aggregate level (e.g. when companies report their financial growth they often report the drivers of price, volume, foreign exchange, acquisitions, etc). Access: Open as data security policy will allow. Typical audience: Business intelligence analysts or even self-service analysis with business/product managers.

At each stage of the funnel, more business logic and terminology is added, and of course more structure is given to the data. Data objects become more predictable in both their format and their range of values and the types of analysis tools used to consume this data are increasingly hungry for relational database form. There are also governance and privacy concerns, as earlier stages are likely to contain personally identifiable and sensitive information than later stages. At each stage of the funnel you should ask yourself who has access to the data and what operations can they perform on it (although that subject is not addressed here).

The first stage of the funnel is a land without law, mostly managed for cost per kb, usually only accessible by the central core of data capabilities. Systems and data engineers throw the data in there for exploration by others. Populating data at this stage is low effort and as such this area can get messy quickly!

The final stage is the polar opposite, a land of highly governed high structured data, widely available (with the exception of market sensitive data in large publically listed companies). Business/product managers are using data to measure the performance of their decisions, and gain high level understanding of the drivers, and as such it's unlikely that by the time the data has reached this stage it isn’t narrowly defined by definition and variable type.

The middle layer, where raw unstructured data is converted into something of partial structure and potentially high value, is the most important and also the most confusing. It’s this layer where balance is structure, communities of practice are serviced and deep insight into the behaviour of your consumers and the reason behind your performance can be discovered.

To go from the first (raw data) layer to the second (analysis friendly) layer is daunting indeed! A single piece of unstructured raw data can populate an entire analysis friendly schema. For example, an image posted on Twitter can be broken into data objects for three separate SQL tables:

  1. Tweet Image pixels, with columns position_x, postition_y, red, green, blue
  2. Tweet, with columns post_text, datetime etc
  3. Tweet publisher, with columns account_name, followers, etc

Potentially all of that information is relevant to make use of the tweet data.

Rather than flushing out the rest of the nuisances of defining a funnel, the rest of this article will speak to those needing to make that first step away from the unstructured land without law and answer the question: where to begin with raw unstructured data? Using data in a raw unstructured format and applying it to real business challenge can be a more effective and efficient way to establish the desired structure of the data, than scoping out the data needs and defining a data schema from there anyways.

Solving the challenges in this layer means unlocking the potential to observe, to understand, to predict and control various aspects of your business, especially consumer behaviour, and therefore it’s worth spending time and effort on. Clearly the potential value of this process outweighs the resource requirements, or else data scientists would be out of a job!

So how can we can go from unstructured chaos to something of value to our organisations and an informed opinion about what structure is needed for that value to be repeated on a regular basis?

Working with unstructured data

The reality today is that it’s extremely difficult to know what the value of a data set is until it’s been analysed in the context of a real business challenge (past or present). Using data in a raw unstructured format and applying it to real business challenges can also be a more effective and efficient way to establish the ideal structure of the data, versus scoping out the data needs of business stakeholder and defining a data schema from there.

So before doing any data modelling or developing data pipelines, try developing a MVDP (minimum viable data product) and provide insight into a real business problem or two.

What are the common types of unstructured data suitable for an MVDP?

Common types of unstructured data

  • Web application data formats (e.g. JSON, HSTORE)
  • Document file formats (e.g. HTML, XML, pdfs)
  • Free form text (e.g. text from within a pdf, the text from a tweet or blog post)
  • Images (e.g. jpg, png)
  • GIS vector and raster (although there are databases with these variable types so I won’t count them as unstructured here)

Most readers will be familiar with web application data and the process of parsing the data from that format into a structured SQL database. The ubiquitous need to transform JSON and other such inputs from web interfaces into fragmented entries in SQL databases underlies much of the motivation to develop the NoSQL solutions. Document file formats such as HTML and XMLs can be parsed with tools like Nokogiri. Most of the time, nearly all of a pdf is free form text (except document scans, of course). So of the common types listed above, images and free form text pose an industry-class problem for extracting value.

When it comes to text and image analysis, the realm of possible tools for analysis is seemingly endless. Rather than advocating for a specific tool, we’re going to understand a specific process. The process involves:

  1. Structuring the business problem conceptually
  2. Implementing the full space of concepts into computable signals
  3. Using data science techniques to establish a mapping from the space of signals to a single structured concept that is relevant to the business problem

Why bother with a process at all, why not just reference existing tools for text and images?

Of course, when analysing images, a perfect image recognition solution would be fantastic. In such an ideal solution, you could apply a method contains_couch and it would return a yes, no, or even better a .png of just the couch, along with a hash of its properties, e.g. {colour: brown}. In that same ideal world, when analysing text, a perfect natural language processing solution would have a method is_sarcastic or contains_political_opinion that would return a boolean answer or better yet, the text containing the sarcasm or reduced form of the political opinion itself. And we’d be done! Unstructured data to highly structured and specific business relevant variables.

In my experience, we’re not quite there yet. We may be close, but we’re not quite there yet. Not everyone can install a python package or ruby gem, query the text or images with the methods described above and get a robust enough result to make decisions on. In the world between these ideal solutions and the raw data, anything that enhances your understanding, moves you towards better products for your user or better decisions for your business/product managers is helpful!

Three places to look for structure conceptually in unstructured data

Part of what underlies the explosion of investment in data is not necessarily that there’s so much of it, but instead that behind most data is a human behaviour intermediated by some piece of technology. What we’re really investing in is systems of observation and understanding into human behaviour, particularly as it pertains to our business. We’re going to use this basic idea to define our approach to adding structure to unstructured data.

Three places to look for structure conceptually:

1. Meta data: Look for structure in the system of observation (i.e. technology) that created the data. All technology operates in a highly structured way; did the system leave a record of structure (e.g. css tags in an html or meta data in server logs)?

2. Data created by the user: Look for structure in the bespoke user interactions inherent in the software that created the data. Most software has unique and distinct interactions that translate to relevant structures when taken into context of that software. For example, a # doesn’t mean anything in general but on Twitter a hashtag is a functional mechanism that categorises the user created content by topic. It’s also used to tell jokes, e.g. #SorryNotSorry. Similarly, an @mention denotes an interaction between two people via their Twitter accounts. Are there unique features in the content created by users on your site or product? Do they correspond to distinct behaviours or concepts? A record of which of those features were used can provide a huge amount of structure to the behavioural state of the person at that time (e.g. location on the site, intended action from the site, interaction with others, etc).

3. Data about the users themselves: This is sometimes called “properties of the root”, and generally speaking refers to the object external to your system that causes a change in state of the system and at the same time creates the data. In the case of software users, the root is the person who interacted with the software.

Once we have established a structured concept (in one of the three areas above or elsewhere), how can we implement this structure of the concept in the data?

Steps two and 3 to add structure computationally

First, we define signals. Signals are quantitative properties of the concept that help us to identify the presence and nature of the concept. They are not the concept itself. For example, if all couches were brown, then observing any shade of brown in an image would be one property that signals the presence of the couch.

Second, we use the signals to distinguish between the different structured values we’ve defined for our concept by establishing the relationship between the signals we’ve computed and the structure of our concept. The relationship could be deterministic (e.g. contains_couch: TRUE or False) or probabilistic (e.g. contains_couch: 0.9).

When the relationship between signals and concept is probabilistic, it’s usually because there isn’t enough information in the signals to define the structure of the concept uniquely. An image that contains the colour brown is a signal of an image with a brown couch, but there will be a chance that it contains a couch and a chance that it does not. For that reason, practitioners will define as many independent signals as is possible, given the unstructured data that they’re working with. In doing so they must strike a balance between three competing factors: the numbers of signals, the independence of those signals and the overall computational cost of computing the signals and the relationship (independence is particularly important since two signals that contribute the same information, e.g. % of image dark_brown and % of image light_brown, won’t improve your ability to identify your concept).

To begin with, any quantitative property that corresponds with the concept could be helpful. If a signal is always present when the concept is present and never present when the concept is present, then it would be a perfect signal of concept presence. If furthermore variations in the signal always corresponded to variations in the concept (e.g. different colours of brown, for different colours of couch) then that signal can uniquely define your concept of the couch! In practice, an individual signal rarely reveals the concept as we wish to define it. Instead, it is very common that a large number of signals would be used to define a single concept. Hundreds of signals are used in search engines to produce the results for the search, even though you only type a handful of words.

Defining signals is as much an art and leap of creative imagination as it is a science. On the other hand, once signals are defined, establishing how these signals distinguish between different states of your concept is more of a (data) science.

To make it through this final step, the first consideration is what data structure will be best to enforce on the data representing your concept? For example, a binary concept such as contains_a_couch, is_political would be a method that takes the signals as inputs and outputs a boolean response. On the other hand, a concept such as text_topic_list would be a method that takes the signals and outputs an array of topics (they could be topic ids, if you have a denormalized master table for the topics themselves).

With a set of signals and a target data structured for your concept, machine learning is high on the list of go-to tools to establish your mapping, but not always necessary. In the example below we use a neural network style classification method based on numerical signals to demonstrate how this could be done.

Hello, World!

In my grad school courses on machine learning and AI, chairs and couches were always the text-book examples. This is because a mundane concept like a chair turns out to be simple to structure conceptually, and therefore nice to explore mathematically, but hugely complex to implement in practice. In our example, the situation will be the opposite: the concepts are incredibly complex to structure in practice but they are idealised in such a way that makes them simple to structure using simple methods.

For our example, we will be analysing the social media behaviour of a character called World. World is not a real person and the examples are heavily stylised to demonstrate the concepts we’ve discussed. The unstructured data will be a social media profile webpage in HTML format. The concept we’re going to use to add structure to this profile page is its Selfie-ish-ness, defined as a floating decimal representing the % of the images that are a selfie from World’s profile page. I’ll be using Ruby to demonstrate the methods with code, but they can be done in Python or any language that can process images as well.

Step 1 - Creating a structured data set based on web page meta data

WARNING! Web scraping is frowned upon and even violates the terms and conditions of the site in many cases. Before scraping any web pages, you should check if there’s an API that has the information and if not, read the terms and conditions to see if you can scrape the data and what the data can be used for if you do.

The first place to look for structure is in the technology that created the data. Web pages are highly structured, especially javascript and html, and a great deal of structure can come directly from there. In our example, we’ll be using the css tags in the .html page to parse out “posts”,”post text” and the “post_image” all of which we’ll need to define how Selfie-ish the page is.

To do this we’ll be using to parse the html using the method:

 def generate_structured_post_array(web_page_html)
	post_array = []	
	css_selectors = {
  	posts: ".post",
  	post_img_url: "img_src"
	@profile_html = Nokogiri::HTML(web_page_html)
    @profile_html.css_selector(css_selectors[:posts]).each do |post|
  	post_image_url = post[css_selectors[:post_img_url:]] #you could open and save the file to a server
  	post_text = post.text.gsub(/\s+/,’ ')            	#get rid of extra lines and spaces
  	post_data = {
    	post_image_url: post_image_url,
    	post_text: post_text
 	return post_array

Already we’ve gone from a web page to an array of post Jsons containing the relevant information from the account. This is enough structure to begin extracting relevant structured information, for example the post_array length corresponds to the number of posts. HTML and XML files alike are ideal “unstructured” formats because the creators of these formats (machines or individuals) are using the syntax to add structure to information so that the machine reading it can use this structure to visualise the information or perform other operations. In other words, the syntax is a contract between the creator and the consumer and every contract requires a certain amount of predictable structure. It’s this structure that often contains the valuable meta data we need for our analysis.

Step 2. Using the data generated by the user

So far, we’ve only established how many posts the account has and for each post what the post_image_urls and post_text is. Images and free form text are traditionally unstructured formats as well so for each post we need to add even more structure before we can establish if the post contains a selfie.

Luckily for us, World is a methodical and unique looking individual who has created posts where the signals for a selfie are pretty clear. For example, all the selfies have World in them and furthermore they are hashtagged with some sort of #selfie variant (e.g. #DangerousSelfie, #Selfie). However, it’s not as easy as looking at the hashtags because, say, you can see in post 2, World also has #SelfieGoals. It’s also true that World posts photos of himself that aren’t selfies. So we’ll need to use both signals to establish whether or not each post is a selfie. Here’s an overview of the exercise:

When processing with free text, it’s worth trying the simplest possible analysis you can think of as this can go a long way to solving your problem and can provide valuable insights into what might be your next (slightly) more complicated step. Never go straight to the most advanced solution (e.g. Natural Language Processing) unless you are highly comfortable with the data and the subject matter area. If not, you can build comfort with both by analysing a few simple properties.

  • Binary single pass variables -> establishing whether or not the text possesses a certain property or doesn’t. For example, does it include the word selfie? Is it more than 100 characters? And so on.
  • Integer variables -> Any property of the text that has a number. e.g. number of “#”, character count, word count (i.e. number of spaces).
  • Text array -> Breaking the full free form text into an array of relevant words (or more commonly n-grams). For a popular open source package that separates various Twitter type arrays (e.g. hashtags) see.

In our case, we’ll need to separate out all the hashtags then calculate the binary single pass variable of contains_selfie as follows:

def classification_topics_array(post_text)
   return post_text.split("#").map{|h|h.split(" ")[0]} 
  def classification_possible_selfie(topics_array)
	keyword= "selfie"
	lower_threshold = 1
	keyword_count = 0 	
	topics_array.each do |t|
  	count+=1 if t.downcase.includes?(selfie_keyword.downcase)
	return (keyword_count>lower_threshold)

Next, it’s time to establish which images actually contain World. When processing images, once again, it’s worth trying the simplest possible analysis although in this case that may not get you anywhere. Unlike the previous step, building your own image processing capability that will suffice for your purpose may be difficult or impossible given your time and other resource constraints. Still, it’s worth going through the thinking here!

Step 3 - Using what we know about real life

Luckily World is a unique individual indeed. Taking a step back to think about the selfie concept in real life, we ask ourselves: what quantitative properties about World uniquely define them visually? Well, World is made up of oceans, with a very unique blue, and land, with a very unique green. The scale of World in the photo is never the same but the relative composition of oceans and lands will be. Eureka! We’re in luck. The quantifiable attribute that uniquely defines world is the ration of World Blue to World Green.

Even here, there will be some variation. When World is laughing, the eyes on World’s faces become smaller and more of the blue and green is revealed. When World is scared, eyelids come down and more blue and green are revealed. Neither case will preserve the ratio exactly, so instead of saying blue is defined by a specific ratio value, blue will instead be defined by a range (i.e. when the ratio of blue to green is within the tolerances, then we can say the photo contains blue).

To implement this, I’m going to use the open source package. To get started, there’s a helpful tutorial. This will help you get started and define the method to produce a HEX histogram (left as an exercise):

  def compute_colour_ratio_signal(hex_distribution,top_colour,bottom_colour)
	bottom_colour_volume = 1
	top_colour_volume = 0
	@hex_distribution.each do |pixel_volume,hex_code|      	
  	case hex_code
  	when top_colour
    	top_colour_volume = pixel_volume
  	when bottom_colour
    	bottom_colour_volume = pixel_volume
	r = top_colour_volume/bottom_colour_volume*1.0
	return r

Once the signal is defined (a ratio), it’s quite common to use machine learning to establish the thresholds upon which the classification can be made.

In this case, I’m going to eyeball it. This gives use the following classification method:

def classification_contains_world(image_url)
	world_blue = "#1683FB"
	world_green = "#2AFD85"
	upper_threshold = 2.3
	lower_threshold = 1.8
	@rmagic_image_object =
	@hex_distribution = generate_hex_distribution(@rmagic_image_object) #hint: use quantize method and loop through all the pixels
	colour_ratio = compute_colour_ratio_signal(@hex_distribution,world_blue,world_green) 
	return (ratio <upper_threshold && ratio>lower_threshold)

The results from running the methods against all images gives output:

  • "image 1 had ratio 1.9686841390724195 contains_world and is possible_selife"
  • "image 2 had ratio 0 DID NOT contains_world and is possible_ selfie"
  • "image 3 had ratio 2.1200486683790727 contains_world and is possible_ selfie"
  • "image 4 had ratio 1.8354139761802266 but IS NOT possible_ selfie "
  • "image 5 had ratio 2.215403012087131 but IS NOT possible_ selfie "
  • "image 6 had ratio 2.256290438533429 but IS NOT possible_ selfie "

In our network, only when both nodes fire as true do we classify the image as a selfie. This only happens for Image 1 and 3 and so the data representing our unstructured concept of “selfie-ish-ness” for world is 33% (2/6)!

About the Author

Rishi Nalin Kumar is a co-founder and chief data scientist at, a data-led digital marketing advisory and SaaS start up. He’s also a chapter lead at DataKind UK, a data philanthropy charity who partner with third sector organisations to use data science in the service of humanity. In the past, he’s worked with companies such as Unilever, The Guardian and Spotify both to develop data capabilities, covering a variety of domains from product development to human rights, and to develop their data strategy, helping those organisations to introduce data science and data scientists into their everyday decision making.


Data science is fast becoming a critical skill for developers and managers across industries, and it looks like a lot of fun as well. But it’s pretty complicated - there are a lot of engineering and analytical options to navigate, and it’s hard to know if you’re doing it right or where the bear traps lie. In this series we explore ways of making sense of data science - understanding where it’s needed and where it’s not, and how to make it an asset for you, from people who’ve been there and done it.

This InfoQ article is part of the series "Getting A Handle on Data Science" . You can subscribe to receive notifications via RSS.

Rate this Article