I believe this article will be of some help for those trying to launch their own AI or Data Science project in the domain they desire. I consider this as an opportunity to share my experiences and what I think would help if one were to collect data for themselves. This is also my humble attempt to support all those who feel irritated when they start thinking about data collection and to keep their hopes and enthusiasm about Data Science or AI alive while going through such a stage.
To be honest, when I first entered into the world of AI and Data Science I saw myself using all those cool machine learning algorithms and creating solutions which had an impact on lives across the globe.
My thoughts were like, “ After I master all these algorithms, I’m gonna be ready to rock”. But as you might have guessed, reality hit me.
It so happened that all my confidence stemmed from all those perfectly cleaned and balanced datasets, if you know what I mean, crying to be fed into one of those magical algorithms to produce mind-boggling results and filling me with pseudo-confidence.
Reality knocked on my front door when I was assigned to an analytics project aimed at predicting the probability of success of a political candidate. It was at that moment I realized, I was hollow as a bubble — I had so many algorithms in my brain but nothing was helping — what’s the use of a gun without a bullet? I needed data to use any of the algorithms I knew. If I had any form of data, I could have done something but there was absolutely nothing!
I had two things at hand; collect data and do the prediction.
Until I thought about it, collecting data felt easy. But again, yes you guessed it right, reality came after me — this time in the form of questions. What data should I collect? Which all questions or predictors will help me better predict the candidate’s success probability? How to define probability? What factors influences its value and the influence of social psychology? Things started going haywire. There were more questions being generated rather than answers.
I spend hours reading about my regional politics, journals of psychology and neuroscience —all these, not to generate survey questionnaire but decide what to include and what not to. One interesting thing I found is that humans retain the memory of negative action more than that of a positive action (Elizabeth A (2007))— something that politicians love to babble about and sometimes helps them win the elections. But we are not talking politics here…
The next thing was, how to ask the questions to the respondents — if they know we collect data for a specific party, chances are they will modify their response and we will lose some highly valuable predictors. We all know, prediction algorithms’ performance and reliability are not the sole responsibility of tuning process and that data quality has a very good part in it. So, the “how” part started creating problems after “what”.
People are just one side of the story
Soon, political publications, some news articles and magazines led me to the role of the opponent in the success of a candidate. This lead me to another realization — I have to do all those research and questionnaire preparation I did for people or voters for political opponents too. This means, there exist “what”s and “how”s for the opponents as well. All these lead me to create an idea for data collection framework — so that I don’t have to scratch my head again for any upcoming projects that could gift me with late night bedtimes if I don’t create one.
Finally, the reason behind this post. Before going any further we need to make some things clear. From here on, I will be using three terms:
- Organization — the one for which we are doing the analytics project
- Customer— the ones who are at the receiving end, the users of solutions the organization creates
- Enabler — data scientists working on the problem, acts as a bridge between organization and customer to deliver organization’s products and services
For any analytics project, there are two perspectives or elements — organizational and customer perspective.
This perspective will help us to identify and define the “what”s and “how”s of the data collection associated with the organization. The first thing will be to identify the domain of the organization, which will filter out so many “what”s. The next thing will be to, understand the behavior of the organization — service industry or manufacturing? This will help you narrow down the variables. For example, time series of employee turnover, time series of scrap volume per day, time series of machine usage hours per day etc in a manufacturing sector will give you an idea about the behavior of that organization. All these behavioral data will guide you to the “what”s and “how”s to great extent and with the help of some research you will do just fine.
Thus, using organizational perspective is like the organization dictating you what it needs and will basically help shape your goal — but this is kind of narrow. So we move to the second perspective.
This is exactly like the Organizational Perspective, but with customers. Here we deal with the customer behavioral data — what is the gender?Age group?Number of previous associations with opponent firms? Financial soundness? Love to experiment with latest products?
For instance, number of companies for which the employee has worked for, number of disciplinary actions, quality of units machined belongs to the customer behavioral data of a manufacturing industry worker. Please don’t confuse the term customer with the usual term. By customer I mean the one who receives and uses the solution (advanced manufacturing units for machining intricate parts in an industrial setting) provided by the organization for overcoming a manufacturing challenge. Customer can be immediate or an end user.
Definition of customer varies with your objective.
Therefore, customer shapes the right solution or help you find the “what”s and “how”s. Again, this perspective is narrow too, so we need the best of both of above.
Hybrid Perspective — Best of Both
The best approach according to me, as this will help you see things in all the dimensions — you get to see the big picture. The choice of this perspective depends on the difficulty of acquiring organizational or customer perspective. If any of the perspective is difficult to attain, it’s always better to start with the customer perspective. According to me, customers can say so many things about the organization than the organization itself.
According to me, customers can say so many things about the organization than the organization itself.
As per the framework, the political environment in which the candidate of my interest competes, will be the organization and it can be basically classified into a service industry as politics help govern our state or country. The public image of the political party, the track record of the candidate of interest and that of the opponent or opponents will be some of the organization’s behavioral data and contributes to the organizational perspective. While, age group, unemployment rates, religious share in the population etc will be part of the customer’s behavioral data and contributes to the customer perspective. With these two combined, you have a hybrid perspective at hand and not only that you are much more confident than before that you have not missed anything knowingly.
The ultimate takeaway is that — there are only two sides in any analytics project; Organization and Customer and this Framework is general enough to apply to almost all the analytics projects.