Credit: Google News
Prashanth Southekal, managing principal at DBP Institute, hosted a workshop last month at Enterprise Data World 2019 Conference, on applied machine learning techniques and when to use different ML algorithms.
Machine Learning (ML) enables computers to automatically learn and adapt using large volumes of data sets. Southekal talked about the five main types of analytics and the three types of machine learning. He also discussed ML algorithms such as Decision Trees, Support Vector Machines (SVM), Logistic Regression, Linear Regression, and Clustering.
InfoQ spoke with Southekal about his conference session and data analytics in the area of applied machine learning.
InfoQ: How do you classify data and what type of analytics performed on each type of data?
Prashanth Southekal: Broadly data, especially in business, can be classified in to three main types. Firstly, from the data storage and processing perspective, business data can be classified into structured data and unstructured data. Secondly, from the data integration point of view, business data can be reference data for managing categories like plants and geographies, master data for managing business entities like vendors and products, and transactional data for capturing business events like purchase orders and invoices. Thirdly, from the data analytics perspective, business data can be classified into nominal data for managing categories like product descriptions, ordinal data for capturing ordered data sets like payment terms and delivery priority, and continuous data for handling price and quantity.
Now coming to the 2nd part of your question – what type of analytics is performed on each type of data. In my view, Analytics is using data based on the questions you have. So, the questions you ask is very important in analytics. The response to these questions comes from the algorithms and the selection of the algorithm is based on the data type. For example, if the question is “Will the shipment be delivered-on time?”, the response will be “Yes/No” and the answer will be derived using the logistics regression algorithm. On other hand, if the question is, “How long will it take for the shipment to be delivered?”, the response will be a numeric value which will be potentially derived using the linear regression algorithm.
InfoQ: Can you talk about some of the data quality dimensions and how they influence the data quality?
Southekal: Data Quality is the assessment of data’s fitness to serve its purpose in a given context. In my view, there are 12 data quality dimensions and these include – Completeness, Consistency, Validity, Cardinality, Accuracy, Correctness, Accessibility, Security, Timeliness, Redundancy, Coverage, and Integrity. In my book – Data for Business Performance, I have explained these data quality dimensions in detail. However, data quality doesn’t mean that all these 12 dimensions should be satisfied all the time. The selection of the data quality dimension depends on the fitness, purpose, and context.
InfoQ: What are the considerations when selecting an ML solution?
Southekal: In my view, a solution is deemed to be a ML solution if it meets four key criteria:
- The output is REFINED CONTINUOUSLY i.e. data ingestion into the ML algorithm is on-going.
- There is MINIMAL (OR EVEN ZERO) HUMAN INTERVENTION in deriving and applying the output.
- The output is PROBABLISTIC as the solution is geared toward the FUTURE STATE.
- The output provides answers to questions on mainly EVENTS or TRANSACTIONS (over entities or categories).
InfoQ: Can you discuss the four types of ML algorithms you covered in your workshop: Regression, Classification, Clustering, and Association?
Southekal: There are hundreds of ML algorithms, but I selected these four types of ML algorithms i.e. Regression, Classification, Clustering, and Association, as these are very commonly used in business.
- Regression algorithms help in predicting the value of the dependent variable based on the set of independent variables.
- Classification algorithm takes the input data to classify the observation into appropriate group.
- Clustering algorithm help in assigning a set of observations into clusters based on some similar conditions.
- Association ML algorithms uncover how items are associated with each other.
InfoQ: Do you have any recommendations for database professionals who want to learn machine learning technologies?
Southekal: Pick a technology that you and your company can easily access or acquire. For example, if you are a Procurement specialist working in a company where the procurement activities are done in SAP ERP, it is better to leverage SAP’s Analytics tools like BI/BOBJ, Leonardo etc. You will have a head start as the data needed for Analytics is already in your SAP landscape and you’ve access to the SAP ecosystem. If you are just starting your career, try R or Python, as both are open-source tools with a large community. But always focus on the APPLICATION of the tool to problems; not learning the tool per-se. Along with skills in technology, build good skills in Statistics and Linear Algebra. Statistics is needed for Descriptive Analytics while Linear Algebra and Statistics is for Predictive Analytics & ML. Also, there are plenty of great free learning materials on the internet. Try them first before enrolling in expensive courses.
Credit: Google News