Machine learning is a data science technique used to extract patterns from data, allowing computers to identify related data, and forecast future outcomes, behaviors, and trends.
One important component of machine learning is that we are taking some data and using it to make predictions or identify important relationships. But looking for patterns
A broad term that refers to computers thinking more like humans.
A subcategory of artificial intelligence that involves learning from data without being explicitly programmed.
A subcategory of machine learning that uses a layered neural-network architecture originally inspired by the human brain.
Raw data, however, is often noisy and unreliable and may contain missing values and outliers. Using such data for modeling can produce misleading results. For the data scientist, the ability to combine large, disparate data sets into a format more appropriate for analysis is an increasingly crucial skill.
Collect Data: Query databases, call Web Services, APIs, scraping web pages.
Prepare Data: Clean data and create features needed for the model.
Train Model: Select the algorithm, and prepare training, testing, and validation data sets. Set-up training pipelines including feature vectorization, feature scaling, tuning parameters, model performance on validation data using evaluation metrics or graphs.
Evaluate Model: Test & compare the performance of models with evaluation metrics/graphs on the validation data set.
Deploy Model: Package the model and dependencies. Part of DevOps, integrate training, evaluation, and deployment scripts in respective build & release pipeline.
Data that is arranged in a data table and is the most common type of data in Machine Learning. is arranged in rows and columns. In tabular data, typically each row describes a single item, while each column describes different properties of the item. Each row describes a single product (e.g., a shirt), while each column describes a property the products can have (e.g., the color of the product)
Row: An item or entity.
Column: A property that the items or entities in the table can have.
Cell: A single value.
It is important to know that in machine learning we ultimately always work with numbers or specifically vectors.
A vector is simply an array of numbers, such as
(1, 2, 3)—or a nested array that contains other arrays of numbers, such as
(1, 2, (1, 2, 3)).
For now, the main points you need to be aware of are that:
- All non-numerical data types (such as images, text, and categories) must eventually be represented as numbers
- In machine learning, the numerical representation will be in the form of an array of numbers — that is, a vector
Scaling data means transforming it so that the values fit within some range or scale, such as 0–100 or 0–1. This scaling process will not affect the algorithm output since every value is scaled in the same way. But it can speed up the training process.
Two common approaches to scaling data:
Standardization rescales data so that it has a mean of 0 and a standard deviation of 1. The formula for this is:
( − )/
Normalization rescales the data into the range [0, 1].
The formula for this is:
( −)/( −)
when we have categorical data, we need to encode it in some way so that it is represented numerically.
There are two common approaches for encoding categorical data:
- Ordinal encoding: convert the categorical data into integer codes ranging from
(number of categories – 1).One of the potential drawbacks of this approach is that it implicitly assumes an order across the categories.
- One-hot encoding: transform each categorical value into a column. One drawback of one-hot encoding is that it can potentially generate a very large number of columns.
An image consists of small tiles called pixels. The color of each pixel is represented with a set of values:
- In grayscale images, each pixel can be represented by a single number, which typically ranges from 0 to 255. This value determines how dark the pixel appears (e.g.,
0is black while
255is bright white).
- In colored images, each pixel can be represented by a vector of three numbers (each ranging from 0 to 255) for the three primary color channels: red, green, and blue. These three red, green, and blue (RGB) values are used together to decide the color of that pixel. For example, purple might be represented as
128, 0, 128(a mix of moderately intense red and blue, with no green).
Color Depth or Depth:
The number of channels required to represent a color in an image.
- RGB depth = 3 (i.e each pixel has 3 channels)
- Grayscale depth= 1
Encoding an Image:
We need to know the following three things about an image to in order to encode it:
- Horizontal position of each pixel
- Vertical position of each pixel
- Color of each pixel
We can fully encode an image numerically by using a vector with three dimensions. The size of the vector required for any given image would be:
Size of Vector = height * weight * depth
Image Data is normalized to subtract per channel mean pixel values
2. This Entire Article Was Written by Open AI’s GPT2
3. Learning To Classify Images Without Labels
4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst
Other Preprocessing Steps:
In addition to encoding an image numerically, we may also need to do some other preprocessing steps. Generally, we would want to ensure the input images have:
- Uniform aspect ratio