As a data scientist I worked on several machine learning and deep learning projects related to the computer vision field. In each project, I was asking myself how to choose the best dataset, and I realized that an accurate and well-organized description would give me the right answer. In this article, I would like to share with you the following table (table 1) which I developed to describe a dataset of images for classification projects in machine learning.
- General information: Dataset name, link, and size.
- Images dimensions: Dimension range for both width and height gives you a better idea about the images and about the transformation that you may apply, also an average value gives you an intuition about the dimension value for most images.
- Number of images: · Depending on the problem you want to solve, there will be an acceptable number that you can deal with. But if the problem is very complex, then this number may need to be sufficient to cover all the possible cases.
- Number of classes: The number of classes will help you choose and set up a ML/DL algorithm.
- Number of images per class: It is very important to know whether the dataset is balanced or imbalanced as it will affect the whole process of training and validating of the ML/DL model.
- Number of images per extension: Sometimes we are interested in a specific image extension. This info will help you to know the portion of images per extension
- Images File size: Will give you an intuition about the images file size distribution.
- Notes: This is useful if you want to add some additional information or notes about the dataset. (such as permissions, ethics…etc)
In order to understand the idea better let me show you a quick demo. The following table (Table 2) shows a description of a Covid19 dataset from Kaggle website.
This is all for this article, I hope you find it useful, and would you please share with me your ideas about the discussed topic.
Credit: BecomingHuman By: Samer Sallam