Artificial intelligence (AI) is quickly becoming a day-to-day component of software development across the globe. If you’ve been following the trends at all, you’re probably very familiar with the term “algorithm.” That’s because, to the world’s big tech companies like Google, Amazon and Facebook, AI is all about developing and leveraging new AI algorithms to gain deeper insights from the information being collected on and about all of us.
However you feel about privacy, the tech giants’ emphasis on algorithms has been good for AI and machine learning (ML) businesses in general. Not only are these companies pushing the boundaries of ML, but they’re also putting their algorithms out there as open-source products for the world to use. This gives smaller companies the opportunity to piggyback on the research and computing power of the tech megaliths. With much of the algorithmic legwork done by Google et al, we can start pushing ML forward in new and exciting ways.
But despite what many would have us think, there’s more to AI and ML than algorithms. In fact, algorithms are just the tip of the machine learning iceberg. And they’re not even the hard part. What truly gives AI and ML its power is lurking below the waterline, and it goes by the name of data. And here’s where we start to see a divide between the “haves” of the tech giants and the “have nots” of the rest of us.
We’re all aware that the aforementioned tech giants have plenty of data to work with, having spent years collecting it themselves at scale. Not only do they have huge data sets readily available, but those data sets have been collated in a clean, standard format of their own choosing. They don’t need to spend the time and money on wrestling their data into something that their algorithms can use.
The opposite is true for the rest of us. Our machine learning using smaller collections of data that might be cobbled together from a variety of sources and that are seldom in a standard format. Not only that, but our data sets are often corrupted and are almost never marked up to act as training sets for these advanced algorithms. These are serious barriers for smaller companies or research teams seeking to build, deploy, test and validate an AI project. Data cleaning is a whole extra production step — and a costly one at that.
While this data divide may seem like a reason to despair, it’s not all bad. Where there’s a problem, there’s an opportunity. Amazon, Google and Facebook have generously passed along their algorithms, but there’s absolutely no business case for them to bother with data access, cleaning and training. It’s obvious that the onus is going to fall on us. But with it comes a potentially lucrative opportunity.
In my experience, about 90% of the success of an AI project is about identifying, collecting, cleaning and marking up data for the algorithms to work with. Solving these problems is where most of the work, funding and business value in AI and ML will be found in the coming five to 10 years. After all, not only is 90% of the success of a project about the data, but 90% or more of the work is related to the data.
The future of corporate machine learning and AI goes far beyond what the big tech companies are interested in offering. By combining open-source algorithms and productized data preparation platforms, smaller players can help companies with the data problems that the big tech firms won’t touch. It’s good news for us — and good news for the clients that would otherwise be overlooked.
If you’re a smaller AI company, there’s a huge opportunity in leveraging your knowledge of data collection, cleaning and management to offer an AI-powered solution that can simplify your clients’ data wrangling. ML models can be used to automate data matching, reducing project time spent on data cleaning, while also improving the efficacy of data cleaning over time. It’s the perfect AI use case — one that the tech megacorps haven’t yet taken on.
Credit: Google News