Welcome to the DSC Weekly Digest, a production of Data Science Central. Every week, we pick out the newest and brightest articles on the topic of data … and how it informs our lives, powers our applications, and provides insights into our world.
From the Editor’s Desk
When working with information, context is becoming increasingly important. All too often, data scientists and analysts tend to work upon the assumption that the data that they work with exists in a vacuum, disconnected from anything else, yet the production of data invariably involves dozens or even hundreds of decisions – which sources to use, which interpretations are to be made on data, what shaped the data to be gathered in the first place, and so on. Loss of context can often render good data meaningless and can introduce unexpected patterns that can seem mystifying until the biases can be resolved.
A few years back, a study was undertaken to attempt to use machine learning on medical patient records to attempt to identify, a priori, whether there were any indications in a person’s medical charts that would suggest that they were dealing with early-stage cancers that had not yet been detected. The researchers scanned tens of thousands of such records through OCR from a given hospital, and lo and behold, the machine learning algorithm was able to pick up such patterns 99.95% of the time.
While the young researchers were jubilant – their technique had worked after all – more experienced data scientists raised an eyebrow that the rate was so very nearly perfect, and began checking their data acquisition chain. Eventually, they made a discovery. Nurses at the hospital who were dealing with patients that had cancer would routinely put a circled C on each record because there was no space in the form to indicate that the patients were dealing with cancer. Once this particular bit of context was factored in, the strong correlation disappeared.
Data science is ultimately about more than just running numbers through algorithms. Understanding the data itself is often more than half the battle. Anyone who believes that doing good data science work doesn’t involve getting heavily involved in ascertaining the quality, veracity, and context of the data that they work with will not succeed in this field. This is why Data Science Central is here.
DSC Featured Articles
Tech Target Articles
Picture of the Week
To make sure you keep getting these emails, please add [email protected] to your address book or whitelist us.
This email, and all related content, is published by Data Science Central, a division of TechTarget, Inc.
275 Grove Street, Newton, Massachusetts, 02466 US