Add another layer to your #Business literacy. We at Serebral360° would love to know if the Forbes – Entrepreneurs article was helpful, leave a comment, like and share. Let’s dive in and discuss the information and put it to use to grow your business. #BusinessStrategy #ContentMarketing #WebDevelopment #BrandStrategy
Info@serebral360.com 762.333.1807 www.serebral360.com
Grap a copy of our NEW Business Stratgety Books #FFSS VOL1 and #FFSS VOL2
While pharma C-suite executives find themselves increasing seduced by the promise of “digital transformation,” and especially by the idea of leveraging AI, the lived, on the ground reality within virtually all pharma R&D organizations couldn’t be further removed.
Novartis CEO Vas Narasimhan candidly alluded to this in January when he reflected on his company’s digital transformation journey, and said,
“The first thing we’ve learned is the importance of having outstanding data to actually base your ML on. In our own shop, we’ve been working on a few big projects, and we’ve had to spend most of the time just cleaning the data sets before you can even run the algorithm. That’s taken us years just to clean the datasets. I think people underestimate how little clean data there is out there, and how hard it is to clean and link the data.”
This is arguably the core problem underlying most datasets in health and in pharma, impeding not only the ability to leverage AI, but also just to understand, at the most basic level, the data within your own organization. Essentially, any effort to glean novel insight from existing data represents a maddeningly difficult challenge.
The enormous magnitude of this challenge was highlighted this week in an outstanding Stat article by Casey Ross, taking readers “behind the scenes” to learn what was required to run the AI-driven analysis of lung cancer diagnosis reported by Google and others. In short, Google’s AI was used to assess chest CT scans to determine whether or not lung cancer was likely present, and compared to the performance of human radiologists; the algorithm seemed to do better than the radiologists when looking at scans representing a single point in time, and no worse than radiologists when they were allowed to view previous scans from the same patient.
According to Ross’s article, a huge challenge in doing this study was extracting the patient cases (from Northwestern’s hospital system) to test the algorithm. Ross quotes Mozziyar Etemadi, “a biomedical engineer and anesthesiologist at the Chicago hospital system,” who told Ross, “It was a pretty crazy engineering challenge. We had to write a ton of software just to communicate between different parts of the [hospital’s] electronic records.”
Ross goes on to astutely observe,
“The struggle underscores one of the biggest barriers to the development and use of AI in medicine: Patient information is held in a crazy array of computer systems and formats that defy efforts to build coherent datasets. The data are needed to train algorithms and, as in the Northwestern example, validate their performance on patient cases they have never seen.”
Behold: this is the dirty little secret bedeviling all “digital transformation” efforts in healthcare and in pharma. Moreover, I don’t think anyone working on digital and data in healthcare found the Northwestern saga remotely surprising; rather, as Ross points out, it is painfully representative.
“It’s About The Data”
“Do I believe this?” Dr. Amy Abernethy, Principal Deputy Commissioner of the FDA, responded rhetorically, when asked about the data curation challenges encountered by the team performing the Google AI study. “Absolutely.”
Abernethy is arguably one of the world’s experts on exactly this sort of data wrangling, having spent her entire professional career (as discussed in this Tech Tonics podcast) working on the challenges of leveraging clinical cancer data, both as an oncologist and clinical researcher at Duke and as a senior executive at the data company Flatiron (both before her present role). As she explains,
“At the end of the day, the challenges for the application of AI to healthcare aren’t really about development of the algorithms – it’s about the data. Sophisticated AI prediction algorithms have been developed in many industries especially those where discrete, voluminous, well-organized and readily-analyzable data are the norm (for example, meteorology and finance). For AI to be applied to healthcare, however, the underlying data need to be organized and readily accessible. That is not the case with most healthcare data.
Just how bad are the data challenges in clinical data? Pretty bad, Abernethy seems to suggest. “While data quality challenges can be overcome,” she says,
“they have to be characterized and addressed. In electronic health record datasets (EHRs), many of these complexities are magnified because data are inconsistently coded (or not coded at all) making the data difficult (if not impossible) to merge. Data quality problems abound, such as cut and paste errors and missing parts of the patient’s longitudinal record. And many key patient outcomes are not characterized or coded.”
Turns out, this creates problems, particularly when you’re trying to develop algorithms. Abernethy explains,
“At least in the near-term, the smartest AI algorithms in healthcare are developed with labeled datasets, i.e., datasets where key features are consistently labeled in a reliable and codified fashion. For example, if you want to generate algorithms to predict that this patient has breast cancer, it is best to build the prediction algorithm using data that accurately labels the patient case as having breast cancer or not.”
Abernethy’s suggestion (presumably informed by her Flatiron experience): human curation, at least for now:
“One recent example of the challenges of the ‘middleware’ of data for AI was when it was disclosed that real humans were listening to Alexa recordings to support algorithm development. Should we have been surprised? As of now, it is hard to imagine reliable data from the real world where we don’t need human curation and/or cross check. That is what you are seeing here in the Google AI for lung cancer screening example. It turns out that even some of the most sophisticated tech companies in the world – Amazon, Google – have data problems. Data are at the core.”
Data: Asset or Liability?
As Eric Perakslis, a health data guru and Rubenstein Fellow at Duke (his Tech Tonics here), explains, data in both health systems and clinical trials (and, I’d add, other biopharma research) are generally collected for extremely specific purposes. In clinical encounters, the data is logged into an EHR, and used for “building a longitudinal history and tallying up procedures for billing.” In clinical trials, “you are filling in boxes in the physical implementation of a clinical protocol, a statistics database.” Both, Perakslis says, are forms of data management where the goal is efficiency, compliance and effective sausage making.
The challenge, he continues is:
“Learning anything secondary, also known as knowledge management, from either is usually an afterthought and requires data to be extracted, reformatted and re-homed into an additional structure such as a data warehouse/mart/lake. This requires additional labor and additional risk as duplicate data amplifies cost as well as compliance, security and privacy risks. For these reasons, it is seldom done unless funded and prioritized via leadership. This can be messy and expensive to bean counters but, I’d argue, should be prioritized, because if done correctly, data should be the second most valuable asset, after talent, in any scientific organization.”
Like Abernethy, Peraklisis acknowledges the effort involved in actually arriving at usable data:
“Given the archaic infrastructure of most large institutions data curation, cleaning and transformation amounts to manual hand-to-hand combat between highly educated humans and text interfaces. It takes forever, costs a fortune and simply should be avoided. People should think things through from day one and know that it makes more sense to lay plumbing and conduit for the anticipated addition when you design the house, not after.”
Adds Perakslis, “randomized control trials cost anywhere from $30,000-$50,000 or more per patient…. For less than $1000 more (per patient), you could create data files in a modern lake ready for AI, ML or human mining. Why aren’t we all spending the extra two cents per sausage even if our primary job is sausage making? Treating data as an asset versus a liability should be the key.“ (I’ve also discussed the contrast between the positive optionality tech companies see in data vs the negative optionality many biopharma companies perceive — see here.)
Data > Algorithm
“90+% of effort in real-world machine learning projects will end up focused on ‘mundane’ data cleaning and data management, not ‘exciting’ models and algorithms work. To a first order, quality and scale of data are much more important than the particular type of ML algorithm used on the data: it can be surprisingly hard to beat simple logistic regression.”
Haque notes that “academic ML model/algorithm research largely focuses on a handful of manually-curated benchmark data sets, partially to abstract away the huge amount of work it takes to collect a dataset and focus instead on the modeling problem, and partially to provide a means of comparing results in different papers.” Unfortunately, for those seeking to solve the sort of real-world problems like those facing hospitals and pharma R&D groups, “data is not usually served in a nicely prepackaged form.”
Generating Their Own Data
Because of the challenge of the data quality problem, some innovators seeking to leverage AI have decided to develop their own datasets. As Chris Gibson, CEO of Recursion Pharmaceuticals, a leading company in this space, explains,
“To ask complex questions of biology and model it using the power of AI & ML, one must ideally generate data specifically for that purpose. Most data available publicly in large datasets, developed by biopharma companies or large health systems, is not designed to be used to ask the kinds of questions we and others need to ask. Data like this requires a Herculean effort to ‘clean and prepare’ it for machine learning use, and is often collected in ways that introduce worrisome biases.”
Hence, he decided to generate the necessary data, on his own. “Because of these issues,” he explains,
“we’ve spent more than five years at Recursion building out the industry’s largest biological images dataset, generated all in-house. Today we do nearly 350,000 carefully controlled experiments each week in a variety of human cell types from which we generate, among other things, fluorescent microscopy images that have a tremendous depth of information about biological states within them.“
“Machine learning is only as good as the data you feed it. A key rate-limiting factor to the application of machine learning in biomedicine is the lack of high-quality data that’s fit to purpose. At insitro, we are building a bio-data factory that leverages a range of cutting edge technologies to produce high-quality data at enormous scale. This enables us to identify problems where having more predictive models would be transformative, and then generate data specifically to enable machine learning to be applied to those problems.”
Bottom Line: Data Parasite Redux?
While many healthcare systems and pharmaceutical companies seem to be embarking on “digital transformations,” the continued reality for both is that virtually all data continue to be collected in an outmoded fashion that enables the information to fulfill it’s original purpose, very concretely defined (billing and rudimentary patient history documentation in the case of EHRs, collection of very specific data for pre-defined statistical analyses in the case of clinical trials), it is nearly prohibitively difficult, if not truly prohibitively difficult, to garner additional value from these data.
In the short term, some of these hurdles can be overcome through “Herculean” curation efforts involving an exceptional amount of manual labor. Going forward, it remains to be seen whether stakeholders – meaning hospital systems and biopharma companies – will evolve to a more modern model that would enable far greater downstream utility, or will keep on keeping on, pursuing the occasional curation project but largely avoiding profound changes to the underlying data philosophy and architecture.
Already, innovators seeking to leverage contemporary AI techniques have recognized the need to create their own, largely preclinical, data. It will be interesting to see if these approaches ultimately generate the profound insights anticipated by their founders, a result that could potentially motivate wider adoption.
I am particularly curious about whether any pharma will truly evolve to a contemporary approach to data collection and management, given that this represents a profound organizational challenge since the benefits accrue to future or adjacent stakeholders (who could benefit from new uses for the data) while the burdens and inconveniences are borne largely by those who most directly need the data for the originally intended use, today. (This can be regarded as just another manifestation of the “data parasite” tension of several years ago, see here.)
Appealing as the vision might be, getting to this “data science mindset,” as I’ve written, seems like a really high hurdle for an organization to overcome. Changing this calculus will require several viscerally and financially compelling examples of successfully using data for secondary purposes; optimistically assertions of data’s exceptional potential may motivate C-suite executives to call for change, but are unlikely to motivate those on the front lines to drive it.
May 23, 2019 at 12:11AM
Forbes – Entrepreneurs