Please don’t share your medical data with me
Until very recently, access to data has been a competitive advantage and barrier to entry in applied artificial intelligence (AI). This applied especially in health care, where getting data is harder due to legal constraints. There’s a consensus that data availability and quality remain a significant problem and that, according to Bharat Rao, a principal in KPMG’s Advisory Services practice, “The technology isn’t really the problem…the problem is getting all the data in one place to make it happen.”
This need to access clinical data has caused companies to make large acquisitions, strike commercial partnerships with health systems, academia and tech transfer offices and even launch data-sharing-based business models. There are no fewer than 40 startups using blockchain in health care — and many are doing so with the premise of enabling providers or patients to directly monetize their data by sharing it for analytics.
In contrast, I’ve found success in applying AI in health care by doing the exact opposite:
• Deploy software that builds and serves models separately for each customer. Each environment is fully isolated and controlled by its customer (in the cloud or on-premise).
• Never take data or models outside of a customer’s environment.
• Never share or mix data across customers.
The main reason to operate this way is that this is what customers want. Companies want a business problem solved now (e.g., how fast can you build a state-of-the-art AI model for this?). They do not want to join a data collective and have long strategic discussions around intellectual property ownership, derivative rights on trained models or patient consent requirements.
While initially designed as a compromise (Wouldn’t a model’s accuracy be affected by the lack of data? Wouldn’t implementations be longer? Doesn’t this give away a competitive moat?), this model seems to outperform its more popular alternatives. This happens for two reasons.
Transfer learning has improved by leaps and bounds
The rate of progress in natural language processing and computer vision over the past few years has been staggering. Question-answering challenges — from closed text like SQuAD, conversational questions like CoQA or reading comprehension like RACE — outperformed human performance in 2019. Things are moving just as quickly on image recognition challenges like ImageNet. If you’ve implemented a state-of-the-art solution a month ago, you’re already behind.
Transfer learning is a highly effective way to keep improving the accuracy of NLP models. Language models like BERT, XLNet and ERNIE are all open source, including pre-trained models, which can be tuned or reused without a major compute effort.
Even better, techniques for using them to learn from small datasets have evolved. This NLP article published in May shows how to use BERT on the IMDB review classification task with only 500 samples to achieve a level of accuracy that required 25,000 last year. This CV article, also published in May, addresses an image classification problem given a dataset of 1,350 images.
As a result? There is no need to amass millions of data points in order to train a state-of-the-art model. Since in health care, a model must be tuned for each provider’s or payer’s population demographics, every deployment requires labeling several hundred cases anyway, if only to measure local accuracy. That happens to be enough to produce a model at top accuracy without requiring a longer or more costly implementation.
Data moats are an empty premise
The Empty Premise of Data Moats is a 2019 post from VC firm Andressen-Horowitz, which made headlines in the tech world because it was, more than anything, surprising. It hit against a common business model and valuation lever of many companies that focused heavily on acquiring more training data and feedback as a way to generate revenue or a competitive moat.
As the post notes, this often fails in practice for enterprise startups because “data + network effects ≠ data network effects.” In social networks, each new member adds to the value of the network for all members, and adding new members becomes cheaper the bigger you are. In contrast, when adding labeled data to AI model training, the value of each new data point decreases as the dataset grows, and the cost to acquire new data increases. The reverse of a network effect is in play.
This happens in just about every health care AI project. The first 10,000 patient records are critical to building a solid predictive model within a specialty. Adding an additional 50,000 records provides a minimal accuracy gain. Adding another 100,000 records on top of it may improve nothing.
The first 10,000 records may be free — obtainable from CMS or academic challenges. The next 50,000 may be licensed claims or images. The next 100,000 are going to get more expensive.
In addition, clinical AI models also tend to not generalize well — as IBM Watson found out when applying clinical guidelines internationally. So the model you trained on that exquisite Mayo clinic dataset that cost an arm and a leg to license will be of little help when implementing it in another health system.
What do you think?
One benefit of writing for Forbes is that when I’m unsure about my understanding of the industry, I can just ask you. Are you seeing a different reality or the same trends? Please contact me if you can help educate me on this matter. I’ll do my best to return the favor.