How to develop Clinical AI Solutions

The peculiar lack of clinical AI diagnostic solutions

Many research studies have shown that machine learning models can perform as well as trained doctors on a range of tasks – especially in radiology – and can even find patterns a human doctor would miss.

But for many years, AI-supported diagnostic solutions were not actually being used in hospitals. In other fields, we’re used to a quick translation from research to practice: after a breakthrough in facial detection technology, you might see Facebook using the same technology to tag your friends in less than six months. But in medical diagnostics, innovation simply didn’t translate into practice. Why not?

This is exactly the situation Eyal faced in 2014, when Zebra Medical Vision was born: lots of breakthrough research, an obvious need for AI support in diagnostics, plenty of talent, but almost no solutions in day-to-day clinical use.

So they went looking for the reasons behind this reality. They found that research alone couldn’t address some of the critical challenges which needed to be solved before AI diagnostics could be applied in clinical contexts. Doctors and clinicians needed:

Models trained on datasets that represent the population;
Seamless integration into clinical workflows;
A technology platform to connect research and hospitals.

Challenge 1: Building a representative dataset

The first issue Eyal and his team discovered was that most academic research relied on small datasets. Not only were they small, but they also didn’t represent the variety of patient cases clinicians see in hospitals on a day-to-day basis. For real-world diagnostic solutions, representative data is absolutely crucial.

So the first challenge is getting a large, diverse dataset that looks exactly like what you’d expect to see in a hospital.

That’s why, for the first two years, Zebra Medical Vision hardly did any machine learning at all. Instead they focused on developing data partnerships with over 30 hospitals in Israel, the US, and India.

Collecting a full view of the patients

While many studies might simply compare doctors’ and machines’ diagnostic performance using one type of data, Zebra Medical Vision had to go further. They collected data on:

Radiology images in many different modalities (X-Ray, CT, MRI, Mammography, PET, and Nuclear Medicine);
Lab analyses;
Patient admission and discharge;
Doctors’ diagnoses;
Clinical outcomes (meaning how patients fared after discharge, possibly years later).

The key factor here was that the samples Zebra-Med collected had to represent the exact same diversity and complexity doctors face in the hospital every day.

Proof-of-concept vs. a clinical diagnostic solution

Let’s say you build a system that can take a radiology image and then correctly judge whether the image shows signs of lung cancer or indicates a healthy patient. At first sight this might seem very useful. But it wouldn’t be useful in a hospital.

So far, this is just a proof-of-concept – you’ve demonstrated that an algorithm can process radiology images and differentiate between two clearly described groups of patients. But does this scenario (lung cancer vs. healthy patient) really represent the problem a doctor would face?

In fact, this isn’t a question that comes up in a hospital. It’s more likely that a patient comes to you because they have some symptoms already, and you need to diagnose the cause: ground-glass opacity, pulmonary embolism, COPD, emphysema, and lung cancer might all present similar symptoms.

If the machine learning model is going to be of any use to the doctor, it needs to be able to differentiate among radiology images of patients with symptoms – not between healthy and unhealthy people.

To put it another way: the data you train the diagnostic model on needs to represent the population you expect to see in the clinic.

But this isn’t the only hurdle to making a useful model for everyday hospital diagnosis.

Challenge 2: Integrating seamlessly into clinical workflows

Table of patients showing th flags provided by Zebra’s AI Solution. — Zebra’s AI prediction integrates directly into the clinical workflow: It flags suspicious studies, and the doctor can easily sort them to the top.

It’s safe to assume that doctors, like anybody else, want new solutions to make their work easier – not more complicated.

But academic research leaves this issue unresolved. For example, you might develop a very accurate machine learning model that only performs well when the image is taken with specific machine settings – settings a doctor wouldn’t normally use in a routine exam.

Solving the problem in this way is certainly a valuable research result, but it’s still far from a workable diagnostic model for clinical practice, because it requires a non-standard workflow.

In practice, the model has to fit seamlessly into the existing workflow. This means:

Using the data produced in the standard workflow

A clinically useful model should not require the clinician to perform any additional steps. It should only require the exact data – in the exact format – that’s already produced in the hospital’s normal diagnostic workflow.

Working with the software doctors already use

Zebra Medical Vision didn’t build a new software tool for doctors to add to their workflow. Instead they partnered with diagnostic workstations and integrated their predictions directly into the software tools the doctors were already familiar with.

This seamless integration is essential – not only because asking doctors to change their workflow is unrealistic, but because the whole point of using AI is to save time, not to create additional work.

But this wasn’t easy. There are a wide variety of workstation providers and software, as well as different kinds of environments (e.g., web-based or Windows based). Plus a lot of the software is quite old. Nevertheless, Zebra-Med has to integrate with each of them.

Separating integration from computation

Zebra-Med built a smart infrastructure that allows for flexibility while minimizing double work. The hospital only has to install an agent in their local system once, and this agent deals with privacy, security, and anonymity, as well as providing a local user interface.

On the other hand, all the heavy lifting and prediction happens in the cloud. This also makes it very simple for Eyal and his team to roll out new versions of the model. They simply update the model on their side, and the agents installed in the hospital communicate with the centrally hosted model. This means hospitals don’t need to do anything to receive updates.

Plus Zebra-Med calculates the predictions in advance, so when a physician clicks on an image, they can see the result immediately.

So let’s say you’ve managed to build a model that accurately represents the population, seamlessly integrates into diagnostic workstations, and fits the doctor’s workflow.

Even then, you still can’t be sure that the doctor would benefit from using your diagnostic model. Here’s why.

Saving the doctor time

Let’s say there are 6 cancer diagnoses for every 1,000 mammographies, and you’ve trained a machine learning model that can assess the images in real-time. The doctor now sees the AI model’s assessments next to all of their scans and can prioritize the cases that have been flagged.

Now we face two crucial considerations:

The false-negative rate. How often does the model think a patient is healthy when in fact they are not? This kind of error is obviously very dangerous. If the doctor doesn’t look at a scan because the system says the patient is healthy when they aren’t, then the patient might miss their chance to get treated. This is a serious error that must be avoided at all costs.

The false-positive rate. How often does the system flag a scan when the patient is in fact healthy? This error is much less problematic – the doctor will cross-check the scan because the system flagged it, and will discover it was a false alarm.

But overall, the false positive rate still determines whether the system saves the doctor time. Imagine you see that a scan is flagged. You’ll inspect it diligently, try to understand why the AI system might have deemed it problematic, and spend a lot of time deliberating before you conclude: “No, the AI made a mistake. This patient is actually healthy.” Let’s say this is the case for 49 out of every 50 scans the AI flags.

In this case, it doesn’t matter how much more quickly the AI can make a single assessment if the doctor has to spend extra time checking and refuting most of the cases the AI flagged.

To assess whether an AI solution can truly speed up a particular diagnostic workflow, you need to consider:

Integration. How seamlessly can your model integrate into the current tools? Does an AI assessment fit into the workflow?
Presentation. Can you present the predictions in a way that makes sense to the doctor?
Clinical prevalence. How many cases do you expect in a given number of scans? And can you reach a false-positive rate that the doctor can tolerate?

So far, we’ve highlighted how important it is to get real clinical data as the basis for building a model that can perform on the exact population and data present in the clinic. And we’ve shown that simply having an automated diagnostic model doesn't necessarily mean you’re saving a doctor time – seamless integration and a manageable false-positive rate are essential.

But if we want to connect research with the clinic – in a way that satisfies regulatory requirements – we’re still missing something.

Challenge 3: Building the technological backbone of an AI diagnostic solution

A proficient data science and research team is essential, but it’s not sufficient.

Once Eyal and his team had both the data and a means of getting solutions into clinicians’ hands, they had to build a highway on top of this bridge: the technological backbone that would make everything else possible.

Annotating and curating data at scale

Machine Learning models learn from the dataset. So if there are any mistakes in your data – like a wrong diagnosis – then the model will also learn those mistakes. Before you run any experiments, you need to double check and correctly annotate all the data points in your study.

For their studies, Zebra Medical Vision had to coordinate support from up to 60 different expert annotators worldwide – all working on the same clinical mission. At this scale, Eyal had to build internal tools to collect, compare, and consolidate all those annotations.

Running thousands of experiments at a time

In the past, a researcher might take the time to design the perfect experiment, implement it, and then assess whether their approach solves the problem. But this doesn’t work for machine learning: there are so many ways to slice the data and build a model, you would never reach the end.

So Eyal [and the team] needed to build a platform to allow researchers to test not one, but thousands of experiments at the same time. They also built tools to track all of these experiments and then compare the results.

Separating research and clinical systems

Many companies talk about feedback loops – where the model improves via user feedback. For clinical solutions, that’s not usually realistic or responsible.

A diagnostic model is considered a medical device, and any change has to go through the process of regulatory approval.

Zebra Medical Vision’s two systems for research (model training and testing) and clinical use are entirely separate. There is a firewall between them, and they’re even hosted in two separate physical locations.

Of course, the team still appreciates getting feedback from doctors so they can understand how the model is performing and put that learning to use in future iterations.

But this is just one of the hurdles of working in a regulated environment.

Fulfilling regulatory requirements

Everything connected to the development and use of a diagnostic model needs to be traceable, including:

Product definition and system requirements;
Design;
Data curation;
Research, development, and engineering;
Testing;
Release processes;
Customer complaints.

Zebra Medical Vision built a global quality management system which provides this information to the FDA on an ongoing basis, gaining trust through transparency. They also adhere to and are tested on three different ISO standards, as well as SOC 2 Type 2 internal controls reports issued by third-party auditors.

This required a huge investment in security, as well as making sure everything is logged and documented.

Now we covered partnerships, data, integration, and the technical backbone. They are all necessary. But these elements are still not sufficient to solve a machine learning diagnostics problem – or what Eyal calls a clinical mission.

The Band: An interdisciplinary group with a clinical mission

Zebra Medical Vision very quickly learned that data scientists can’t build a useful medical device all on their own. What’s more, if you pair clinicians with data scientists you’re still stuck, because even when they think they’re talking about the same thing, they’re usually not. Someone needs to fill the gap.

Translating between two worlds: The role of the clinical information manager

Zebra Medical Vision needed someone with clinical trials management experience – someone who could speak the languages of both the clinicians and the data scientists. They dubbed this role the “Clinical Information Manager” - and this person usally has a PhD in biomedical engineering or clinical research.

Each clinical mission also needs a project manager, a research engineer, and an operations engineer.

Eyal calls this unique interdisciplinary team the band: different talents, same clinical mission.

Overview of the different roles in the band: Project manager, clinician, clinical information manager, data scientist, research engineer and operations engineers. — The Band: An interdisciplinary team that Zebra-Med uses to successfully solve clinical missions and bring AI diagnostics into the clinic.

Each band member has a specific role:

The project manager keeps the project on track.
The clinician brings knowledge about the complexity of everyday reality in the hospital.
Data scientists and researcher engineers formulate and test assumptions on the data and also know how to train and validate machine learning models.
The clinical information manager understands clinical research and bridges the gap between clinicians and researchers.
Operations engineers implement the clinical solution in a robust and scalable way.

Having a band that works closely together means everything can be much more dynamic. As Eyal discovered, this flexibility is absolutely essential.

Dynamic problem solving: The mission always changes along the way

In university studies, the problem you’re working on is often fixed. But Eyal and the band found that they very often make discoveries along the way. And because those discoveries improve their understanding of the problem, they often change the clinical mission.

The band is very well suited to adjust to these changes. One reason is Zebra Medical Vision’s technological backbone, which allows them to work fast and to easily modify and rerun experiments – without starting from scratch.

With all this infrastructure and this team in place, Zebra Medical Vision can move at a dazzling pace. They’re already on their way to their seventh FDA-approved diagnostic solution, with more coming soon.

Now let’s look at two examples of diagnostic solutions Eyal and his team developed, and find out what they learned along the way.

The difference between real-world and research approaches to medical image solutions

Example 1: Early warning for coronary artery disease

Half of all cardiovascular-related deaths are due to coronary artery disease. Cellular waste products, proteins, and calcium stick to blood vessel walls and combine with fat to form plaque. If this happens in the arteries that supply blood to the heart muscle, then it can limit or stop the supply of oxygen to the heart and cause a heart attack.

Unfortunately, the buildup of calcifications in coronary arteries is often only diagnosed after a heart attack or similar cardiac event.

The academic research approach: When Zebra Medical Vision looked into this problem, they found an academic paper saying that if you manually segment the areas of interest, and have a gated CT scan (a scan focused only on the heart), and measure with and without contrast in a specific protocol, then you could build a model that can provide the equivalent of the Agatston score (a risk measurement from 0 (low risk) to 400 (very high risk)).

Essentially this is a segmentation problem: you segment the white clouds of calcification in the coronary arteries.

The problem with this approach: This particular CT scan protocol is something you would only run on a patient who is already known to have a risk of a heart attack. And therefore, this would not help all the other patients who are at risk but have no symptoms yet.

Considering how few patients have symptoms before a heart attack, it would be more helpful to find a way to diagnose a much larger group of patients.

The more practical approach: Zebra Medical Vision realized that patients get CT scans for many other diseases. These so-called untargeted or ungated scans can include many organs, and also the heart. Eyal and his team found an approach where they could take these much more frequent scans and still achieve a similar accuracy - in predicting risk for heart disease - to the model that the researchers build for the targeted scans.

This means doctors now have an early-warning system for heart disease running in the background. This system automatically alerts them if a patient is high risk – even though they took the scan for another reason and may never have looked at the heart. Hence many patients are diagnosed earlier and receive preventive treatment.

Today this is one of Zebra Medical Vision’s leading solutions, and it’s proven to be effective on a large part of the population. It has even outperformed other solutions that only work on gated CT scans.

Example 2: Vertebral compression fractures

Image showing a vertebral compression fracture. — Illustration of vertebral compression fracture. Artist: Charles H. Boyter

Early diagnosis and treatment for osteoporosis is essential. But vertebral compression fractures – a reliable symptom of osteoporosis – are often missed during routine exams.

Vertebral compression fractures (VCFs) – a condition in which part of a vertebra bone in the spine collapses – are often simply ignored by radiologists. It’s not part of their standard workflow, the diagnosis is often not acute, and they’re tedious to diagnose: The radiologist has to check another section of the scan and then compare each vertebra’s height to its baseline height. In the end, 75% of all VCFs go undiagnosed or unreported.

But most people in high-risk groups already have a scan somewhere on file. That means the pixels are there and the compression fractures are already captured in the data – they’re just not reported.

Even though the problem isn’t sexy and doesn’t receive much attention, the benefits of diagnosing VCFs – and consequently osteoporosis – are immense for the patients. For example, 50% of patients who fracture a hip die of complications in the next 10 years. Then there’s also the immense burden of rehabilitation.

So Zebra Medical Vision built an algorithm that locates and identifies these compression factors. This helps the clinicians who really care, who deal with osteoporosis prevention and treatment programs, to identify patients with VCFs.

Now hospitals that run screening programs on patients who are at risk of osteoporosis can run this model in the background and are automatically alerted to patients who likely have a fracture. Then doctors can confirm these cases with a manual check.

As Eyal says, “Sometimes we look at the problems that are meaningful and not sexy.” This is exactly the kind of opportunity you find when you look at large datasets with an exploratory mindset – a data scientist’s mindset.

Accomplishing the impossible with naive optimism

In May 2014, at the very beginning, Eyal and his colleagues went to a radiology conference in Dallas. Everyone told them: “There will never be anything like machine vision in radiology. That's a fantasy. You’re nice Israeli guys. Enjoy your vacation, go back to Israel, and find something else to do.”

But their naïve optimism saved them. Eyal and his co-founder thought they could solve this problem, and fast. In the end, even if it took a bit longer, it would be an extremely meaningful challenge, and the impact and value of the project would be undeniable.

This mindset pushed them on through myriad challenges. But as Eyal said, “You need to be extremely stupid – in positive way.”

How Zebra Medical Vision Developed Clinical AI Solutions

The peculiar lack of clinical AI diagnostic solutions