Future of Real-World Evidence: Common Data Models and OHDSI Tools Used in the COVID-Real Case Study
This insightful webinar, hosted by Ryan Muse, explores the future of Real-World Evidence (RWE) and the use of Common Data Models (CDMs) and OHDSI tools in clinical research. Gabriel Maeztu, Co-Founder and CTO of IOMED, and Neus Valveny, Sr. Director and Head of RWE at TFS HealthScience, shared their expertise on how these models and tools are transforming the field.
Gabriel Maeztu highlighted IOMED’s role in converting hospital data into standardized CDMs, focusing on the OMOP CDM. He explained how this standardization enables faster and more accurate RWE studies by integrating diverse data sources, thus overcoming data heterogeneity. Maeztu emphasized the importance of normalized data and common vocabularies for improving the efficiency and reproducibility of observational studies.
Neus Valveny presented a global COVID-19 case study, demonstrating the practical application of RWE and CDMs. She discussed the collaborative effort between TFS, IOMED, and various hospitals, which resulted in quick and effective analysis of electronic health record (EHR) data. Valveny highlighted the benefits of CDMs in RWE studies, such as reduced costs, faster results, and the ability to conduct large-scale, federated studies. The webinar concluded with a Q&A session that addressed data privacy, quality control, and data pooling from multiple hospitals, reinforcing the value of CDMs in advancing clinical research.
Webinar Transcript
Speakers:
- Gabriel Maeztu, Co-Founder and CTO, IOMED
- Neus Valveny, Sr. Director and Head of RWE, TFS HealthScience
- Ryan Muse (Moderator)
Speaker 1 – Ryan Muse (00:06):
Good day to everyone joining us and welcome to today’s Xtalks webinar. Today’s talk is entitled Future of Real-World Evidence, Common Data Models, and O-H-D-S-I tools used in the COVID Real case study.
My name is Ryan Muse and I’ll be your Xtalks host for today. Today’s webinar will run for approximately 60 minutes, and this presentation includes a Q&A session with our speakers. Now, the webinar is designed to be interactive, and webinars work best when you’re involved, so please feel free to submit your questions and comments for our speakers throughout the presentation using the questions chat box.
We’ll try to attend to your questions during the Q&A session. This chat box is located in the control panel which is on the right-hand side of your screen. If you require any assistance along the way, please contact me at any time by sending a message using this same chat panel.
Speaker 1 – Ryan Muse (00:54):
At this time, note that all participants are in listen only mode, and please note that the event will be recorded and made available for streaming on XTalks.com.
At this point, I’d like to thank TFS HealthScience (TFS) who developed the content for this presentation. TFS HealthScience is a global contract research organization (CRO) that supports biotechnology and pharmaceutical companies throughout their entire clinical development journey. In partnership with customers, they build solution driven teams working for a healthier future, bringing together nearly 700 professionals. TFS delivers tailored clinical research services in more than 40 countries.
I would like to introduce our speakers for today’s event. With a Ph.D. in human genetics and 20+ years of experience, Neus combines a passion for statistics and epidemiology with experience in designing and running 100+ real-world (RWE) studies. Neus consults with Biopharma on the best RWE and late-phase study design and execution, helping them determine the best path forward.
As a medical doctor and mathematician, Gabriel combines his two passions, working at the intersection of medicine and machine learning. He works with large amounts of data to build tools that help physicians to contextualize all the information and make better decisions. Gabriel believes that the cross section between medicine and data-driven technologies will define the medicine of the future.
Without further ado, I will hand the mic over to our first speaker. You may begin when you’re ready.
Speaker 2 – Neus Valveny (02:31):
Thank you, Ryan. Welcome everybody to this webinar. We are really happy to have you here today to speak a little bit about the common data models and the OA tools used in a specific COVID case study.
The agenda for today’s meeting includes a first part where when we will talk about the future of real- world evidence (RWE), which is, we truly believe it goes through the common data models and how these common data models can be leveraged to do a fast analysis of real-world data (RWD). Then we will explain in detail the case study that we performed together with TFS and IOMED and then we will go to the conclusion about database studies using electronic health records.
Speaker 2 – Neus Valveny(03:23):
A bit of introduction about TFS. Our company, as Ryan said, is a mid-size CRO (Contract Research Organization). We have offices and legal entities in 17+ counties. We have almost 700 employees and we have performed more than 250 studies over the last five years, both clinical trials and non-interventional studies, in more than 80,000 patients.
We provide all type of real-world evidence (RWE) services including from real-world effectiveness and safety of approved drugs, also epidemiological studies, health economics, and also studies assessing electronic patient reported outcomes. Now, I hand it over to Gabriel who will explain about the common data models. Thank you.
Speaker 3 – Gabriel Maeztu ( 04:14):
So, prior to starting to talk about common data models, very fast, IOMED, what we do is we transform, and we create these common data models. We work with the OMOP CDM, which I will explain in more detail later on.
But in the end, the idea is that at IOMED, what we do is we work with hospitals to create these common data models, and thanks to this we allow CROs like TFS to perform studies faster by accessing clinical data in a federated network work of hospitals that are connected using this common data model. So, this is part of what we do. The other part is we do transform all the clinical nodes. We are using natural language processing, but I will speak about that in more detail later on.
Speaker 3 – Gabriel Maeztu (05:13):
I will briefly talk about common data models, but the idea about a data model is that we do have many use cases that we want to perform on the same data. So, we can imagine ourselves performing an observational study. The main problem is that in every hospital, every system, that whenever we want to work with them, we face [the challenge] that they have different databases. We have a different representation of the same events.
So, we can imagine that in one hospital, the patients and all their data, it’s totally structured in a format that is not totally compatible with a second hospital where we want to perform the same use case. So, this data is very heterogeneous. By this nature, whenever we want to store any kind of data, we have to make a decision on how and where we decide to store this data, whatever the formatting might be.
So, if we want to have a common data model, we need to perform this normalization where, in this analogy, we can imagine ourself, that we are taking all this data from all these different data sources, the different hospitals, and what we want to do is to create a standard format where we can plug all our different use cases. Just like in this case, we will plug our observational study into multiple sites at the same time.
Speaker 3 – Gabriel Maeztu (06:40):
This is part of what a common data model solves that it’s providing the same structure for all the data. So, we could imagine that we are providing to every hospital the same plug, and because we are transforming all this data, but just as it happens with electricity, this is not enough because in the end we need not just to put all the data in the same schema and the same format, but also we need that all the data contained in these schemas has the same meaning, the same semantics.
So, to represent that we need the common data vocabularies. In the end, a way of representing all the knowledge that is stored in this common schema, but that it’s normalized too. It’s important to normalize the data and to provide the same place or kind of structure to store it, but it’s also as important to provide the same semantics to each data point that we are storing in these databases.
Speaker 3 – Gabriel Maeztu (07:38):
In the end, from the different data sources, we might find very different ways of representing the same semantics. It’s very important to perform these both transformation. On the next slide we can see what this would mean in this case for a hospital.
You can imagine, as we can see on the left, that source one, source two, source three could be different hospitals. Each one with our their own systems – it could be Cerner, or it could be Allscripts – whatever provider they do have for the EHR (Electronic Health Records). The problem is that we need to transform all the data from those EHRs to this common data schema to have the same database structure. But, this is just half of the work. The other half of the work is providing each data point with the same representation. So, we all need to be speaking the same language.
Speaker 3 – Gabriel Maeztu (08:32):
It’s very nice to be talking about whatever disease that we want to represent, but we need all those diseases to be represented with the same IDs in the different hospitals that we might be working with. So, that’s why we need not just to talk about a common data model, but also about the two pieces that the data model represents; the data schema and the data vocabularies. So, the structure and the semantics of both things.
Common data models is something that is not new, it’s been something that many different entities have been working on for the latest years, but today we will focus mainly on the common data model that has been developed by the observational health data science and informatics organization (OHDSI) that it’s called the OMOP or the Observational Medical Outcomes Partnership Common Data Model. So, the OHDSI OMOP CDM (Common Data Model), as we will refer to it from now on, is a nice initiative from the OHDSI organization.
Speaker 3 – Gabriel Maeztu (09:37):
The OHDSI is an open science and collaborative effort from many, many different organizations from public and private sectors where they are working on having and providing an open community data standards. In the end, it’s just deciding between the whole community, those data schemas and those data vocabularies, how they are going to be organized to represent all that data in a huge, large network that can be used for clinical research.
This community is an open community, which the goal is to provide clinical research, the reproducibility and collaboration and openness that it really needs to have nowadays. The network of researchers that is part of the OHDSI is working for all these open standards, but also, working on huge studies that use this network to provide new evidence on top of it.
Next, I want to talk to you a little bit more in detail of what the OMOP CDM is.
Speaker 3 – Gabriel Maeztu (10:52):
As I said before, I think that it’s very interesting to just split it in two where we’ll talk first about the data schema. The structure or the representation where we will restore all this information. It’s very important to understand that the data schema in this case, in the OMOP CDM, is a patient-centric data schema.
What that means is that everything will go around a person, in this case, because it might not be a patient, but all the information that it might be of interest for any kind of different research studies will be always linked to an individual from the databases that have been transformed to the OMOP CDM. This schema is optimized for observational research purposes, but it has been, and it’s actually used every day for other use cases such as other analytical tools or predictive analytics that they are nurturing from this data to be able to perform the inference that they might need for their own use cases.
Speaker 3 – Gabriel Maeztu (12:09):
So, the kind of information that we can find on the OMOP CDM is not just about clinical data, but it’s the main part of it. We can find out things such as drug exposure so we can see every time that the person has received a drug or maybe all the observations that clinicians and other practitioners have performed, we can see also for example, all the care sites and the different measurements that a patient might have had in the latest years.
So, all the data points in these different domains that I just explained, all of them are normalized into what we can see on the right side of the screen and a standard vocabulary in this case represented such as a standard concept id. It’s just a unique ID where it represents the semantic that I was talking about before. So in the next slide, please, all these data vocabularies or all these concepts, in the end, what they are doing is trying to represent these semantics but not in a way that they do start building these semantics from zero or starting to try to create all these concepts against, but they are leveraging all the ontologies, terminologies and vocabularies that the healthcare communities already agree and they build up from there to start creating the representation and the semantics that are needed to be able to represent all the procedures, all the drugs, all the measurements, all the observations that a hospital can perform during the practice.
Speaker 3 – Gabriel Maeztu (13:52):
So adopting existing vocabularies is key for all the effort and all the success that the O-D-A-C-V-M almost CM has had because what it had allowed is many existing databases that were already in one of these controlled vocabularies to be remitted or transformed into TAs Opium. The next slide please.
Speaker 3 – Gabriel Maeztu (14:18):
So, in the end, these vocabularies or medical dictionaries are just standards that have been adopted by different parts of the medical community. I’m pretty sure that many of them, you already know them like the I CV 10 or maybe the SNOMED or EOR, but all these different vocabularies, they do have their own space inside the home obsidian.
On the next slide we can see how all these concepts have a representation inside the OMOP CDM. So, in the end we can see this example for example, where if we can see the vocabulary ID that it’s the fourth row, it shows us that this concept is coming from the SNOMED vocabulary, but it already has its own concept ability. It has its own representation inside the OMOP CDM.
But the nice thing about the OMOP CDM is that any other vocabulary or terminologies trying to describe the atrial fibrillation as it has been in the nomad. They are also mapped into the same concept ID.
Speaker 3 – Gabriel Maeztu (15:35):
We do have a unique representation for each of the semantic meanings that we might be interested, and those semantic meanings are not just invented by the OMOP CDM, they are agreed upon all these different terminologies and ontologies. The nice work that is performed by the OMOP CDM is to review all these vocabularies and to create this standard format that all of the researchers can start working on top of.
So, we can imagine just a huge collection, it’s almost 6 million different concepts with more than 78 vocabularies that take part of them, but they are not just independent vocabularies. All those vocabularies are mapped between them. So, there is just one representation of whatever concept that we might be interested on. We can imagine them just like a stack of Legos, each of them for a different domain. Each of them is nurturing themselves from the best vocabulary that has been created for that domain.
For example, we could imagine that for observations it’s been used at the SNOMED CV that it’s nowadays one of the best representations of the observation and diseases of the patients. But for example, for the drugs domain RX norm and their extensions are used to represent at their best the drugs that can be provided in a hospital. This huge stack of different concepts and all the normalization to understand that we can just have a unique concept to show them.
Speaker 3 – Gabriel Maeztu (17:12):
So, this is really nice and what it really provides us is a normalized way to be able to generate all the evidence that we might want to create from real-world data. It can be used mainly for patient characterization or population-level estimation and patient-level predictions. In fact, the opium is more focused on the characterization and patient-level estimations, but they are just different ways of using this existing data and this data to perform different kinds of analysis.
For example, on the right [of the slide] we can see how you can define an observation period because we do have these tables inside the OMOP schema and inside this observation period we can define what we want to measure to be able, for example, to measure the incidence of whatever outcome that we might be interested in.
This would allow us to perform an incidence study really easily on top of the OMOP, but not just that we might also be able to perform a case control design where we look for similar subjects and that we know that because of the data that represent themselves are very similar and we could put them just as a case control to be able to measure if the outcome of interest for example, is different between our case and our controls.
Speaker 3 – Gabriel Maeztu (18:49):
So, this is really nice, and it can be defined not just by technical people that are using all these databases, but thanks to the graphical user interface, anyone with a little bit of experience in defining observational studies can go and define by themself the cohorts of interest that they might want to work with.
So, there is a tool called Atlas that is accessible if you look for it as “Atlas” on Google and you can define your own cohort by quite easily just asking for whatever you might be interested in. For example, you could define the index date and the index period to understand whenever a patient or whenever you want. When do you want to include a patient inside your own study? On the next slide we can go even farther on this tool, and you can see the link just down in the bottom right there.
Speaker 3 – Gabriel Maeztu (19:49):
You can go farther and not just define when do you want to be a patient included inside the cohort but also all the inclusion and exclusion criteria that a patient must comply to be able to be part of that study. So, in this case for example, and I’m going just to put one example, we are talking about the CHARYBDIS study. It’s one of the COVID studies that the OHDSI community has performed during this COVID period.
One of the inclusion criteria was that the patient has at least one occurrence of measurements of SARS-COV-2 positive test. So, all those patients between the index date that we previously defined and that had at least one of these occurrences would be included in the cohort. You could add as many inclusion or exclusions criteria as you might want to be able to define your own cohort.
Speaker 3 – Gabriel Maeztu (20:45):
So, these command data models really help us to be able to perform near real time EHR world data studies so we can access all the information that is available in these EHR systems once we have already transformed it into this common data model. And what this gives us, is the availability to be fast and to be reproducible in different hospitals with large samples of sizes, because we are not limiting ourselves to whatever someone might be able to find out in clinical record. We are in fact using the whole hospital and all the patients that might comply with the inclusion criteria to make our sample size.
Some disadvantages. The problem is that most of the data that is used for nurturing the common data models usually is already existing and structured information. So, information that might be in free text, for example in the clinical notes that a physician might have about their patient, or those imaging tests cannot be directly analyzed because they cannot find their place on this OMOP CDM.
So, that’s one of the problems. Others of course are patients that might be misclassified or maybe some of the problems might be underestimated because some of data lacking. Also sometimes depending on the variables that we might want to measure, it’s very difficult to be able to compare between them. So, some of these disadvantages are ones that, for example, we’ve been working on for the latest for the last time.
Speaker 3 – Gabriel Maeztu (22:34):
And in fact, we think that this is the next generation of data collection. In the end, we can imagine ourselves that we are seeing how a new generation of occupational data collection has arrived and it’s here to just to stay. So, we’ve seen how in the past CRFs were totally digitalized thanks to e CRFs and now how EHR data have completely been used for nurturing the same observational studies.
But in the end, by having a normalization process, what really has changed and what gives us really a lot of power to be able to perform the huge studies is that it changes who is making the effort to make the data available. Up until now, we were dependent on having someone on being able to go through all the medical records and to store all that data in a database.
Nowadays, thanks to these normalizations process with the common data models, it’s a huge effort that is performed by technical companies or that are able to perform these transformations and they are not limited to the effort of one person looking for many people looking for all that data.
Speaker 3 – Gabriel Maeztu (23:53):
So, other challenges that we are facing is that almost 80% of the data in a hospital is in a sector format, almost like 65 to 85 feet depending on the hospital is stored. For example, as a free text, it’s just clinicians writing about their patients, all their narrative of what’s happening to them. So, this is very nice for the clinicians because they can understand really easily what’s happening with the patient. But the problem is that if we want to analyze this kind of data, even if we do have a common data model, it is not enough because we need to go a step farther.
Speaker 3 – Gabriel Maeztu (24:36):
And that’s what we are mainly working at IOMED. What we do is we use machine learning and what we have done is to create, you can imagine small bots that are able to read all these clinical records and are able to find all the data points that are of interest in these clinical records. So, as you can see on the right, there is a part and extractive clinical record where we see that a patient has diabetes.
So, in diabetes in this case it’ll be a condition, the colors are wrong here, but in the end what we are doing with all these machines are extracting all these data points of interest and transforming into these OMOP CDM representation. So, in the end, by having the same semantics that we were explaining before and by external all the data in the same common data model, we can enrich more than 15 times the available data thanks to mining all the data that’s available in the technical records.
Speaker 3 – Gabriel Maeztu (25:41):
This is really nice, but one of the main problems that has worked at hospitals a lot is having to share their own data with a third party. So, the nice thing about the OMOP CDM and the OHDSI effort in general is that they are very privacy concerned and that they do not want to have a centralized way of accessing all the data.
In the end, what they do is they push the analysis to all the data to all the hospitals. So, on the left we could imagine the systems in a hospital that it might be a data partner between inside the OHDSI network and on the right ,we could imagine ourselves or some performing some clinical study. So, once we have defined the study of interest and we know and for example using Atlas, we can define the cohorts and all the variables that we want to compute.
Speaker 3 – Gabriel Maeztu (26:29):
What is usually performed is a study that can be computed inside the hospital and just the results of the analysis are shared with those study coordinators. So in this sense, all the hospitals that might be taking part of one of these studies, they are just sharing the analysis and the results of their studies.
So, they are pretty sure that all the privacy concerns that they usually have and all the problems that the regulatory might find in the European unions are avoided because all the exploitation of the data but also the transformation is performed inside the own installations of the hospitals themselves. And that really helps us to perform much faster studies and to go faster because we just have to be worried about our study design and to being able to execute those analysis inside their own hospitals.
Speaker 3 – Gabriel Maeztu (27:25):
There are also some challenges. As I told you there might be some cohort definition problems. So, whenever we want to define a cohort we might be, we need to be sure that that kind of data point will be found on the hospital’s database. So, it’s pretty important to be sure to that kind of data point will be there. Also, there are a lot of confounding information that might lead to spirit associations because of the large amount of data that it’s available ECC to over or underestimate whatever that we might be interested on our own study.
There are a lot of biases that are the typical ones that we might find in any observational study and also there are problems whenever there is some missing or data such as for example, a patient that might have changed from a hospital or it might have died in a non-hospital environment.
Speaker 3 – Gabriel Maeztu (28:31):
So that kind of information is not recorded on their systems. Also, just as a little detail, it is very important how clinicians also define their own work because one of the things that one might find out is that hospitals from Europe and from the U.S.
Some kind of procedures or diagnosis, they perform from following different kinds of rule sets and that might give them different semantics. So, even if we are provided with the same encoding for each of those diseases, because how they do perform their own diagnosis changes how the data can be represented. Next slide please.
So, without further delay, I will give the presentation back to you [Neus].
Speaker 2 – Neus Valveny (29:23):
Thank you very much, Gabriel, for this overview and detailed overview of the common data models and the vocabularies that are used by OMOP.
Now we are going to explain an example of how all these tools, or these new technologies can be applied in a specific example. And this was a study that was started in March, 2020 and it was entitled Real-world Characteristics Management and Outcomes of Such Screen or Diagnosed with COVID-19 in Spain. And this was really a truly collaborative effort from multiple stakeholders.
On one side we have five public hospitals that wanted to participate, two from Barcelona and three from the Basque country. The study included patients diagnosed or hospitalized with COVID-19 and also additional controls and diagnosed with influenza prior to COVID pandemia. In order to compare the characteristics of these patients, it finally included almost 3000 cases hospitalized in hospital del mar.
Speaker 2 – Neus Valveny (30:32):
The design was observational retrospective and database study. The sponsors were Dr Cossio from Vall d’Hebron hospital and Dr. Horcajada from hospital del mar. And the partners executing the study were TFS in charge of project management, regulatory and medical writing and also IOMED for data extraction, codification and analysis and the OHDSI community that provided the core definition using the OMOP and an Atlas definition as Gabriel explained before.
Also, the analysis packages in AIR program for doing the analyzing the outcomes and also technical support for all data partners around the world educating their packages because this is a global study that included partners around the world. So, we were in charge of some data sources in Spain only. Also, there was some partial funding from the EHDEN initiative to the hospitals, the participating hospitals, and also from Oxford University with funding from the Bill Gates Foundation.
Speaker 2 – Neus Valveny (31:48):
How all started this study? The first lockdown in Spain was the 14th of March, 2020 and at the same week it was going to happen the OHDSI symposium in Oxford and we were going to attend [TFS] and they of course canceled that symposium and replace the symposium by a global COVID study in order to provide real evidence to inform about the new disease that was happening around the world.
So, the entire OHDSI community decided to focus all the scientific efforts. More than 300 people around the world were collaborating in this global effort. So TFS and IOMED wanted to collaborate in this effort, and we engaged three hospitals in the first meeting that took place the 17th of March. The three hospitals, we must say that they were very, very engaged in the study. They had a high interest in getting data from their patients because they didn’t know what was happening with those patients.
Speaker 2 – Neus Valveny (32:50):
They didn’t know the characteristics or the outcomes. They also had known us from prior collaborations; private clinical trials and observational studies. They also knew the main challenge that we found at that moment was that they didn’t know about OMOP. They didn’t know about OHDSI, they didn’t know that the power of the electronic health records (EHR) in the hospital and that they could be analyzed in a fast way. Let’s say also another challenge was the IT department of the hospitals should be involved. Also, the infectious department because the principal investigator was an infectious physician. But also, we needed the IT because the IT needed to give us access to the hospital recourse. The IT departments were very, very busy at that moment, as you can imagine. Also, there were some legal constraints that we will explain in the next slide.
Speaker 2 – Neus Valveny (33:47):
The regulatory pathway that we had to undergo for this study was the standard one for retrospective observational study in Spain. So, one central ethics committee (EC), which is the KIB NF IRB. In the U.S. we needed the approval for secondary data use. We obtained disapproval within five days only when the standard is between one and two months.
So, at that moment the ethics committee (EC) was focused only on COVID studies. To obtain this approval, we chose to write an umbrella protocol because OHDSI was in parallel developing their own protocols with several objectives. One was characterization, the protocol was named Charybdis. Another protocol was focused on effectiveness of the drug was called Scylla, et cetera.
So, we wrote an umbrella protocol covering all these of course and objectives and endpoints. Then we also asked the informant consent exemption. This was critical to run the study because we fight the ethics committee that the analysis, the people who did the analysis, only had access to IOMED data and the access to data would be controlled, because as you can imagine, we were asking access to an entire hospital database.
Speaker 2 – Neus Valveny (35:10):
So, this was very important to convince the ethics committee and they agreed, so they approved the study. We also had to go to the local ethics committee that assessed the local aspects only. Again, very fast. We had to perform additional steps like registration in the EU register for observational studies, as we recommend for all sponsors. Also, we had to sign an agreement with all the hospitals. So, the standard timelines for the agreements is between two and four months for these types of studies.
But in this case, we must admit that we required more months. It was a long process because the legal departments were fully collapsed with a lot of studies starting at that moment. And also, there were some comments regarding the GDP compliance for this study that we ensure that all sponsors, CRO (Contract Research Organization) and investigators (PIs) were compliant with this European guidance for data protection. Also, we had to use open wording for the sample size because as you can imagine, the COVID numbers were increasing very fast. So, we didn’t know how many patients would be in the study. We used open wording that you can read here [ see slide on video].
Speaker 2 – Neus Valveny (36:34):
Once we had all the regulatory process in place, we could go to the technical part. The technical part comprised five steps. One was a separate contract between IOMED and the IT departments at the hospitals so that they could access the hospital data, which was, as you can imagine, not anonymized.
Then this company had a data anonymization process because they have a process where they remove all the personal attributes from the electronic health records (EHR). This is really important. Then in the anonymized database, they converted it to OMOP, as Gabriel explained before. They converted all the terms in the electronic health records (EHR) into a single number or concept ID. So, all the variables, the drugs, the conditions, et cetera are converted into a number. Then in this converted database, they executed the air package from OHDSI is the same air package was executed in the U.S., Korea, Italy, and in countries around the world.
Speaker 2 – Neus Valveny (37:39):
And you can find this package mentioned here on this webpage [URL on the slide]. And finally, we obtained such a great amount of results that we had to upload them to a web application called Shiny app, where you can easily review these results because otherwise it’s impossible. It’s not user-friendly to review them as you receive from the output. And the Shiny app is in this link here [on the slide].
On the next slide you can view the QR code. I encourage you to go to this webpage and review these results so that you can see not only from Spain but from results around the world. On the next slide you can see the main characteristics of the results. We want to explain because this is really important. Once we had access to the hospital records in OMOP, all the package was run in only one week.
Speaker 2 – Neus Valveny (38:44):
So, this is very important. In this one week we had access to more than five hospitalization and visit information, more than half a million subjects with information from the last 20 years in the hospital. As you can see, how powerful is this tool?
We had information for all inpatient care, outpatient specialist care, emergency room visits, and partial information from other settings. We obtained the outputs containing more than 2 million covariates in four domains. So one domain is the cohorts, the other is the demographic data, the drugs and the conditions.
All these variables were obtained for more than 1000 cohorts or different strata that comprised the small variations in the observation period requested to the cohort, the prior follow-up the comorbidities, et cetera. So, for example, we divided results between patients with diabetes without diabetes, with hypertension, with hypertension, et cetera. And these are in the next slide.
Speaker 2 – Neus Valveny (39:48):
You can see how these results looked like at the beginning when we started directly from the package. So these were CSB files and they were received seven of them. And these CSB files, some of them contain the dictionary of the covariates because the covariates are all defined by a number and also the cohorts and the features. The cohorts is the group of individuals and the feature is a group of covariates.
So, for example, the feature diabetes Milus comprises several covariates. One covariate is diabetes as a condition. Another covariate, for example, is an antidiabetic drug, et cetera. So, the feature, diabetes is composed by multiple concept ID. And here you can see the main results are in the fourth table here.
And on this slide, which is the covariate value. In this table you can see in column A, cohort identification number. For example, 1 1 1, the Covariate, 1 0 0 0 0 1, has a mean value of zero point 27. In this case, this is a categorical variable. So, even if it says it refers to mean it’s a proportion, okay, that means that 27% of patients had discovery it in this cohort.
Speaker 2 – Neus Valveny (41:14):
On the next slide, you can see how all this research was combined into a single file. That couldn’t happen in Excel because it’s so large. As you can see here, it contains almost 4 million rows. So, it cannot be opened in Excel. We open for example here in Power BI and here you can see for example several prevalences of conditions in different cohorts.
This is an example only in the next slide you can see the user-friendly format of the results. And this is the Shiny app that we were referring to before. And in this additional app there are some dropdown menus where you can pick up the database, you can pick up the core that you want to look at, you can pick up the strata, you can pick up for example two core, the target and the operator and the domain. And you can look also to the time window.
Speaker 2 – Neus Valveny (42:10):
You look before the index date or after the index date that cover it. And here this is only an example, it has very nice plots where you can compare the prevalence between one cohort and the other. And for example, this.in red is the prevalence of hypertension between compare between patients who entered intensive care versus those not entering in the intensive care. And as you can see, it was much more prevalent. Patients in intensive care had more hypertension.
In the next slide you can see another example. I encourage you to go to the Shiny app and look at several results because it’s really nice to see the powerful and how many cobar and information is there. For example, if you want to compare the use of anti in patients with and without diabetes, you can see that the use is more than three times in patients with diabetes.
Speaker 2 – Neus Valveny (43:09):
They had also more hypertension. This is a comparison between hospitalized patients. The next slide here, I will explain the three publications that have been already been in the public domain including results from this study. And this is the first one. This is descriptive study of the entire course from around the world.
It contains more than 4 million COVID cases and it has really nice pictures displaying, for example, the age and gender distribution of patients entering the hospital and entering the intensive care units across the data sources. Here, you must remember that not all data sources enter it in the same wave of COVID. This can or in the same wave or month.
So, you can explain some of the differences between the age and gender distribution. This is the second manuscript that was accepted in British Medical Journal and describes the use of repurpose drugs in COVID pandemia.
Speaker 2 – Neus Valveny (44:20):
For example, the line in yellow, the yellow line is hospital del mar. You can see at the beginning of March, more than 75% of patients were receiving hydroxy chlorine because it seemed that it was effective. But after a few months more evidence was available showing that hydroxychloroquine was not effective. So, all hospitals declined to use it. In the next slide you can see another publication. This is the third one is accepted by British Medical Journal Open and it’s a comparison of outcomes between patients with hypertension and without hypertension. In this course, and for example, if you can see the mortality in hospital del mar was 14% in patients with hypertension versus less than 4% in patients without hypertension. This clearly indicates that hypertension is a risk factor for COVID, worse COVID prognosis. And also, you can compare, for example, the outcomes between several data sources.
Speaker 2 – Neus Valveny (45:27):
For example, hospital Delmar was similar outcomes versus op database from better and affairs in the us. However, the mortality in Optum database was slower and you always need to take into account the sociodemographic characteristics of the data sources. In this case, Optum had younger patients and more women than hospital del Mar. So, a basic quality check that you need to do when you compare resources between data sources is always take a look, at least at demographics because agent gender are correlated with almost all outcomes.
It’s very important to take that into account. I think this was the last one. Yeah, let’s go to the conclusion. We have reviewed database studies using electronic health records and common data models. In this case, in this example, using OMOP, the advantages are that you can have relatively quick results.
Speaker 2 – Neus Valveny (46:32):
Some weeks only after all the regulatory process is in place, they have lower costs versus a traditional non-interventional study. Because you don’t need an EDC system, you don’t need to transpose data into A CRF, et cetera. It’s not time consuming for the investigators other than of course involving them in study design and core definitions and so on it we recommend to do of course the study design with investigators.
It allows a federated approach, which means that you can replicate the study across many countries and data sources very fast. And also, of course, allows to obtain a big amount of data. The main challenges are the potential misclassification or missing data in the electronic health records, potential biases as in all retrospective studies. For example, indication bias in mortal time bias. So, it’s very important to very well the core and also to analyze the data very well.
Speaker 2 – Neus Valveny (47:32):
For example, if you want to compare cores, you may need probably multivariable analysis or to adjust by confounders or for example, propensity score matching. You also may need to use positive or negative controls in order because you cannot adjust for multiplicity because you are doing so many analysis that if you adjust the P values, you simply get lost.
So, for example, the OHDSI is using positive and negative controls to ensure that the comparisons are fine. And finally, maybe if you want to extrapolate the ratio, you may need empirical calibration of the model parameters. And that’s all we encourage in the next slide.
We encourage all of you to use these common data models to do your real-world evidence (RWE) studies. We really think that the initiatives around the world like the European Health Data and evidence network that is providing funding to the European hospitals for doing this conversion will help in this process because in the next five years, we expect more than 100 data partners to have this hospital databases in this model. So, we are looking forward to working with all of you on these type of studies. Thank you very much.
Speaker 1 – Ryan Muse (48:50):
Well thank you very much, both of you, for that insightful presentation. Before we move on to our Q&A session, we have a poll question for audience members. This should be appearing on everyone’s screen right now. You can participate by selecting any of the answers you see in front of you and then clicking submit.
The question that we have for you asks,
“Would you consider this data collection method for your next study?”
Your answer options are:
- Yes
- Yes, but only using structured data. No NLP.
- No, I prefer an ECRF.
We’ll give everyone a few seconds to consider their answer, how it best applies to themselves, their company. The question again being would you consider this data collection method for your next study?
Speaker 1 – Ryan Muse (49:34):
It looks like most of you have voted. Thank you very much for participating. Let’s take a look at your results. We have 69% of you have said yes, 23% no, and then 8% at yes, but only using structured data.
So, thank you very much again for that participation. And now I would like to invite the audience to continue sending your questions or comments right now using the questions window for this Q&A portion of the webinar. I’ve already received some questions, so we’ll get ourselves started with those.
The very first question that we have for you asks,
“If a subject from the study must be re-identified, for example has a rare adverse reaction, can this be done and how?”
Speaker 3 – Gabriel Maeztu (50:21):
Sure, happy to answer that. So, this is a process that it can be performed. It’s not easy process because usually all the data, as we said, once it’s outside the hospital because all the analyze process has finished and the results go outside the hospital, all the data is anonymized. So we cannot in any way re-identify that patient.
But usually all the hospitals, well usually they have to all the hospital, they store the results of all the analysis, but also all the audits. So, they do have all the information about all the participants of that study, but the hospital is the one that has all that data. And they are the ones who are able to identify the patients in case there is a react adverse effect that you don’t need to identify the patient. So, it’s possible, but not by the ones working on the study, just by the hospitals and using a very specific process that takes.
Speaker 1 – Ryan Muse (51:23):
Excellent. Thank you very much. The next question that we have for you asks,
“Which quality controls must be done to the outputs? Can code mis classifications be identified and corrected?”
Speaker 3 – Gabriel Maeztu (51:37):
Okay, I’ll take it one too. So yeah, indeed. One of the things that is very, very important is to perform a data verification and a data validation process. So, in that sense, whenever a normal CDM transformation process is going on, it’s very important to understand how the information is represented in the hospital and what’s the semantics behind it of it.
So, it’s very typical to work with your own hospital to understand how they are storing the information, but also with the clinicians to understand if all the, for example, all the data represented, and the results have a meaning that might have a sense or not. Because sometimes you might find out that there is some misclassification on something that can be easily redone and recheck.
So, that’s why there is a data validation and data verification process that take part of these kind of studies to check with the clinicians that all the data makes sense, and the results are something that will be expected.
Speaker 1 – Ryan Muse (52:41):
Wonderful, thank you very much. Another question we have for you here would like to know,
“Can data from several hospitals be pooled at an individual level and if so, what steps are needed?”
Speaker 2 – Neus Valveny (52:55):
I can take that one Gabriel. Yes, it’s possible to pull data from several hospitals provided that the ethics committee (EC) and the site contract allows for that. So, of course, when we run the analysis, we need access to individual data from the hospital, but the site contract defines whether this data can be pulled or not with data from the other hospitals.
So, it’s part of the regulatory process to think ahead when you design the study. And if you foresee that you will need to pull data because for example it’s a rare disease and you need to have only few cases from each hospital, you need to convince, let’s say both the ethics committee and the hospital agreement and include and specify very well, then you will fulfill again the GDPR or the IPA or whatever confidentiality law is in place that we’ll fulfill this law.
Speaker 1 – Ryan Muse (53:55):
Alright, thank you for that. The next question we have asks, well first states that,
“There are differences in the way of diagnosing and therefore coding the pathologies and procedures between countries and continents. What consequences can this have in the study?”
Speaker 3 – Gabriel Maeztu (54:14):
Sure, so that’s a known limitation and that must be checked whenever the study is designed. So, that’s a design limitation that you must take into account whenever you are defining the study to understand whatever information you’re looking for, how that information is managed by the different clinician in the different hospitals. So, that’s a very important part that the scientific writing and the design part, design part of the study must take part of it to be able to really take into account those kind of problems. But once those problems are diagnosed before defining the cohorts, you can define different cohorts that might be not exactly the same for each region. So, you can really compare them much better or maybe to have more smaller cohorts. That can be even changed in a second iteration because one of the nice things about this kind of studies is that you can really iterate easily without having to have all the data collection and acquisition process. That is usually the huge and painful part of any study because you can really change a definition, have an okay from everyone, keep on going on the study.
Speaker 1 – Ryan Muse (55:38):
Excellent. Thank you very much for that answer and for all of the answers today. However, we have reached the end of the Q&A portion of this webinar.
Now, if we couldn’t attend to your questions, the team at TFS HealthScience may follow up with you or if you have further questions, you can direct them to the email addresses that are up on your screen. I want to thank everyone for participating in today’s webinar. You’ll be receiving a follow-up email from Xtalks with access to the recorded archive for this event. A survey window will be popping up on your screen as you exit, and your participation is appreciated as it will help us to improve our webinars.
Now I’m about to send you a link in the chat box and with this link you’ll be able to view the recording of this event on this page and you can also share this link with your colleagues when they register for the recording here as well. So, I encourage you to do that now.
Please join me once more in thanking our speakers for their time here today. We hope that you all found the webinar informative. Have a great day everyone, and thanks for coming.
Speaker 2 – Neus Valveny (56:35):
Thank you, Ryan. Thank you all for attending.
Speaker 3 – Gabriel Maeztu (56:40):
Thank you very much. It was a pleasure.
Connect with Us
Contact us today to discover how TFS can be your strategic CRO partner in clinical development.



