The Harmonization CoPilot is a powerful tool designed to streamline the transformation of electronic health record (EHR) data into the OMOP common data model. This tutorial explores the process of harmonizing data using the Harmonization CoPilot, guiding you through the key steps involved in syntactic and semantic mapping, as well as running the data harmonization process. In this example, we'll transform three clinical data tables into their corresponding OMOP tables: Person, Visit Occurrence, and Measurement.
Overview
Harmonizing data is a multi-step process that involves both schema transformations (aka syntactic mapping) and code transformations (aka semantic mapping). Once this structural foundation is in place, the next step is ensuring that the data values align with standard terminologies through semantic mapping. This crucial step allows data fields, such as race, ethnicity, service type, and lab test names, to conform to OMOP’s standard codes, enhancing consistency and interoperability.
1. Create a Syntactic Mapping – Map the three source tables to their corresponding OMOP tables.
2. Create the Required Semantic Mappings – Harmonize race, ethnicity, type of service codes, and lab test names to OMOP standard codes.
3. Run the Data Harmonization Code Object – Apply the syntactic and semantic mappings to transform the data.
Step 1: Investigate Your Data
Example data files used in this example can be downloaded from github.com/RhinoHealth/user-resources/tree/main/tutorials to replicate the steps described below. |
The first step in any data harmonization project is to develop a deep understanding the source data that is being transformed. The following questions can serve as a guide when reviewing such data:
1. What information am I standardizing? In this example, the goal is to standardize data abouth patients treated within my health system, their clinical visits (both inpatient and outpatient), and laboratory test results.
2. In which data tables is the information located? Often multiple data warehouse tables will store the relevant data (this scenario is supported by the Copilot). However, the scenario in this notebook is straightforward because each clinical event is stored within their own table.
3. In which columns is the information located? When transforming data to a common data model, you'll isolate relevant columns in your source data. Columns with information not required by the data model can be ignored.
4. How are the columns encoded? Often, important information like laboratory tests and hospital procedures are represented using vocabularies that are specific to a given hospital or health system. You'll want to identify these as targets for AI-powered semantic mapping. In the case that columns are represented by standard vocabularies, performing semantic mapping via generative AI may not be necessary.
Once you've developed a deep understanding of the data, import the relevant tables onto the Rhino Federated Computing Platform.
Step 2: Create Syntactic Mapping
Syntactic mapping serves as the blueprint for translating your source data into the OMOP model. It defines how each table and field will be mapped and transformed. Within the Harmonization CoPilot, users can create and manage these mappings through an intuitive interface. Following these simple steps to create a new syntactic mapping:
1. Select Data Mappings on the left toolbar.
2. Select Syntactic Mappings on the tab menu.
3. Select Create Syntactic Mapping button.
4. Select OMOP V5.4 as the target data model.
5. Select Manually Configure This will allow you to use the user-interface to design your syntactic mapping. Otherwise, you can upload a JSON file that specifies the mapping. We recommend the latter only for power users!
6. Select your Source Data Schemas. You'll want to select the auto-generated schemas associated with each of the source datasets you uploaded in the previous step. If your dataset is named 'My Patient Data', for example, the schema will be named 'My Patient Data (v0)'. If you want to transform 3 source datasets, you'll select 3 schemas!
7. Select the Target Tables that you are transforming data into. You'll refer to the [OMOP Common Data Model documentation](https://ohdsi.github.io/CommonDataModel/cdm54.html) to identify the relevant tables to transform to. In this case, we are tranforming 'My Patient Dataset' into the OMOP 'Person' table, 'My Encounter Data' into the OMOP 'Visit Occurrence' table, and 'My Laboratory Data' into the Measurement table.
Using the graphical interface to design my ETL
Once a syntactic mapping is created, the dialog to map from source to target tables will automatically appear. The Target Field column lists all the columns associated with the target tables selected in the previous step. A red asterisk indicates that the target field is required - don't forget to provide mappings for these, or else you'll have trouble later!
For each required target field, select one or more source fields to map to the target field. Click on the source field dropdown and you can select fields from any input data schemas for the syntactic mapping. If you select more than one field, make sure that you select fields from the same schema.
In the case that the source data needs to be modified in any way to meet the requirements of the target field, you'll specify a transformation in the **Transformation column**. You can createa and edit the transformations by clicking the pencil icon ein the Transformations column.
Column Transformations: The 'Workhorse' of Data Harmonization
In the Copilot, Transformations do the real work of data harmonization by modifying source data to comply with the specifications of a target data model.
In mapping these three source tables to the OMOP Common Data Model, we used the following transformation types:
Custom Mapping: Maps values from the source to corresponding values in the target dataset, like mapping a number to the day of the week. In this example, a Custom Mapping transformation was used to map the source 'Gender' field to the 'gender_concept_id' field in the OMOP Person table. The transformation was specified as a CSV and pasted into the entry box:
Male, 8251
Female, 8329
Other, 8391
Semantic Mapping: Applies a semantic mapping to transform values, like mapping an input text to an OMOP concept ID. OMOP domains are helpful high-level categories that allow non-expects to select the appropriate set of standardized codes to map non-standard codes to.
In this example, four semantic mappings were created:
1. Source column gender to OMOP Gender domain.
2. Source column race to OMOP Race domain.
3. Source column service_type to OMOP Visit domain.
4. Source column test_name to OMOP Measurement domain.
Set Value: Assigns a specific value to all rows in the field, like setting all values to the number 1.
(Helpful Hint: in the case that information is missing for a required OMOP field, encode a set value of 0, which is the concept_id for 'Missing Information').
Convert Date: Changes the date to a different format. For example, you can use this to convert a date from MM/DD/YY format to YYYY-MM-DD.
Stable UUID: Generates an encrypted unique identifier based on the input. The same input will always generate the same unique identifier.
Step 3: Execute Data Harmonization via the User Interface or SDK
With both syntactic and semantic mappings in place, the final step is executing the data harmonization process. The Harmonization CoPilot automates this transformation, applying the defined mappings to convert raw EHR data into the OMOP standard. This can be performed either by using the user-interface of Rhino's web platform or by using Rhino's Python SDK, the latter of which enables users to build automated data pipelines. This process takes place within the secure environment of the Rhino Client, maintaining compliance with data privacy policies.
Execute Data Harmonization via the User Interface: Ideal for One-Time Harmonization
To initiate the harmonization process, users select the relevant syntactic mapping, input datasets, and associated semantic mappings. The system then processes the data, applying transformations and standardizations in a structured manner. Throughout this process, users can monitor progress and review detailed logs to ensure accuracy. Once completed, the harmonized dataset is stored within the same environment, ready for further analysis or export.