Dataset De-identification

Overview

There are several methods for Dataset de-identification using the Rhino Health Platform:

  1. De-identifying data before importing it as a Dataset
  2. HIPAA Safe Harbor De-identification using the Rhino FCP's built-in capabilities
  3. De-identification using custom Code Objects
  4. Interactive de-identification using Interactive Containers

 

De-identifying Data Before Importing

This doesn’t involve the Rhino Health Platform directly, but users always have the option of de-identifying data before making it accessible on the Rhino FCP using any methods or utilities.

The de-identified data can then be copied to the Rhino Client (e.g., via SFTP) and then imported as a Dataset into the FCP. The user would then import their Dataset with the import parameter Is Data De-identified? = Yes when specifying the configuration of the new Dataset.

 

HIPAA Safe Harbor De-identification

FCP supports HIPAA Safe Harbor De-identification on some Dataset data types. This is implemented via the field types defined in the Data Schema. Each field type can have specific de-identification logic associated with it. For example, fields specified as being type Age will be de-identified by setting all values ≥90 to 90. A full list of FCP data types and their associated de-identification logic is listed in the Supported Data Schema Data Types user guide.

In order to apply HIPAA Safe Harbor De-identification, set the Data Schema field May Contain PHI to True. Next, when importing a Dataset change the import parameter Is Data De-identified? = No in the Dataset's configuration.  This will trigger the following de-identification logic:

  • Any field that is marked in the Data Schema with May Contain PHI as False, will simply be copied over to the de-identified data.
  • Any field that is marked in the Data Schema with May Contain PHI as True, and the field type has associated de-identification logic (e.g., Age—See Supported Data Schema Data Types), will have that de-identification logic applied to produce the de-identified data for the field.
  • Any field that is marked in the Data Schema with May Contain PHI as True, and whose field type does not have any associated de-identification logic (e.g., String—See Supported Data Schema Data Types) will be omitted from the de-identified data (the entire column will be removed).

Any information that is important for the reverse translation between de-identified and raw data (e.g., reverse crosswalk tables for newly generated de-identified UIDs) will be stored by RHP along with the raw Dataset data (this data is treated as PHI).

 

De-identification with Custom Code Objects

Users can apply custom de-identification logic using any library they desire using Rhino Health’s Generalized Compute (GC) Code capability. Users could utilize open source software, 3rd party tools, or code that they wrote, build a container that includes this software, and run it on the FCP.

The code should be written such that it reads the raw Dataset from /input/ (within the container) and writes the clean (de-identified) Dataset to /output/ (within the container). For more information on how to use custom code on the FCP, click here.

Secure Access could be used to review the de-identified Dataset and verify that no PHI was accidentally left intact. This could be performed either by members of the site from which the Dataset was imported or by another collaborator (after the owner has provided access via a Secure Access List).

 

It is also possible to limit visibility to the raw Dataset and/or the de-identified Dataset before it has been reviewed or approved. No one will have access to the patient-level data unless they have been granted access via Secure Access.

In order to further limit access to aggregate statistical information about the Dataset before it has been approved (e.g., Categorical fields where PHI may have been left):

  1. Users can set the default permissions for viewing Dataset analytics to just their own site, and then the project lead and other project collaborators will not be able to view aggregate statistics about their Datasets.
  2. Users can create a separate project in which they will interact with the Dataset before it has been validated as being de-identified. Then, once the Dataset has been validated as being sufficiently de-identified, the de-identified data can be exported from the “de-identification” project and re-imported into the “real” project.

 

Interactive De-identification with Interactive Containers

In addition to applying custom de-identification logic that will run remotely using GC, users can utilize the FCP's Interactive Container capability to create Code Objects that enable them to perform de-identification in an interactive session.

Users can take the application they want to use, as long as it can run on Linux, and build a container that includes this application. Then, this container can be run in an interactive session on the FCP.

Users can connect to the UI of this interactive session and use the application to perform de-identification. The raw Dataset will be available at /input, and after de-identification, users can store the clean (de-identified) Dataset to /output.

Similar to de-identification with GC, Secure Access can be used to review the de-identified Dataset and verify that no PHI was accidentally left intact. It is also possible to limit visibility to the Dataset before de-identification has been verified in the same way as with using GC for de-identification.

Click here for more information about Interactive Containers

 

Was this article helpful?
0 out of 0 found this helpful

Articles in this section

See more