Dataset De-identification – Rhino Federated Computing

Overview

There are several methods for Dataset de-identification using Rhino FCP:

De-identifying data before importing
HIPAA Safe Harbor De-identification using the Rhino FCP's built-in capabilities
De-identification using custom Code Objects
Interactive de-identification using Interactive Containers

De-identifying Data Before Importing

This doesn’t involve Rhino FCP directly, but users always have the option of de-identifying data before making it accessible on the Rhino FCP using any methods or utilities.

The de-identified data can then be made accessible to the Rhino Client (e.g., via client-mounted storage or SFTP) and then imported as a Dataset into the FCP. For more information see Mounting Storage to Your Rhino Client.

HIPAA Safe Harbor De-identification

FCP supports HIPAA Safe Harbor De-identification on some Dataset data types. This is implemented via the field types defined in the Data Schema. Each field type can have specific de-identification logic associated with it. For example, fields specified as being type Age will be de-identified by setting all values ≥90 to 90. A full list of FCP data types and their associated de-identification logic is listed in the Supported Data Schema Data Types page.

To apply HIPAA Safe Harbor De-identification, complete the following steps:

When reviewing the data schema, set the Sensitive Data options for each field. When finished, select the dataset's three-dot menu and select Perform schema-based de-identification option. This triggers the de-identification process. When complete, right click the menu again, select the Mark as de-identified. For detailed instructions see Sensitive Datasets article for detailed information on how to do this.

Any field that is marked in the Data Schema with Sensitive Data as No, will simply be copied over to the de-identified data.
Any field that is marked in the Data Schema with Sensitive Data as Yes, and the field type has associated de-identification logic (e.g., Age—See Supported Data Schema Data Types), will have that de-identification logic applied to produce the de-identified data for the field.
Any field that is marked in the Data Schema with Sensitive Data as Yes, and whose field type does not have any associated de-identification logic (e.g., String—See Supported Data Schema Data Types) will be omitted from the de-identified data (the entire column will be removed).

Any information that is important for the reverse translation between de-identified and raw data (e.g., reverse crosswalk tables for newly generated de-identified UIDs) will be stored by RHP along with the raw Dataset data (this data is treated as sensitive).

De-identification with Custom Code Objects

Users can apply custom de-identification logic using any library they desire using Rhino's Generalized Compute (GC) Code capability. Users could utilize open source software, 3rd party tools, or code that they wrote, build a container that includes this software, and run it on the FCP.

The code should be written such that it reads the raw Dataset from /input/ (within the container) and writes the clean (de-identified) Dataset to /output/ (within the container). For more information on how to use custom code on the FCP, click here.

Secure Access could be used to review the de-identified Dataset and verify that no sensitive data was accidentally left intact. This could be performed either by members of the site from which the Dataset was imported or by another collaborator (after the owner has provided access via a Secure Access List).

It is also possible to limit visibility to the raw Dataset and/or the de-identified Dataset before it has been reviewed or approved. No one will have access to the patient-level data unless they have been granted access via Secure Access.

In order to further limit access to aggregate statistical information about the Dataset before it has been approved (e.g., Categorical fields where sensitive data may have been left):

Users can set the default permissions for viewing Dataset analytics to just their own site, and then the project lead and other project collaborators will not be able to view aggregate statistics about their Datasets.
Users can create a separate project in which they will interact with the Dataset before it has been validated as being de-identified. Then, once the Dataset has been validated as being sufficiently de-identified, the de-identified data can be exported from the “de-identification” project and re-imported into the “real” project.

Interactive De-identification with Interactive Containers

In addition to applying custom de-identification logic that will run remotely using GC, users can utilize the FCP's Interactive Container capability to create Code Objects that enable them to perform de-identification in an interactive session.

Users can take the application they want to use, as long as it can run on Linux, and build a container that includes this application. Then, this container can be run in an interactive session on the FCP.

Users can connect to the UI of this interactive session and use the application to perform de-identification. The raw Dataset will be available at /input, and after de-identification, users can store the clean (de-identified) Dataset to /output.

Similar to de-identification with GC, Secure Access can be used to review the de-identified Dataset and verify that no sensitive data was accidentally left intact. It is also possible to limit visibility to the Dataset before de-identification has been verified in the same way as with using GC for de-identification.

Click here for more information about Interactive Containers