Sensitive Datasets – Rhino Federated Computing

Rhino FCP safeguards sensitive data. Sensitive data is information that must be kept confidential because unauthorized access could pose risks. This includes categories such as Personally Identifiable Information (PII), Protected Health Information (PHI), Payment Card Industry (PCI) data, and other personal data protected by various regulations. Examples of sensitive data are a patient’s social security number, a client’s full name, or a person’s credit card number and security code.

Background

Our Data Security Principles, which are listed below, keep data (including sensitive data) secure. Here is a summary of our principles.

De-identified individual data is persisted within the institution’s firewall and can be provided for temporary remote access with proper authorization. De-identified data about individual people (regardless of data type) is always stored behind the institution's firewall, in the Rhino Client. It is shared within the Rhino Client with models for training/validation. It can only be shared with users transiently, meaning it does not persist outside of the firewall, and after obtaining required approvals from the data custodian.
De-identified aggregate data can be stored outside the institution’s firewall. Aggregate statistics of de-identified data can be stored outside of the institution's firewall, such as in the cloud, and provided to authorized users based on their permission settings, so long as they are aggregated enough to sufficiently reduce the risk of re-identification (e.g. using k-anonymization with ‘k’ defined by the institution).
Sensitive data (such as PII/PHI) is only persisted within the institution's firewall, including derived aggregate statistics, and can be provided for temporary remote access with proper authorization.Sensitive data (such as PII and PHI) is only stored behind the institution's firewall, in the Rhino Client. Aggregate statistics of sensitive data are also only stored behind the institutional firewall. The sensitive data is shared (within the Rhino Client) with code/models for processing, and all outputs of such processing are also considered as potentially containing sensitive data. Sensitive data can only be shared with users transiently and after obtaining required approvals from the data custodian, such as ensuring that a Business Associate Agreement (BAA) is in place.

Working with Sensitive Data in Datasets

Rhino FCP securely handles sensitive information in datasets. You can explicitly indicate which fields contain (or could potentially contain) secure data, such as a Patient ID that might be based on a Social Security number or a payment field that includes a credit card number.

Rhino FCP employs Safe Harbor to de-identify secure data; what happens to the data depends on the datatypes of the data fields indicated in the schema.

Importing and Reviewing Sensitive Datasets

When you import a dataset with sensitive data, you will need to do several things before you use it.

View the available datasets.
Import a dataset and select whether to autogenerate a schema for it or use an existing schema.
Review the dataset. This includes tailoring the schema for the dataset if needed, reviewing fields that contain sensitive data, and determining whether to make sensitive information available to collaborators.

Viewing Available Datasets

To view datasets, complete the following steps.

Select the Dataset from the main menu in your project. Available datasets appear on the page. Use the Filter options in the top right sections to filter by creator or schema.

When a dataset has sensitive information, it is highlighted in pink and has an icon, as shown below.

Importing a Dataset

To import a dataset, do the following.

In the Datasets page, on the Overview tab, select the Import New Dataset button located in the top right corner of the page.
In the Import New Dataset page, enter the following information:

Name: Name of the dataset.
Description: Description of the dataset.
Workgroup: The name of the workgroup that the dataset belongs to.
Data Schema: Indicates the name of the schema for the dataset. If you do not want to use an existing schema, you can choose to auto-generate a schema from the data. Otherwise, choose a schema from the selections offered.
DICOM Data Location (if needed) - Indicate the import method (DICOM Server or File System)
DICOM Path (if needed): Indicate path where the DICOM data is stored.
File Data Location: Indicate the path of the file data.
Sensitive Data - Indicates whether the dataset contains sensitive data. Note that when you import a new dataset, you select Auto-generate schema from data in the Data Schema field, and you select the Sensitive Data field, you will need to review the schema by clicking the Review Schema button. If you import a new dataset, but select a schema (and do not autogenerate it), you can select the Sensitive Data field and then select the Import New Dataset button. Rhino FCP behaves this way because it is assumed that you've already indicated which fields are sensitive in the schema you chose.

Select Import New Dataset.

Reviewing the Schema of Sensitive Dataset

To review the schema, examine the schema and review the analytics, then review project settings.

Examine the Schema and Mark Sensitive Data

To examine the schema and mark sensitive data, do the following.

Select Data Schema, then select the schema that your dataset uses. The Schema page opens.

2. Review the information in these fields:

Name: Name of the field.
Identifier: The string by which this field is identified in the input data. If empty, the Name field will be used for identification.
Description: Description of the field.
Type: The field's data type, like Float or String. Please refer to the Supported Data Schema Data Types for a full list of supported data types within the Rhino FCP. If you decide to de-identify sensitive data, Rhino FCP employs Safe Harbor. You will want to familiarize yourself with what happens to the data depends on the datatypes of the data fields indicated in the schema.
Type Params: Any parameters for the field type (e.g., min/max values for a ContrainedInt type). Please refer to the Supported Data Schema Data Types for a full list of supported parameters for each data type within the Rhino FCP.
Unit: The Unit in which the values of this field are expected to be. These units will also appear in charts showing aggregate values from this field in the Rhino user interface.
Required: Indicates whether the schema field is required or not.
Sensitive Data: Information that must be kept confidential because unauthorized access could pose risks. This includes categories such as Personally Identifiable Information (PII), Protected Health Information (PHI), Payment Card Industry (PCI) data, and other personal data protected by various regulations. Move the slider to the right if the data in the field is sensitive (contains PHI, PII, PCI, or other information that must be safeguarded).
Aggregate Statistics: Move the slider to the right (Enabled) if the field should be included in aggregate statistics in the data analytics page or in federated analytics via the Rhino SDK. Note that if the field contains sensitive data, aggregate statistics cannot be able to be enabled to avoid inadvertent exposure of sensitive data.
Secure Access: Indicates whether the field should be included in data shared via secure access lists, which is under the Secure Access tab for your dataset. (You can do this only if you have permission to do this.) To do this, move the slider to the right to enable this option. If it is disabled, no one will be able to see the data; if it is enabled, then those with the secure access permission will be able to view the data.

Use care as you adjust the Sensitive Data setting. If you indicate that a field contains Sensitive Data, it will automatically disable aggregate statistics. If you set Sensitive Data to "No" again, the aggregate statistics setting will still be disabled, so you might need to enable it again if needed.

Examine Analytics Data

When you select the Analytics tab, which appears when the Dataset menu option is selected, sensitive data is omitted from the Data Completeness and field sections. However, if you have permissions to view sensitive data and the secure access option is enabled, you will see the omitted data.

Review Project Settings

If necessary, you will need to review the project settings for collaborators. To do this, complete the following steps.

In your project, go to Collaborators > Site Level > Action.
Indicate which personas can do the following:
- Manage Secure Access Lists
- Share Secure Access Lists
- View Datasets via Secure Access View Client Side Logs with Sensitive Data
- View Sensitive Datasets via Secure Access.
- View Client Side Logs with Sensitive Data - Allows collaborators to view logs from code runs on Sensitive Databases from the site. If collaborator personas are not added, collaborators will not be able to see logs from sensitive datasets.

NOTE: For more information on how to make changes to Site Level permissions, see Viewing Your Project's Permissions Policy. Note the following: View Sensitive Datasets via Secure Access - Allows collaborators to access sensitive data from this site with views such as the tabular data view. It also allows access from interactive containers, including fields marked as local only.

De-Identifying a Dataset with Sensitive Data

De-identification is the process of removing or modifying personal and sensitive information from data sets to protect individuals' privacy. This is typically done by removing identifiable elements such as names, addresses, and other direct identifiers while retaining the data's usefulness for analysis and research. Note that if you choose to de-identify data, this could result in data loss. Rhino FCP employs Safe Harbor. You will want to familiarize yourself with what happens to the data is based on the datatypes of the data fields indicated in the schema.

To de-identify the dataset, complete the following steps.

Select Datasets from your project's menu, then select the Overview tab.
Select a dataset that has sensitive data. A sensitive data dataset is highlighted in pink and has the sensitive data icon.
Inspect the data by doing the following.
- Select the Analytics link; ensure that no sensitive data is shown and that the data looks as expected.
- Select the Data link; ensure that sensitive data appears and that data looks as expected.
- If needed, also select secure access.
- If the information does not look as expected (for example, if something that should be marked as Sensitive Data, but is not), select Data Schemas from the main menu, then select the schema for the dataset. In the data schema for the dataset, review the type, whether it is marked as Sensitive Data, whether it should be in aggregate statistics, and whether secure access should be enabled.

Next, select Datasets and in the Overview tab, select the three-dot menu to the right of the dataset.
Optionally, select Perform schema-based de-identification if you want to perform the schema-based de-identification on Safe Harbor. Keep in mind hat if you do this, there could be data loss. You might choose to do this if you are testing or writing code, but might not choose to do this in production.

In the pop-up window that appears, select the Mark resulting dataset as de-identified if you want to indicate that Safe Harbor was applied without further checking. If you want to review it, leave the checkbox unchecked. Select Confirm to close.

Perform De-Identification.jpg

A new dataset is created. If you select it and view analytics, the de-identified data now appears in the analytics.
If you did not check the checkbox in the pop-up window (marking the dataset), select the three-dot menu to the right of the dataset entry in the Dataset Overview tab, and select Mark as de-identified if you want to mark the dataset as de-identified.

Related Articles: