Importing a New Dataset or Dataset Version Using Rhino SDK and SQL – Rhino Federated Computing

Rhino FCP allows you to easily and securely import datasets that you can use in your project.This article explains how to use the Rhino SDK and SQL to import a new dataset or dataset version.

Note: If you want to know how to do this in the Rhino FCP GUI instead, see Importing a New Dataset in the Rhino FCP GUI.

Prerequisites

Before creating a new dataset, make sure you know the following:

Name: The name you would like to provide for your Dataset.
(Optional) Description: A description that will help you understand the nature of the Dataset.
Workgroup: The workgroup for which you will import your Dataset. Unless granted special permissions, you will only be able to import datasets for your own Workgroup. Please read Importing a Remote Dataset, for more information.
Data Schema: The schema that defines the structure of the Dataset. It can either be pre-defined or auto-generated by the system. [Auto-generate schema from data]: Allows the system to infer the schema for your Dataset. For more information about auto-generating data schemas, please refer to Auto-Generated Data Schemas. Pre-defined Schema: A data schema that you have defined previously using the create schema function within the FCP. Please refer to Creating a New Schema or Schema Version for a reminder about how to create a schema within the FCP.
Tabular Data File Location: A file path within the Rhino Client (under /rhino_data/) where the CSV file containing the tabular data is located. If a pre-defined schema is used, this file must have the same columns as defined in the Data Schema. Each row should represent a specific data point (e.g. study). Each column represents the features of that data point (e.g. DICOM Study UID, age, gender, clinical features, associated label for ML, etc.). Note: When a Schema is provided, Datasets are validated against the Schema while importing. Therefore you will need to make sure your data types are correct (e.g. no decimal point where integers are expected) and the values are within the parameters defined in the schema.
DICOM Data Location: When a column representing DICOM UIDs is defined in the tabular CSV file, you will also be required to define the location of the DICOM files on the Rhino Client.
Filesystem: If the files are located on the Rhino Client, then you should place the parent folder holding all the files within the text field (starting with /rhino_data/).
DICOM Server: If the DICOM folders are located on a DICOM server, you will need the following information for the DICOM server to connect and import the DICOM files: host, DICOM web port, DICOM web prefix, username, and password.
File Data Location: The path on your Rhino Client (starting with /rhino_data/) where all your file data is contained. The file data is then provided in the tabular data file as a feature of type Filename with relative paths to the absolute path specified in the Dataset import page. If more than one file needs to be associated with each row, then multiple Filename columns can be added to the Data Schema and populated with file paths in the tabular data file.
Is Data Deidentified?: Whether or not the data that is about to be imported has been de-identified or not. For more information, please refer to De-Identification of Dataset Data.

Importing a New Dataset using the Rhino SDK

Prerequisites

Before starting this process, you should have already:

Created a Project using the Rhino SDK or UI
Optionally created a Data schema using the Rhino SDK or UI

Import your Python Dependencies

import rhino_health as rh
from rhino_health.lib.endpoints.dataset.dataset_dataclass import DatasetCreateInput
import getpass

Remember to change all lines with CHANGE_ME comments above them in all the blocks below!

Log into the Rhino SDK using your FCP Credentials

Your username will be the email address you log into the Rhino FCP platform with.

print("Logging In")

# CHANGE_ME: MY_USERNAME
my_username = "MY_USERNAME"

my_password = getpass.getpass()
session = rh.login(username=my_username, password=my_password)
print("Logged In")

Get supporting FCP information needed to import your Dataset

At this point, you will need the name of your Project, as well as all the information mentioned at the top of this article. You can also retrieve each object's UUID by following the instructions here: How do I retrieve a Project's, Collaborator's, Schema's, Cohort's, Model's, or Model Results' UUID?

# CHANGE_ME: YOUR_FCP_PROJECT_NAME
project = session.project.get_project_by_name('YOUR_FCP_PROJECT_NAME')

workgroup = session.project.get_collaborating_workgroups(project.uid)[0]

# CHANGE_ME: SCHEMA_NAME & Possibly Version Number too. To Infer your schema, set this variable to None
dataschema_uid = session.project.get_data_schema_by_name('SCHEMA_NAME', project_uid=project.uid, version=1).uid

Import your Dataset from files on your Rhino Client

To import/create your Dataset you will need to change several different pieces of information in the code block below. Remember the paths for csv_filesystem_location, image_filesystem_location, and file_base_path will be paths located in your Rhino Client and should start with /rhino_data.

dataset_params = DatasetCreateInput(

  # CHANGE_ME: DATASET_NAME
  name = "DATASET_NAME",

  # CHANGE_ME: DATASET_DESCRIPTION
  description = "DATASET_DESCRIPTION",

  project_uid = project.uid, 
  workgroup_uid = workgroup.uid,
  data_schema_uid=dataschema_uid,

  # CHANGE_ME: DATASET_TABULAR_CSV_FILE
  csv_filesystem_location = "DATASET_TABULAR_CSV_FILE",
  
 # CHANGE_ME: DICOM_FILESYSTEM_LOCATION
  image_filesystem_location = "DICOM_FILESYSTEM_LOCATION",
  
 # CHANGE_ME: FILE_DATA_FILESYSTEM_LOCATION
  file_base_path = "FILE_DATA_FILESYSTEM_LOCATION",

  method = "filesystem"
)

dataset = session.dataset.add_dataset(dataset_params)
print(f"Created new dataset '{dataset.name}' with uid '{dataset.uid}'")

Creating a New Dataset using the Rhino SDK & SQL Data Ingestion

Note: Importing a cohort from SQL is currently only supported via the Rhino SDK.

Prerequisites

Have an SQL DB that is open to connections from a Rhino Client, with access credentials for read-only access to this DB.
Have a project where you are either part of the project's lead workgroup and the DB is within your site, or where there is a collaborator in the project that has the DB at their site.
Ensure the required site-level permissions for SQL querying (Import and export Datasets, View Dataset analytics) are enabled for the site that has the DB.
Ensure that you have the Rhino SDK installed on your computer (pip install rhino_health).

To import a Dataset directly from a SQL DB, follow the following steps:

Import your Python Dependencies

import rhino_health as rh
from rhino_health.lib.endpoints.sql_query.sql_query_dataclass import (
    SQLQueryImportInput,
    SQLQueryInput,
    SQLServerTypes,
    ConnectionDetails,
)
import getpass

Remember to change all lines with CHANGE_ME comments above them in all the blocks below.

Log into the Rhino SDK using your FCP Credentials

Your username will be the email address you log into the Rhino FCP platform with.

print("Logging In")

# CHANGE_ME: MY_USERNAME
my_username = "MY_USERNAME"

session = rh.login(username=my_username, password=getpass.getpass())
print("Logged In")

Get Supporting FCP Information Needed to Import Your Dataset

At this point, you will need the name of your project to retrieve your Project's UUID and subsequently your Workgroup UUID. You can also retrieve each object's UUID by following the instructions here: How do I retrieve a Project's, Collaborator's, Data Schema's, Dataset's, Code Object's, or Code Run's UID?

# CHANGE_ME: YOUR_FCP_PROJECT_NAME
project = session.project.get_project_by_name('YOUR_FCP_PROJECT_NAME')

workgroup = session.project.get_collaborating_workgroups(project.uid)[0]

Connection Setup

When specifying the connection details, ensure that you provide the server_type using the approved SQLServerTypes enum. This step ensures that your server is supported and compatible with the querying process. For a complete list of supported & compatible servers, refer to the SDK documentation here: SQL Server Types. Replace the following variables below:

sql_db_user - Your SQL database username. Make sure that the user has read-only permissions
sql_db_password - Your SQL database password. For better security, consider using an environment variable (e.g. os.getenv("DB_PASSWORD")), or using getpass.getpass() to type in the password
external_server_url - Your SQL database url & port (i.e. "{url}:{port}")
db_name - Your SQL database name

# CHANGE_ME: SQL_DATABASE_USERNAME
sql_db_user = "SQL_DATABASE_USERNAME"

# CHANGE_ME: SQL_DATABASE_PASSWORD
sql_db_password = "SQL_DATABASE_PASSWORD"

# CHANGE_ME: EXTERNAL_SERVER_URL
external_server_url = "EXTERNAL_SERVER_URL"

# CHANGE_ME: DB_NAME
db_name = "DB_NAME" # Replace this with your DB name.
connection_details = ConnectionDetails(
    server_user=sql_db_user,
    password=sql_db_password,
    
    # CHANGE_ME: Replace POSTGRESQL with one of the supported SQL Server Types outlined in the documentation above
    server_type=SQLServerTypes.POSTGRESQL,
    server_url=external_server_url,
    db_name=db_name
)

[Optional] Running Exploratory Queries

You can run SQL queries on the remote DB and receive aggregate statistics on the results of the query.

This involves two inputs.

Define the query you want to run (note that the RHP does not limit the SQL code that is run - always connect with a DB user that has read-only permissions)
Define the metrics you would like to calculate on the query results (See the SDK documentation for more information about what Metrics you can calculate: rhino_health.lib.metrics)

# CHANGE_ME: QUERY_STRING, Replace with query you want to run, e.g. "SELECT * FROM <your_table> WHERE <condition>"
starting_query = "QUERY_STRING"

# CHANGE_ME: ["METRICS"], # Define a list of metrics (e.g. [Mean(variable="Height")], etc.) outlined in the documentation above
metric_definitions = ["METRICS"]

Define the Exploratory Query Run Parameters

When defining your SQLQueryInput, your project & workgroup will be used to validate permissions that were set at the project and site level (i.e. k-anonymization value). For more information on permissions, please refer to Permissions.

query_run_params = SQLQueryInput(
    session=session,
    project=project.uid,
    workgroup=workgroup.uid,
    connection_details=connection_details,
    sql_query=starting_query,
    timeout_seconds=600,
    metric_definitions=metric_definitions
)

Run the Exploratory Query

Run the query on your SQL database and get the metric results.

response = session.sql_query.run_sql_query(query_run_params)
print(response.results)

Import Query Results as a Dataset in the Rhino FCP

You can run SQL queries on the remote database and then have the results of the query stored as a Dataset on the Rhino FCP allowing further processing, analysis, etc.

This involves two inputs:

The query you want to run (note that the RHP does not limit the SQL code that is run - always connect with a DB user that has read-only permissions)
Data needed for the Dataset creation (e.g. Dataset name)

# CHANGE_ME: QUERY_STRING, Replace with query you want to run to generate the data for the Dataset, e.g. "SELECT * FROM <your_table> WHERE <condition>"
query = "QUERY_STRING"

# CHANGE_ME: COHORT_NAME
cohort_name = "COHORT_NAME"

# CHANGE_ ME: False, Can be either True or False depending on whether the data being queried is de-identifed
is_data_deidentified = False # 

# CHANGE_ME: SCHEMA_NAME & Possibly Version Number too. To auto-generate your schema, set this variable to None
dataschema_uid = session.project.get_data_schema_by_name('SCHEMA_NAME', project_uid=project.uid, version=1).uid

Define the Dataset Import Parameters

Same as when defining your SQLQueryInput, when creating your SQLQueryImportInput, your project & workgroup will be used to validate permissions that were set at the project and site level (i.e. k-anonymization value). For more information on permissions, please refer to Permissions.

import_run_params = SQLQueryImportInput(
    session=session,
    project=project.uid,
    workgroup=workgroup.uid,
    connection_details=connection_details,
    dataset_name=dataset_name,
    data_schema_uid=dataschema_uid
    timeout_seconds=600,
    is_data_deidentified=is_data_deidentified,
    sql_query=query
)

Trigger the Query to Import your Dataset

Run the query on your SQL database and the results will be imported into the Rhino FCP as a Dataset.

response = session.sql_query.import_dataset_from_sql_query(import_run_params)
print(response.results)

Importing Datasets at a Collaborating Site

Given the right permissions, it is possible for users to import Datasets at other sites (and not just their own site). This is useful in case collaborators prefer to only prepare the data and not perform the Dataset import operation on the Rhino FCP.

To perform a remote Dataset import, a user’s role must have the “Import and export Datasets” permission for that site. For example, if you are the Workgroup Admin of the site that created the project (hence you are the Project Lead Admin, or PLA), then the collaborating site must enable the “Import and export Datasets” permission for the PLA persona.

To trigger the Dataset import at the collaborating site via the UI, follow the same process for importing a Dataset to your own site via the UI, and in the Select Workgroup dropdown, select the Workgroup of the collaborator. Make sure to input the data file locations on the collaborator’s Rhino Client.

To trigger the Dataset import at the collaborating site via the SDK, follow the same process for importing a Dataset to your own site via the SDK, and in the “workgroup_uid” field pass in the UID of the collaborator’s workgroup.