Before creating a new Dataset, make sure you know the following:
- Name: The name you would like to provide for your Dataset
- (Optional) Description: A description that will help you understand the nature of the Dataset
- Workgroup: The workgroup for which you will import your Dataset. Unless granted special permissions, you will only be able to import Dataset for your own Workgroup. Please read Importing a Remote Dataset, for more information
-
Data Schema: The schema that defines the structure of the Dataset. It can either be pre-defined or auto-generated by the system
- [Auto-generate schema from data]: Allow the system to infer the schema for your Dataset. For more information about auto-generating data schemas, please refer to Auto-Generated Data Schemas
- Pre-defined Schema: A data schema that you have defined previously using the create schema function within the FCP. Please refer to Creating a New Schema or Schema Version for a reminder about how to create a schema within the FCP
- Tabular Data File Location: A file path within the Rhino Client (under /rhino_data/) where the CSV file containing the tabular data is located. If a pre-defined schema is used, this file must have the same columns as defined in the Data Schema. Each row should represent a specific data point (e.g. study). Each column represents the features of that data point (e.g. DICOM Study UID, age, gender, clinical features, associated label for ML, etc.). Note: When a Schema is provided, Datasets are validated against the Schema while importing. Therefore you will need to make sure your data types are correct (e.g. no decimal point where integers are expected) and the values are within the parameters defined in the schema
-
DICOM Data Location: When a column representing DICOM UIDs is defined in the tabular CSV file, you will also be required to define the location of the DICOM files on the Rhino Client.
- Filesystem: If the files are located on the Rhino Client, then you should place the parent folder holding all the files within the text field (starting with /rhino_data/)
- DICOM Server: If the DICOM folders are located on a DICOM server, you will need the following information for the DICOM server to connect and import the DICOM files: host, DICOM web port, DICOM web prefix, username, and password
- File Data Location: The path on your Rhino Client (starting with /rhino_data/) where all your file data is contained. The file data is then provided in the tabular data file as a feature of type Filename with relative paths to the absolute path specified in the Dataset import page. If more than one file needs to be associated with each row, then multiple Filename columns can be added to the Data Schema and populated with file paths in the tabular data file
- Is Data Deidentified?: Whether or not the data that is about to be imported has been de-identified or not. For more information, please refer to De-Identification of Dataset Data
Creating a New Dataset using the Rhino FCP UI
Prerequisites
Before starting this process, you should have already:
- Created a Project
- Having made data available in the Rhino Client
Create a New Dataset
Go to Your Project -> Datasets -> Import New Dataset, and fill in the form:
Creating a New Dataset using the Rhino SDK
Prerequisites
Before starting this process, you should have already:
- Created a Project using the Rhino SDK or UI
- Optionally created a Data schema using the Rhino SDK or UI
Import your Python Dependencies
import rhino_health as rh
from rhino_health.lib.endpoints.dataset.dataset_dataclass import DatasetCreateInput
import getpass
Remember to change all lines with CHANGE_ME comments above them in all the blocks below!
Log into the Rhino SDK using your FCP Credentials
Your username will be the email address you log into the Rhino FCP platform with.
print("Logging In")
# CHANGE_ME: MY_USERNAME
my_username = "MY_USERNAME"
my_password = getpass.getpass()
session = rh.login(username=my_username, password=my_password)
print("Logged In")
Get supporting FCP information needed to import your Dataset
At this point, you will need the name of your Project, as well as all the information mentioned at the top of this article. You can also retrieve each object's UUID by following the instructions here: How do I retrieve a Project's, Collaborator's, Schema's, Cohort's, Model's, or Model Results' UUID?
# CHANGE_ME: YOUR_FCP_PROJECT_NAME
project = session.project.get_project_by_name('YOUR_FCP_PROJECT_NAME')
workgroup = session.project.get_collaborating_workgroups(project.uid)[0]
# CHANGE_ME: SCHEMA_NAME & Possibly Version Number too. To Infer your schema, set this variable to None
dataschema_uid = session.project.get_data_schema_by_name('SCHEMA_NAME', project_uid=project.uid, version=1).uid
Create your Dataset from files on your Rhino Client
To create your Dataset you will need to change several different pieces of information in the code block below. Remember the paths for csv_filesystem_location, image_filesystem_location, and file_base_path will be paths located in your Rhino Client and should start with /rhino_data
.
dataset_params = DatasetCreateInput(
# CHANGE_ME: DATASET_NAME
name = "DATASET_NAME",
# CHANGE_ME: DATASET_DESCRIPTION
description = "DATASET_DESCRIPTION", project_uid = project.uid, workgroup_uid = workgroup.uid, data_schema_uid=dataschema_uid,
# CHANGE_ME: DATASET_TABULAR_CSV_FILE
csv_filesystem_location = "DATASET_TABULAR_CSV_FILE",
# CHANGE_ME: DICOM_FILESYSTEM_LOCATION
image_filesystem_location = "DICOM_FILESYSTEM_LOCATION",
# CHANGE_ME: FILE_DATA_FILESYSTEM_LOCATION
file_base_path = "FILE_DATA_FILESYSTEM_LOCATION",
method = "filesystem"
) dataset = session.dataset.add_dataset(dataset_params) print(f"Created new dataset '{dataset.name}' with uid '{dataset.uid}'")
Creating a New Dataset using the Rhino SDK & SQL Data Ingestion
*Note: Importing a cohort from SQL is currently only supported via the Rhino SDK.
Prerequisites
- Have an SQL DB that is open to connections from a Rhino Client, with access credentials for read-only access to this DB.
- Have a project where you are either part of the project's lead workgroup and the DB is within your site, or where there is a collaborator in the project that has the DB at their site.
- Ensure the required site-level permissions for SQL querying (Import and export Datasets, View Dataset analytics) are enabled for the site that has the DB.
- Ensure that you have the Rhino SDK installed on your computer (pip install rhino_health).
To import a Dataset directly from a SQL DB, follow the following steps:
Import your Python Dependencies
import rhino_health as rh from rhino_health.lib.endpoints.sql_query.sql_query_dataclass import ( SQLQueryImportInput, SQLQueryInput, SQLServerTypes, ConnectionDetails, ) import getpass
Remember to change all lines with CHANGE_ME comments above them in all the blocks below.
Log into the Rhino SDK using your FCP Credentials
Your username will be the email address you log into the Rhino FCP platform with.
print("Logging In") # CHANGE_ME: MY_USERNAME my_username = "MY_USERNAME" session = rh.login(username=my_username, password=getpass.getpass()) print("Logged In")
Get Supporting FCP Information Needed to Import Your Dataset
At this point, you will need the name of your project to retrieve your Project's UUID and subsequently your Workgroup UUID. You can also retrieve each object's UUID by following the instructions here: How do I retrieve a Project's, Collaborator's, Data Schema's, Dataset's, Code Object's, or Code Run's UID?
# CHANGE_ME: YOUR_FCP_PROJECT_NAME project = session.project.get_project_by_name('YOUR_FCP_PROJECT_NAME') workgroup = session.project.get_collaborating_workgroups(project.uid)[0]
Connection Setup
When specifying the connection details, ensure that you provide the server_type using the approved SQLServerTypes enum. This step ensures that your server is supported and compatible with the querying process. For a complete list of supported & compatible servers, refer to the SDK documentation here: SQL Server Types. Replace the following variables below:
- sql_db_user - Your SQL database username. Make sure that the user has read-only permissions
- sql_db_password - Your SQL database password. For better security, consider using an environment variable (e.g. os.getenv("DB_PASSWORD")), or using getpass.getpass() to type in the password
- external_server_url - Your SQL database url & port (i.e. "{url}:{port}")
- db_name - Your SQL database name
# CHANGE_ME: SQL_DATABASE_USERNAME sql_db_user = "SQL_DATABASE_USERNAME" # CHANGE_ME: SQL_DATABASE_PASSWORD sql_db_password = "SQL_DATABASE_PASSWORD" # CHANGE_ME: EXTERNAL_SERVER_URL external_server_url = "EXTERNAL_SERVER_URL" # CHANGE_ME: DB_NAME db_name = "DB_NAME" # Replace this with your DB name. connection_details = ConnectionDetails( server_user=sql_db_user, password=sql_db_password, # CHANGE_ME: Replace POSTGRESQL with one of the supported SQL Server Types outlined in the documentation above server_type=SQLServerTypes.POSTGRESQL, server_url=external_server_url, db_name=db_name )
[Optional] Running Exploratory Queries
You can run SQL queries on the remote DB and receive aggregate statistics on the results of the query.
This involves two inputs:
- Define the query you want to run (note that the RHP does not limit the SQL code that is run - always connect with a DB user that has read-only permissions)
- Define the metrics you would like to calculate on the query results (See the SDK documentation for more information about what Metrics you can calculate: rhino_health.lib.metrics)
# CHANGE_ME: QUERY_STRING, Replace with query you want to run, e.g. "SELECT * FROM <your_table> WHERE <condition>" starting_query = "QUERY_STRING" # CHANGE_ME: ["METRICS"], # Define a list of metrics (e.g. [Mean(variable="Height")], etc.) outlined in the documentation above metric_definitions = ["METRICS"]
Define the Exploratory Query Run Parameters
When defining your SQLQueryInput, your project & workgroup will be used to validate permissions that were set at the project and site level (i.e. k-anonymization value). For more information on permissions, please refer to Permissions.
query_run_params = SQLQueryInput( session=session, project=project.uid, workgroup=workgroup.uid, connection_details=connection_details, sql_query=starting_query, timeout_seconds=600, metric_definitions=metric_definitions )
Run the Exploratory Query
Run the query on your SQL database and get the metric results.
response = session.sql_query.run_sql_query(query_run_params) print(response.results)
Import Query Results as a Dataset in the Rhino FCP
You can run SQL queries on the remote database and then have the results of the query stored as a Dataset on the Rhino FCP allowing further processing, analysis, etc.
This involves two inputs:
- The query you want to run (note that the RHP does not limit the SQL code that is run - always connect with a DB user that has read-only permissions)
- Data needed for the Dataset creation (e.g. Dataset name)
# CHANGE_ME: QUERY_STRING, Replace with query you want to run to generate the data for the Dataset, e.g. "SELECT * FROM <your_table> WHERE <condition>" query = "QUERY_STRING" # CHANGE_ME: COHORT_NAME cohort_name = "COHORT_NAME" # CHANGE_ ME: False, Can be either True or False depending on whether the data being queried is de-identifed is_data_deidentified = False # # CHANGE_ME: SCHEMA_NAME & Possibly Version Number too. To auto-generate your schema, set this variable to None dataschema_uid = session.project.get_data_schema_by_name('SCHEMA_NAME', project_uid=project.uid, version=1).uid
Define the Dataset Import Parameters
Same as when defining your SQLQueryInput, when creating your SQLQueryImportInput, your project & workgroup will be used to validate permissions that were set at the project and site level (i.e. k-anonymization value). For more information on permissions, please refer to Permissions.
import_run_params = SQLQueryImportInput( session=session, project=project.uid, workgroup=workgroup.uid, connection_details=connection_details, dataset_name=dataset_name, data_schema_uid=dataschema_uid timeout_seconds=600, is_data_deidentified=is_data_deidentified, sql_query=query )
Trigger the Query to Import your Dataset
Run the query on your SQL database and the results will be imported into the Rhino FCP as a Dataset.
response = session.sql_query.import_dataset_from_sql_query(import_run_params) print(response.results)