Using SQL to Extract Metrics and Import Data From a Database – Rhino Federated Computing

SQL Queries and Data Ingestion from SQL Databases

Note: Importing a cohort from SQL is currently only supported via the Rhino SDK.

Prerequisites

Have an SQL DB that is open to connections from a Rhino Client, with access credentials for read-only access to this DB.
Have a project where you are either part of the project's lead workgroup and the DB is within your site, or where there is a collaborator in the project that has the DB at their site.
Ensure the required site-level permissions for SQL querying (Import and export Datasets, View Dataset analytics) are enabled for the site that has the DB.
Ensure that you have the Rhino SDK installed on your computer (pip install rhino_health).

To import a Dataset directly from a SQL DB, follow the following steps:

Import your Python Dependencies

import rhino_health as rh
from rhino_health.lib.endpoints.sql_query.sql_query_dataclass import (
    SQLQueryImportInput,
    SQLQueryInput,
    SQLServerTypes,
    ConnectionDetails,
)
import getpass

Remember to change all lines with CHANGE_ME comments above them in all the blocks below.

Log into the Rhino SDK using your FCP Credentials

Your username will be the email address you log into the Rhino FCP platform with.

print("Logging In")

# CHANGE_ME: MY_USERNAME
my_username = "MY_USERNAME"

session = rh.login(username=my_username, password=getpass.getpass())
print("Logged In")

Get Supporting FCP Information Needed to Import Your Dataset

At this point, you will need the name of your project to retrieve your Project's UUID and subsequently your Workgroup UUID. You can also retrieve each object's UUID by following the instructions here: How do I retrieve a Project's, Collaborator's, Data Schema's, Dataset's, Code Object's, or Code Run's UID?

# CHANGE_ME: YOUR_FCP_PROJECT_NAME
project = session.project.get_project_by_name('YOUR_FCP_PROJECT_NAME')

workgroup = session.project.get_collaborating_workgroups(project.uid)[0]

Connection Setup

When specifying the connection details, ensure that you provide the server_type using the approved SQLServerTypes enum. This step ensures that your server is supported and compatible with the querying process. For a complete list of supported & compatible servers, refer to the SDK documentation here: SQL Server Types. Replace the following variables below:

sql_db_user - Your SQL database username. Make sure that the user has read-only permissions
sql_db_password - Your SQL database password. For better security, consider using an environment variable (e.g. os.getenv("DB_PASSWORD")), or using getpass.getpass() to type in the password
external_server_url - Your SQL database url & port (i.e. "{url}:{port}")
db_name - Your SQL database name

# CHANGE_ME: SQL_DATABASE_USERNAME
sql_db_user = "SQL_DATABASE_USERNAME"

# CHANGE_ME: SQL_DATABASE_PASSWORD
sql_db_password = "SQL_DATABASE_PASSWORD"

# CHANGE_ME: EXTERNAL_SERVER_URL
external_server_url = "EXTERNAL_SERVER_URL"

# CHANGE_ME: DB_NAME
db_name = "DB_NAME" # Replace this with your DB name.
connection_details = ConnectionDetails(
    server_user=sql_db_user,
    password=sql_db_password,
    
    # CHANGE_ME: Replace POSTGRESQL with one of the supported SQL Server Types outlined in the documentation above
    server_type=SQLServerTypes.POSTGRESQL,
    server_url=external_server_url,
    db_name=db_name
)

[Optional] Running Exploratory Queries

You can run SQL queries on the remote DB and receive aggregate statistics on the results of the query.

This involves two inputs.

Define the query you want to run (note that the RHP does not limit the SQL code that is run - always connect with a DB user that has read-only permissions)
Define the metrics you would like to calculate on the query results (See the SDK documentation for more information about what Metrics you can calculate: rhino_health.lib.metrics)

# CHANGE_ME: QUERY_STRING, Replace with query you want to run, e.g. "SELECT * FROM <your_table> WHERE <condition>"
starting_query = "QUERY_STRING"

# CHANGE_ME: ["METRICS"], # Define a list of metrics (e.g. [Mean(variable="Height")], etc.) outlined in the documentation above
metric_definitions = ["METRICS"]

Define the Exploratory Query Run Parameters

When defining your SQLQueryInput, your project & workgroup will be used to validate permissions that were set at the project and site level (i.e. k-anonymization value). For more information on permissions, please refer to Permissions.

query_run_params = SQLQueryInput(
    session=session,
    project=project.uid,
    workgroup=workgroup.uid,
    connection_details=connection_details,
    sql_query=starting_query,
    timeout_seconds=600,
    metric_definitions=metric_definitions
)

Run the Exploratory Query

Run the query on your SQL database and get the metric results.

response = session.sql_query.run_sql_query(query_run_params)
print(response.results)

Import Query Results as a Dataset in the Rhino FCP

You can run SQL queries on the remote database and then have the results of the query stored as a Dataset on the Rhino FCP allowing further processing, analysis, etc.

This involves two inputs:

The query you want to run (note that the RHP does not limit the SQL code that is run - always connect with a DB user that has read-only permissions)
Data needed for the Dataset creation (e.g. Dataset name)

# CHANGE_ME: QUERY_STRING, Replace with query you want to run to generate the data for the Dataset, e.g. "SELECT * FROM <your_table> WHERE <condition>"
query = "QUERY_STRING"

# CHANGE_ME: COHORT_NAME
cohort_name = "COHORT_NAME"

# CHANGE_ ME: False, Can be either True or False depending on whether the data being queried is de-identifed
is_data_deidentified = False # 

# CHANGE_ME: SCHEMA_NAME & Possibly Version Number too. To auto-generate your schema, set this variable to None
dataschema_uid = session.project.get_data_schema_by_name('SCHEMA_NAME', project_uid=project.uid, version=1).uid

Define the Dataset Import Parameters

Same as when defining your SQLQueryInput, when creating your SQLQueryImportInput, your project & workgroup will be used to validate permissions that were set at the project and site level (i.e. k-anonymization value). For more information on permissions, please refer to Permissions.

import_run_params = SQLQueryImportInput(
    session=session,
    project=project.uid,
    workgroup=workgroup.uid,
    connection_details=connection_details,
    dataset_name=dataset_name,
    data_schema_uid=dataschema_uid
    timeout_seconds=600,
    is_data_deidentified=is_data_deidentified,
    sql_query=query
)

Trigger the Query to Import your Dataset

Run the query on your SQL database and the results will be imported into the Rhino FCP as a Dataset.

response = session.sql_query.import_dataset_from_sql_query(import_run_params)
print(response.results)