SQL Queries and Data Ingestion from SQL Databases
*Note: Importing a cohort from SQL is currently only supported via the Rhino SDK.
Prerequisites
- Have an SQL DB that is open to connections from a Rhino Client, with access credentials for read-only access to this DB.
- Have a project where you are either part of the project's lead workgroup and the DB is within your site, or where there is a collaborator in the project that has the DB at their site.
- Ensure the required site-level permissions for SQL querying (Import and export Datasets, View Dataset analytics) are enabled for the site that has the DB.
- Ensure that you have the Rhino SDK installed on your computer (pip install rhino_health).
To import a Dataset directly from a SQL DB, follow the following steps:
Import your Python Dependencies
import rhino_health as rh from rhino_health.lib.endpoints.sql_query.sql_query_dataclass import ( SQLQueryImportInput, SQLQueryInput, SQLServerTypes, ConnectionDetails, ) import getpass
Remember to change all lines with CHANGE_ME comments above them in all the blocks below!
Log into the Rhino SDK using your FCP Credentials
Your username will be the email address you log into the Rhino FCP platform with.
print("Logging In") # CHANGE_ME: MY_USERNAME my_username = "MY_USERNAME" session = rh.login(username=my_username, password=getpass.getpass()) print("Logged In")
Get Supporting FCP Information Needed to Import Your Dataset
At this point, you will need the name of your project to retrieve your Project's UUID and subsequently your Workgroup UUID. You can also retrieve each object's UUID by following the instructions here: How do I retrieve a Project's, Collaborator's, Data Schema's, Dataset's, Code Object's, or Code Run's UID?
# CHANGE_ME: YOUR_FCP_PROJECT_NAME project = session.project.get_project_by_name('YOUR_FCP_PROJECT_NAME') workgroup = session.project.get_collaborating_workgroups(project.uid)[0]
Connection Setup
When specifying the connection details, ensure that you provide the server_type using the approved SQLServerTypes enum. This step ensures that your server is supported and compatible with the querying process. For a complete list of supported & compatible servers, refer to the SDK documentation here: SQL Server Types. Replace the following variables below:
- sql_db_user - Your SQL database username. Make sure that the user has read-only permissions
- sql_db_password - Your SQL database password. For better security, consider using an environment variable (e.g. os.getenv("DB_PASSWORD")), or using getpass.getpass() to type in the password
- external_server_url - Your SQL database url & port (i.e. "{url}:{port}")
- db_name - Your SQL database name
# CHANGE_ME: SQL_DATABASE_USERNAME sql_db_user = "SQL_DATABASE_USERNAME" # CHANGE_ME: SQL_DATABASE_PASSWORD sql_db_password = "SQL_DATABASE_PASSWORD" # CHANGE_ME: EXTERNAL_SERVER_URL external_server_url = "EXTERNAL_SERVER_URL" # CHANGE_ME: DB_NAME db_name = "DB_NAME" # Replace this with your DB name. connection_details = ConnectionDetails( server_user=sql_db_user, password=sql_db_password, # CHANGE_ME: Replace POSTGRESQL with one of the supported SQL Server Types outlined in the documentation above server_type=SQLServerTypes.POSTGRESQL, server_url=external_server_url, db_name=db_name )
[Optional] Running Exploratory Queries
You can run SQL queries on the remote DB and receive aggregate statistics on the results of the query.
This involves two inputs:
- Define the query you want to run (note that the RHP does not limit the SQL code that is run - always connect with a DB user that has read-only permissions)
- Define the metrics you would like to calculate on the query results (See the SDK documentation for more information about what Metrics you can calculate: rhino_health.lib.metrics)
# CHANGE_ME: QUERY_STRING, Replace with query you want to run, e.g. "SELECT * FROM <your_table> WHERE <condition>" starting_query = "QUERY_STRING" # CHANGE_ME: ["METRICS"], # Define a list of metrics (e.g. [Mean(variable="Height")], etc.) outlined in the documentation above metric_definitions = ["METRICS"]
Define the Exploratory Query Run Parameters
When defining your SQLQueryInput, your project & workgroup will be used to validate permissions that were set at the project and site level (i.e. k-anonymization value). For more information on permissions, please refer to Permissions.
query_run_params = SQLQueryInput( session=session, project=project.uid, workgroup=workgroup.uid, connection_details=connection_details, sql_query=starting_query, timeout_seconds=600, metric_definitions=metric_definitions )
Run the Exploratory Query
Run the query on your SQL database and get the metric results.
response = session.sql_query.run_sql_query(query_run_params) print(response.results)
Import Query Results as a Dataset in the Rhino FCP
You can run SQL queries on the remote database and then have the results of the query stored as a Dataset on the Rhino FCP allowing further processing, analysis, etc.
This involves two inputs:
- The query you want to run (note that the RHP does not limit the SQL code that is run - always connect with a DB user that has read-only permissions)
- Data needed for the Dataset creation (e.g. Dataset name)
# CHANGE_ME: QUERY_STRING, Replace with query you want to run to generate the data for the Dataset, e.g. "SELECT * FROM <your_table> WHERE <condition>" query = "QUERY_STRING" # CHANGE_ME: COHORT_NAME cohort_name = "COHORT_NAME" # CHANGE_ ME: False, Can be either True or False depending on whether the data being queried is de-identifed is_data_deidentified = False # # CHANGE_ME: SCHEMA_NAME & Possibly Version Number too. To auto-generate your schema, set this variable to None dataschema_uid = session.project.get_data_schema_by_name('SCHEMA_NAME', project_uid=project.uid, version=1).uid
Define the Dataset Import Parameters
Same as when defining your SQLQueryInput, when creating your SQLQueryImportInput, your project & workgroup will be used to validate permissions that were set at the project and site level (i.e. k-anonymization value). For more information on permissions, please refer to Permissions.
import_run_params = SQLQueryImportInput( session=session, project=project.uid, workgroup=workgroup.uid, connection_details=connection_details, dataset_name=dataset_name, data_schema_uid=dataschema_uid timeout_seconds=600, is_data_deidentified=is_data_deidentified, sql_query=query )
Trigger the Query to Import your Dataset
Run the query on your SQL database and the results will be imported into the Rhino FCP as a Dataset.
response = session.sql_query.import_dataset_from_sql_query(import_run_params) print(response.results)