Welcome to the Rhino Health SDK Federated Analytics Guide. This comprehensive tutorial is designed to guide you through the process of configuring and retrieving various federated metrics using the Rhino Health SDK.
Before you start implementing metrics, make sure you can log in to the Rhino Health FCP via the SDK using your Rhino Health account:
import rhino_health as rh
import getpass
# Provide your Rhino Health username
my_username = "your_username@example.com"
# Log in to Rhino Health
session = rh.login(username=my_username, password=getpass.getpass())
Retrieving a Metric
The RHP allows you to quickly and securely retrieve metrics in a federated way across multiple sites. The metric retrieval will always be done in two steps:
1. Configuring the metric parameters using the metric object.
2. Making an API call to the Endpoint (EP).
The Metric Object
The metric configuration object is a crucial component, serving as a blueprint for metric retrieval. It allows you to specify the metric variables, grouping preferences, and data filters.
For example, let's take a look at the Mean metric, for the "Height" column:
from rhino_health.lib.metrics import Count, Mean, StandardDeviation
# Replace 'dataset_id_1', 'dataset_id_2' with actual Dataset IDs
dataset_uids = ["dataset_id_1", "dataset_id_2"]
# Create the Mean config
mean_config = Mean(variable="Hight", group_by={
"groupings": ["Gender"]
},
data_filters=[
{
"filter_column": "Weight",
"filter_type": ">",
"filter_value": 50
},
{
"filter_column": "Weight",
"filter_type": "<",
"filter_value": 80
}])
# Make API call for Mean calculation
response_mean = session.project.aggregate_dataset_metric(dataset_uids, mean_config)
We can see that the Mean object is initialized to contain the variable of interest, that is the label in the Dataset that this metric will be calculated on.
- Group By: The group_by
parameter allows you to organize metrics based on specific categorical variables, providing segmentation. In this example, the mean is grouped by the Gender column in the data.
- Data filters: The data_filters
parameter enables you to refine your analysis by setting conditions and filtering the output by certain criteria. In the example above, the mean is filtered to be calculated only on a specific range within the Weight column.
After the Metric object is set, one can use the session.project.aggregate_dataset_metric
endpoint to retrieve the metric from the chosen Datasets.
The Metric Response Object
When retrieving a metric, all results are returned in a MetricResponse
object (or its derivatives). The MetricResponse
object is a structured Python object that includes the specific outcome values in the 'output' attribute, such as statistical measures, and details about the metric configuration in metric_configuration_dict
, specifying the type of metric, applied filters, and relevant variables.
For example, printing the results object for the ChiSquare
metric will result in the following:
MetricResponse(output={
'chi_square': {
'statistic': 2.2,
'p_value': 0.809,
'dof': 2
}
},
metric_configuration_dict={
'metric': 'chi_square',
'arguments': '{
"data_filters": [],
"count_variable_name": "variable",
"variable": "id",
"variable_1": "Zb",
"variable_2": "E"
}'
})
The metric results will always be under the output attribute, under the metric name key (in this case chi_square
). The metric response values are then stored under the value name (e.g. p_value
in the example above). The initial metric configuration used to generate this output can be found under the metric_configuration_dict
attribute.
Basic Metrics: Sum, Count, Mean, Standard Deviation
To calculate basic numeric metrics such as count, mean, sum, and standard deviation, use the following syntax to create the metric configuration and retrieve the results:
from rhino_health.lib.metrics import Count, Mean, StandardDeviation, RocAuc
# Replace 'dataset_id_1', 'dataset_id_2' with actual Dataset IDs
dataset_uids = ["dataset_id_1", "dataset_id_1"]
# Calculate Mean
mean_config = Mean(variable="Height")
response_mean = session.project.aggregate_dataset_metric(dataset_uids, mean_config)
# Calculate Standard Deviation
stddev_config = StandardDeviation(variable="Height") # Replace with actual variable name
response_stddev = session.project.aggregate_dataset_metric(dataset_uids, stddev_config)
# Calculate Count
count_config = Count(variable="id")
response_count = session.project.aggregate_dataset_metric(dataset_uids, count_config)
Two By Two table
The TwoByTwoTable
metric facilitates the creation of a two-by-two contingency table, enabling you to analyze the relationships between variables. Here is an example for generating a two by two table metric for the columns exposed and detected in the data:
from rhino_health.lib.metrics.epidemiology.two_by_two_table_based_metrics import TwoByTwoTable
# Calculate TBTT
tbtt_config = TwoByTwoTable(variable="detected", detected_column_name="detected", exposed_column_name="exposed")
table_result = session.project.aggregate_dataset_metric(dataset_uids, tbtt_config)
The table results are also stored in a response object, that can be parsed into a pandas data frame in order to view the results as a table:
import pandas as pd
# Display the result as a DataFrame
pd.DataFrame(table_result.as_table())
Odds Metric
The Odds
metric calculates the odds of an event occurring rather than not occurring, and can be generated like so:
from rhino_health.lib.metrics import Odds
# Calculate Odds
odds_config = Odds(variable="SeriesUID", column_name="Pneumonia")
odds_results = session.project.aggregate_dataset_metric(dataset_uids, odds_config)
Odds Ratio Metric
The OddsRatio
metric is used to calculate the odds ratio between two binary variables and can be generated as follows:
from rhino_health.lib.metrics.epidemiology.two_by_two_table_based_metrics import OddsRatio
odds_ratio_config = OddsRatio(
variable="id",
exposed_column_name="Smoking",
detected_column_name="Pneumonia",
)
odds_ratio_results = session.project.aggregate_dataset_metric(dataset_uids, odds_ratio_config)
Risk Metric
The Risk
metric calculates the ratio between the true outcome and the total population with respect to detected and exposed columns.
from rhino_health.lib.metrics.base_metric import Risk, DataFilter
risk_config = Risk(
variable="id",
exposed_column_name="Smoking",
detected_column_name="Pneumonia",
)
risk_results = session.project.aggregate_dataset_metric(dataset_uids, risk_config)
Prevalence and Incidence
The Prevalence
metric calculates the proportion of individuals who have or develop a specific condition over a specified time range, whereas the Incidence
describes the occurrence of new cases over a period of time. In this example, the prevalence and incidence of pneumonia are calculated within the given specific time range. Note that the column representing the time data should contain time in UTC format.
from rhino_health.lib.metrics import Prevalence, Incidence
prevalence_config = Prevalence(
variable="id",
time_column_name="Time Pneumonia",
detected_column_name="Pneumonia",
start_time="2023-02-02T07:07:48Z",
end_time="2023-06-10T11:24:43Z",
)
prevalence_results = session.project.aggregate_dataset_metric(dataset_uids, prevalence_config)
incidence_config = Incidence(
variable="id",
time_column_name="Time Pneumonia",
detected_column_name="Pneumonia",
start_time="2023-02-02T07:07:48Z",
end_time="2023-06-10T11:24:43Z",
)
incidence_results = session.project.aggregate_dataset_metric(dataset_uids, incidence_config)
Statistical Tests: Chi-Square, T-Test, One Way ANOVA, Wilcoxon
Chi-Square Test
The Chi-Square test is employed to assess the independence between two categorical variables. In this example, we examine the association between the occurrence of pneumonia and gender across different Datasets. The result includes the Chi-Square statistic, p-value, and degrees of freedom.
from rhino_health.lib.metrics.statistics_tests import ChiSquare
chi_square_config = ChiSquare(variable="id", variable_1="Pneumonia", variable_2="Gender")
result = project.aggregate_dataset_metric(dataset_uids, chi_square_config)
T-Test
The T-test is utilized to determine if there is a significant difference between the means of the two groups. The implemented method is the Welch test, which does not assume equality of variance. The result includes the T-Test statistic, p-value, and degrees of freedom.
from rhino_health.lib.metrics.statistics_tests import TTest
t_test_config = TTest(numeric_variable="Height", categorical_variable="Gender")
t_test_result = project.aggregate_dataset_metric(dataset_uids, t_test_config)
One-Way ANOVA
The One-Way ANOVA (Analysis of Variance) is applied to assess whether there are any statistically significant differences between the means of three or more independent groups. In this example, we examine the relationship between inflammation level and height. The result will contain the following calculated values: ANOVA statistic value, p-value, DFC, DFE, DFT, MSC, MSE, SSC, SSE, and SST.
from rhino_health.lib.metrics.statistics_tests import OneWayANOVA
anova_config = OneWayANOVA(variable="id", numeric_variable="Height", categorical_variable="Inflammation Level")
anova_result = project.aggregate_dataset_metric(dataset_uids, anova_config)
Wilcoxon signed-rank test
The singed-rank test is a non-parametric rank test for statistical hypothesis testing used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples.
Our implementation always runs the calculation on a single column of values. To compare two columns, one must first add a column with the differences between the two columns (see example below). Additionally, one must add a column with the absolute values of the column to be used for the calculation (also shown in the example below).
Regarding handling of zeros and ties: For zeros, we use the “signed-rank zero procedure” where the sign for zero values is set to zero. For ties, we use the common “average rank” method where equal values are assigned identical “ranks”, which are equal to the average of the ranks before and after them.
The result of the calculation is the value according to the following formula:
The following example shows calculating the Wilcoxon signed-rank test to examine the relationship between weight and maximum weight in a dataset:
from rhino_health.lib.metrics.statistics_tests import Wilcoxon
# Add columns with differences and aboslute values thereof.
datasets = [session.dataset.get_dataset(uid) for uid in dataset_uids]
datasets_with_diffs_columns = []
for dataset in datasets:
output_datasets, _run_result = dataset.run_code(
"df['WeightDiff'] = df.MaxWeight - df.Weight\n" +
"df['WeightDiffAbs'] = df.WeightDiff.abs()"
)
datasets_with_diffs_columns.append(output_datasets[0])
# Calculate the test statistic.
new_dataset_uids = [dataset.uid for dataset in datasets_with_diffs_columns]
wilcoxon_config = Wilcoxon(variable="WeightDiff", abs_values_variable="WeightDiffAbs")
wilcoxon_result = project.aggregate_dataset_metric(new_dataset_uids, wilcoxon_config)
Correlation Coefficients
Pearson Correlation Coefficient
The Pearson correlation coefficient measures linear correlation between two sets of data.
The following example shows calculating the Pearson correlation coefficient to examine the linear relationship between height and weight in a dataset:
from rhino_health.lib.metrics.statistics_tests import Pearson
pearson_config = Pearson(variable_1="Height", variable_2="Weight")
pearson_result = project.aggregate_dataset_metric(dataset_uids, pearson_config)
Intraclass Correlation Coefficient
The Intraclass correlation coefficient ("ICC") measures similarity of units organized into groups. More concretely, it compares set of values with similar distributions. This implementation of ICC calculates the earliest variation of this statistic, as proposed by Ronald Fisher in "Statistical Methods for Research Workers":
The following example shows calculating ICC to examine the relationship between weight and maximum weight in a dataset:
from rhino_health.lib.metrics.statistics_tests import ICC
icc_config = ICC(variable_1="Weight", variable_2="MaxWeight")
icc_result = project.aggregate_dataset_metric(dataset_uids, icc_config)
Spearman's rank correlation coefficient
Spearman's rank correlation coefficient is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.
The following example shows calculating Spearman's rank correlation coefficient to examine the monotonic relationship between height and weight in a dataset:
from rhino_health.lib.metrics.statistics_tests import Spearman
spearman_config = Spearman(variable_1="Height", variable_2="Weight")
spearman_result = project.aggregate_dataset_metric(dataset_uids, spearman_config)
Kaplan-Meier
The Kaplan-Meier Metric is a powerful tool for analyzing time-to-event data, such as patient survival rates.
Running Kaplan-Meier Metric
Basic Kaplan-Meier Metric
For configuring the basic Kaplan-Meier metric, you can set up the metric and retrieve results as follows:
from rhino_health.lib.metrics import KaplanMeier
# Set the time and event variables
time_variable = "time_column_name"
event_variable = "event_column_name"
# The uids of the datasets of interest
dataset_uids = [<uid1>, <uid2>]
# Create a KaplanMeier instance
metric_configuration = KaplanMeier(time_variable=time_variable, event_variable=event_variable)
# Retrieve results for your project and datasets
results = project.aggregate_dataset_metric(dataset_uids=dataset_uid, metric_configuration=metric_configuration)
Grouping and Filtering
Grouping and filtering the data, before the Kaplan-Meier metric calculations, can be done using the existing metric groupings and filtering configurations:
from rhino_health.lib.metrics import KaplanMeier
# Set the time and event variables
time_variable = "time_column_name"
event_variable = "event_column_name"
# The uids of the datasets of interest
dataset_uids = [<uid1>, <uid2>]
# Define data filters and groupings
data_filters = [
{"filter_column": "time_column_name", "filter_type": ">", "filter_value": 400},
{"filter_column": "time_column_name", "filter_type": "<", "filter_value": 460},
]
group_by = {"groupings": ["time_cat"]}
# Create a KaplanMeier instance with filters and groupings
metric_configuration = KaplanMeier(time_variable=time_variable, event_variable=event_variable, group_by=group_by, data_filters=data_filters)
# Retrieve results and analyze specific groups
results = project.aggregate_dataset_metric(dataset_uids=dataset_uids, metric_configuration=metric_configuration)
group_1_results = results.output["1"]
Working with Kaplan-Meier Metric Results
Extracting Time and Events Vector
The Kaplan-Meier Metric in the Rhino Health Platform provides results that allow you to analyze time-to-event data, create survival models, and visualize Kaplan-Meier curves.
The results of the Kaplan-Meier Metric are stored in a KaplanMeierModelResults object with an "output" attribute that contains time and event vectors. Access these vectors as follows:
# Accessing the vectors using the names of the original time and event data columns
# For non grouped results
time_vector = results.output["time_column_name"]
event_vector = results.output["event_column_name"]
# For grouped results, where the group of interest is "1"
time_vector_group_1 = results.output["1"]["time_column_name"]
event_vector_group_1 = results.output["1"]["event_column_name"]
By obtaining these vectors, you can proceed to create a survival model and gain more insights from your Kaplan-Meier data in any way desired.
Creating a Survival Model
The Rhino Health Platform SDK provides a convenient way to obtain the survival model object, which allows you to explore detailed Kaplan-Meier analysis. The object is a SurvFuncRight object from the statsmodels library:
# For non grouped results
survival_model = results.surv_func_right_model()
# For grouped results, get the survival model where the group of interest is "1"
group = "1"
survival_model = results.surv_func_right_model(group=group)
# Access various properties of the survival model
median_time = survival_model.quantile(0.5) # Median survival time
cumulative_hazard = survival_model.cumulative_hazard_at_times([100, 200, 300]) # Cumulative hazard at specific times
print(survival_model.summary())
The results of the summary print will look like:
Note that to use this feature, you need to have the statsmodels library installed in your Python environment. If you haven't installed it yet, you can do so using `pip`:
pip install statsmodels
Plotting Kaplan-Meier Curves
Visualizing Kaplan-Meier curves is a way to gain insights into your survival data. The Rhino Health Platform SDK KaplanMeierMetricResults object can be used to plot these curves. Using matplotlib.pyplot library is a convenient way for that:
import matplotlib.pyplot as plt
# Accessing time and event vectors
time_vector = results.output["time_column_name"]
event_vector = results.output["event_column_name"]
# Plot Kaplan-Meier survival curve
plt.figure(figsize=(10, 6))
plt.step(time_vector, event_vector, where='post', label="model 1")
plt.legend(loc="upper left")
plt.title("Kaplan-Meier Survival Curve")
plt.xlabel("Time")
plt.ylabel("Survival Probability")
plt.show()
Differential Privacy for Kaplan-Meier Metric
Differential privacy is a technique used in the FCP to protect patient data by adding noise to query results. Like all FCP metrics, the Kaplan-Meier metric supports differential privacy, and you can configure the privacy enforcement level in your project settings. The default privacy level is Medium, but you can select from None, Low, Medium, or High according to your project's privacy requirements, whereas:
- None - No noise is added to any of the data.
- Low, Medium - Noise is partially added to the data. Times that less than k (anonymity threshold) events occur are aggregated and averaged across events occurring in adjacent times, and noise is then added to them.
- High - Noise is added to all of the time data.
To learn more about configuring differential privacy settings, please refer to Specifying a Project's Permissions Policy.
Cox Proportional Hazard
The Cox metric, also known as the proportional hazards ratio or Cox proportional hazards model is a statistical technique used in survival analysis to assess the relationship between the survival time of individuals and one or more predictor variables (covariates). The Cox metric utilizes the Newton-Raphson optimization method to estimate the coefficients (betas) of the covariates in the model.
Note: Check Rhino user-resources github repo for an example notebook to get started!
Running the Cox Metric
For configuring the basic Cox metric, you can set up the metric and retrieve results as follows:
from rhino_health.lib.metrics import Cox
# Set the time and event variables
time_variable = "time_column_name"
event_variable = "event_column_name"
covariates = ["c1", "c2"]
# The uids of the datasets of interest
dataset_uids = [<uid1>, <uid2>]
# Create a Cox instance
metric_configuration = Cox(time_variable=time_variable, event_variable=event_variable, covariates=covariates, initial_beta="mean", max_iterations=50)
# Retrieve results for your project and datasets
results = project.aggregate_dataset_metric(dataset_uids=dataset_uids, metric_configuration=metric_configuration)
The expected results of the Cox metric are the betas vector and the standard error for each coefficient of the beta vector.
The initial_beta
argument allows you to specify how the beta coefficients are initialized before the optimization process begins.
If set to "mean", the initial beta coefficients are initialized to the mean values of the local beta coefficients calculated for each site (using the statsmodels.PHReg
library). This can provide a starting point that is closer to the optimal solution.
Alternatively, setting initial_beta
to "zero" initializes all beta coefficients to zero before optimization starts.
The max_iterations
argument determines the maximum number of iterations that the Newton-Raphson optimization algorithm will perform to estimate the beta coefficients.
The Newton-Raphson method is an iterative process that updates the beta coefficients in each iteration until convergence, hence if convergence is not achieved within the specified maximum iterations, the optimization process stops, and the current estimates of the beta coefficients are returned.
Setting a higher value for max_iterations
allows for more iterations, potentially leading to better convergence and more accurate coefficient estimates. However, it also increases computational time.
Differential Privacy for Cox
Differential privacy is a technique used in the FCP to protect patient data by adding noise to query results. Like all FCP metrics, the Cox metric supports differential privacy, and you can configure the privacy enforcement level in your project settings. The default privacy level is Medium, but you can select from None, Low, Medium, or High according to your project's privacy requirements, whereas:
- None - No noise is added to any of the data.
- Low, Medium - Noise is partially added to the data. Times that less than k (anonymity threshold) events occur are aggregated and averaged across events occurring in adjacent times, and noise is then added to them.
- High - Noise is added to all of the time data.
To learn more about configuring differential privacy settings, please refer to Specifying a Project's Permissions Policy.