Calculating Federated Analytics (Part 2) – Rhino Federated Computing

The Rhino SDK Federated Analytics Guide is a comprehensive tutorial that is designed to help you configure and retrieve various federated metrics using the Rhino SDK.

About this Tutorial

Part 1 of this tutorial addresses the following:

Retrieving a Metric
Basic Metrics (Sum, Count, Mean, Standard Deviation)
Two-by-Two Tables
Odds and Risk Metrics
Prevelance Incidence
Statistical Tests: Chi-Square, T-Test, One Way ANOVA, Wilcoxon

Part 2 of this tutorial addresses the following:

Correlation Coefficients
Kaplan-Meier
Cox-Proportional Hazard
Regressions

Note: Part 1 of this tutorial is here.

Correlation Coefficients

Pearson Correlation Coefficient

The Pearson correlation coefficient measures linear correlation between two sets of data.

The following example shows calculating the Pearson correlation coefficient to examine the linear relationship between height and weight in a dataset:

from rhino_health.lib.metrics.statistics_tests import Pearson 

pearson_config = Pearson(variable_1="Height", variable_2="Weight")

pearson_result = project.aggregate_dataset_metric(dataset_uids, pearson_config)

Intraclass Correlation Coefficient

The Intraclass correlation coefficient ("ICC") measures similarity of units organized into groups. More concretely, it compares set of values with similar distributions. This implementation of ICC calculates the earliest variation of this statistic, as proposed by Ronald Fisher in "Statistical Methods for Research Workers":

The following example shows calculating ICC to examine the relationship between weight and maximum weight in a dataset:

from rhino_health.lib.metrics.statistics_tests import ICC 

icc_config = ICC(variable_1="Weight", variable_2="MaxWeight")

icc_result = project.aggregate_dataset_metric(dataset_uids, icc_config)

Spearman's rank correlation coefficient

Spearman's rank correlation coefficient is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

The following example shows calculating Spearman's rank correlation coefficient to examine the monotonic relationship between height and weight in a dataset:

from rhino_health.lib.metrics.statistics_tests import Spearman 

spearman_config = Spearman(variable_1="Height", variable_2="Weight")

spearman_result = project.aggregate_dataset_metric(dataset_uids, spearman_config)

Kaplan-Meier

The Kaplan-Meier Metric is a powerful tool for analyzing time-to-event data, such as patient survival rates.

Running Kaplan-Meier Metric

Basic Kaplan-Meier Metric

For configuring the basic Kaplan-Meier metric, you can set up the metric and retrieve results as follows:

from rhino_health.lib.metrics import KaplanMeier

# Set the time and event variables
time_variable = "time_column_name"
event_variable = "event_column_name"

# The uids of the datasets of interest
dataset_uids = [<uid1>, <uid2>]

# Create a KaplanMeier instance
metric_configuration = KaplanMeier(time_variable=time_variable, event_variable=event_variable)

# Retrieve results for your project and datasets
results = project.aggregate_dataset_metric(dataset_uids=dataset_uid, metric_configuration=metric_configuration)

Grouping and Filtering

Grouping and filtering the data, before the Kaplan-Meier metric calculations, can be done using the existing metric groupings and filtering configurations:

from rhino_health.lib.metrics import KaplanMeier

# Set the time and event variables
time_variable = "time_column_name"
event_variable = "event_column_name"

# The uids of the datasets of interest
dataset_uids = [<uid1>, <uid2>]

# Define data filters and groupings
data_filters = [
    {"filter_column": "time_column_name", "filter_type": ">", "filter_value": 400},
    {"filter_column": "time_column_name", "filter_type": "<", "filter_value": 460},
]

group_by = {"groupings": ["time_cat"]}

# Create a KaplanMeier instance with filters and groupings
metric_configuration = KaplanMeier(time_variable=time_variable, event_variable=event_variable, group_by=group_by, data_filters=data_filters)

# Retrieve results and analyze specific groups
results = project.aggregate_dataset_metric(dataset_uids=dataset_uids, metric_configuration=metric_configuration)
group_1_results = results.output["1"]

Working with Kaplan-Meier Metric Results

Extracting Time and Events Vector

The Kaplan-Meier Metric in the Rhino Platform provides results that allow you to analyze time-to-event data, create survival models, and visualize Kaplan-Meier curves.

The results of the Kaplan-Meier Metric are stored in a KaplanMeierModelResults object with an "output" attribute that contains time and event vectors. Access these vectors as follows:

# Accessing the vectors using the names of the original time and event data columns
# For non grouped results
time_vector = results.output["time_column_name"]
event_vector = results.output["event_column_name"]

# For grouped results, where the group of interest is "1"
time_vector_group_1 = results.output["1"]["time_column_name"]
event_vector_group_1 = results.output["1"]["event_column_name"]

By obtaining these vectors, you can proceed to create a survival model and gain more insights from your Kaplan-Meier data in any way desired.

Creating a Survival Model

The Rhino Platform SDK provides a convenient way to obtain the survival model object, which allows you to explore detailed Kaplan-Meier analysis. The object is a SurvFuncRight object from the statsmodels library:

# For non grouped results
survival_model = results.surv_func_right_model()

# For grouped results, get the survival model where the group of interest is "1"
group = "1"
survival_model = results.surv_func_right_model(group=group)

# Access various properties of the survival model
median_time = survival_model.quantile(0.5)  # Median survival time
cumulative_hazard = survival_model.cumulative_hazard_at_times([100, 200, 300])  # Cumulative hazard at specific times
print(survival_model.summary())

The results of the summary print will look like:

Note that to use this feature, you need to have the statsmodels library installed in your Python environment. If you haven't installed it yet, you can do so using `pip`:

pip install statsmodels

Plotting Kaplan-Meier Curves

Visualizing Kaplan-Meier curves is a way to gain insights into your survival data. The Rhino Platform SDK KaplanMeierMetricResults object can be used to plot these curves. Using matplotlib.pyplot library is a convenient way for that:

import matplotlib.pyplot as plt

# Accessing time and event vectors
time_vector = results.output["time_column_name"]
event_vector = results.output["event_column_name"]

# Plot Kaplan-Meier survival curve
plt.figure(figsize=(10, 6))
plt.step(time_vector, event_vector, where='post', label="model 1")
plt.legend(loc="upper left")
plt.title("Kaplan-Meier Survival Curve")
plt.xlabel("Time")
plt.ylabel("Survival Probability")
plt.show()

Differential Privacy for Kaplan-Meier Metric

Differential privacy is a technique used in the FCP to protect patient data by adding noise to query results. Like all FCP metrics, the Kaplan-Meier metric supports differential privacy, and you can configure the privacy enforcement level in your project settings. The default privacy level is Medium, but you can select from None, Low, Medium, or High according to your project's privacy requirements, whereas:

None - No noise is added to any of the data.
Low, Medium - Noise is partially added to the data. Times that less than k (anonymity threshold) events occur are aggregated and averaged across events occurring in adjacent times, and noise is then added to them.
High - Noise is added to all of the time data.

To learn more about configuring differential privacy settings, please refer to Specifying a Project's Permissions Policy.

Cox Proportional Hazard

The Cox metric, also known as the proportional hazards ratio or Cox proportional hazards model is a statistical technique used in survival analysis to assess the relationship between the survival time of individuals and one or more predictor variables (covariates). The Cox metric utilizes the Newton-Raphson optimization method to estimate the coefficients (betas) of the covariates in the model.

Note: Check Rhino user-resources github repo for an example notebook to get started!

Running the Cox Metric

For configuring the basic Cox metric, you can set up the metric and retrieve results as follows:

from rhino_health.lib.metrics import Cox

# Set the time and event variables
time_variable = "time_column_name"
event_variable = "event_column_name"
covariates = ["c1", "c2"]

# The uids of the datasets of interest
dataset_uids = [<uid1>, <uid2>]

# Create a Cox instance
metric_configuration = Cox(time_variable=time_variable, event_variable=event_variable, covariates=covariates, initial_beta="mean", max_iterations=50)

# Retrieve results for your project and datasets
results = project.aggregate_dataset_metric(dataset_uids=dataset_uids, metric_configuration=metric_configuration)

The expected results of the Cox metric are the betas vector and the standard error for each coefficient of the beta vector.

The initial_beta argument allows you to specify how the beta coefficients are initialized before the optimization process begins.
If set to "mean", the initial beta coefficients are initialized to the mean values of the local beta coefficients calculated for each site (using the statsmodels.PHReg library). This can provide a starting point that is closer to the optimal solution.
Alternatively, setting initial_beta to "zero" initializes all beta coefficients to zero before optimization starts.

The max_iterations argument determines the maximum number of iterations that the Newton-Raphson optimization algorithm will perform to estimate the beta coefficients.
The Newton-Raphson method is an iterative process that updates the beta coefficients in each iteration until convergence, hence if convergence is not achieved within the specified maximum iterations, the optimization process stops, and the current estimates of the beta coefficients are returned.
Setting a higher value for max_iterations allows for more iterations, potentially leading to better convergence and more accurate coefficient estimates. However, it also increases computational time.

Differential Privacy for Cox

Differential privacy is a technique used in the FCP to protect patient data by adding noise to query results. Like all FCP metrics, the Cox metric supports differential privacy, and you can configure the privacy enforcement level in your project settings. The default privacy level is Medium, but you can select from None, Low, Medium, or High according to your project's privacy requirements, whereas:

None - No noise is added to any of the data.
Low, Medium - Noise is partially added to the data. Times that less than k (anonymity threshold) events occur are aggregated and averaged across events occurring in adjacent times, and noise is then added to them.
High - Noise is added to all of the time data.

To learn more about configuring differential privacy settings, please refer to Specifying a Project's Permissions Policy.

Regressions

NOTE: These regressions are not available as direct SDK methods. Instead, you must create an NVFlare Code object to run them. Please refer to Creating New NVFlare Code or Code Version for setup instructions. We have provided specific code examples for each Regression type in the sections below.

Regression is a statistical method designed to investigate the relationship between one or more predictor (independent) variables and a target (dependent) variable. In the realm of machine learning, regression analysis is a foundational concept comprising various methods used to predict a continuous outcome value (y) based on the input values of one or more features (x). Ultimately, it serves to model and quantify the effect that changes in one variable have on another.

This section provides links to several different types of regressions.

GLM Coefficient Estimation

This example contains files to fit a federated Generalized Linear Model (GLM) to estimate coefficients using Rhino Federated Computing Platform (FCP) and NVIDIA FLARE (NVFLARE). It demonstrates how to:

Fit a federated GLM to estimate coefficients and standard errors using NVFlare, supporting different GLM families (e.g. Binomial, Gaussian, Poisson, etc.).
Use an optimization method for aggregating each of the client's model parameters using NVFlare (with the examples of Newton-Raphson and Iteratively Reweighted Least Squares (IRLS) optimizers).
Read different configurations for the federated server and client from a config file, including the GLM family type, the optimization method, and the formula for the regression.
Package the code in a Docker container that can be used with FCP.

Link to Example Files: GLM Coefficient Estimate Example.

Logistic and Linear Regression

This example uses NVIDIA FLARE v2.3 to fit logistic and linear regression models using Rhino Federated Computing Platform (FCP). It shows how to:

Use sklearn with NVIDIA FLARE (NVFlare) v2.3 to fit logistic and linear regression models on FCP.
Package the code in a Docker container that can be used with FCP.

Link to Example Files: Logistic and Linear Regression Example

Poisson Regression

This example uses NVIDIA FLARE v2.3 to fit a poisson regression model using Rhino Federated Computing Platform (FCP). It shows how to:

Use sklearn with NVIDIA FLARE (NVFlare) v2.3 to fit a poisson regression model on FCP.
Package the code in a Docker container that can be used with FCP.

Link to Example Files: Poisson Regression Example

Federated GLM with AIC-Based Feature Selection

This example demonstrates how to fit a federated Generalized Linear Model (GLM) and perform Akaike Information Criterion (AIC)-based feature selection using Rhino's Federated Computing Platform (FCP) with NVIDIA FLARE (NVFLARE). It shows how to:

Fit a federated GLM to estimate coefficients and standard errors using NVFlare, supporting different GLM families (e.g. Binomial, Gaussian, Poisson, etc.).
Apply AIC-based backward feature elimination to iteratively remove non-informative predictors and identify a more parsimonious model.
Use an optimization method for aggregating each of the client's model parameters using NVFlare (with the examples of Newton-Raphson and Iteratively Reweighted Least Squares (IRLS) optimizers).
Read different configurations for the federated server and client from a config file, including the GLM family type, the optimization method, and the formula for the regression.
Package the code in a Docker container that can be used with FCP.

Link to Example Files: Federated GLM with AIC-Based Feature Selection

Quantile Regression

This example uses NVIDIA FLARE v2.3 to fit a quantile regression model using Rhino Federated Computing Platform (FCP). It shows how to:

Use sklearn with NVIDIA FLARE (NVFlare) v2.3 to fit a quantile regression model on FCP.
Package the code in a Docker container that can be used with FCP.

Link to Example Files: Quartile Regression