Tabular data, such as electronic health records (EHR) or financial transaction data, can be used for both horizontal and vertical federated learning applications. In this article, we will use XGBoost to demonstrate the Federated Training flow.
XGBoost (website) is a software library for training gradient-boosted trees for both classification and regression machine learning problems. It is widely accepted as a performant, cost-effective model for numeric and categorical tabular data applications.
In the Horizontal FL setting, each collaborator has distinct data samples (rows), and each row has the same set of features and is labeled. This contrasts with Vertical FL, in which only one collaborator possesses labels and the features may differ between collaborators’ datasets.
To implement Horizontal FL with XGBoost, users may take advantage of tree-based model collaboration.
Tree-Based Collaboration
Tree-based federated XGBoost is implemented as the following (see Figure 1):
- Each collaborator performs 1 round of XGBoost training on their own data
- Each collaborator shares the learned tree with the FL Server
- The FL Server bags the locally-trained trees to produce a global model
- The global model is shared with each of the collaborators
- Each collaborator performs 1 round of boosting to obtain a new tree, and repeats…
Figure 1. Illustrative schematic of tree-based collaboration for Federated XGBoost.
In this paradigm, only learned trees are ever shared with the FL Server/Rhino Cloud. The final set of aggregated trees (i.e., the final global model), can be stored on the FL server for future use (e.g., model evaluation on test sets and model inference).
For a comprehensive code example for how to implement Tree-Based XGBoost on the Rhino FCP, please visit our user resources.
In part 2 of Federated Training workflow series, we will cover end to end examples with Protein Sequence data and Protein Model training / fine tuning.