Ethics in Artificial Intelligence: Introduction to the Fairlearn package
Literature and code in this notebook was inspired by Selbst et al. “Fairness and Abstraction in Sociotechnical Systems”, Fairlearn’s Python package documentation, as well as Fairlearn’s 2021 SciPy tutorial:
SciPy 2021 Tutorial: Fairness in AI systems: From social context to practice using Fairlearn by Manojit Nandi, Miroslav Dudík, Triveni Gandhi, Lisa Ibañez, Adrin Jalali, Michael Madaio, Hanna Wallach, Hilde Weerts is licensed under CC BY 4.0.
About Fairlearn
Fairlearn is an open-source, community-driven project to help data scientists improve fairness of AI systems. It includes:
A Python library for fairness assessment and improvement (fairness metrics, mitigation algorithms, plotting, etc.)
Educational resources covering organizational and technical processes for unfairness mitigation (user guide, case studies, Jupyter notebooks, etc.)
The project was started in 2018 at Microsoft Research. In 2021 it adopted neutral governance structure and since then it is completely community-driven.
Read more: https://fairlearn.org
Why Ethics in AI matter
AI systems can behave unfairly for a variety of reasons:
-
Societal biases are reflected in the training data.
-
Societal biases are reflected and in the decisions made during the development and deployment of these systems.
-
AI systems behave unfairly because of characteristics of the data or characteristics of the systems themselves.
Motivating example: healthcare
Our scenario builds on previous research that highlighted racial disparities in how health care resources are allocated in the U.S. (Obermeyer et al., 2019). Motivated by that work, in this tutorial we consider an automated system for recommending patients for high-risk care management programs, which are described by Obermeyer et al. 2019 as follows:
These programs seek to improve the care of patients with complex health needs by providing additional resources such as greater attention from trained providers, to help ensure that care is well coordinated.
Because the programs are themselves expensive—with costs going toward teams of dedicated nurses, extra primary care appointment slots, and other scarce resources—health systems rely extensively on algorithms to identify patients who will benefit the most.
Convenience restriction
In practice, the modeling of health needs would use large data sets covering a wide range of diagnoses. In this tutorial, we will work with a publicly available clinical dataset that focuses on diabetic patients only (Strack et al., 2014).
Dataset and task
Clincial dataset of hospital re-admissions over a ten-year period (1998-2008) for diabetic patients across 130 different hospitals in the US.
Each record represents the hospital admission records for a patient diagnosed with diabetes whose stay lasted one to fourteen days.
The features include: demographics, diagnoses, diabetic medications, number of visits in the year preceding the encounter, and payer information, whether the patient was readmitted after release, and whether the readmission occurred within 30 days of the release.
Goal:
Develop a classification model, which decides whether the patients should be suggested to their primary care physicians for an enrollment into the high-risk care management program. The positive prediction will mean recommendation into the care program.
Decision point: Task definition
A hospital readmission within 30 days can be viewed as a proxy that the patients needed more assistance at the release time, so it will be the label we wish to predict.
Because of the class imbalance, we will be measuring our performance via balanced accuracy. Another key performance consideration is how many patients are recommended for care, metric we refer to as selection rate.
Ideally, health care professionals would be involved in both designing and using the model, including formalizing the task definition.
Fairness considerations
-
Which groups are most likely to be disproportionately negatively affected? Previous work suggests that groups with different race and ethnicity can be differently affected.
-
What are the harms? The key harms here are allocation harms. In particular, false negatives, i.e., don’t recommend somebody who will be readmitted.
-
How should we measure those harms?
Exploratory data analysis
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.utils import Bunch
from sklearn.metrics import (
balanced_accuracy_score,
classification_report,
roc_auc_score,
accuracy_score,
recall_score,
confusion_matrix,
roc_auc_score,
roc_curve,
ConfusionMatrixDisplay
)
import process_health_data as phd
import pandas as pd
import seaborn as sns
import numpy as np
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
df = pd.read_csv("https://raw.githubusercontent.com/fairlearn/talks/main/2021_scipy_tutorial/data/diabetic_preprocessed.csv")
display(df.head())
Console output (1/1):
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
display(df.info())
Console output (1/2):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 race 101766 non-null object
1 gender 101766 non-null object
2 age 101766 non-null object
3 discharge_disposition_id 101766 non-null object
4 admission_source_id 101766 non-null object
5 time_in_hospital 101766 non-null int64
6 medical_specialty 101766 non-null object
7 num_lab_procedures 101766 non-null int64
8 num_procedures 101766 non-null int64
9 num_medications 101766 non-null int64
10 primary_diagnosis 101766 non-null object
11 number_diagnoses 101766 non-null int64
12 max_glu_serum 101766 non-null object
13 A1Cresult 101766 non-null object
14 insulin 101766 non-null object
15 change 101766 non-null object
16 diabetesMed 101766 non-null object
17 medicare 101766 non-null bool
18 medicaid 101766 non-null bool
19 had_emergency 101766 non-null bool
20 had_inpatient_days 101766 non-null bool
21 had_outpatient_days 101766 non-null bool
22 readmitted 101766 non-null object
23 readmit_binary 101766 non-null int64
24 readmit_30_days 101766 non-null int64
dtypes: bool(5), int64(7), object(13)
memory usage: 16.0+ MB
Console output (2/2):
None
df["race"].value_counts(normalize=True).plot(kind='bar', rot=45);
plt.title("Normalized race distribution");
Console output (1/1):
Obs:
African American, Unknown, Hispanic, Other and Asian are underrepresented.
phd.plot_pointplot(df, "race")
Console output (1/1):
Obs
Correlation between patients being readmitted to hospital within 30 days, and the boolean Had Inpatient Days.
Large error bars for underrepresented groups on their emergency and non emergency visits. Implication: they sought care, but were not admitted.
sns.barplot(x="readmit_30_days", y="race", data=df, ci=95);
plt.title("Barplot by race")
Console output (1/2):
Text(0.5, 1.0, 'Barplot by race')
Console output (2/2):
Prepare data for training
df_c = df.copy()
# Set random seed
random_seed = 445
np.random.seed(random_seed)
# Set target variable, demographic and data sensitivity
target_variable = "readmit_30_days"
demographic = ["race", "gender"]
sensitive = ["race"]
Y, A = df_c.loc[:, target_variable], df.loc[:, sensitive]
# We next drop the features that we don't want to use in
# our model and expand the categorical features into 0/1 indicators ("dummies").
X = pd.get_dummies(df_c.drop(columns=[
"race",
"discharge_disposition_id",
"readmitted",
"readmit_binary",
"readmit_30_days"
]))
## Split data intro training and testing data
X_train, X_test, Y_train, Y_test, A_train, A_test, df_train, df_test = train_test_split(
X,
Y,
A,
df,
test_size=0.50,
stratify=Y,
random_state=random_seed)
Resampling data
X_train_bal, Y_train_bal, A_train_bal = phd.resample_dataset(X_train, Y_train, A_train)
phd.plot_descriptive_stats(A_train_bal, Y_train_bal, A_test, Y_test);
Console output (1/1):
Training the model
We will build a pipeline with two main steps:
- StandardScaler
- Logistic regression
unmitigated_pipeline = Pipeline(steps=[
("preprocessing", StandardScaler()),
("logistic_regression", LogisticRegression(max_iter=1000))
])
# Fit data
unmitigated_pipeline.fit(X_train_bal, Y_train_bal)
Console output (1/1):
('logistic_regression', LogisticRegression(max_iter=1000))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" ><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">Pipeline</label><div class="sk-toggleable__content"><pre>Pipeline(steps=[('preprocessing', StandardScaler()),
('logistic_regression', LogisticRegression(max_iter=1000))])</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" ><label for="sk-estimator-id-2" class="sk-toggleable__label sk-toggleable__label-arrow">StandardScaler</label><div class="sk-toggleable__content"><pre>StandardScaler()</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-3" type="checkbox" ><label for="sk-estimator-id-3" class="sk-toggleable__label sk-toggleable__label-arrow">LogisticRegression</label><div class="sk-toggleable__content"><pre>LogisticRegression(max_iter=1000)</pre></div></div></div></div></div></div></div>
Y_pred_proba = unmitigated_pipeline.predict_proba(X_test)[:,1]
Y_pred = unmitigated_pipeline.predict(X_test)
Evaluating the model
print("Accuracy score", accuracy_score(Y_test, Y_pred))
print("Balanced accuracy score", balanced_accuracy_score(Y_test, Y_pred))
Console output (1/1):
Accuracy score 0.6200106125818053
Balanced accuracy score 0.5902921897575506
# F1 score, recall and precision report
print(classification_report(Y_test, Y_pred))
Console output (1/1):
precision recall f1-score support
0 0.92 0.63 0.75 45204
1 0.16 0.55 0.24 5679
accuracy 0.62 50883
macro avg 0.54 0.59 0.50 50883
weighted avg 0.83 0.62 0.69 50883
# Generate plot of confusion matrix
cm = confusion_matrix(Y_test, Y_pred, labels=unmitigated_pipeline.classes_)
# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=unmitigated_pipeline.classes_)
disp.plot()
plt.title("Confusion matrix")
plt.show()
Console output (1/1):
print("ROC AUC SCORE", roc_auc_score(Y_test, Y_pred_proba))
fpr, tpr, thresholds = roc_curve(Y_test, Y_pred_proba)
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
Console output (1/2):
ROC AUC SCORE 0.6209266753216064
Console output (2/2):
coef_series = pd.Series(data=unmitigated_pipeline.named_steps["logistic_regression"].coef_[0], index=X.columns)
coef_series.sort_values().plot.barh(figsize=(4, 12), legend=False);
Console output (1/1):
Evaluating bias with MetricFrame
from fairlearn.metrics import (
MetricFrame,
true_positive_rate,
false_positive_rate,
false_negative_rate,
selection_rate,
count,
false_negative_rate_difference
)
from fairlearn.postprocessing import ThresholdOptimizer, plot_threshold_optimizer
from fairlearn.postprocessing._interpolated_thresholder import InterpolatedThresholder
from fairlearn.postprocessing._threshold_operation import ThresholdOperation
from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds, TruePositiveRateParity
metrics_dict = {
"selection_rate": selection_rate,
"false_negative_rate": false_negative_rate,
"balanced_accuracy": balanced_accuracy_score,
}
metricframe_unmitigated = MetricFrame(metrics=metrics_dict,
y_true=Y_test,
y_pred=Y_pred,
sensitive_features=df_test['race'])
# The disaggregated metrics are then stored in a pandas DataFrame:
metricframe_unmitigated.by_group
Console output (1/1):
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# You'll probably want to view them transposed:
pd.DataFrame({'difference': metricframe_unmitigated.difference(),
'ratio': metricframe_unmitigated.ratio(),
'group_min': metricframe_unmitigated.group_min(),
'group_max': metricframe_unmitigated.group_max()}).T
Console output (1/1):
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
metricframe_unmitigated.by_group.plot.bar(subplots=True, layout= [1,3], figsize=(12, 4),
legend=False, rot=-45, position=1.5);
Console output (1/1):
Obs: algorithm reinforced bias on “Other” and “Unknown” groups when choosing not to select them.
Postprocessing with Threshold Optimizer
# Now we instantite ThresholdOptimizer with the logistic regression estimator
postprocess_est = ThresholdOptimizer(
estimator=unmitigated_pipeline,
constraints="false_negative_rate_parity",
objective="balanced_accuracy_score",
prefit=True,
predict_method='predict_proba'
)
# Fit the postprocessing estimator
postprocess_est.fit(X_train_bal, Y_train_bal, sensitive_features=A_train_bal)
Console output (1/1):
estimator=Pipeline(steps=[('preprocessing',
StandardScaler()),
('logistic_regression',
LogisticRegression(max_iter=1000))]),
objective='balanced_accuracy_score',
predict_method='predict_proba', prefit=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-4" type="checkbox" ><label for="sk-estimator-id-4" class="sk-toggleable__label sk-toggleable__label-arrow">ThresholdOptimizer</label><div class="sk-toggleable__content"><pre>ThresholdOptimizer(constraints='false_negative_rate_parity',
estimator=Pipeline(steps=[('preprocessing',
StandardScaler()),
('logistic_regression',
LogisticRegression(max_iter=1000))]),
objective='balanced_accuracy_score',
predict_method='predict_proba', prefit=True)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-5" type="checkbox" ><label for="sk-estimator-id-5" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: Pipeline</label><div class="sk-toggleable__content"><pre>Pipeline(steps=[('preprocessing', StandardScaler()),
('logistic_regression', LogisticRegression(max_iter=1000))])</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-6" type="checkbox" ><label for="sk-estimator-id-6" class="sk-toggleable__label sk-toggleable__label-arrow">StandardScaler</label><div class="sk-toggleable__content"><pre>StandardScaler()</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-7" type="checkbox" ><label for="sk-estimator-id-7" class="sk-toggleable__label sk-toggleable__label-arrow">LogisticRegression</label><div class="sk-toggleable__content"><pre>LogisticRegression(max_iter=1000)</pre></div></div></div></div></div></div></div></div></div></div></div></div>
# Record and evaluate the output of the trained ThresholdOptimizer on test data
Y_pred_postprocess = postprocess_est.predict(X_test, sensitive_features=A_test)
metricframe_postprocess = MetricFrame(
metrics=metrics_dict,
y_true=Y_test,
y_pred=Y_pred_postprocess,
sensitive_features=A_test
)
pd.concat([metricframe_unmitigated.by_group,
metricframe_postprocess.by_group],
keys=['Unmitigated', 'ThresholdOptimizer'],
axis=1)
Console output (1/1):
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
def figure_to_base64str(*args):
return None
metricframe_postprocess.by_group.plot.bar(subplots=True, layout=[1,3], figsize=(12, 4), legend=False, rot=-45, position=1.5)
postprocess_performance = figure_to_base64str(plt)
Console output (1/1):
Conclusions
In this notebook we observed how to use Fairlearn to mitigate bias of algorithms when applied in datasets were bias was identifies.