Ethics in AI -

Ethics in Artificial Intelligence: Introduction to the Fairlearn package

Literature and code in this notebook was inspired by Selbst et al. “Fairness and Abstraction in Sociotechnical Systems”, Fairlearn’s Python package documentation, as well as Fairlearn’s 2021 SciPy tutorial:

SciPy 2021 Tutorial: Fairness in AI systems: From social context to practice using Fairlearn by Manojit Nandi, Miroslav Dudík, Triveni Gandhi, Lisa Ibañez, Adrin Jalali, Michael Madaio, Hanna Wallach, Hilde Weerts is licensed under CC BY 4.0.

About Fairlearn

Fairlearn is an open-source, community-driven project to help data scientists improve fairness of AI systems. It includes:

A Python library for fairness assessment and improvement (fairness metrics, mitigation algorithms, plotting, etc.)

Educational resources covering organizational and technical processes for unfairness mitigation (user guide, case studies, Jupyter notebooks, etc.)

The project was started in 2018 at Microsoft Research. In 2021 it adopted neutral governance structure and since then it is completely community-driven.

Why Ethics in AI matter

AI systems can behave unfairly for a variety of reasons:

Societal biases are reflected in the training data.
Societal biases are reflected and in the decisions made during the development and deployment of these systems.
AI systems behave unfairly because of characteristics of the data or characteristics of the systems themselves.

Motivating example: healthcare

Our scenario builds on previous research that highlighted racial disparities in how health care resources are allocated in the U.S. (Obermeyer et al., 2019). Motivated by that work, in this tutorial we consider an automated system for recommending patients for high-risk care management programs, which are described by Obermeyer et al. 2019 as follows:

These programs seek to improve the care of patients with complex health needs by providing additional resources such as greater attention from trained providers, to help ensure that care is well coordinated.

Because the programs are themselves expensive—with costs going toward teams of dedicated nurses, extra primary care appointment slots, and other scarce resources—health systems rely extensively on algorithms to identify patients who will benefit the most.

Convenience restriction

In practice, the modeling of health needs would use large data sets covering a wide range of diagnoses. In this tutorial, we will work with a publicly available clinical dataset that focuses on diabetic patients only (Strack et al., 2014).

Dataset and task

Clincial dataset of hospital re-admissions over a ten-year period (1998-2008) for diabetic patients across 130 different hospitals in the US.

Each record represents the hospital admission records for a patient diagnosed with diabetes whose stay lasted one to fourteen days.

The features include: demographics, diagnoses, diabetic medications, number of visits in the year preceding the encounter, and payer information, whether the patient was readmitted after release, and whether the readmission occurred within 30 days of the release.

Goal:

Develop a classification model, which decides whether the patients should be suggested to their primary care physicians for an enrollment into the high-risk care management program. The positive prediction will mean recommendation into the care program.

Decision point: Task definition

A hospital readmission within 30 days can be viewed as a proxy that the patients needed more assistance at the release time, so it will be the label we wish to predict.

Because of the class imbalance, we will be measuring our performance via balanced accuracy. Another key performance consideration is how many patients are recommended for care, metric we refer to as selection rate.

Ideally, health care professionals would be involved in both designing and using the model, including formalizing the task definition.

Fairness considerations

Which groups are most likely to be disproportionately negatively affected? Previous work suggests that groups with different race and ethnicity can be differently affected.
What are the harms? The key harms here are allocation harms. In particular, false negatives, i.e., don’t recommend somebody who will be readmitted.
How should we measure those harms?

Exploratory data analysis

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


from sklearn.pipeline import Pipeline
from sklearn.utils import Bunch
from sklearn.metrics import (
    balanced_accuracy_score,
    classification_report,
    roc_auc_score,
    accuracy_score,
    recall_score,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    ConfusionMatrixDisplay
    )
import process_health_data as phd
import pandas as pd
import seaborn as sns
import numpy as np

import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")

df = pd.read_csv("https://raw.githubusercontent.com/fairlearn/talks/main/2021_scipy_tutorial/data/diabetic_preprocessed.csv")
display(df.head())

Console output (1/1):

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

display(df.info())

Console output (1/2):

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 25 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   race                      101766 non-null  object
 1   gender                    101766 non-null  object
 2   age                       101766 non-null  object
 3   discharge_disposition_id  101766 non-null  object
 4   admission_source_id       101766 non-null  object
 5   time_in_hospital          101766 non-null  int64 
 6   medical_specialty         101766 non-null  object
 7   num_lab_procedures        101766 non-null  int64 
 8   num_procedures            101766 non-null  int64 
 9   num_medications           101766 non-null  int64 
 10  primary_diagnosis         101766 non-null  object
 11  number_diagnoses          101766 non-null  int64 
 12  max_glu_serum             101766 non-null  object
 13  A1Cresult                 101766 non-null  object
 14  insulin                   101766 non-null  object
 15  change                    101766 non-null  object
 16  diabetesMed               101766 non-null  object
 17  medicare                  101766 non-null  bool  
 18  medicaid                  101766 non-null  bool  
 19  had_emergency             101766 non-null  bool  
 20  had_inpatient_days        101766 non-null  bool  
 21  had_outpatient_days       101766 non-null  bool  
 22  readmitted                101766 non-null  object
 23  readmit_binary            101766 non-null  int64 
 24  readmit_30_days           101766 non-null  int64 
dtypes: bool(5), int64(7), object(13)
memory usage: 16.0+ MB

Console output (2/2):

None

df["race"].value_counts(normalize=True).plot(kind='bar', rot=45);
plt.title("Normalized race distribution");

Console output (1/1):

10-0

Obs:

African American, Unknown, Hispanic, Other and Asian are underrepresented.

phd.plot_pointplot(df, "race")

Console output (1/1):

12-0

Obs

Correlation between patients being readmitted to hospital within 30 days, and the boolean Had Inpatient Days.

Large error bars for underrepresented groups on their emergency and non emergency visits. Implication: they sought care, but were not admitted.

sns.barplot(x="readmit_30_days", y="race", data=df, ci=95);
plt.title("Barplot by race")

Console output (1/2):

Text(0.5, 1.0, 'Barplot by race')

Console output (2/2):

14-1

Prepare data for training

df_c = df.copy()

# Set random seed

random_seed = 445
np.random.seed(random_seed)

# Set target variable, demographic and data sensitivity
target_variable = "readmit_30_days"
demographic = ["race", "gender"]
sensitive = ["race"]

Y, A = df_c.loc[:, target_variable], df.loc[:, sensitive]

# We next drop the features that we don't want to use in 
    # our model and expand the categorical features into 0/1 indicators ("dummies").
X = pd.get_dummies(df_c.drop(columns=[
        "race",
        "discharge_disposition_id",
        "readmitted",
        "readmit_binary",
        "readmit_30_days"
    ]))


## Split data intro training and testing data

X_train, X_test, Y_train, Y_test, A_train, A_test, df_train, df_test = train_test_split(
                                                                        X,
                                                                        Y,
                                                                        A,
                                                                        df,
                                                                        test_size=0.50,
                                                                        stratify=Y,
                                                                        random_state=random_seed)

Resampling data

X_train_bal, Y_train_bal, A_train_bal = phd.resample_dataset(X_train, Y_train, A_train)

phd.plot_descriptive_stats(A_train_bal, Y_train_bal, A_test, Y_test);

Console output (1/1):

19-0

Training the model

We will build a pipeline with two main steps:

StandardScaler
Logistic regression

unmitigated_pipeline = Pipeline(steps=[
    ("preprocessing", StandardScaler()),
    ("logistic_regression", LogisticRegression(max_iter=1000))
])

# Fit data
unmitigated_pipeline.fit(X_train_bal, Y_train_bal)

Console output (1/1):

            (&#x27;logistic_regression&#x27;, LogisticRegression(max_iter=1000))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" ><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">Pipeline</label><div class="sk-toggleable__content"><pre>Pipeline(steps=[(&#x27;preprocessing&#x27;, StandardScaler()),
            (&#x27;logistic_regression&#x27;, LogisticRegression(max_iter=1000))])</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" ><label for="sk-estimator-id-2" class="sk-toggleable__label sk-toggleable__label-arrow">StandardScaler</label><div class="sk-toggleable__content"><pre>StandardScaler()</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-3" type="checkbox" ><label for="sk-estimator-id-3" class="sk-toggleable__label sk-toggleable__label-arrow">LogisticRegression</label><div class="sk-toggleable__content"><pre>LogisticRegression(max_iter=1000)</pre></div></div></div></div></div></div></div>

Y_pred_proba = unmitigated_pipeline.predict_proba(X_test)[:,1]
Y_pred = unmitigated_pipeline.predict(X_test)

Evaluating the model

print("Accuracy score", accuracy_score(Y_test, Y_pred))
print("Balanced accuracy score", balanced_accuracy_score(Y_test, Y_pred))

Console output (1/1):

Accuracy score 0.6200106125818053
Balanced accuracy score 0.5902921897575506

# F1 score, recall and precision report
print(classification_report(Y_test, Y_pred))

Console output (1/1):

precision    recall  f1-score   support

           0       0.92      0.63      0.75     45204
           1       0.16      0.55      0.24      5679

    accuracy                           0.62     50883
   macro avg       0.54      0.59      0.50     50883
weighted avg       0.83      0.62      0.69     50883

# Generate plot of confusion matrix
cm = confusion_matrix(Y_test, Y_pred, labels=unmitigated_pipeline.classes_)

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=unmitigated_pipeline.classes_)
disp.plot()
plt.title("Confusion matrix")
plt.show()

Console output (1/1):

26-0

print("ROC AUC SCORE", roc_auc_score(Y_test, Y_pred_proba))


fpr, tpr, thresholds = roc_curve(Y_test, Y_pred_proba)

plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

Console output (1/2):

ROC AUC SCORE 0.6209266753216064

Console output (2/2):

27-1

coef_series = pd.Series(data=unmitigated_pipeline.named_steps["logistic_regression"].coef_[0], index=X.columns)
coef_series.sort_values().plot.barh(figsize=(4, 12), legend=False);

Console output (1/1):

28-0

Evaluating bias with MetricFrame

from fairlearn.metrics import (
    MetricFrame,
    true_positive_rate,
    false_positive_rate,
    false_negative_rate,
    selection_rate,
    count,
    false_negative_rate_difference
)

from fairlearn.postprocessing import ThresholdOptimizer, plot_threshold_optimizer
from fairlearn.postprocessing._interpolated_thresholder import InterpolatedThresholder
from fairlearn.postprocessing._threshold_operation import ThresholdOperation
from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds, TruePositiveRateParity

metrics_dict = {
    "selection_rate": selection_rate,
    "false_negative_rate": false_negative_rate,
    "balanced_accuracy": balanced_accuracy_score,
}

metricframe_unmitigated = MetricFrame(metrics=metrics_dict,
                  y_true=Y_test,
                  y_pred=Y_pred,
                  sensitive_features=df_test['race'])

# The disaggregated metrics are then stored in a pandas DataFrame:

metricframe_unmitigated.by_group

Console output (1/1):

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

# You'll probably want to view them transposed:
pd.DataFrame({'difference': metricframe_unmitigated.difference(),
              'ratio': metricframe_unmitigated.ratio(),
              'group_min': metricframe_unmitigated.group_min(),
              'group_max': metricframe_unmitigated.group_max()}).T

Console output (1/1):

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

metricframe_unmitigated.by_group.plot.bar(subplots=True, layout= [1,3], figsize=(12, 4),
                      legend=False, rot=-45, position=1.5);

Console output (1/1):

33-0

Obs: algorithm reinforced bias on “Other” and “Unknown” groups when choosing not to select them.

Postprocessing with Threshold Optimizer

# Now we instantite ThresholdOptimizer with the logistic regression estimator
postprocess_est = ThresholdOptimizer(
    estimator=unmitigated_pipeline,
    constraints="false_negative_rate_parity",
    objective="balanced_accuracy_score",
    prefit=True,
    predict_method='predict_proba'
)

# Fit the postprocessing estimator
postprocess_est.fit(X_train_bal, Y_train_bal, sensitive_features=A_train_bal)

Console output (1/1):

               estimator=Pipeline(steps=[(&#x27;preprocessing&#x27;,
                                          StandardScaler()),
                                         (&#x27;logistic_regression&#x27;,
                                          LogisticRegression(max_iter=1000))]),
               objective=&#x27;balanced_accuracy_score&#x27;,
               predict_method=&#x27;predict_proba&#x27;, prefit=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-4" type="checkbox" ><label for="sk-estimator-id-4" class="sk-toggleable__label sk-toggleable__label-arrow">ThresholdOptimizer</label><div class="sk-toggleable__content"><pre>ThresholdOptimizer(constraints=&#x27;false_negative_rate_parity&#x27;,
               estimator=Pipeline(steps=[(&#x27;preprocessing&#x27;,
                                          StandardScaler()),
                                         (&#x27;logistic_regression&#x27;,
                                          LogisticRegression(max_iter=1000))]),
               objective=&#x27;balanced_accuracy_score&#x27;,
               predict_method=&#x27;predict_proba&#x27;, prefit=True)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-5" type="checkbox" ><label for="sk-estimator-id-5" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: Pipeline</label><div class="sk-toggleable__content"><pre>Pipeline(steps=[(&#x27;preprocessing&#x27;, StandardScaler()),
            (&#x27;logistic_regression&#x27;, LogisticRegression(max_iter=1000))])</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-6" type="checkbox" ><label for="sk-estimator-id-6" class="sk-toggleable__label sk-toggleable__label-arrow">StandardScaler</label><div class="sk-toggleable__content"><pre>StandardScaler()</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-7" type="checkbox" ><label for="sk-estimator-id-7" class="sk-toggleable__label sk-toggleable__label-arrow">LogisticRegression</label><div class="sk-toggleable__content"><pre>LogisticRegression(max_iter=1000)</pre></div></div></div></div></div></div></div></div></div></div></div></div>

# Record and evaluate the output of the trained ThresholdOptimizer on test data

Y_pred_postprocess = postprocess_est.predict(X_test, sensitive_features=A_test)
metricframe_postprocess = MetricFrame(
    metrics=metrics_dict,
    y_true=Y_test,
    y_pred=Y_pred_postprocess,
    sensitive_features=A_test
)
pd.concat([metricframe_unmitigated.by_group,
           metricframe_postprocess.by_group],
           keys=['Unmitigated', 'ThresholdOptimizer'],
           axis=1)

Console output (1/1):

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}

def figure_to_base64str(*args):
        return None
metricframe_postprocess.by_group.plot.bar(subplots=True, layout=[1,3], figsize=(12, 4), legend=False, rot=-45, position=1.5)
postprocess_performance = figure_to_base64str(plt)

Console output (1/1):

39-0

Conclusions

In this notebook we observed how to use Fairlearn to mitigate bias of algorithms when applied in datasets were bias was identifies.