Predicting Medical Insurance Charges with Boosting, SHAP, and Ceteris Paribus Profiles¶

Thisproject builds an interpretable machine-learning workflow for forecasting individual medical insurance charges. I train a gradient boosting regressor, evaluate its predictive performance, and use two post-hoc explanation methods—SHAP and ceteris paribus (CP) profiles—to explain how the fitted model uses the available predictors.

Project goal: predict medical insurance charges from demographic and health-related variables, then translate the model behavior into interpretable evidence that a nontechnical stakeholder could review.

Methods used: data quality checks, one-hot encoding, train/test split, cross-validated hyperparameter tuning, gradient boosting regression, SHAP feature attribution, CP profiles, and robustness checks.

# Core setup
import os
from pathlib import Path
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import shap

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.ensemble import GradientBoostingRegressor

try:
    from sklearn.metrics import root_mean_squared_error
except ImportError:  # Compatibility with older scikit-learn versions
    from sklearn.metrics import mean_squared_error

    def root_mean_squared_error(y_true, y_pred):
        return mean_squared_error(y_true, y_pred, squared=False)

try:
    import joblib
except ImportError:
    joblib = None

try:
    import kagglehub
except ImportError:
    kagglehub = None

warnings.filterwarnings("ignore", category=FutureWarning)
pd.set_option("display.max_columns", 100)
plt.rcParams["figure.figsize"] = (8, 5)

RANDOM_STATE = 479
OUTPUT_DIR = Path("portfolio_outputs")
OUTPUT_DIR.mkdir(exist_ok=True)


def save_plot(filename):
    """Save the active Matplotlib figure for portfolio/web use."""
    filepath = OUTPUT_DIR / filename
    plt.savefig(filepath, dpi=200, bbox_inches="tight")
    return filepath


def load_insurance_data():
    """Load the insurance dataset from a local file, Kaggle, or a public CSV mirror."""
    local_candidates = [
        Path("insurance.csv"),
        Path("data") / "insurance.csv",
        Path("../data") / "insurance.csv",
    ]

    for candidate in local_candidates:
        if candidate.exists():
            print(f"Loaded local dataset: {candidate}")
            return pd.read_csv(candidate)

    if kagglehub is not None:
        try:
            path = kagglehub.dataset_download("mirichoi0218/insurance")
            csv_file = os.path.join(path, "insurance.csv")
            print("Loaded dataset with kagglehub.")
            return pd.read_csv(csv_file)
        except Exception as err:
            print(f"KaggleHub download failed: {err}")

    public_url = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv"
    print("Loading dataset from public CSV mirror.")
    return pd.read_csv(public_url)


# Load data
df = load_insurance_data()
df.head()

Using Colab cache for faster access to the 'insurance' dataset.
Loaded dataset with kagglehub.

Project Framing¶

This project uses the medical insurance charges dataset. The real-world problem is cost forecasting: given a person’s age, sex, body mass index (BMI), number of children, smoking status, and region, can we predict their expected medical insurance charges? In machine-learning terms, this is a supervised regression problem where charges is the response variable and the remaining columns are predictors.

These explanations are useful for several groups. Pricing analysts and actuarial teams can use them to understand which variables are most strongly associated with large predicted charges. Managers can use them to check whether the model behaves in a substantively reasonable way before relying on it for planning or pricing support. The explanations do not establish causality, but they do show how the fitted model uses the available predictors.

I accessed the data by downloading the CSV file and reading it into Python with pandas. I checked the dataset for missing values and duplicate rows. I used one-hot encoding for the categorical variables (sex, smoker, and region) so they could be used in the boosting model. Because I used one-hot encoding with drop_first=True, the omitted reference categories are sex = female, smoker = no, and region = northeast, so the displayed indicator features are interpreted relative to those baselines. I did not create additional features because the original variables are already easy to interpret, which helps keep the explanation methods clear.

Data Quality and Feature Preparation¶

duplicate_rows_before = df.duplicated().sum()
df = df.drop_duplicates()

quality_summary = pd.DataFrame({
    "missing_values": df.isna().sum()
})

display(quality_summary)
print(f"Duplicates removed: {duplicate_rows_before}")
print(f"Final dataset shape: {df.shape}")

Duplicates removed: 1
Final dataset shape: (1337, 7)

X = df.drop(columns="charges")
y = df["charges"]

# One-hot encode categorical predictors
X_encoded = pd.get_dummies(X, drop_first=True, dtype=int)

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded,
    y,
    test_size=0.20,
    random_state=RANDOM_STATE
)

print(f"Training rows: {X_train.shape[0]}")
print(f"Test rows: {X_test.shape[0]}")
print(f"Encoded predictors: {X_encoded.shape[1]}")
display(X_encoded.head())

Training rows: 1069
Test rows: 268
Encoded predictors: 8

Model Training¶

gbr = GradientBoostingRegressor(random_state=RANDOM_STATE)

param_grid = {
    "n_estimators": [100, 200, 300],
    "learning_rate": [0.05, 0.1, 0.2],
    "max_depth": [2, 3, 4],
    "min_samples_leaf": [1, 3, 5],
    "subsample": [0.8, 1.0]
}

cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

grid = GridSearchCV(
    estimator=gbr,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=cv,
    n_jobs=-1
)

grid.fit(X_train, y_train)
best_model = grid.best_estimator_

y_pred = best_model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)

model_summary = pd.DataFrame({
    "metric": [
        "best_cv_rmse",
        "test_rmse",
        "learning_rate",
        "max_depth",
        "min_samples_leaf",
        "n_estimators",
        "subsample"
    ],
    "value": [
        round(-grid.best_score_, 2),
        round(rmse, 2),
        grid.best_params_["learning_rate"],
        grid.best_params_["max_depth"],
        grid.best_params_["min_samples_leaf"],
        grid.best_params_["n_estimators"],
        grid.best_params_["subsample"]
    ]
})

display(model_summary)
model_summary.to_csv(OUTPUT_DIR / "model_summary.csv", index=False)

if joblib is not None:
    joblib.dump(best_model, OUTPUT_DIR / "gradient_boosting_insurance_model.joblib")

I trained a gradient boosting regressor to predict insurance charges. I chose boosting because it is a flexible ensemble method that can capture nonlinear relationships and interactions, but it is not intrinsically interpretable. The main hyperparameters I tuned were the number of trees (n_estimators), the learning rate (learning_rate), the depth of each tree (max_depth), the minimum number of observations allowed in a terminal node (min_samples_leaf), and the subsampling proportion (subsample). I selected these hyperparameters with 5-fold cross-validation and grid search, using cross-validated RMSE as the tuning criterion.

The table above reports the selected hyperparameters and the held-out test RMSE. Saving the table to portfolio_outputs/model_summary.csv makes the result easy to reuse in a website write-up without manually copying values.

Explanation 1: Model-Based SHAP¶

I chose model-based SHAP because the fitted model is a tree ensemble, so TreeSHAP can compute exact or highly efficient feature attributions.

# Use all test rows for SHAP
X_shap = X_test.copy()

# Build the model-based SHAP explainer
explainer = shap.TreeExplainer(best_model)

# Compute SHAP values for all test rows
shap_values = explainer.shap_values(X_shap)

# Global SHAP importance
# This averages the absolute SHAP values across all test rows and ranks the features by overall importance.
shap_importance = pd.DataFrame({
    "feature": X_shap.columns,
    "mean_abs_shap": np.abs(shap_values).mean(axis=0)
}).sort_values("mean_abs_shap", ascending=False)

shap_importance.to_csv(OUTPUT_DIR / "shap_feature_importance.csv", index=False)
display(shap_importance)

shap.summary_plot(shap_values, X_shap, max_display=10, show=False)
plt.title("SHAP Summary Plot")
plt.tight_layout()
save_plot("figure_1_shap_summary.png")
plt.show()

Figure 1. SHAP summary plot for the test observations. Each point is one observation. The x-axis shows the SHAP contribution of that feature to the prediction, measured in dollars of predicted charges relative to the model baseline. Color shows the feature value.

I chose the test case with the highest predicted charge so the local SHAP plot would illustrate which features drive an extreme prediction.

case_idx = int(np.argmax(y_pred))
base_value = float(np.ravel(explainer.expected_value)[0])

case_explanation = shap.Explanation(
    values=shap_values[case_idx],
    base_values=base_value,
    data=X_shap.iloc[case_idx],
    feature_names=X_shap.columns.tolist()
)

shap.plots.waterfall(case_explanation, max_display=10, show=False)
plt.title("Local SHAP Waterfall Plot for a High-Predicted-Charge Case")
plt.tight_layout()
save_plot("figure_2_shap_waterfall.png")
plt.show()

Figure 2. Local SHAP waterfall plot for a high-predicted-charge individual. The plot starts at the model’s baseline prediction and then shows how each feature moves this individual prediction up or down. Red bars increase the predicted charge, while blue bars decrease it. For this case, smoking status is the largest positive contributor, with age and BMI providing the next largest upward contributions.

SHAP supports both global and local explanations. At the global level, SHAP summarizes how important each feature is across many observations and how the direction of a feature value relates to the direction of its contribution. At the local level, SHAP decomposes one individual prediction into additive feature-level contributions relative to the model’s baseline prediction. Because this is a regression problem, the SHAP values are measured in dollars of predicted insurance charges.

Figures 1 and 2 provide these two complementary views. Figure 1 is the global SHAP summary plot. It shows that smoker_yes has the largest average absolute SHAP magnitude and the widest spread of SHAP values on the test observations, indicating that smoking status is the feature the fitted model relies on most strongly in this sample. It also shows that higher values of age and bmi tend to be associated with more positive SHAP contributions, meaning that they often push predicted charges upward. By contrast, the region indicators have much smaller SHAP magnitudes and much less spread, so they appear to play a relatively minor role in this fitted model. Figure 2 is a local waterfall plot for one high-predicted-charge individual. Starting from the model baseline, the largest upward contributions come from smoker_yes, age, and bmi, which together explain most of why this person’s predicted charge is much higher than average.

One possible misreading is to interpret a positive SHAP value as a causal effect. That would be incorrect. A positive SHAP value only means that, in the fitted model, that feature pushed this prediction upward relative to the model baseline. It does not imply that changing the feature in the real world would cause charges to increase by the same amount.

Explanation 2: Ceteris Paribus Profiles¶

I chose ceteris paribus (CP) profiles because they complement SHAP by answering a different interpretability question. I used CP profiles for age and bmi because both are substantively important continuous predictors, and I wanted to examine the shape of the model’s fitted relationship with each variable rather than only its overall importance. To show how the fitted response differs across the prediction spectrum, I plotted CP profiles for nine test observations chosen by predicted charge: three low-prediction, three middle-prediction, and three high-prediction cases.

def make_cp_profiles(estimator, X, feature, obs_idx, grid_resolution=25):
    """
    Build pure ceteris paribus profiles for selected observations.

    For each selected observation:
    - vary only `feature` across a grid
    - hold all other predictors fixed
    - compute predicted values

    Returns a long DataFrame with one row per grid point per observation.
    """
    x_min = X[feature].min()
    x_max = X[feature].max()
    grid = np.linspace(x_min, x_max, grid_resolution)

    rows = []
    for idx in obs_idx:
        x0 = X.iloc[idx].copy()

        for g in grid:
            x_new = x0.copy()
            x_new[feature] = g
            y_hat = estimator.predict(pd.DataFrame([x_new]))[0]

            rows.append({
                "obs_id": idx,
                feature: g,
                "prediction": y_hat
            })

    return pd.DataFrame(rows)


def plot_cp_profiles(estimator, X, feature, title, obs_idx, grid_resolution=25, filename=None):
    """
    Plot pure ceteris paribus profiles:
    one line per selected observation, no PDP overlay.
    """
    cp_df = make_cp_profiles(
        estimator=estimator,
        X=X,
        feature=feature,
        obs_idx=obs_idx,
        grid_resolution=grid_resolution
    )

    plt.figure(figsize=(8, 5))

    for obs in obs_idx:
        one_profile = cp_df[cp_df["obs_id"] == obs]
        plt.plot(
            one_profile[feature],
            one_profile["prediction"],
            alpha=0.8,
            linewidth=1.5,
            label=f"obs {obs}"
        )

    plt.xlabel(feature)
    plt.ylabel("Predicted insurance charges")
    plt.title(title)
    plt.legend(title="Observation", fontsize=8)
    plt.tight_layout()

    if filename is not None:
        save_plot(filename)

    plt.show()

    return cp_df


# Select representative test observations:
# 3 low-prediction, 3 middle-prediction, 3 high-prediction cases
pred_order = np.argsort(y_pred)

low_idx = pred_order[:3]
mid_start = len(pred_order) // 2 - 1
mid_idx = pred_order[mid_start:mid_start + 3]
high_idx = pred_order[-3:]

cp_obs_idx = np.concatenate([low_idx, mid_idx, high_idx]).tolist()
cp_obs_idx

[27, 90, 74, 37, 185, 71, 95, 101, 159]

cp_age_df = plot_cp_profiles(
    estimator=best_model,
    X=X_test.reset_index(drop=True),
    feature="age",
    title="Ceteris Paribus Profiles for Age",
    obs_idx=cp_obs_idx,
    grid_resolution=25,
    filename="figure_3_cp_profiles_age.png"
)

cp_age_df.to_csv(OUTPUT_DIR / "cp_profiles_age.csv", index=False)

Figure 3. CP profiles for age. Each line shows how the model’s predicted insurance charge changes for one selected observation as age varies while all other predictors for that observation are held fixed. Most profiles rise with age, indicating that the fitted model generally associates older age with higher predicted charges, although the slope differs somewhat across individuals.

cp_bmi_df = plot_cp_profiles(
    estimator=best_model,
    X=X_test.reset_index(drop=True),
    feature="bmi",
    title="Ceteris Paribus Profiles for BMI",
    obs_idx=cp_obs_idx,
    grid_resolution=25,
    filename="figure_4_cp_profiles_bmi.png"
)

cp_bmi_df.to_csv(OUTPUT_DIR / "cp_profiles_bmi.csv", index=False)

Figure 4. CP profiles for BMI. Each line shows how the model’s predicted insurance charge changes for one selected observation as BMI varies while all other predictors for that observation are held fixed. The profiles generally increase with BMI and show a sharper upward change around BMI 30–31 for several observations, suggesting a nonlinear pattern in the fitted model.

These plots are ceteris paribus (CP) profiles, so they are local explanations. Each line corresponds to one selected observation and shows how the model’s predicted insurance charge changes when the chosen feature is varied while all other predictors for that same observation are held fixed. Unlike SHAP, CP profiles do not assign an overall importance score to each feature. Instead, they show the functional response of the fitted model for specific observations and make it possible to see whether that response is linear, nonlinear, or heterogeneous across individuals.

In this regression problem, the vertical axis is the model’s predicted insurance charges in dollars. A vertical change of, for example, $2,000 means that when the selected feature is changed and all other predictors are held fixed, the fitted model’s predicted charge changes by about $2,000 for that observation.

Figure 3 shows that predicted charges generally increase as age increases. The age profiles are broadly monotone, which suggests that the fitted model has learned a mostly steady positive relationship between age and predicted charges.

Figure 4 shows that predicted charges also tend to rise as bmi increases, but the BMI profiles are more nonlinear than the age profiles. For some observations, the increase becomes much steeper around BMI 30, which suggests that the fitted model captures threshold-like behavior rather than a simple straight-line effect.

A possible misreading would be to interpret these curves as realistic causal interventions. That would be too strong. These CP profiles show how the fitted model responds when one feature is varied and the others are held fixed. Some of those combinations may be rare or unrealistic in the actual data, so the curves should be interpreted as descriptions of model behavior, not as guaranteed real-world outcomes or causal effects.

Method Description¶

To help explain the boosting model, I use SHAP. The main idea of SHAP is to take one prediction and break it into understandable pieces. The model begins from a baseline prediction, which can be thought of as the typical predicted insurance charge in the reference data. SHAP then shows how each feature for a specific person—such as smoking status, age, or BMI—moves the prediction up or down relative to that baseline. For a single prediction, the SHAP values across features add up to the difference between that prediction and the model’s baseline prediction.

The output of SHAP is a set of feature contributions. For an individual prediction, each feature receives a SHAP value that represents how much that feature contributed to the final predicted charge. A positive SHAP value means the feature pushed the prediction higher, while a negative SHAP value means it pushed the prediction lower. Because this is a regression problem, the SHAP values are measured in the same units as the prediction target, so they are expressed in dollars of predicted insurance charges.

SHAP should be interpreted as describing how the fitted model made its prediction, not as proving causation. For example, if smoking has a large positive SHAP value for a person, that means the model used smoking status to push the prediction upward; it does not by itself prove a causal effect of that exact size in the real world.

Validation and Robustness Checks¶

Correlation Structure and Interpretation¶

numeric_corr = df[["age", "bmi", "children"]].corr().round(3)
print("Correlation matrix for numeric predictors:")
display(numeric_corr)

print("Rows with BMI between 29 and 32:", df["bmi"].between(29, 32).sum())

all_predictors_encoded = pd.get_dummies(df.drop(columns="charges"), drop_first=False, dtype=int)
corr_all = all_predictors_encoded.corr()
upper_triangle = corr_all.where(np.triu(np.ones(corr_all.shape), k=1).astype(bool))

high_corr_pairs = (
    upper_triangle.stack()
    .reset_index()
    .rename(columns={"level_0": "feature_1", "level_1": "feature_2", 0: "correlation"})
)

mechanical_pairs = {
    tuple(sorted(["sex_female", "sex_male"])),
    tuple(sorted(["smoker_no", "smoker_yes"])),
    tuple(sorted(["region_northeast", "region_northwest"])),
    tuple(sorted(["region_northeast", "region_southeast"])),
    tuple(sorted(["region_northeast", "region_southwest"])),
    tuple(sorted(["region_northwest", "region_southeast"])),
    tuple(sorted(["region_northwest", "region_southwest"])),
    tuple(sorted(["region_southeast", "region_southwest"]))
}

high_corr_pairs["pair"] = high_corr_pairs.apply(
    lambda row: tuple(sorted([row["feature_1"], row["feature_2"]])),
    axis=1
)

high_corr_pairs = high_corr_pairs[~high_corr_pairs["pair"].isin(mechanical_pairs)].copy()
high_corr_pairs["abs_correlation"] = high_corr_pairs["correlation"].abs()
high_corr_pairs = high_corr_pairs.sort_values("abs_correlation", ascending=False)

print("Strongest non-mechanical encoded-feature correlations:")
display(high_corr_pairs[["feature_1", "feature_2", "correlation"]].head(8))

numeric_corr.to_csv(OUTPUT_DIR / "numeric_predictor_correlations.csv")
high_corr_pairs[["feature_1", "feature_2", "correlation"]].head(8).to_csv(
    OUTPUT_DIR / "top_encoded_feature_correlations.csv", index=False
)

Correlation matrix for numeric predictors:

Rows with BMI between 29 and 32: 240
Strongest non-mechanical encoded-feature correlations:

I checked the correlation structure of the predictors before interpreting the explanations. For the correlation diagnostic, I examined a broader fully one-hot-encoded version of the predictors (drop_first=False) and then removed mechanically redundant dummy pairs, whereas the fitted model itself used drop_first=True. The numeric predictors show only weak pairwise correlations, so strong linear dependence among age, BMI, and children does not appear to be a major concern. Among the encoded predictors, the strongest remaining non-mechanical association is between bmi and region_southeast, while the other non-mechanical correlations are small. This matters for both explanation methods: SHAP may split attribution across related predictors, and CP profiles can risk extrapolation when one feature is varied while others are held fixed. Therefore, I interpret both SHAP and CP outputs as descriptions of model behavior rather than causal effects.

SHAP can assign small nonzero importance to weak or largely irrelevant predictors because of finite-sample noise, correlation, and model structure. In this analysis, some region indicators have small but nonzero SHAP values, which should not be overread as strong substantive effects. CP profiles can also look structured for weak predictors if the fitted model has learned noise.

CP Grid-Resolution Sensitivity¶

for resolution in [10, 50]:
    plot_cp_profiles(
        estimator=best_model,
        X=X_test.reset_index(drop=True),
        feature="bmi",
        title=f"Ceteris Paribus Profiles for BMI (grid_resolution = {resolution})",
        obs_idx=cp_obs_idx,
        grid_resolution=resolution,
        filename=f"cp_profiles_bmi_grid_resolution_{resolution}.png"
    )

Figure 5. BMI CP profiles with grid_resolution = 10. The feature is evaluated on a relatively coarse grid, so the individual profiles are shown at fewer evaluation points and local changes appear less detailed.

Figure 6. BMI CP profiles with grid_resolution = 50. The feature is evaluated on a finer grid, so local changes in the fitted model are shown more clearly.

One important CP hyperparameter is grid_resolution, which controls how many evaluation points are used across the feature range. Figures 5 and 6 compare the BMI plot using grid_resolution = 10 and grid_resolution = 50.

The main substantive conclusion stayed the same in both plots: predicted charges generally increase with BMI, and there is a noticeable upward jump around BMI 30–31. The higher-resolution plot makes the step-like behavior appear more sharply, but it does not change the overall interpretation. This suggests that the BMI conclusion is qualitatively stable to reasonable changes in grid_resolution.

Comparing the Two Explanation Methods¶

SHAP and CP profiles provide different but complementary information. SHAP gives both global and local explanations: it can rank features by average contribution across many observations and can also decompose one individual prediction into additive feature-level contributions. CP profiles are local rather than global: they do not rank all predictors, but they are better for showing the shape of the fitted response as one selected feature changes while the others are held fixed.

In my results, both methods point to the importance of age and BMI. SHAP ranks them among the more important predictors, and the CP plots show that predicted charges generally rise with both features. SHAP also highlights smoker_yes as the strongest predictor overall, while CP profiles are especially useful for revealing how the fitted relationship with BMI is nonlinear and appears to steepen around BMI 30–31.

Overall, the two methods are consistent, but they answer different questions: SHAP is better for attribution and ranking, while CP profiles are better for visualizing response shape and heterogeneity across selected observations.

	age	bmi	children
age	1.000	0.109	0.042
bmi	0.109	1.000	0.013
children	0.042	0.013	1.000

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

	metric	value
0	best_cv_rmse	4425.15
1	test_rmse	4860.77
2	learning_rate	0.05
3	max_depth	3.00
4	min_samples_leaf	5.00
5	n_estimators	100.00
6	subsample	0.80

	feature	mean_abs_shap
4	smoker_yes	7522.636672
0	age	3346.413659
1	bmi	1998.731282
2	children	580.393220
3	sex_male	139.618939
7	region_southwest	105.274915
6	region_southeast	33.660371
5	region_northwest	20.122203

	feature_1	feature_2	correlation
17	bmi	region_southeast	0.270057
15	bmi	region_northeast	-0.138178
16	bmi	region_northwest	-0.136138
0	age	bmi	0.109344
35	sex_male	smoker_yes	0.076596
28	sex_female	smoker_no	0.076596
29	sex_female	smoker_yes	-0.076596
34	sex_male	smoker_no	-0.076596