HBV calibration

HBV calibration#

This notebook is used to calibrate the parameters for the HBV model for the Loire river analysis. The period for calibration and validation period is divided in 1990-2014 for calibration and 2015-2019 for calibration.

1. Importing general python modules#

# general python
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

import numpy as np
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import geopandas as gpd
import pandas as pd

#niceties
from rich import print

# Needed
from ipywidgets import IntProgress
from IPython.display import display
from scipy.stats import qmc
from sklearn.metrics import mean_squared_error
from scipy.stats import wasserstein_distance

# general eWaterCycle
import ewatercycle
import ewatercycle.models
import ewatercycle.forcing

# import drought analyser function
%run Drought_analyser.ipynb

2. Defining experiment data and paths#

In this chapter all the variables are defined for the chosen basin. Also the loading and generating path are defined for the forcings and model.

# name of the catchment
basin_name = "FR003882"

# defining dates for calibration
experiment_start_date = "1990-01-01"
experiment_end_date = "2019-12-31"

# defining path for catchment shape file
station_shp = Path.home() / "BEP-Loire" / "book" / "model_loire" / "estreams_cb_FR003882.shp"

# defining destination path for ERA5 data
forcing_path_ERA5 = Path.home() / "forcing" / "loire_river" / "ERA5-90-19"
forcing_path_ERA5.mkdir(exist_ok=True)

# model HBV destination path
model_path_HBV = Path.home() / "tmp" / "HBV_model"

gdf = gpd.read_file("estreams_cb_FR003882.shp")
gdf = gdf.to_crs(epsg=2154)
gdf["area_km2"] = gdf.geometry.area / 1e6  
basin_area = gdf["area_km2"].sum()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 13
     11 # defining destination path for ERA5 data
     12 forcing_path_ERA5 = Path.home() / "forcing" / "loire_river" / "ERA5-90-19"
---> 13 forcing_path_ERA5.mkdir(exist_ok=True)
     15 # model HBV destination path
     16 model_path_HBV = Path.home() / "tmp" / "HBV_model"

File /opt/conda/envs/ewatercycle2/lib/python3.12/pathlib.py:1311, in Path.mkdir(self, mode, parents, exist_ok)
   1307 """
   1308 Create a new directory at this given path.
   1309 """
   1310 try:
-> 1311     os.mkdir(self, mode)
   1312 except FileNotFoundError:
   1313     if not parents or self.parent == self:

FileNotFoundError: [Errno 2] No such file or directory: '/home/zoe/forcing/loire_river/ERA5-90-19'

3. Generating ERA5 forcings#

The ERA5 forcings can be generated using the marked out code below. If ERA5 is already generated, it can be loaded using option 2.

ERA5_start_date = experiment_start_date+'T00:00:00Z'
ERA5_end_date = experiment_end_date+'T00:00:00Z'

# Option 1: Generate forcings
#ERA5_forcing = ewatercycle.forcing.sources["LumpedMakkinkForcing"].generate(
#   dataset="ERA5",
#   directory= str(forcing_path_ERA5),
#   start_time=ERA5_start_date,
#   end_time=ERA5_end_date,
#   shape=station_shp,
#)

# Option 2: Load forcings
# get data from stored location 
ERA5_forcing = ewatercycle.forcing.sources["LumpedMakkinkForcing"].load(directory=forcing_path_ERA5)

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[7], line 15
      2 ERA5_end_date = experiment_end_date+'T00:00:00Z'
      4 # Option 1: Generate forcings
      5 #ERA5_forcing = ewatercycle.forcing.sources["LumpedMakkinkForcing"].generate(
      6 #   dataset="ERA5",
   (...)
     13 # Option 2: Load forcings
     14 # get data from stored location 
---> 15 ERA5_forcing = ewatercycle.forcing.sources["LumpedMakkinkForcing"].load(directory=forcing_path_ERA5)

File /opt/conda/envs/ewatercycle2/lib/python3.12/site-packages/ewatercycle/base/forcing.py:276, in DefaultForcing.load(cls, directory)
    271 if not meta.exists():
    272     msg = (
    273         f"Forcing file {meta} not found. "
    274         f"Perhaps you want to use {cls.__name__}(...)?"
    275     )
--> 276     raise FileNotFoundError(msg)
    277 metadata = meta.read_text()
    278 # Workaround for legacy forcing files having !PythonClass tag.
    279 # Remove it so ewatercycle.forcing.source[<forcing name>].load(dir) works.

FileNotFoundError: Forcing file /home/zoe/forcing/loire_river/ERA5-90-19/ewatercycle_forcing.yaml not found. Perhaps you want to use LumpedMakkinkForcing(...)?

4. Defining historical data from eStreams#

The observerd discharges for the Blois station are loaded in the cell below.

q_data = pd.read_csv("FR003882_streamflow_m3s.csv", index_col='date', parse_dates=True)
Q_obs = q_data[experiment_start_date:experiment_end_date]

5. Calibrate the parameters#

In this chapter multiple calibration methods are defined. The first method is RMSE. This method is used to check if the overall calibration method works accordingly. The second and third method are used to specifically calibrate on drought duration and deficit. Because the distribution method gave the best results, this is eventually used as final calibration method.

5.1 Calibration method 1: RMSE#

For the first method ‘Root Mean Squared Error’ is used. This method is used first to check wether the rest of the calibration method works accordinglu, as this should give decent parameters for discharge. The first year of the modelled discharge is skipped for calibration, because the s_0 (initial storages) need time to fill. Also the code checks wether the mean_flow is lower than 100 m3/s, to speed up the process, as these values can never be a good Q_model and thus the parameters are not useful.

def drought_calibration_objective1(modelOutput, observation, start_calibration, end_calibration):
    # Combine modeled and observed data in one DataFrame
    hydro_data = pd.concat([modelOutput.reindex(observation.index, method='ffill'), observation], axis=1)

    # Select calibration period (skip first year)
    start_calibration = str(int(start_calibration[:4]) + 1) + start_calibration[4:]
    hydro_data = hydro_data[hydro_data.index > pd.to_datetime(pd.Timestamp(start_calibration).date())]
    hydro_data = hydro_data[hydro_data.index < pd.to_datetime(pd.Timestamp(end_calibration).date())]
    hydro_data = hydro_data.dropna(subset=[basin_name])

    # Check if the discharge is not too low
    mean_flow = hydro_data['model output'].mean()
    if mean_flow < 100:
        #print(f"Skipping iteration: Mean flow {mean_flow:.2f} m³/s is too low.")
        return np.inf

    # Use RMSE on filtered data
    rms = mean_squared_error(hydro_data[0], hydro_data[1], squared=False)

    return rms

5.2 Calibration method 2: Normal distribution#

This calibration method return the Earth Mover’s Distance (EMD), which calculates the dissimilarity between two distributions (Modelled distribution and observed distribution). This is both done for duration and deficit. These two EMD’s are added and returned in this function. The lower the EMD, the better the result. Also is check is done if the mean flow is not under 100 m3/s to speed up the process, as these discharges won’t be a good fit for this model. And the amount of modelled droughts have to be equal to 0.5-1.5 times the observed droughts. This makes sure that parameter sets which cause too few/many droughts are left out.

def drought_calibration_objective2(modelOutput, observation, start_calibration, end_calibration):
    # Combine modeled and observed data in one DataFrame
    hydro_data = pd.concat([modelOutput.reindex(observation.index, method='ffill'), observation], axis=1)

    # Select calibration period
    start_calibration = str(int(start_calibration[:4]) + 1) + start_calibration[4:]
    hydro_data = hydro_data[hydro_data.index > pd.to_datetime(pd.Timestamp(start_calibration).date())]
    hydro_data = hydro_data[hydro_data.index < pd.to_datetime(pd.Timestamp(end_calibration).date())]
    hydro_data = hydro_data.dropna(subset=[basin_name])
    
    mean_flow = hydro_data['model output'].mean()
    if mean_flow < 100:  # Threshold check
        return np.inf

    # Run drought analyser on both modeled and observed data
    drought_obs = drought_analyser(hydro_data[basin_name], basin_name, 66.5)
    drought_model = drought_analyser(hydro_data['model output'], 'model output', 66.5)
    
    if not (
        0.5 * drought_obs["Drought Number"].iloc[-1] < drought_model["Drought Number"].iloc[-1] < 1.5 * drought_obs["Drought Number"].iloc[-1]
    ):
        return np.inf

    # Extract drought duration and max deficit values
    x_obs, y_obs = drought_obs["Duration (days)"], drought_obs["Max Cumulative Deficit (m3/s)"]
    x_model, y_model = drought_model["Duration (days)"], drought_model["Max Cumulative Deficit (m3/s)"]

    # Compute Earth Mover’s Distance
    duration_emd = wasserstein_distance(x_obs, x_model)
    deficit_emd = wasserstein_distance(y_obs, y_model)

    total_distance = (duration_emd * 0.5) + (deficit_emd * 0.5)

    return total_distance

5.3 Calibration method 3: Fitted polynomial#

This calibration method calculates a 1st-degree polynomial fitted line for modelled and observed droughts based on duration and deficit. This method then return the difference in coefficients for these lines. The lower this difference, the better the parameters. Also a check is added that leaves out parameter sets that return a max. deficit which is too high.

def drought_calibration_objective3(modelOutput, observation, start_calibration, end_calibration):
    # Combine modeled and observed data in one DataFrame
    hydro_data = pd.concat([modelOutput.reindex(observation.index, method='ffill'), observation], axis=1)
    
    # Select calibration period
    start_calibration = str(int(start_calibration[:4]) + 1) + start_calibration[4:]
    hydro_data = hydro_data[hydro_data.index > pd.to_datetime(pd.Timestamp(start_calibration).date())]
    hydro_data = hydro_data[hydro_data.index < pd.to_datetime(pd.Timestamp(end_calibration).date())]
    hydro_data = hydro_data.dropna(subset=[basin_name])
    
    mean_flow = hydro_data['model output'].mean()
    if mean_flow < 100:  # Threshold check (adjust as needed)
        #print(f"Skipping iteration: Mean flow {mean_flow:.2f} m³/s is too low.")
        return np.inf

    # Run drought analyser on both modeled and observed data
    drought_obs = drought_analyser(hydro_data[basin_name], basin_name, 66.5)
    drought_model = drought_analyser(hydro_data['model output'], 'model output', 66.5)
    
    if (
        1.2 * drought_obs["Max Cumulative Deficit (m3/s)"].min() < drought_model["Max Cumulative Deficit (m3/s)"].min()
    ):
        #print(drought_obs["Max Cumulative Deficit (m3/s)"].min(), drought_model["Max Cumulative Deficit (m3/s)"].min())
        return np.inf

    # Fit linear lines to both datasets
    x_obs, y_obs = drought_obs["Duration (days)"], drought_obs["Max Cumulative Deficit (m3/s)"]
    x_model, y_model = drought_model["Duration (days)"], drought_model["Max Cumulative Deficit (m3/s)"]

    if len(x_obs) > 1 and len(x_model) > 1:
        coeffs_obs = np.polyfit(x_obs, y_obs, 1)
        coeffs_model = np.polyfit(x_model, y_model, 1)
    else:
        #print(f"Insufficient data points for polyfit ({len(x_obs)}, {len(x_model)}).")
        return np.inf

    # Calculate sum of difference between polynomial coefficients and distribution
    poly_diff = np.sum(np.abs(coeffs_obs - coeffs_model))

    return poly_diff

5.4 Start calibrating#

For the calibration we need to define sets of parameters. These sets are defined using a range for each parameter and then the sets are semi-randomly generated using a Latin Hypercube sampler, which makes sure that the parameters sets are well distributed. For this research a sample size of 2000 sets is used. A greater size did not result in better parameters. Then, the discharges are generated for each set of parameters and assessed using drought_calibration_objective2. At the end unnecessary data is deleted, as we only need the objective (which is the return value of the objective function).

At this moment N=20, as a sample size of 2000 can take up to 2 hours of calibraing.

# Define initial storage for the model
#               Si,  Su, Sf, Ss, Sp
s_0 = np.array([0,  100,  0,  15, 0])


# Define parameter ranges for the model
p_min = np.array([0,   0.2,  40,    .5,   .001,   1,     .01,  .0001,  .01])
p_max = np.array([8,    1,  800,   4,    .3,     10,    .1,   .01,  0.5])

# Sample random parameter combinations
N = 20
parameters = np.zeros([9, N])

# Create a Latin Hypercube sampler
sampler = qmc.LatinHypercube(d=9)
sample = sampler.random(n=N)

# Scale the sample to match the parameter ranges
parameters = qmc.scale(sample, p_min, p_max)

ensemble = []

for counter in range(N): 
    ensemble.append(ewatercycle.models.HBVLocal(forcing=ERA5_forcing))
    config_file, _ = ensemble[counter].setup(
                            parameters = parameters[counter],
                            initial_storage=s_0,
                            cfg_dir = model_path_HBV,
                               )
    ensemble[counter].initialize(config_file)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 4
      1 ensemble = []
      3 for counter in range(N): 
----> 4     ensemble.append(ewatercycle.models.HBVLocal(forcing=ERA5_forcing))
      5     config_file, _ = ensemble[counter].setup(
      6                             parameters = parameters[counter],
      7                             initial_storage=s_0,
      8                             cfg_dir = model_path_HBV,
      9                                )
     10     ensemble[counter].initialize(config_file)

NameError: name 'ERA5_forcing' is not defined

# Progress bar for visualization
f = IntProgress(min=0, max=N)
display(f)

# Array to store objective values
objectives = []

# Loop over ensemble members
for ensembleMember in ensemble:
    Q_m = []
    time = []
    while ensembleMember.time < ensembleMember.end_time:
        ensembleMember.update()
        discharge_this_timestep = ensembleMember.get_value("Q")
        Q_m.append(discharge_this_timestep[0])
        time.append(pd.Timestamp(ensembleMember.time_as_datetime.date()))

    # Create DataFrame for model results
    Q_m = convert_Qsim_mmday_to_m3s(np.array(Q_m), basin_area)
    discharge_dataframe = pd.DataFrame({'model output': Q_m}, index=pd.to_datetime(time))

    # Calculate the custom drought-based objective function
    objective_this_model = drought_calibration_objective2(
        discharge_dataframe, 
        Q_obs, 
        experiment_start_date, 
        experiment_end_date
    )
    objectives.append(objective_this_model)

    # Free up memory
    del Q_m, time, discharge_dataframe, objective_this_model
    f.value += 1

# Clean up models to save memory
for ensembleMember in ensemble:
    ensembleMember.finalize()

6. Results#

The best fitting parameters are chosen based on the lowest objective value. This is done in the cell below. If the minimum value in the objective list is equal to np.inf, no real parameter is chosen. This means that the parameter sets need to be adjusted.

# Let's also show the minimal values:
parameters_minimum_index = np.argmin(np.array(objectives))
if np.min(np.array(objectives)) == np.inf:
    print("No real parameter is chosen")

parameters_minimum = parameters[parameters_minimum_index]

print(list(parameters_minimum))

[
    4.457875757565083,
    0.4230702227959797,
    201.57602671208883,
    2.3649785293066605,
    0.26755301186057023,
    8.629648911384976,
    0.08356848543300459,
    0.003707325010773819,
    0.29644921009582514
]