Selecting hyperparameter values with sequential, human-in-the-loop, search space modification

Thoughts and Theory, Hyperparameter optimization

Shout out to Zeev Waks who’s been part of the project and the writing of this post.

Intro

Searching for a good set of hyperparameters, also known as hyperparameter tuning or hyperparameter optimization (HPO), is often one of the most time-consuming and costly aspects of machine learning model development. This is because hyperparameter search spaces have a tendency to become enormous as they grow combinatorially with each additional parameter, and data scientists have a tendency to want to test many combinations in search of good results. Because evaluating each hyperparameter set entails training the model (and calculating the evaluation metric on a validation set), the search for optimal hyperparameter combinations can result in the training of hundreds, thousands, and even millions of separate models. Having a strategy for HPO is thus important, as it can have a large impact on cost, development time, and of course prediction performance.

The traditional approach for HPO attempts to find a good set of hyperparameters in one shot, typically using a grid, random, or Bayesian search methods (read more about Bayesian optimization here). In fact, auto-ml approaches are usually based on one-shot Bayesian search approaches, such as sequential model-based optimization (SMBO), and their extended version [1, 2]. The great advantage of Bayesian methods is that they use results from historical runs, or gradually increasing prior knowledge in the form of hyperparameter values/objective function score pairs, to inform the selection of the next hyperparameter values to evaluate. This leads to much quicker achievement of good objective function scores, thus saving time and compute cost. And indeed, using Bayesian approaches is probably the best option in most cases.

However, how can you have confidence that a Bayesian approach, such as the popular Tree-structured Parzen Estimators (TPE), did not settle on a suboptimal local minimum, but rather explored the hyperparameter search space sufficiently? And what should you do if the initial search space did not contain the ideal ranges for one or more hyperparameters?

In this blog, we present our experience with an alternative to one-shot HPO in which we perform a small series of exploration runs to iteratively modify the search space before conducting our final HPO run. Our goal was two-fold: to better understand the hyperparameter search space and to gain confidence that our selected hyperparameter values are indeed good and not a suboptimal local minimum.

Our use case is apartment rent prediction and we use the Optuna, a relatively new open-source framework developed by Preferred Networks, Inc.

Use case: rent prediction

At Skyline AI we collect data from a large number of sources and construct machine learning models that help make informed investment decisions. The data and model predictions are displayed in a web-based platform for real estate professionals to use.

One of our models aims to predict the monthly rent of an apartment given its features, such as the number of bedrooms, presence of amenities, and its location. To do so, we use data from hundreds of thousands of apartments in multifamily residences. Each apartment has a feature vector containing dozens of features. As such, this is a supervised regression problem, and we measure the prediction performance using Mean Absolute Percentage Error (MAPE).

There are a few other efforts to predict rent, with notable differences such as single home vs multifamily, different geographies, and even different objectives, but diving into this is beyond the scope of this blog. Here is a short list for those interested:

House Rent Prediction Based on Joint Model [3]
A comparison of apartment rent price prediction using a large dataset: Kriging versus DNN [4]
Modeling Housing Rent in the Atlanta Metropolitan Area Using Textual Information and Deep Learning [5]
Rent Zestimate (Zillow)

Optuna

There are many frameworks for HPO, and there are some great blogs and reviews that compare them (10 Hyperparameter optimization frameworks, Optuna vs Hyperopt, Getting Accurate Scikit Learn models using Optuna: A Hyper-parameter framework). After evaluating a few options, we chose to use Optuna, a relatively new, Pythonic, open-source framework developed by Preferred Networks, Inc.

Optuna has several nice features including:

Ease of use, Pythonic in nature.
Good documentation with many code examples
Option to save and resume the search process
Enabling to define custom hyperparameters such as whether to use a specific preprocessing step
Custom objective function and user attributes
Implementation of several hyperparameter search space samplers, including a Bayesian sampler using a TPE (Tree-structured Parzen Estimator) algorithm
Good visualization suite
Unique features like pruning, multi-objective optimization, callbacks, and exception handling
Support for distributed running across machines, and support for parallelism within single machines (running multiple trials simultaneously on different cores) although currently the latter is not working optimally (known issue).
Support for evaluating specific hyperparameter values in a single run, can be used to test your code (FixedTrial)

Search space

Our hyperparameter search space contained 9 different hyperparameters, spanning different areas of model development including preprocessing (training data selection, PCA), feature engineering (categorical encoding), model selection (RandomForest, XGBoost), model parameter selection (number of trees, tree depth), and label transformation (natural log or no transformation). We decided to sample only from discrete distributions since this way we will be able to quantify the size of our hyperparameters search space and measure sampling coverage.

As a side note, for the categorical encoders, we used a great Python library called ״category_encoders״ (source code, read more).

Overall, our starting search space contained 5 x 29 x 6 x 2 x2 x 9 x 16 = 501,120 combinations.

Baseline result

We built a model using manually chosen hyperparameter values in order to have a baseline result to enable subsequent comparisons. We obtained a MAPE of 9.53%. The chosen hyperparameter values were max rent diff of 5%, no PCA, ordinal encoding, XGBoost with 100 estimators, no max tree depth limitation, and the label as is (no log transformation).

Bayesian one-shot HPO

We started by performing the classic, one-shot HPO approach using Optuna’s TPE implementation (TPESampler), a Bayesian probabilistic-based search optimization algorithm on the search space. After 21,900 trials, we obtained a minimal MAPE of 8.80%, which was achieved already after only 1,114 trials. This is much better than the 9.53% baseline result achieved without HPO, but we noticed a few funny things.

Repetitive evaluation of the same hyperparameter values was very common, which is a known issue (see discussion). Out of the 21,900 trials, there were only 940 distinct hyperparameter combinations, and the best hyperparameter combination was evaluated more than 8,200 times – nearly 2 out of every 5 trials!

Rather than sampling close to 5% of our search space, we in fact only sampled 0.19%. This led us to question whether the TPESampler may have hit a suboptimal local minimum, as the objective function is likely not convex. While unlikely in our case, low sampling coverage can be even more problematic in the case of high dimensionality, as Bayesian optimization methods perform similarly to random search in cases where the dimension of the hyperparameter space is very large [6, 7].

Another issue with the low coverage was that we did not have a good idea of how to modify our search space for a follow-up HPO run in search of a better MAPE. What hyperparameter ranges would we decrease, which should we increase, and where should we decrease interval steps (or even make ranges continuous) within specific ranges?

Sequential HPO using search space pruning

We next decided to try sequential HPO random searches in order to be more confident that we indeed settle on a good search space (that contains hyperparameter combos with ideal results) and that this leads us to obtain a good local, or even global, minimum of our objective function.

The idea is very simple, and is similar to computational Bayesian approaches, only that it involves manual judgment. We randomly sample (using RandomSampler) our search space to find hyperparameter value ranges that yield poor results and prune them out. We can also expand the search space where our initial intuition was wrong (too small of a range). The unbiased sampling, an inherent aspect of random search, enables to statistically quantify the odds of removing good hyperparameter values if one desires extra thoroughness, although with a big caveat – it requires assuming independence between hyperparameters. As this is typically not the case, the actual value is most likely incorrect, however it may potentially be used as a ballpark figure.

In theory, there could be a scenario where the reduction in search space may have removed the best hyperparameter combination as hyperparameters can interact, however in our case, and probably in many other cases, this does not appear to be a big risk.

Sequential HPO using search space pruning flow. Image by author

To prevent confusion, note that Optuna has pruning functionality that refers to something else. Optuna’s pruning enables to stop poorly performing training processes in order to save time, which is similar to early-stopping in model training, whereas the sequential HPO pruning we described is a process to modify (increase or decrease) the hyperparameters search space.

Sequential pruning – 1st iteration

We ran our first pruning study for 25,056 trials to cover approximately 5% of our hyperparameters search space. We ran it on the same search space as the one-shot HPO (see "Search space" section). As mentioned above, we used a random search (RandomSampler) in order to explore the search space in an unbiased manner. The best model obtained a MAPE of 8.77%. This was surprisingly even a little better than our Bayesian search on the same search space (21,900 trials), which resulted in 8.80%!

Looking at how the objective function results (MAPE on the validation set) vary per individual trial, as visualized by Optuna’s visualization suite, it is clear that the search space can be trimmed.

To trim search space, we used Optuna’s slice plot visualization to look at the nine individual hyperparameters. Below are the slice plots for the four hyperparameters whose ranges we modified.

Based on these, we decided to remove the following hyperparameter values: train_rent_diff = 0.01, regressor = "RandomForest", n_estimators = {40,60}, max_depth = {2, 16–32 (step=2)}. This resulted in a search space reduction of 88%, from 501,120 to 58,464, thus enabling us to search more confidently for a good local minimum using less trials and obtaining higher search space coverage.

For extra rigor, given that we are using a random search, it is possible to statistically quantify the odds that we are improperly removing hyperparameter values, if we assume that the hyperparameter values are independent. For example, the top 708 results were obtained using XGBoost as a model, despite that there was a similar amount of trials with Random Forest (10,401) as XGBoost (10,425). Please note that 10,425 + 10,401 < 25,056 since there were 4,230 duplicated trials (17%).

Not surprisingly, 3 of the 4 hyperparameters whose search space we modified were of the highest importance as quantified by Optuna. The plot is also a built-in capability of the package.

Sequential pruning – 2nd iteration

We continued by running a second random search iteration covering 20% of the search space (4 x 29 x 6 x 2 x 1 x 7 x 6 = 58,464 combinations), this time with less trials, 11,693. Again, we obtained an improvement in MAPE compared to the last iteration, although not too substantial this time, 8.74% vs 8.77%.

One nice effect of search space pruning is that we removed all trials that ended with large objective function scores, of which there were many in the first iteration (see three plots above also).

Upon analyzing individual hyperparameters values, we decided to reduce search space for some (removing train_rent_diff=0.02, removing the PCA option, removing n_estimators=80, and keeping only the target encoder option), increase the resolution for some hyperparameters (max_depth value smaller increments of 1 instead of 2, range 4–11), and even expand the range for the target encoder’s hyperparameters (te_min_samples_leaf and te_smoothing to 1–11 with increments of 2). Below are the plots that we used to make these decisions.

After the adjustments, the new search space was confidently reduced by 96% from our starting point of 501,120 possibilities! Here is the resulting search space (3 x 36 x 1 x 2 x 1 x 11 x 8 = 19,008 combinations).

Sequential pruning – final run

We were now ready for our final run. We used Bayesian search (TPESampler) given that we no longer need to modify the search space and therefore no longer want unbiased sampling. On the contrary, we want good results and with little trials.

We only ran the HPO study for 500 trials, and the best result was found after only 49 trials, with a MAPE of 8.55%. It turns out that this was the global minimum, as verified by grid search (GridSampler). Importantly, this is substantially better than the 8.80% MAPE we received using Bayesian sampling on our initial search space.

The best model hyperparameter values were: max rent diff of 4%, no PCA, target encoding with smoothing=1 and min_sample_leaf=11, XGBoost with 280 estimators, and a max tree depth of 5, and a natural log-transformed label. Note that some of these hyperparameter values were not in our initial search space! This highlights another advantage of iterations – that they can uncover search spaces that are too small due to mistaken intuition.

Additional visualizations

Optuna enables us to compare different HPO studies by plotting the empirical Cumulative Distribution Function (i.e., e[CDF](https://en.wikipedia.org/wiki/Cumulative_distribution_function)) of the objective function scores obtained in their trials. It is "empirical" since the eCDF is an estimate of the Cumulative Distribution Function (i.e., CDF), but for analysis purposes, it is essentially the same. Using this, we could easily see that in each sequential pruning iteration, the odds of individual trials getting achieving objective function scores dramatically increased.

We can see that in each sequential pruning the odds of individual trials getting low MAPEs increases. For example, for the first two pruning iterations, the probability of getting a MAPE value smaller than 9.5 is 2% and 15% respectively, whereas for the final Bayesian run it jumps to 99.6%. Image by author.

Comparing to Bayesian once more

We started with a search space of 501k combinations, however, this space did not contain our best result. The union of search spaces of all our sequential HPO runs is actually 7X bigger than this starting point, or about 3.5M combinations, as beyond pruning we also expanded some hyperparameter value ranges and decreased the step sizes within ranges.

Therefore, for a fair comparison, we ran a one-shot Bayesian search (TPESampler) on the 3.5M combination search space. The best run had a MAPE of 8.56% (found after 3,362 trials, total trials run was 26,500) – which is very close to the global minimum of 8.55%. There is an important lesson to learn here.

Conclusion

So what is our take-home message other than Optuna being a great package? Bayesian approaches can be amazing, however, human intuition is not perfect. The initial Bayesian HPO run was on a suboptimal search space and thus yielded a poorer result. Incorrectly estimating the search space seems can happen often in real-world settings. In addition, Bayesian searches can get stuck in local minima making them sometimes worse than random search (as in our case on our initial search space – 1st iteration random search vs Bayesian).

Sequential runs coupled to manual analysis, as shown here, can help better understand the search space to ensure we get a good result, and equally important, one that we will be confident in.

By using our sequential HPO using search space pruning we were able to reduce our MAPE by 10% from 9.53% to 8.55%. Using Bayesian one-shot HPO on our initial search space (size - 501k, trials - 22k) and on the union of search spaces (size - 3.5M, trials - 26.5k) gave us MAPE's values of 8.80% and 8.56% respectively. Image by author. — By using our sequential HPO using search space pruning we were able to reduce our MAPE by 10% from 9.53% to 8.55%. Using Bayesian one-shot HPO on our initial search space (size – 501k, trials – 22k) and on the union of search spaces (size – 3.5M, trials – 26.5k) gave us MAPE’s values of 8.80% and 8.56% respectively. Image by author.

Hai Rozencwajg is a Senior Data Scientist at Skyline AI, the company building the AI mastermind to solve RE.

References

[1] M. Wistuba, N. Schilling, and L. Schmidt-Thieme, Hyperparameter Search Space Pruning – A New Component for Sequential Model-Based Hyperparameter Optimization (2015), Part of the Lecture Notes in Computer Science book series (LNCS, volume 9285)

[2] A. Lacoste, H. Larochelle, F. Laviolette, and M. Marchand, Sequential Model-Based Ensemble Optimization (2014), arXiv

[3] Z. Kun, S. LingCong, and L. Ninging, House Rent Prediction Based on Joint Model (2019), Proceedings of the 2019 8th International Conference on Computing and Pattern Recognition, ICCPR ’19

[4] H. Seya and D. Shiroi, A comparison of apartment rent price prediction using a large dataset: Kriging versus DNN (2019), arXiv

[5] X. Zhou, W. Tong, and D. Li, Modeling Housing Rent in the Atlanta Metropolitan Area Using Textual Information and Deep Learning (2019), ISPRS International Journal of Geo-Information

[6] Z. Wang, M. Zoghi, F. Hutter, D. Matheson, and N. De Freitas, Bayesian Optimization in High Dimensions via Random Embeddings (2013), Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI ’13

[7] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh and A. Talwalkar, Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization (2018), arXiv

Selecting hyperparameter values with sequential, human-in-the-loop, search space modification

Thoughts and Theory, Hyperparameter optimization

Intro

Use case: rent prediction

Optuna

Search space

Baseline result

Bayesian one-shot HPO

Sequential HPO using search space pruning

Sequential pruning – 1st iteration

Sequential pruning – 2nd iteration

Sequential pruning – final run

Additional visualizations

Comparing to Bayesian once more

Conclusion

References

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

Our Columns