Evaluators and Automating Model Tuning, Model Metrics

Overview

As mentioned in the previous article on Decisioin trees, it is possible to tune model parameters automatically. For example, regression and classification have the same core model tuning functionality. The automation approach would be to select evaluator, pick metric to optimize for and then train pipeline to do the parameter tuning.

Each so selected model can be evaluated against standard metrics as offered by RegressionMetrics Spark Scala and Python objects:

Setting up the parameter tuning pipeline

RegressionEvaluator is used to build the pipeline, as in sample below:

RMSE is metric to optimize against

import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.GeneralizedLinearRegression
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
val glr = new GeneralizedLinearRegression()
  .setFamily("gaussian")
  .setLink("identity")
val pipeline = new Pipeline().setStages(Array(glr))
val params = new ParamGridBuilder().addGrid(glr.regParam, Array(0, 0.5, 1))
  .build()
val evaluator = new RegressionEvaluator()
  .setMetricName("rmse")
  .setPredictionCol("prediction")
  .setLabelCol("label")
val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(params)
  .setNumFolds(2) // should always be 3 or more but this dataset is small
val model = cv.fit(df)

Assessing other top model's metrics

import org.apache.spark.mllib.evaluation.RegressionMetrics
val out = model.transform(df)
  .select("prediction", "label")
  .rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
val metrics = new RegressionMetrics(out)
println(s"MSE = ${metrics.meanSquaredError}")
println(s"RMSE = ${metrics.rootMeanSquaredError}")
println(s"R-squared = ${metrics.r2}")
println(s"MAE = ${metrics.meanAbsoluteError}")
println(s"Explained variance = ${metrics.explainedVariance}")

References

ISBN 978-1491912058 Jake VanderPlas "Python Data Science Handbook: Essential Tools for Working with Data 1st Edition "