As mentioned in the previous article on Decisioin trees, it is possible to tune model parameters automatically. For example, regression and classification have the same core model tuning functionality. The automation approach would be to select evaluator, pick metric to optimize for and then train pipeline to do the parameter tuning.
Each so selected model can be evaluated against standard metrics as offered by RegressionMetrics Spark Scala and Python objects:
RegressionEvaluator is used to build the pipeline, as in sample below:
import
org.apache.spark.ml.evaluation.RegressionEvaluator
import
org.apache.spark.ml.regression.GeneralizedLinearRegression
import
org.apache.spark.ml.Pipeline
import
org.apache.spark.ml.tuning.
{
CrossValidator
,
ParamGridBuilder
}
val
glr
=
new
GeneralizedLinearRegression
()
.
setFamily
(
"gaussian"
)
.
setLink
(
"identity"
)
val
pipeline
=
new
Pipeline
().
setStages
(
Array
(
glr
))
val
params
=
new
ParamGridBuilder
().
addGrid
(
glr
.
regParam
,
Array
(
0
,
0.5
,
1
))
.
build
()
val
evaluator
=
new
RegressionEvaluator
()
.
setMetricName
(
"rmse"
)
.
setPredictionCol
(
"prediction"
)
.
setLabelCol
(
"label"
)
val
cv
=
new
CrossValidator
()
.
setEstimator
(
pipeline
)
.
setEvaluator
(
evaluator
)
.
setEstimatorParamMaps
(
params
)
.
setNumFolds
(
2
)
// should always be 3 or more but this dataset is small
val
model
=
cv
.
fit
(
df
)
import
org.apache.spark.mllib.evaluation.RegressionMetrics
val
out
=
model
.
transform
(
df
)
.
select
(
"prediction"
,
"label"
)
.
rdd
.
map
(
x
=>
(
x
(
0
).
asInstanceOf
[
Double
],
x
(
1
).
asInstanceOf
[
Double
]))
val
metrics
=
new
RegressionMetrics
(
out
)
println
(
s"MSE =
${
metrics
.
meanSquaredError
}
"
)
println
(
s"RMSE =
${
metrics
.
rootMeanSquaredError
}
"
)
println
(
s"R-squared =
${
metrics
.
r2
}
"
)
println
(
s"MAE =
${
metrics
.
meanAbsoluteError
}
"
)
println
(
s"Explained variance =
${
metrics
.
explainedVariance
}
"
)