Rescaling in Flink

Rescaling a running Flink job is useful to better use computational resources when your application does not have the same workload at all times. In theory it means that at some point you should be able to scale up or down, by adding or removing TaskManagers (worker processes) on a Flink cluster.

Currently there is no way of “automagically” rescaling a running Flink job, i.e changing it’s parallelism when there are changes on available TaskManagers to execute the subtasks. Even if you spawn more TaskManagers and they get registered at the JobManager, the already running job will still use the same level of parallelism.

This post covers some findings about rescaling. The latest version of Flink is 1.11.2 at the time of this writing.

Removal of Job Rescaling from CLI and REST API

According to this malining list, the experimental feature of modifying the parallelism of an already running Flink job was removed.

If you compare CLI (Command Line Interface) documentations of versions 1.8 and 1.9 you can see that the command below was removed (FLINK-12312) from the CLI:

flink modify <jobID> -p <newParallelism>

In older versions this command would rescale an already running job with the specified by the user new parallelism. This is also supported by Flink’s REST API and it’s interesting that the two endpoints for it are still available in latest 1.11.2 version, having the specification below:

But the endpoint /jobs/:jobId/rescaling is not working as well and returns an error:

{
  "errors": [
    "org.apache.flink.runtime.rest.handler.RestHandlerException: Rescaling is temporarily disabled. See FLINK-12312 (..)"
  ]
}

Reactive Container Mode

Apparently there is an active development (FLINK-10407) on a feature called Reactive Container Mode in which according to the description makes a Flink cluster “react to newly available resources (e.g. started by an external service) and make use of them by rescaling the existing job.”

You can check all the details of this feature in their design document and possibly it will be released in version 1.12, as pointed in release document as “Reactive-scaling mode” feature.

Rescaling with a Job Restart

Currently the only way of rescaling a Flink job is by doing a graceful shutdown with a savepoint and restarting the job from the saved savepoint with the new parallelism.

Netflix implemented their own autoscaling algorithm by following this approach and deciding how to scale based on analysis from collected metrics. Check their Autoscaling Flink at Netflux talk by Timothy Farkas.

More on Flink’s Reactive Container Mode in Future of Apache Flink Deployments: Containers, Kubernetes and More - Till Rohrmann talk by Till Rohrmann.

Resources

Rescaling endpoints on Flink’s REST API

[FLINK-12312] Temporarily disable CLI command for rescaling

[FLINK-10407] Reactive Container Mode

Flink Design Document: Reactive Container Mode

Deploy Flink Job Cluster on Kubernetes

Categories:

Updated:

Leave a comment