The Startup
Published in

The Startup

Single Scheduler — Multiple Worker Architecture With GRPC and Go — Part 2

Single scheduler (conductor) and multiple workers

Part 2 — The scheduler overview

Scheduler exposes an HTTP-server that gives a gateway to the scheduler-worker-cluster. HTTP requests to the scheduler API are translated and proxied into workers.

  • The scheduler runs an HTTP server as well as a GRPC server.
  • The scheduler keeps a record of the available workers in the cluster.
  • Workers can register and deregister on the scheduler.

Data Structures

When a new worker registers to the scheduler, a worker object is created. Pointer to this worker object is kept in the workers' map with the key being the worker id and value being the pointer to the worker. Worker objects hold the id of the worker and the address of the worker’s GRPC server to start/end/query jobs.

var (
workersMutex = &sync.Mutex{}
workers = make(map[string]*worker)
)

// worker holds the information about registered workers
// - id: uuid assigned when the worker first register.
// - addr: workers network address, later used to create grpc client to the worker
type worker struct {
id string
addr string
}

Configuration for the scheduler is done through the `config.toml` file. I wrote about how I approach configuration in Go projects. You can read more here: https://koraygocmen.medium.com/handling-configuration-in-go-22393a8f5b6d

GRPC Server

  • addr: Address on which the GRPC server will be run.
  • use_tls: Whether the GRPC server should use TLS. If true, crt_file and key_file should be provided.
  • crt_file: Path to the certificate file for TLS.
  • key_file: Path to the key file for TLS.

HTTP Server

  • addr: Address on which the HTTP server will be run on.
# Full config file: scheduler/config.toml[grpc_server]
addr = "127.0.0.1:50000"
use_tls = false
crt_file = "server.pem"
key_file = "server.key"

[http_server]
addr = "127.0.0.1:3000"

GRPC Server

Only 2 GRPC requests required in the scheduler GRPC server which are used to register and deregister workers on the scheduler.

// Scheduler GRPC server definitionsservice Scheduler {
rpc RegisterWorker(RegisterReq) returns (RegisterRes) {}
rpc DeregisterWorker(DeregisterReq) returns (DeregisterRes) {}
}

message RegisterReq {
string address = 1;
}

message RegisterRes {
bool success = 1;
string workerID = 2;
}

message DeregisterReq {
string workerID = 1;
}

message DeregisterRes {
bool success = 1;
}

`RegisterWorker` is called by any worker that comes online, the scheduler then assigns a unique worker ID to that worker and registers the worker in the workers' map with the id and the GRPC address passed in the request. This address will later be used to start/stop/query jobs on the worker.

`DeregisterWorker` is called by any worker that goes offline, the workerID specified in the deregister request will be removed from known workers.

HTTP API

There are 3 requests that can be made to the scheduler via HTTP. These are to start/stop/query jobs on a specific worker. A job is calling a specified file via a specified command on a worker. For example: if you have a python worker, the scheduler can call a python script via python3 and query the results later.

Start job

POST “/start”

Request:

  • “Run which file on which worker with what command”
  • Provide: `command` / `path` / `worker_id

Response:

  • returns `job_id`: which can be used to query or stop the job later.
// example start request
{
"worker_id": "d5feef60-3029-11e9-b210-d663bd873d93",
"command": "bash",
"path": "worker/scripts/count.sh"
}
// example start response - success
{
"job_id": "4c5ced1c-5ea9-40f8-90ce-63d09cea26f6"
}
// example start response - error
{
"error": "worker not found"
}

Stop job

POST “/stop”

Request:

  • “Stop which job on which worker”
  • Provide: `worker_id` / `job_id`

Response:

  • returns `success`: boolean response.
// example stop request
{
"worker_id": "d5feef60-3029-11e9-b210-d663bd873d93",
"job_id": "4c5ced1c-5ea9-40f8-90ce-63d09cea26f6"
}
// example stop response - success
{
"success": true
}
// example stop response - error
{
"error": "worker not found"
}

Query job

POST “/query”

Request:

  • “Query which job on which worker”
  • Provide: `worker_id` / `job_id`

Response:

  • returns `done`: boolean for if the job is done.
  • returns `error`: boolean for if the job had an error.
  • returns `error_text`: string for the error thrown by the job.
// example query request
{
"worker_id": "d5feef60-3029-11e9-b210-d663bd873d93",
"job_id": "4c5ced1c-5ea9-40f8-90ce-63d09cea26f6"
}
// example query response - success
{
"done": true,
"error": true,
"error_text": "signal: killed"
}
// example query response - error
{
"error": "worker not found"
}

And finally HTTP request and response structs on the scheduler:

// apiStartJobReq expected API payload for `/start`
type apiStartJobReq struct {
Command string `json:"command"`
Path string `json:"path"`
WorkerID string `json:"worker_id"`
}

// apiStartJobRes returned API payload for `/start`
type apiStartJobRes struct {
JobID string `json:"job_id"`
}

// apiStopJobReq expected API payload for `/stop`
type apiStopJobReq struct {
JobID string `json:"job_id"`
WorkerID string `json:"worker_id"`
}

// apiStopJobRes returned API payload for `/stop`
type apiStopJobRes struct {
Success bool `json:"success"`
}

// apiQueryJobReq expected API payload for `/query`
type apiQueryJobReq struct {
JobID string `json:"job_id"`
WorkerID string `json:"worker_id"`
}

// apiQueryJobRes returned API payload for `/query`
type apiQueryJobRes struct {
Done bool `json:"done"`
Error bool `json:"error"`
ErrorText string `json:"error_text"`
}

// apiError is used as a generic api response error
type apiError struct {
Error string `json:"error"`
}

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Koray Göçmen

Koray Göçmen

51 Followers

University of Toronto, Computer Engineering, architected and implemented reliable infrastructures and worked as the lead developer for multiple startups.