This project implements a weighted round-robin(WRR) scheduler.
sudo ./build-rpi3.sh
sudo ./setup-images.sh
test/compile-mount-and-copy.sh
compiles tests using make
, mounts the image & copies the compiled test executables into the image, and unmounts the image.
If you only want to compile the test, you can run make
inside the test
directory.
test/compile-mount-and-copy.sh
./qemu.sh
# in QEMU
./dummy 20 0 & # dummy at CPU 0
./dummy 20 1 & # dummy at CPU 1
./dummy 20 2 & # dummy at CPU 2
./dummy 20 3 & # dummy at CPU 3
./test # turnaround time test
command line below is usage of dummy
and test
in QEMU, parameters inside square brackets([]
) are optional
./dummy [WEIGHT] [CPU_TO_SET_AFFIITY]
./test [NUMBER_TO_FACTORIZE]
Some structs are added for representing WRR information.
It contains WRR-related information of a task.
struct task_struct {
...
struct sched_wrr_entity wrr;
...
}
struct sched_wrr_entity {
unsigned int weight;
unsigned int time_slice;
struct list_head run_list;
unsigned short on_rq;
};
weight
: weight of the task.time_slice
: remaining ticks for execution.run_list
: list node used in a runqueue.on_rq
: 1 if task is on a runqueue, 0 otherwise.
It represents runqueue of WRR scheduler.
struct rq {
...
struct wrr_rq wrr;
...
}
struct wrr_rq {
struct list_head queue;
unsigned int nr_running;
unsigned int total_weight;
};
queue
: the first node of the runqueue list.nr_running
: number of running tasks.total_weight
: total weight of tasks on the runqueue.
wrr_sched_class
locates in kernel/sched/wrr.c
. Necessary scheduler interface functions are implemented for the WRR scheduler:
enqueue_task_wrr()
: enqueue a task to the WRR runqueue.dequeue_task_wrr()
: dequeue a task from the WRR runqueue.yield_task_wrr()
: requeue the current WRR task when yielding.pick_next_task_wrr()
: pick the next task to execute from the WRR runqueue.put_prev_task_wrr()
: requeue the previous WRR task on the runqueue.select_task_rq_wrr()
: select a CPU with minimum total weight to exeucte a task.- RCU read lock is used to check total weight of all CPUs.
update_curr_wrr()
: update statistics of the current WRR task.task_tick_wrr()
: update timeslice of the current WRR task at every tick.get_rr_interval_wrr()
: return the WRR timeslice based on task's weight.
WRR scheduler replaces the default CFS scheduler:
kernel/sched/rt.c
- RT scheduler(
rt_sched_class
) points the WRR scheduler, instead of the CFS scheduler.
- RT scheduler(
kernel/sched/core.c
sched_init()
initializes the WRR runqueue at scheduler initialization.init_idle()
initializesstruct sched_wrr_entity
of the idle process(swapper
).sched_fork()
initializesstruct sched_wrr_entity
of the new task, and assigns it to the WRR scheduler if it follows neither DL or RT policy.rt_mutex_setprio()
assigns all normal tasks to the WRR scheduler.sched_setscheduler()
assigns all normal tasks to the WRR scheduler, and set the task's weight to default when newly scheduled as WRR.normalize_rt_tasks()
set the task's scheduler policy to WRR.
init/init_task.c
init_task
is assigned to the WRR scheduler.
kernel/kthread.c
kthread
is assigned to the WRR scheduler.
There are system calls to update and query the weight of a WRR task.
sched_setweight(pid, weight)
: set the weight of a WRR task.- RCU read lock is used to read task data from all CPUs.
task_rq_lock
is used to modify the weight of a task on the runqueue.
sched_getweight(pid)
: return the weight of a WRR task.- RCU read lock is used to read task data from all CPUs.
The start point of WRR scheduler load balancing is the scheduler_tick()
function in core.c
. We replaced the trigger_load_balance()
call of CFS scheduler with trigger_load_balance_wrr()
of WRR scheduler. The implementation of WRR load balancer was heavily inspired by CFS load balancer. Following are descriptions of functions related to load balancing in WRR scheduler.
We made sure that load balancing occurs every 2000ms, only one CPU at a time. We achieved this by keeping a global variable named next_balance_wrr
. next_balance_wrr
is initialized with jiffies + msecs_to_jiffies(2000)
at boot time by init_sched_wrr_class()
. scheduler_tick()
calls trigger_load_balance_wrr()
every tick, and trigger_load_balance_wrr()
checks whether current time(jiffies
) is past next_balance_wrr
and if so, raises soft IRQ handler, which is registered as run_load_balance_wrr()
. To make sure only one CPU actually does the load balancing at a time, we introduced a spinlock named wrr_balancer_lock
.
-
init_sched_wrr_class()
: Initializewrr_balancer_lock
, softIRQ handler(run_load_balance_wrr()
), andnext_balance_wrr
. -
trigger_load_balance_wrr()
: Called fromscheduler_tick()
. This function makes sure only one CPU enters the critical section ofwrr_balancer_lock
, checks whetherjiffies
is pastnext_balance_wrr
, and raises the soft IRQ handler(run_load_balance_wrr()
) registered formerly. -
run_load_balance_wrr()
: Soft IRQ handler called when it is time to load balance. This function simply callsload_balance_wrr()
. -
load_balance_wrr()
: Function that actually does load balancing. Details will be provided in the next section.
Choosing target CPUs and doing the actual migration atomically is complicated since we have to hold RCU read lock and runqueue spinlocks for writing at once. Therefore, we chose a relaxed approach to divide the process into two separate atomic phases.
Choosing max_cpu
and min_cpu
is done with following steps:
- Acquire RCU read lock with
rcu_read_lock()
, since we're reading data from multiple CPUs. - Using
for_each_online_cpu()
, iterate over all online CPUs and choosemax_cpu
andmin_cpu
. - Release RCU read lock with
rcu_read_unlock().
- If
max_cpu == min_cpu
, this means that there is only one CPU in this system; thus no need for load balancing, return. Otherwise, proceed to Phase 2.
In this phase, we do the following things atomically: 1. Choose an appropriate task from max_cpu
, 2. Migrate the selected task to min_cpu
. The steps are as follows:
- Disable interrupts using
local_irq_save()
. - Atomically lock runqueues of
max_cpu
andmin_cpu
usingdouble_rq_lock()
. - Iterating through the wrr runqueue struct of
max_cpu
, search for an appropriate task for migration that meets following conditions:- The task should not be currently running,
- Migration should not make the total weight of
min_cpu
equal to or greater than that ofmax_cpu
, - The migration should not violate the task's CPU affinity.
- If no migratable task exists, return.
- Actually migrate the task from
max_cpu
tomin_cpu
. This is done by dequeueing the task frommax_cpu
, setting the CPU of the task withset_task_cpu()
, and enqueueing the task tomin_cpu
. - Print logs about the migration.
- Unlock runqueues of
max_cpu
andmin_cpu
usingdouble_rq_unlock()
. - Enable interrupts using
local_irq_restore()
.
- We take prime number
300000007
to prime factorization target number. - All test repeated 5 times and we measured average of the execution time.
First, we executed prime factorization test
alone. As a result, because there's rare interrupt to test, execution time was around 1.9 secs regardless of task weight
So, we decided to run spinning dummy
tasks at background to all CPU
s. Then run test
and measured execution time. we did 3 trials with different dummy
weight, 1, 10 and 20.
The graph below shows excution time based on task weight.
- The color of graph is differentiated by weight of dummy task.
We used two locks: RCU read lock and task/runqueue spinlock.
- RCU read lock: read shared data (runqueue of other CPUs)
- spinlock: write to shared data (task/runqueue of other CPUs)
Locks are used in multiprocessor functions.
select_task_rq_wrr()
: read all CPUs' total weightssched_setweight()
: fetch a task from all CPUs, and change the task's weight and its WRR runqueue's total weightsched_getweight()
: fetch a task from all CPUsload_balance_wrr()
: read all CPUs' total weights, and move a task from its runqueue to other runqueue
The key was that RCU read lock protects only reading, not writing. Even if RCU read lock is held, appropriate spinlock must be acquired before the value of shared data change.
At first, requeue_task_wrr()
called wrong list API by mistake.
static void requeue_task_wrr(struct rq *rq, struct task_struct *p)
{
...
list_move(&wrr_se->run_list, &wrr_rq->queue); // not list_move_tail
}
It was just a simple typo, but the side effect was too fatal. It made the runqueue operate like a stack; when a running task ran out of its timeslice, the timeslice is refilled and the recent task is inserted at the head again! Other tasks could not be executed before the current task terminates.
This small mistake destroyed the whole scheduling system:
- A single CPU could not perform multitasking.
- Interrupt during the task could not be handled.
- Kernel processes suffered from starvation.
We keenly realized the importance of choosing appropriate data structures.