One major pain point in Tenancy for Laravel, when using a multi-database setup, has been slow migrations. This is typically only an issue once you have tenant databases in the high hundreds or thousands, but when it does become an issue, it’s a very annoying one — migrations take an eternity.
To solve this, some people asked me for a parallel migrations feature. I decided to give it a try for our new version 4, and to my surprise it works very well and was super quick to implement!
Basic concurrency concepts
There are a few things to understand about concurrency first.
Note: While this article tries to explain all the concepts necessary to understand the code implementation, it will not focus heavily on theory and as such some things may be simplified.
Threads vs processes: These are basic building blocks of concurrency. In simple terms, the difference is that threads run as part of one process and share memory. Processes are fully isolated1. Processes are heavier, threads are more lightweight. However, threads are not entirely free, and many applications that need good real-time performance will avoid spawning threads at runtime, instead distributing work to pre-spawned thread pools and all sorts of other approaches. We won’t need any of that here, but it’s good to have a mental model of how concurrency is used in practice!
Additionally, it’s good to understand the difference between concurrency and parallelism, even if they’re often used interchangeably. In simple terms, parallelism refers to true simultaneous execution. This generally means tasks running on separate CPU cores, which allows for multiple tasks to actually run at the same time, in parallel.
In contrast, concurrency refers to the idea of “running multiple tasks at once” which may or may not be parallel. For instance, a single-threaded event loop can just very quickly alternate between tasks, essentially creating an illusion of multiple tasks running at the same time even if strictly speaking true parallelism may not be involved.
In the context of operating systems on modern machines, you always get both — tasks running on separate CPU cores in parallel and the operating system scheduler alternates between which processes are currently being run, and which processes are put to sleep. Things like blocking I/O (e.g. trying to connect to a network socket, which takes time) can put a thread/process to sleep automatically, other times the scheduler will have to use some heuristics to reasonably balance each process (that needs to run) so nothing stalls too much.
In short: parallelism = actual simultaneous execution, concurrency = general idea of tasks “running at the same time”, whatever that may end up being in practice. Processes are heavier but isolated, threads are lighter and share memory. Additionally both the operating system scheduler and individual programs can have their own logic for distributing tasks across existing workers (pre-spawned threads or event loops in the case of a process or CPU cores in the case of the operating system scheduler).
That’s a lot of info! Let’s see how we can use this in PHP.
Back to PHP
PHP has no native async/await or an event loop. Everything is synchronous. So, how do we get stuff to run concurrently?
We need to use PHP extensions. When making things concurrent, threads are typically the first thing people reach for. So let’s see how we can do that in PHP. The first search result yields the pthreads extension (presumably a wrapper around POSIX threads), but the docs page says the extension is dead and we should use parallel instead. Let’s check that out. Reading the docs, it looks like an extension with a well-thought-through API, however it requires a PHP setup that hardly anyone has. You need to have the extension installed as well as PHP built with thread safety enabled. I checked my local setup (from Laravel Herd) and neither of those apply.
Next, we can consider Fibers. Above I said PHP doesn’t have a native event loop like JavaScript does, in a way fibers let you build that. Fibers in themselves run on a single thread, but can be viewed as a building block for actual asynchronous PHP. Projects like amphp handle this.
However… the title of the article says vanilla PHP. So we’re not going to use additional dependencies, we just want to make a couple of lines of PHP run concurrently, with better performance.
This leaves us with pcntl. It is a PHP extension, making this technically not complete vanilla PHP, but it’s fairly standard for it to be included in most PHP builds for Unix-based operating systems (like Linux or macOS).
If I run this, on my Mac which uses PHP provided by Herd:
$ php -m | grep pcntl
pcntl
I can see the pcntl extension is available. Same result with PHP coming from Homebrew or Nix.
Processes to the rescue
pcntl doesn’t let us create threads, it lets us create processes. As discussed above, the main things you should focus on when comparing threads and processes are that:
Processes are heavier
Threads share memory
Shared memory sounds great, but in reality it’s infinite footguns and can be especially painful to deal with in languages that weren’t designed around multi-threading. Which is why all of the threading solutions we’ve discussed above were implemented as extensions and with a very specific way of using them.
So instead, we’ll be creating processes! Specifically by forking our process into multiple child processes.
By now I should also clarify this is not recommended for HTTP requests. Forking php-fpm worker processes could have unexpected effects. If you ever have a task in a request that’s so slow you want to parallelize it, don’t. Put it in a queue where you can approach speeding up the work by using multiple queue workers.
Forking processes
fork() is one of the basic approaches of creating processes on Unix-like systems. It’s very simple.2 You call it and a new, identical copy of the current process is created. The function call itself returns:
-1 on error
0 in the child process
> 0 (the process ID — PID — of the child process) in the parent process
Using the return value, we can determine if we’re in the child process or parent process.
<?php
echo "Hello from the parent process\n";
$pid = pcntl_fork();
if ($pid == -1) {
throw new Exception("Fork call failed");
} else if ($pid == 0) {
echo "Hello from the child!\n";
} else {
echo "Hello from the parent again, the child's PID is $pid!\n";
}
We get:
Hello from the parent process
Hello from the parent again, the PID of the child is 86940!
Hello from the child!
Let’s make our implementation more robust. The docs for the pcntl extension say:
The pcntl_waitpid($pid, &$status) function will suspend the parent process until
$pid
child has exited, then write its status to$status
The pcntl_wifexited($status) function will tell us if the child process exited normally
The pcntl_wexitstatus($status) function will, after making sure the child process exited normally, give us the exact exit status of the child process. We want this to be 0 on success
Putting all of this together:
<?php
/**
* Spawn a child and return its PID
*
* @param Closure():bool $callback The function to execute in the child process. It should return a bool indicating success.
* @return int The PID of the child process
*/
function spawn_child(Closure $callback): int
{
$pid = pcntl_fork();
if ($pid === -1) {
throw new Exception('fork() failed');
} else if ($pid > 0) {
return $pid;
} else {
exit($callback() === false ? 1 : 0);
}
}
$pids = [];
for ($i = 0; $i < 8; $i++) {
$pid = spawn_child(function (): bool {
if (rand(0, 1)) {
sleep(1);
printf("Hello from child!\n");
return true;
} elseif (rand(0, 1)) {
sleep(2);
printf("Error in child :(\n");
return false;
} else {
sleep(3);
printf("Abnormal exit in child!\n");
posix_kill(posix_getpid(), SIGABRT);
}
});
printf("Spawned child $pid\n");
$pids[] = $pid;
}
$success = true;
foreach ($pids as $pid) {
$status = null;
// Blocks execution until the child has finished running...
pcntl_waitpid($pid, $status);
$normalExit = pcntl_wifexited($status);
if ($normalExit) {
$exitCode = pcntl_wexitstatus($status);
if ($exitCode === 0) {
echo "Child process $pid executed successfully.\n";
} else {
$success = false;
echo "Child process $pid executed with errors.\n";
}
} else {
echo "Child process $pid exited abnormally.\n";
$success = false;
}
}
if ($success) {
echo "All child processes executed successfully.\n";
} else {
echo "Errors occurred in child processes.\n";
}
If you run this a few times, you should see all the possible scenarios coming from child processes: success, failure, and abnormal exits.
If you’ve ever worked with threads, the code above may seem a bit familiar even if it’s working with processes instead. The waitpid call is very much analogous to joining threads — the main thread blocks (waits) until the child thread is done, which is the exact same thing as what we’re doing here. We want the parent to exit last so we can properly report success or failure after we see if all of the child processes succeeded.
How many processes to spawn
The general rule of thumb is that for CPU-bound tasks, meaning tasks that are computationally intensive rather than waiting on some I/O like network sockets talking to a database or an HTTP API, the ideal number of processes or threads is the number of physical cores3 your CPU has.
Given the section on concurrency vs parallelism, you probably see why that is. If you have 8 CPU cores, your CPU cannot physically run more than 8 threads/processes at the same time, meaning if you try to run more you’ll just add context switching overhead as the CPU has to juggle processes to make sure their execution is reasonably balanced.
Now, with I/O-bound tasks, like sending a bunch of requests or running database queries, things get more complicated. If the other service is on the same machine, you likely will not want more processes than CPU cores. If it’s on another machine, you may get better performance by matching that machine’s core count, rather than yours. Will this map perfectly to reality? Probably not, these are all rules of thumb at best, and you’ll want to test what works best in your case. Also, going back to the section on concurrency vs parallelism, you may notice that in I/O-bound tasks, the ideal would be your process itself handling this, using non-blocking I/O, but as mentioned that’s not a thing outside of projects like amphp and way beyond the scope of this article. You can think of the OS scheduler as a replacement for that, though. If you spawn more processes than can execute simultaneously, but all of them block on some network I/O, your operating system will wake up each process as it gets a response with no work on your end and without much overhead that’d make a significant difference here.
So for a good default, we’ll simply stick with the physical CPU core count. How can we get that number? It’s OS-specific and PHP doesn’t have any nice wrapper around that. Not a problem, we can write our own.
On Windows, this is simply reading the NUMBER_OF_PROCESSORS
environment variable. From my testing, it’s simply always available, so this couldn’t be simpler on Windows.
(int) getenv('NUMBER_OF_PROCESSORS')
On Linux, you read /proc/cpuinfo
and count how many times the word ‘processor’ occurs.
substr_count(file_get_contents('/proc/cpuinfo'), 'processor')
On MacOS… it gets more difficult. Being BSD-based, unlike Linux, we have to use the sysctl command:
$ sysctl -n hw.ncpu
10
If you want maximal performance though, you only want to use P-cores (performance cores) and explicitly not use E-cores (efficiency cores — lower power, lower performance). You may expect using all cores to work best, but from my testing using only the performance cores ends up working better. Your mileage may vary, but for a variety of possible reasons this seems to be the case when migrating a lot of databases.
How do we get just the P-core count?
$ sysctl -n hw.perflevel0.logicalcpu
8
$ sysctl -n hw.perflevel1.logicalcpu
2
8 performance cores, 2 efficiency cores.
This is how it works on M-series Macs. On older Intel Macs, there’s only hw.perflevel0
. That’s fine since that’s the only thing we care about.
So a general BSD-compatible solution, with MacOS-optimized handling would look something like:
$res = shell_exec('sysctl -n hw.perflevel0.logicalcpu 2>/dev/null');
if ($res === null) {
$res = shell_exec('sysctl -n hw.ncpu');
if ($res === null) {
return -1; // error
}
}
return (int) trim($res); // core count
However… we can do better. This is needlessly optimizing things, since we generally only run this once at the start of the program to determine how many cores we’ll need, but from my testing, the approach above takes ~8ms at first, then ~3ms on subsequent calls. No one will ever perceive this, but 8ms to determine the core count is crazy slow and we shouldn’t need to spawn a process for this.
Well, we don’t. Enter FFI.
FFI (Foreign Function Interface), added in PHP 7.4, basically lets us call functions in other programs from our code directly, without having to actually create entirely new processes like above.
The sysctl
command also comes with a function provided by the C standard library (libc), with the same name4. This means that, in a C program we could do:
$ cat ncpu.c
#include <stdio.h>
#include <sys/types.h>
#include <sys/sysctl.h>
int main() {
int cores;
size_t size = sizeof(cores);
if (sysctlbyname("hw.ncpu", &cores, &size, NULL, 0) == -1) return 1;
printf("Number of CPUs (ncpu): %d\n", cores);
if (sysctlbyname("hw.perflevel0.logicalcpu", &cores, &size, NULL, 0) == -1) return 1;
printf("Number of CPUs (perflevel0): %d\n", cores);
if (sysctlbyname("hw.perflevel1.logicalcpu", &cores, &size, NULL, 0) == -1) return 1;
printf("Number of CPUs (perflevel1): %d\n", cores);
}
$ make ncpu && ./ncpu
cc ncpu.c -o ncpu
Number of CPUs (ncpu): 10
Number of CPUs (perflevel0): 8
Number of CPUs (perflevel1): 2
To do the same in PHP:
<?php
$ffi = FFI::cdef('int sysctlbyname(const char *name, void *oldp, size_t *oldlenp, void *newp, size_t newlen);');
$cores = $ffi->new('int');
$size = $ffi->new('size_t');
$size->cdata = FFI::sizeof($cores);
if ($ffi->sysctlbyname("hw.ncpu", FFI::addr($cores), FFI::addr($size), null, 0) == -1) exit(1);
printf("Number of CPUs (ncpu): %d\n", $cores->cdata);
if ($ffi->sysctlbyname("hw.perflevel0.logicalcpu", FFI::addr($cores), FFI::addr($size), null, 0) == -1) exit(1);
printf("Number of CPUs (perflevel0): %d\n", $cores->cdata);
if ($ffi->sysctlbyname("hw.perflevel1.logicalcpu", FFI::addr($cores), FFI::addr($size), null, 0) == -1) exit(1);
printf("Number of CPUs (perflevel1): %d\n", $cores->cdata);
And we get the exact same output! To explain the first line a little bit:
The first parameter is C type definitions, this is basically signatures of functions we want to call and any types that may be used in them. Often you may just file_get_contents() a header (.h) file here
The second parameter would be the library where the function is located. Thankfully, since this comes from the C standard library, which PHP abundantly uses, it is already linked, meaning we don’t even need to specify this parameter.
Now to make a more robust implementation again:
function sysctlGetLogicalCoreCount(bool $darwin): int
{
$ffi = FFI::cdef('int sysctlbyname(const char *name, void *oldp, size_t *oldlenp, void *newp, size_t newlen);');
$cores = $ffi->new('int');
$size = $ffi->new('size_t');
$size->cdata = FFI::sizeof($cores);
// perflevel0 refers to P-cores on M-series, and the entire CPU on Intel Macs
if ($darwin && $ffi->sysctlbyname('hw.perflevel0.logicalcpu', FFI::addr($cores), FFI::addr($size), null, 0) === 0) {
return $cores->cdata;
} elseif ($darwin) {
// Reset the size, just in case
$size->cdata = FFI::sizeof($cores);
}
// This should return the total number of logical cores on any BSD-based system
if ($ffi->sysctlbyname('hw.ncpu', FFI::addr($cores), FFI::addr($size), null, 0) !== 0) {
return -1;
}
return $cores->cdata;
}
function getLogicalCoreCount(): int
{
// We use the logical core count as it should work best for I/O bound code
return match (PHP_OS_FAMILY) {
'Windows' => (int) getenv('NUMBER_OF_PROCESSORS'),
'Linux' => substr_count(
file_get_contents('/proc/cpuinfo') ?: throw new Exception('Could not open /proc/cpuinfo for core count detection, please specify -p manually.'),
'processor',
),
'Darwin', 'BSD' => sysctlGetLogicalCoreCount(PHP_OS_FAMILY === 'Darwin'),
default => throw new Exception('Core count detection not implemented for ' . PHP_OS_FAMILY . ', please specify -p manually.'),
};
}
With FFI, we get (in my tests) 0.05ms-0.2ms on the first call and < 0.01ms on all subsequent calls. A lot faster than the shell’s initial 8ms. Is this an important optimization? Not at all. We only call this once to get the machine’s core count if the user doesn’t specify how many processes to spawn, so 8ms is not perceivable for a long running command.
However, if we can do something efficiently, why do it less efficiently. We can call sysctl from any process directly, especially if libc is already linked, so all of the overhead of spawning a process just for getting a simple number would be being rude to our computer.
Using this in Laravel
Now let’s tie this all back to Laravel. We could simply add the code above to a specific command, but let’s go with a more generic solution:
trait ParallelCommand
{
/** Used as protection against spawning a huge number of processes by accident */
public const int MAX_PROCESSES = 24;
protected bool $runningConcurrently = false;
abstract protected function childHandle(mixed ...$args): bool;
public function addProcessesOption(): void
{
$this->addOption(
'processes',
'p',
InputOption::VALUE_OPTIONAL,
'How many processes to spawn. Maximum value: ' . static::MAX_PROCESSES . ', recommended value: core count (use just -p)',
-1,
);
$this->addOption(
'forceProcesses',
'P',
InputOption::VALUE_OPTIONAL,
'Same as --processes but without a maximum value. Use at your own risk',
-1,
);
}
protected function forkProcess(mixed ...$args): int
{
if (! app()->runningInConsole()) {
throw new Exception('Parallel commands are only available in CLI context.');
}
$pid = pcntl_fork();
if ($pid === -1) {
return -1;
} elseif ($pid) {
// Parent
return $pid;
} else {
// Child
DB::reconnect();
exit($this->childHandle(...$args) ? 0 : 1);
}
}
// ...
}
This trait can be added to any CLI command that does a bunch of work that can be split up into smaller chunks. Most of the above is boilerplate: configuring command parameters and the basic forkProcess()
function which calls the abstract childHandle()
that then individual commands implement.
Next, the command includes the methods for determining the machine’s core count that we saw above. For brevity I will not repeat them here. We also include a function that will give us the actual number of processes to be used — taking MAX_PROCESSES
, processes
, and forceProcesses
into account:
protected function sysctlGetLogicalCoreCount(bool $darwin): int { ... }
protected function getLogicalCoreCount(): int { ... }
protected function getProcesses(): int
{
$processes = $this->input->getOption('forceProcesses');
$forceProcesses = $processes !== -1;
if ($processes === -1) {
$processes = $this->input->getOption('processes');
}
if ($processes === null) {
// This is used when the option is set but *without* a value (-p).
$processes = $this->getLogicalCoreCount();
} elseif ((int) $processes === -1) {
// Default value we set for the option -- this is used when the option is *not set*.
$processes = 1;
} else {
// Option value set by the user.
$processes = (int) $processes;
}
if ($processes < 1) {
$this->components->error('Minimum value for processes is 1. Try specifying -p manually.');
exit(1);
}
if ($processes > static::MAX_PROCESSES && ! $forceProcesses) {
$this->components->error('Maximum value for processes is ' . static::MAX_PROCESSES . ' provided value: ' . $processes);
exit(1);
}
if ($processes > 1 && ! function_exists('pcntl_fork')) {
$this->components->error('The pcntl extension is required for parallel migrations to work.');
exit(1);
}
return $processes;
}
And finally, the function that manages the concurrent execution:
/**
* @param array|(ArrayAccess<int, mixed>&Countable)|null $args
*/
protected function runConcurrently(array|(ArrayAccess&Countable)|null $args = null): int
{
$processes = $this->getProcesses();
$success = true;
$pids = [];
if ($args !== null && count($args) < $processes) {
$processes = count($args);
}
$this->runningConcurrently = true;
for ($i = 0; $i < $processes; $i++) {
$pid = $this->forkProcess($args !== null ? $args[$i] : null);
if ($pid === -1) {
$this->components->error("Unable to fork process (iteration $i)!");
if ($i === 0) {
exit(1);
}
}
$pids[] = $pid;
}
// Fork equivalent of joining an array of join handles
foreach ($pids as $i => $pid) {
pcntl_waitpid($pid, $status);
$normalExit = pcntl_wifexited($status);
if ($normalExit) {
$exitCode = pcntl_wexitstatus($status);
if ($exitCode === 0) {
$this->components->success("Child process [$i] (PID $pid) finished successfully.");
} else {
$success = false;
$this->components->error("Child process [$i] (PID $pid) completed with failures.");
}
} else {
$success = false;
$this->components->error("Child process [$i] (PID $pid) exited abnormally.");
}
}
return $success ? 0 : 1;
}
We can then use this trait in any existing command with a few small adjustments. For instance, here’s how our tenants:migrate
command looks:
// call addProcessesOption() in the constructor
public function handle(): int
{
// ...
if (! $this->confirmToProceed()) {
return 1;
}
if ($this->getProcesses() > 1) {
return $this->runConcurrently($this->getTenantChunks()->map(function ($chunk) {
return $this->getTenants($chunk);
}));
}
return $this->migrateTenants($this->getTenants()) ? 0 : 1;
}
protected function childHandle(mixed ...$args): bool
{
$chunk = $args[0];
return $this->migrateTenants($chunk);
}
We basically just call migrateTenants(all tenants)
when running synchronously. And when using multiple processes, the chunks passed to runConcurrently()
are executed by childHandle()
which then calls migrateTenants(chunk)
.
The getTenantChunks() method is also part of the ParallelCommand trait in our implementation, but that’s only because we limit the trait’s use to commands that run for multiple tenants in our package.
And just like that, we can apply this trait to any of our other commands (in the case of Tenancy that is tenants:migrate-fresh
and tenants:rollback
).
A lot of the info in this article is very surface-level, but at the same time much lower level than the typical PHP code you’d write. My hope is this serves as a good introduction to some lower level programming you can do, not just in PHP, but in nearly any language by taking advantage of how operating systems work. You don’t have to be bound by frameworks or what the standard library lets you do, many languages give you a lot more freedom than you may realize!
Performance results
After testing this feature in a demo app with 1000 (local) tenant databases, I observed exactly what you’d expect: an 8x speedup on my machine with 8 physical cores. This is huge for apps that have hundreds or thousands of tenants. I’ve had people tell me their deployments take 30+ minutes. 30 minutes would now be under 4 minutes, or perhaps even less on serious server hardware! This feature is a perfect fit for concurrency — the exact same task running for lots and lots of completely independent databases — so really there is no reason why this should be running synchronously. You can find all the benchmarks in this Twitter thread.
There’s additional nuance about physical vs logical cores I’m omitting in this article. Some CPUs also have logical cores (also called hardware threads) which would let you have more I/O bound (but not CPU bound) tasks at the hardware level, rather than just at the OS level. Massive oversimplification but you can think about it as another scheduler in addition to the OS scheduler, this time at the hardware level. Since I want to keep the main text simple, I only ever talk about physical cores, but actually end up using counting logical cores on most mentioned operating systems. You can read a bit more about that here
See https://man.freebsd.org/cgi/man.cgi?sysctl(3), we care about sysctlbyname
(equivalent of sysctl -n
)