You may have heard of the
You may have heard of the "The One Billion Row Challenge" (1brc) and in case you don't, go checkout Gunnar Morlings's 1brc repo .
I got sucked in because two of my colleagues have entered the competition and are on the leaderboard .
PHP is not known for its speed, but as I am working on the PHP profiler I thought I give it a shot and see how fast it can get.
I cloned the repository and created the billion row dataset in measurements.txt
. After that I started building my first naive implementation of something that could solve the challenge:
$stations[$data[0]][1]) {
$stations[$data[0]][1] = $data[1];
}
}
}
ksort($stations);
echo '{';
foreach($stations as $k=>&$station) {
$station[2] = $station[2]/$station[3];
echo $k, '=', $station[0], '/', $station[2], '/', $station[1], ', ';
}
echo '}';
There is nothing wild in here, just open the file, use fgetcsv()
to read the data. If the station is not found already, create it, otherwise increment the counter, sum the temperature and see if the current temperature is lower than or higher than min or max and updated accordingly.
Once I have everything together, I use ksort()
to bring the $stations
array in order and then echo out the list and calculate the average temperature while doing so (sum / count).
Running this simple code on my laptop takes25 Minutes????
Time to optimise and have a look at the profiler:
The timeline visualisation helps me see, that this is clearly CPU bound, file compilation at the beginning of the script is negligible and there are no garbage collection events.
The flame graph view is helpful as well in showing that I am spending 46% of CPU time in fgetcsv()
.
fgets()
instead of fgetcsv()
The first optimisation was to use fgets()
to get a line and split on the ;
character manually instead of relying on fgetcsv()
. This is because fgetcsv()
does a whole lot more than what I need.
// ...
while ($data = fgets($fp, 999)) {
$pos = strpos($data, ';');
$city = substr($data, 0, $pos);
$temp = substr($data, $pos+1, -1);
// ...
Additionally I refactored $data[0]
to $city
and $data[1]
to $temp
everywhere.
Running the script again with just this one change already brought the runtime down to19m 49s. In absolute numbers, this is still a lot but also: it's21% down!
The flame graph reflects the change, switching to showing the CPU time by line also reveals what is going on in the root frame:
I am spending ~38% CPU time in lines 18 and 23, which are:
18 | $stations[$city][3] ++;
| // ...
23 | if ($temp > $stations[$city][1]) {
Line 18 is the first access to the $stations
array in the loop, otherwise it is only an increment and line 23 is a comparison, nothing that seems expensive at first glance, but lets do few more optimisations and you'll see what is taking time here.
$station = &$stations[$city];
$station[3] ++;
$station[2] += $temp;
// instead of
$stations[$city][3] ++;
$stations[$city][2] += $data[1];
This should help PHP to not search for the key in the $stations
array on every array access, see it like a cache for accessing the "current" station in the array.
And it actually helps, running this only takes17m 48s, another10% down!
While looking at the code, I stumbled over this piece:
if ($temp < $station[0]) {
$station[0] = $temp;
}
if ($temp > $station[1]) {
$station[1] = $temp;
}
In case the temperature is lower than min, it cannot be higher than max anymore, so I made this an elseif
and maybe spare some CPU cycles.
BTW: I don't know anything about the order of temperatures in measurements.txt
, but depending on that order it could make a difference if I first checked the one or the other.
The new versions takes 17m 30s, which is another ~2%. Better than just jitter, but not by a lot.
PHP is known as a dynamic language and it is something that I valued a lot when I just got started writing software, one less problem to care about. But on the other side, knowing the types helps the engine make better decisions when running your code.
$temp = (float)substr($data, $pos+1, -1);
Guess what? This simple cast makes the script run in just13m 32swhich is a whoppingperformance increase of 21%!
18 | $station = &$stations[$city];
| // ...
23 | } elseif ($temp > $station[1]) {
Line 18 still shows up with 11% of CPU time spend, which is the access to the array (finding the key in the hash map, which is the underling data structure used for associative arrays in PHP).
Line 23's CPU time dropped from ~32% to ~15%. This is due to PHP not doing type juggling anymore. Before the type cast, $temp
/ $station[0]
/ $station[1]
were strings
, so PHP had to convert them to float
in order to compare them on every single comparison.
OPCache in PHP is by default disabled in CLI and needs the opcache.enable_cli
setting to be set to on
. JIT (as part of OPCache) is default enabled, but effectively disabled as the buffer size is set to 0
, so I set the opcache.jit-buffer-size
to something, I just went with 10M
. After these changes have been applied, I re-ran the script with JIT and see it finish in:
7m 19s????
which is45.9%less time spend!!
I already brought the runtime down from 25 minutes in the beginning to just ~7 minutes. One thing that I found absolutely astonishing is that fgets()
allocates ~56 GiB/m of RAM for reading a 13 GB file. Something seems off, so I checked the implementation of fgets()
and it looks like that I can spare a lot of these allocations by just omitting the len
argument to fgets()
:
while ($data = fgets($fp)) {
// instead of
while ($data = fgets($fp, 999)) {
Comparing the profile before and after the change gives this:
You might think that this gives a lot of performance improvement, but it is just ~1%. This is because those are small allocations which the ZendMM can handle in bins and those are blazingly fast.
Yes, we can! So far my approach was single threaded, which is the nature of most PHP software, but PHP does support threads in user land through the parallel extension .
Reading the data in PHP is a bottleneck as the profiler clearly shows. Switching from fgetcsv()
to fgets()
and manual split helps, but this is still taking a lot of time, so lets use threads to parallelise reading and processing the data and then afterwards combine the intermediate results from the worker threads.
*/
function get_file_chunks(string $file, int $cpu_count): array {
$size = filesize($file);
if ($cpu_count == 1) {
$chunk_size = $size;
} else {
$chunk_size = (int) ($size / $cpu_count);
}
$fp = fopen($file, 'rb');
$chunks = [];
$chunk_start = 0;
while ($chunk_start < $size) {
$chunk_end = min($size, $chunk_start + $chunk_size);
if ($chunk_end < $size) {
fseek($fp, $chunk_end);
fgets($fp); // moves fp to next \n char
$chunk_end = ftell($fp);
}
$chunks[] = [
$chunk_start,
$chunk_end
];
$chunk_start = $chunk_end+1;
}
fclose($fp);
return $chunks;
}
/**
* This function will open the file passed in `$file` and read and process the
* data from `$chunk_start` to `$chunk_end`.
*
* The returned array has the name of the city as the key and an array as the
* value, containing the min temp in key 0, the max temp in key 1, the sum of
* all temperatures in key 2 and count of temperatures in key 3.
*
* @return array
*/
$process_chunk = function (string $file, int $chunk_start, int $chunk_end): array {
$stations = [];
$fp = fopen($file, 'rb');
fseek($fp, $chunk_start);
while ($data = fgets($fp)) {
$chunk_start += strlen($data);
if ($chunk_start > $chunk_end) {
break;
}
$pos2 = strpos($data, ';');
$city = substr($data, 0, $pos2);
$temp = (float)substr($data, $pos2+1, -1);
if (isset($stations[$city])) {
$station = &$stations[$city];
$station[3] ++;
$station[2] += $temp;
if ($temp < $station[0]) {
$station[0] = $temp;
} elseif ($temp > $station[1]) {
$station[1] = $temp;
}
} else {
$stations[$city] = [
$temp,
$temp,
$temp,
1
];
}
}
return $stations;
};
$chunks = get_file_chunks($file, $threads_cnt);
$futures = [];
for ($i = 0; $i < $threads_cnt; $i++) {
$runtime = new \parallel\Runtime();
$futures[$i] = $runtime->run(
$process_chunk,
[
$file,
$chunks[$i][0],
$chunks[$i][1]
]
);
}
$results = [];
for ($i = 0; $i < $threads_cnt; $i++) {
// `value()` blocks until a result is available, so the main thread waits
// for the thread to finish
$chunk_result = $futures[$i]->value();
foreach ($chunk_result as $city => $measurement) {
if (isset($results[$city])) {
$result = &$results[$city];
$result[2] += $measurement[2];
$result[3] += $measurement[3];
if ($measurement[0] < $result[0]) {
$result[0] = $measurement[0];
}
if ($measurement[1] < $result[1]) {
$result[1] = $measurement[1];
}
} else {
$results[$city] = $measurement;
}
}
}
ksort($results);
echo '{', PHP_EOL;
foreach($results as $k=>&$station) {
echo "\t", $k, '=', $station[0], '/', ($station[2]/$station[3]), '/', $station[1], ',', PHP_EOL;
}
echo '}', PHP_EOL;
This code does a few things, first I scan the file and split it into chunks that are \n
aligned (for I am using fgets()
later). When I have the chunks ready, I start $threads_cnt
worker threads that then all open the same file and seek to their assigned chunk start and read and process the data till the chunk end, returning an intermediate result that afterwards gets combined, sorted and printed out in the main thread.
This multithreaded approach finishes in just:
1m 35s
Nope, certainly not. There is at least two more things to this solution:
CFLAGS="-g -O0 ..."
due my needs in my day to day work ????I should have checked this in the beginning, so I re-compiled PHP 8.3 using CLFAGS="-Os ..."
and my final number (with 16 threads) is:
27.7 seconds
This number is by no means comparable to what you can see in the leaderboard of the original challenge and this is due to the fact that I ran this code on total different hardware.
This is a timeline view with 10 threads:
The thread at the bottom is the main thread, waiting for the results from the worker threads. Once those workers have returned their intermediate results the main thread can be seen working on combining and sorting everything. We can also clearly see, that the main thread is by no means the bottleneck. In case you want to try to optimise this even further, concentrate on the worker threads.
Each abstraction layer simply trades usability/integration for CPU cycles or memory. fgetcsv()
is super easy to use and hides a lot of stuff, but this comes at a cost. Even fgets()
hides some stuff from us but makes it super convenient to read the data.
Adding types to your code will help the language either optimise execution or will stop type juggling which is something you do not see but still pay for it with CPU cycles.
JIT is awesome, especially when dealing with CPU bound problems!
It is definitely not the nature of most PHP software, but thanks to parallelisation (using ext-parallel
) we could push the numbers down significantly.
I hope you had as much fun reading this article as I had. In case you want to further optimise the code, feel free to grab this and leave a comment how far you got.