Road to the Code with Less Bugs?

I think that writing any non-trivial software without introducing bugs is close to impossible. World is complex, data used in programs is complex, handling all possible interactions of our code with said data is hard. There are probably bugs even in the short script that is accompanying this article. Sounds dark? Maybe it shouldn’t.

Everybody who writes software is creating bugs, it is inevitable.

There is huge amount of research about what causes bugs and there are various results and conclusions.

My interpretation of the surface reading is that number of “bugs per line” is scaling in some way with total number of lines in program. For example, program with 1000 lines will have one bug, program with 2000 lines will have 3 bugs, and program with 50 000 lines will have 100 bugs. Personally, I feel like that is true as it kind of matches my experience. Programmers mostly agree about this.

Another finding is that programs with more restriction about who (who = which code) and when (under which conditions) can modify some state also have less bugs. (Scientists say!) I agree, because it also matches my experience. In long functions with many variables or classes with lots of properties which are used later for logic there are more bugs. Especially if we are writing code under assumption that objects will always be in some finite amount of states that we accounted for or that variables will be configured in a certain way. (error calling Method() on null, index not found, Varien_Object returning null on getPropertty() instead of giving error)

There is also something called cyclomatic complexity (amount of logical branches code can take).
I hope there are least some of us who did not experience that feeling of dread upon reading logs and seeing an exception occurred in the middle of 500+ line method of class on a line which starts somewhere after character number 52…

So, if we could somehow:

  • Make objects/variables which are always in valid state. Maybe by using static declarations, there will be another article about how PHP7 is improvement over PHP5 here.
  • Make sections of code smaller and easier to understand. Perhaps by reducing number of branches and loops.
  • Reduce overall amount of code. Perhaps by using higher level constructs, so we write less code.

Honestly, I am not really sure if such ideas will really work. So far, my feeling is that it helps a lot. Maybe it will help more when I get more experience, maybe it is actually worse and I lack knowledge/skill to see it.

But I can see there is lot of space for improvement. And I am very confident that we should at least examine some of the options and see how they work in practice.

So, in this teaser post, let’s do something everybody enjoys!

Parsing CSV files, and doing some analysis on that data!!!

Input: CSV with UN employment data. Rows with years, work sectors and number of people working in each sector per given year (in 2002, there were 100 000 people working for the Government).

Output: Statistic with percentages that show how many people worked in some sector for given year, and how those percentages changed over years.

For the first post, we are going simple, only showing data for one country and for both genders.
That gives us excuse to introduce generators. 🙂 Our goal is to show the table in HTML (ugly looking table but we won’t be concerned with styling, only with filling table with proper data).
Performance is secondary goal, and it will not be tackled until we are satisfied with the code quality. Corectness and crash-resistance are primary goals!

I will quickly show uses of some of the concepts and we will try to improve code as series go on.
Not sure where we will end and if functional style is good fit for the problem overall but let’s see where it will take us. Along the way, we’ll also see if functional in PHP will be easier to understand than the same thing implemented in OOP way or old-school imperative approach.

Code is on github

https://github.com/grizwako/broadening_horizons

We will start with closures.

Closures are good for encapsulating data, without going full on object-oriented.
http://php.net/manual/en/functions.anonymous.php
http://culttt.com/2013/03/25/what-are-php-lambdas-and-closures/
http://fabien.potencier.org/on-php-5-3-lambda-functions-and-closures.html

Goes like this – you create some function which will have some data saved, and you can later call that function (with new parameters too).

Example function is accessing values of CSV data rows by column names defined in CSV header line.

function mappedRow(data-row, column-map)
 
$entry = mappedRow(
    [4, 'bla@blo.com'],
    ['id' => 0, 'email' => '1]
);
 
echo $entry('email'); //bla@blo.com

Implementation:

/**
 * Creates dictionary-like closure which accesses array values by string keys.
 * So instead of $row[$map['key'] you can use $row('key')
 * @param $row
 * @param $columnMap
 * @return Closure
 */
function mappedRow($row, $columnMap)
{
    if(empty($columnMap)) {
        throw new LogicException('Column map must not be empty! Hint: reverse header row to use as column map');
    }
 
    //returns function which accepts one parameter and has saved values for $row and $columnMap
    return function($key = null) use ($row, $columnMap) {
        return $row[$columnMap[$key]];
    };
}

 

Second useful thing: generators!

http://php.net/manual/en/language.generators.overview.php
Generators are like iterators/array, but differ in the fact that they will give you data only when they are told to do so. We can use generator to read huge CSV file line by line, and we only take lines that we need.

For now, you can think of keyword yield as “feed (return) value into caller, but save state of generator function for when it is called next”.

Generator is somewhat like closure, it can also have some local variables saved for later usages.

/**
 * Lazily read CSV file
 *
 * @param $filename
 * @param string $split
 * @param int $maxLineLength
 * @param array $columnMap
 * @return Generator
 * @throws Exception
 */
function mappedCsvGenerator($filename, $split = ',', $maxLineLength = 0, $columnMap = [])
{
    $fileHandle = fopen($filename,'r');
    if(FALSE === $fileHandle) {
        throw new Exception('Could not open file: '.$filename);
    }
    while (($data = fgetcsv($fileHandle, $maxLineLength, $split)) !== FALSE) {
        if(empty($columnMap)) {
            //Moves array pointer to next row
            $columnMap = array_flip($data);
            continue;
        }
 
        yield mappedRow($data, $columnMap);
    }
}

Let’s create instance of a generator; we are still not asking it to read data from CSV.

$rows = mappedCsvGenerator('un_employment_1980-2008.csv');

Next, we will not need all rows from CSV, only a few. Let’s pretend that we do not have the database system in which we can import data. Filter rows as we are only interested in data for one country and for both genders.

It would be wonderful if we could somehow delay reading of the CSV rows to the point when we actually need them.

Luckily, Nikita Popov created a really nice library that helps with using lazy collections. (Also, he is one behind a lot of cool stuff that PHP got recently. Generators included.)
About Nikita: https://nikic.github.io/aboutMe.html
Lazy collections: https://github.com/nikic/iter

$stats = iter\filter(
    function($row) {
        return $row('Country or Area') === 'Croatia'
		&& $row('Sex') === 'Total men and women';
    },
    $rows
);

CSV file is still unread!! We did not actually use the values, we only specified that the values we will use will have to satisfy some condition. Instead of creating an array and then inserting rows from some foreach, we only said that $stats will give us rows that satisfy some condition when we ask for them.

Grouping values with common higher order functions

Next step, I think that we will need to group values by work sector and year if we want to display them in a table.

Higher order just means function that accepts other function in a parameter, nothing actually fancy.
Word group sounds like it should be a function, but there is not one such in Nikic’s library.
Some searching will bring us to: https://github.com/lstrojny/functional-php

It has group, it seems it can only do one level of grouping with one call. There is some advanced stuff in the library Lars made so maybe we can do some mumbo-jumbo when we revisit the code and refactor it to compose function that will group by two levels.

First we group by work sector (subclassification). And then we do something that is actually elegant, but looks weird if you are not used to it. Map is kind of like foreach, but it is not supposed to modify collection it is iterating over. Think of it like: map = foreach item: call function “a” and return items function “a” returned.

Function “a” will accept one item from the collection, do something and return one item. Map will return array of all items function “a” returned.

$statsBySubclassification = group($stats, function($row){
    return $row('Subclassification');
});
 
$statsBySubclassificationAndYear = map($statsBySubclassification, function($subclass) {
    $indexed = reindex($subclass, function($row) {
        return (int)$row('Year');
    });
    return map($indexed, function($row) {
        return (float)$row('Value');
    });
});

So, with our grouped entries we will call function (which accepts all entries that fall under one Subclassifiction). That function will first create a new array which will have “Year” of entry as a key and after that go over all elements and get ‘Value’ only for each row.

Something like:

$statsBySubclassificationAndYear = [];
foreach($statsBySubclassification as $subclassifiction => $items) {
    foreach($items as $item) {
    	//PHP has autovivification, so we dont have to worry about creating proper array keys
        $statsBySubclassificationAndYear[$subclassifiction][(int)$item('Year')] = (float)$item('Value');
    }
}

Honestly, in PHP foreach seems easier at first, even to me.
In some made up functional language, we would have this sort of thing:

statsBySubclassificationAndYear = map(statsBySubclassification, (subclass) =>
    indexed = reindex(subclass, (row) => (int)row('Year'))
    map(indexed, (row) => (float)row('Value'))
)

 

Displaying grouped data as a table:

<?php
$totalByYear = $statsBySubclassificationAndYear['Total.'];
$years = array_keys($totalByYear);
sort($years); // Why is this bad style?
 
//now, lets build table that will show us suclassifications percentages over years
// I feel that foreach is cleaner here, because we are mutating some object
$table = new SimpleXMLElement('<table/>');
$head = $table->addChild('tr');
$head->addChild('th', 'Subclassification');
foreach($years as $year) {
    $head->addChild('th', $year);
}
 
foreach($statsBySubclassificationAndYear as $subclass => $byYear) {
    $tableRow = $table->addChild('tr');
    $tableRow->addChild('td', $subclass);
    $percentage = 0;
    //we are relying on $years that was defined too far from where we are using it
    foreach($years as $year) {
        if(array_key_exists($year, $byYear)) {
            // can this part of code be improved by using functional style?
            // we are relying on $percentage from last iteration, stuff like that causes bugs
            $tempPerc = 100 *($byYear[$year] / $totalByYear[$year]);
            $delta = $tempPerc - $percentage;
            $percentage = $tempPerc;
            $procFormat = number_format($percentage, 2);
            $deltaFormat = number_format($delta, 2);
            $tableRow->addChild('td', $procFormat)->addChild('p', $deltaFormat);
        } else {
            $tableRow->addChild('td', ' - ');
        }
    }
}
 
?>
 
<style>
    table, th, td {
        border: 1px solid black;
    }
    p {
        color: blueviolet;
    }
</style>
 
Percentages by sector, blue is delta from previous year
<?php echo $table->asXML(); ?>

Conclusion of the first part:

Article turned somewhat different than I expected. Having nicer lambda syntax would improve readability so much. Closures are great in limiting local scope, generators are really nice to use, in some other languages they require more syntax. Nikic does wonders for PHP.

We have nicely formatted data, look at this design masterpiece!
data screenshot

 Credits:

UN data

http://data.un.org/
Total employment, by economic activity (Thousands)  [1980-2008]
http://data.un.org/Data.aspx?q=employment&d=LABORSTA&f=tableCode%3a2B

https://github.com/nikic/iter
https://github.com/lstrojny/functional-php


About Marin Grizelj

Backend Developer

Marin works as a Backend Developer, a position that matches his specialities - databases, network protocols and feeling for code organisation. His biggest wish is to buy a bus, gather his friends and travel the world.

Read more posts by Marin / Visit Marin's profile

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <blockquote cite=""> <code> <del datetime=""> <em> <s> <strike> <strong>. You may use following syntax for source code: <pre><code>$current = "Inchoo";</code></pre>.