Census 2012 may well be prone to misinformation

Must we believe that the Marikana miners, after a salary increase of 22 percent, now earn between twice and almost triple the annual income of black South African households? According to the census 2012 we must.

Statistical analysis must be carried out in a responsible manner. It requires methodological soundness, theoretical coherence and proper articulation of results.

Does the statistical release of census 2012 meet those requirements?

Some have questioned the credibility by pointing out oddities in the data. Others appear to be unaffected by such signals and present statistical claims without restrictions and without concern for misleading the public but to prove their points.

What must we make of the statement that “Census 2011 found that over the past 10 years, the average annual household income for all households in South Africa more than doubled, to R103 204 from R48 385 recorded in census 2001?

This statement, which suggests that the income distribution of households is continuous and can be represented as a mean income per year, exposes the fallacious interpretation of data.

This interpretation “proves” the claims of incumbent politicians, business and public commentators on the alleged relationship between ethnicity and household income and concluding that coloureds earn almost twice, Indians more than four times and white people six times more than black people.

It gets worse; “If the difference in growth in income between the two race groups remains steady in coming years, the black population would catch up with the white population, eventually. It would take about 50 years for the groups to fall in the income bracket”.

“By 2031, white households would still earn nearly three times as much as black households. By 2041, white families would be earning R2 for every R1 brought home by black families. Only in 2061 would the balance finally shift.”

What is wrong with the representation and why are comments such as the above misleading?

First, the census release refers to the census 2001 methodology where household income was based on personal income and categorised in 12 income classes. The highest category is less than half of the highest income bracket of tax year 2011.

Census 2001 comes with a note that warns users to be aware of the limitations of this variable and that census income is understated for most of the population.

Deriving an average from class-grouped frequency is a risk: it is the sum of each mid-value in a class times the probability of occurrence of that class. Furthermore, where central tendencies are best bets about observations in a distribution, dispersion reflects the poorness of a central tendency to represent an observation.

Second, it is common knowledge that household income distribution is skew.

Maybe not all people appreciate the statistical implications, but they do know the average (mean) doesn’t tell the story. Thus, household income is a discrete variable and the statistic to describe the true location of an observation should acknowledge the income distribution and the dispersion around the central tendency.

Although the averaged squared differences of all observations with the mean result in the smallest deviation, the average absolute deviation is smallest when the median is used.

Then why is household income presented as an average, a statistical mean of a group such as ethnicity without further explanation? Simplistically, such an average invites an interpretation that one out of three black African household earns R60 600 or less in the range of one standard score. The question then is: “How much less”? Unfortunately the census does not allow us to calculate, as it does not provide a measure of dispersion.

There is of course a reason why this is not provided for it is known that inferences about group differences and predictions based on averages of incomes is a statistical violation when the true distribution is unknown and the assumptions of the common theoretical distribution ignored.

Why was house income not presented in a grouped frequency table as a direct reflection of the measurement? This would allow insight in the number of households per class range and cross-tabulation with important variables such as gender.

Why were other central tendencies such as the mode (modus) and median not presented? These would at least indicate the form of the distribution and give richer information than the presented averages that are prone to misrepresentation.

The mode would point at an embarrassingly low, but most frequent occurring household income.

It would show the occurrence of ethnicity at higher income classes. It would point at the huge skewness of the income distribution and put minimum income discussions and other disturbing social imbalances out in the open.

It would reveal that mode income is not a reference for income measurements by the government. It would reveal that the president and ministers earn a questionable multiple of the mode and a median income. It would reveal governments’ toleration to allow members to have supplementary income as “entrepreneurs” is not only unethical for inviting corruption.

It would reveal that civil top administrators and service providers earn a multiple of ministers even without bonuses.

Some western countries consider an income of such administrators of more than 130 percent of a minister’s income as unethical and unacceptable. It would reveal that the commercialisation of, for example, private schooling is unethical when the annual fee per child is a multiple of the mode and median income.

The census and a proper analysis thereof, especially in the media, could result in a discussion about ethics, such as greed versus socioeconomic opportunities, fairer income distribution, education and health care, justice and many other urgent matters.

Unfortunately, simplistic and misguided analysis obscures the realities and worse allows for explanations that misinform public opinion and policies.

Dr Herman Kruijsse was professor in research methodology at Stellenbosch University and is a consultant in data analysis and research.