project A DOS® Server In a Virtual Machine

An Outlier

Today, we are going to talk about an outlier. You might probably ask me a question, "What is this?". Wikipedia answers the question more or less okay: http://en.wikipedia.org/wiki/Outlier. Besides, statistics is learnt in classes of mathematics at high school. In this article, I'm going to explain the term giving a simple example.
Let us suppose some kind of a specific organization (in Russia, this could be Oblstat) doing collection of statistic data with making subsequent reckoning of average values either in the country, or in the region, or in the city. As an example, we are taking a city polyclinic. Its staff consists of the administration, experts, operating level and operating personnel (office-cleaners, janitors, a doorman and so on).
Oblstat is going to calculate average earnings a month at the medical institution to impartially appreciate the revenue growth of the employees. (We'll see hereafter how much this point is "unbiassed".) For making such calculuses, the organization requests qualitative data of employees' incomes in the polyclinic. Here are what it's got (the scale of wages is taken at will):

Appointment Salary per human, Roubles. No. of persons
Medical Dept. Deputy Head Doctor 23000 1
Doorman 4500 2
Head Doctor 65000 1
Secretary To the Head Doctor 15345 1
Accountant 21500 10
Accountant General 35000 1
Therapeutist 17000 14
Pediatrician 17437 15
Otolaryngologist 16877 2
Head of the Pediatric Dept. 18000 1
Head of the Therapeutic Dept. 18000 1
Neuropathologist 22300 1
Deputy Head Doctor of the Control-Expert Committee 17200 1
Head of the Personnel Dept. 18000 1
Inspector of the Personnel Dept. 18000 1
Human Programmer 18200 1
Electronics Engineer 18000 1
Electrician 17300 1
Plumber 14700 2
Janitor 4000 2


Judging by the table, we're getting an array of numbers as follows: 23000, 4500, 4500, 65000, 15345, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 35000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 16877, 16877, 18000, 18000, 22300, 17200, 18000, 18000, 18200, 18000, 17300, 14700, 14700, 4000, 4000.
If taking a method of an average earnings' calculation over the arithmetic mean of a series, we get a value as follows, (23000 + (4500 * 2) + 65000 + 15345 + (21500 * 10) + 35000 + (17000 * 14) + (17437 * 15) + (16877 * 2) + 18000 + 18000 + 22300 + 17200 + 18000 + 18000 + 18200 + 18000 + 17300 + (14700 * 2) + (4000 * 2)) / 60 = 1098054 / 60 = 18300.90 (Roubles) - average earnings in the polyclinic.
But we can observe evident outliers in the table which spoil a real view on statistics: these are earnings of 65000 Roubles, 35000 Roubles that look in contrast to 17000 Roubles. (In practice, it happens the following: employees' pays are 25000 Roubles. but a director-general earns 300000 Roubles. This is also an outlier which artificially increases general income of the organization.)
We undestand it well and obviously, ain't satisfied with the statistics, because it doesn't have any expected result. But how can we fairly see that distinction where the index differs much from the common series, and where it's within the mark?
A method of reckoning an average could be fit very well for that. It consists of as follows. Firstly, we find arithmetic means for a numerical series of M1...Mn, M1...M(n-1), M1...M(n-2), ..., M1...M1. Then, we find an average of A1 from the arithmetic means. Those M numbers which have values to be more than twice (it means more than 100%), if comparing with A1, are ignored on the 1-st assorting: these are evident outliers.
On the 2-nd one, there can exist also outliers (for example, 1, 20, 20, 20, 20, 20, 20, 21, 21; 1 is an outlier). We're therefore accounting for how much any number can be met in the series. After that, we find a maximum number of repeats of maxAmount and an average number of repeats of uniqueAver as well. Then, we leave only those N numbers that are N <= maxAmount and N >= (uniqueAver / 2) at the same time; in other words, every repeat should be more than twice. Numbers which are met rarely are ignored.
We're getting an unprejudiced statistics this way.
All the algorithm was made up by me as a function of outlier() in JavaScript which awaits for a numerical array as an input and returns another array as an output. The output can be freely used to find an average. All outliers are ignored. The source code is quoted below.

/* --- НАЧАЛО функции анализа статистического выброса --- */
/* http://ru.wikipedia.org/wiki/Выброс_(статистика) */
/* автор: Ефремов А. В. (a.k.a. "Nikodim") */
function outlier(nr) {
   var average = new Array(), z = 0, aver = 0, nr02 = new Array(), nr03 = new Array(), c = 0;
   var uniqueNr = new Array(), uniqueAmount = new Array(), uniqueAver = 0, maxAmount = 0;
   var nrRes = new Array(), nrQty = new Array();

/* упорядочиваем массив */
   for (var x = 0; x < nr.length - 1; x++) {
      for (var y = x + 1; y < nr.length; y++) {
         if (nr[x] > nr[y]) {
            z = nr[x];
            nr[x] = nr[y];
            nr[y] = z;
         }
      }
   }

/* ищем средние арифметические */
   for (x = 0; x < nr.length; x++) {
      z = 0;
      for (y = 0; y <= x; y++) {
         z += nr[y];
      }
      average[average.length++] = z / (x + 1);
   }

/* ищем среднее арифметическое от средних арифметических */
   for (x = 0; x < average.length; x++) {
      aver += average[x];
   }
   aver = aver / average.length;

/* числа, которые попадут в 1-й отбор */
   for (x = 0; x < nr.length; x++) {
      if (Math.abs(nr[x]) <= Math.abs(2 * aver)) { /* если значение числа не превышает двукартного показателя ср. арифм. */
         nr02[nr02.length++] = nr[x];
      }
   }

/* среднее арифметическое по числам 1-го отбора */
   z = 0;
   aver = 0;
   for (x = 0; x < nr02.length; x++) {
      aver += nr02[x];
   }
   aver = aver / nr02.length;


   z = nr02[0]; c = 1;
   for (x = 1; x < nr02.length; x++) {
      if (nr02[x] == z) {
         c++;
      } else {
         uniqueNr[uniqueNr.length++] = z; uniqueAmount[uniqueAmount.length++] = c;
         z = nr02[x];
         c = 1;
      }
   }
   uniqueNr[uniqueNr.length++] = z; uniqueAmount[uniqueAmount.length++] = c;

   maxAmount = 0;
   for (x = 0; x < uniqueAmount.length; x++) {
      uniqueAver += uniqueAmount[x];
      if (maxAmount < uniqueAmount[x]) {
         maxAmount = uniqueAmount[x];
      }
   }
   uniqueAver = Math.round(Math.round(100 * uniqueAver / uniqueAmount.length) / 100);

   for (x = 0; x < uniqueAmount.length; x++) {
      if ((uniqueAmount[x] <= maxAmount) && (uniqueAmount[x] >= Math.floor(uniqueAver / 2))) {
         nrRes[nrRes.length++] = uniqueNr[x];
         nrQty[nrQty.length++] = uniqueAmount[x];
      }
   }


/* числа, которые попадут во 2-й отбор */
   for (x = 0; x < nrRes.length; x++) {
      for (y = 0; y < nrQty[x]; y++) {
         nr03[nr03.length++] = nrRes[x];
      }
   }

/* если во 2-ом отборе ничего не найдено, берутся все числа из 1-го отбора */
   if (nr03.length < 1) {
      for (x = 0; x < uniqueNr.length; x++) {
         for (y = 0; y < uniqueAmount[x]; y++) {
            nr03[nr03.length++] = uniqueNr[x];
         }
      }
   }


   return (nr03);
}
/* --- КОНЕЦ функции анализа статистического выброса --- */

By way of our example, let's call this function for handling and calculate average earnings ignoring so-called outliers.

/* ИСХОДНЫЕ ДАННЫЕ ЗДЕСЬ!!!!!! */
var nr = new Array(23000, 4500, 4500, 65000, 15345,
21500, 21500, 21500, 21500, 21500, 21500, 21500, 21500,
21500, 21500, 35000, 17000, 17000, 17000, 17000, 17000,
17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000,
17000, 17437, 17437, 17437, 17437, 17437, 17437, 17437,
17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437,
16877, 16877, 18000, 18000, 22300, 17200, 18000, 18000,
18200, 18000, 17300, 14700, 14700, 4000, 4000); /* исходные данные */


var sortedNr = new Array(), z = 0;

sortedNr = outlier(nr);
document.body.innerHTML = "Исходные числа: " + nr  + "<BR />\n";
document.body.innerHTML += "<B><U>Числа, которые попадут в статистику:</U></B> " + sortedNr + "<BR />\n";

for (var x = 0; x < sortedNr.length; x++) {
   z += sortedNr[x];
}
z = z / sortedNr.length;

document.body.innerHTML += "<B><U>Средний показатель:</U></B> " + z + "<BR />\n";

As the program's running we get as follows,

Initial numerical series: 4000, 4000, 4500, 4500, 14700, 14700, 15345, 16877, 16877, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17200, 17300, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 18000, 18000, 18000, 18000, 18000, 18200, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 22300, 23000, 35000, 65000
Numbers which will be in statistics: 4000, 4000, 4500, 4500, 14700, 14700, 16877, 16877, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17000, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 17437, 18000, 18000, 18000, 18000, 18000, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 21500, 21500
Average: 17013.634615384617

The earnings of 65000 Roubles, 35000 Roubles have tuned out outliers as expected. The average wage is 17013.63 Roubles which is 7.03% lower than the value having been accounted before.

It is worth noting the fact that overestimated values come to a biased mark of the situation in the country which is a reason of inflation.