stats::tabulate -- statistics
of duplicate rows
Introductionstats::tabulate(s) eliminates duplicate
rows in the sample s and appends a column containing the
multiplicities.
stats::tabulate(s, c1, c2, ..., f)
combines all rows that are identical except for entries in the
specified columns c1, c2 etc. The function
f is applied to these columns, its result replaces the
values in these columns.
stats::tabulate(s, [c1, f1], [c2, f2],
...) combines all rows that are identical except for entries in
the columns c1, c2 etc. The functions
f1, f2 etc. are applied to these columns, the
results replace the values in these columns.
Call(s)stats::tabulate(s)
stats::tabulate(s, c1, c2... <, f>)
stats::tabulate(s, c1..c2, c3..c4... <, f>)
stats::tabulate(s, [c1, f1], [c2, f2]...)
stats::tabulate(s, [c1, c2..., f1], [c3, c4...,
f2]...)
Parameterss |
- | a sample of domain type stats::sample |
c1, c2, ... |
- | integers representing column indices of the sample
s |
f, f1, f2, ... |
- | procedures |
Returnsa sample of domain type stats::sample.
Related
Functions
Detailsstats::tabulate regards rows as duplicates, if they
have identical entries in the columns that are not
specified.stats::tabulate(s, c1, c2, ..., f)
the function f is applied to the entries of the duplicate
rows in the specified columns. Duplicates are eliminated and replaced
by a single instance of the row, the result of f is
inserted into the corresponding columns.
The function f must accept as many parameters as there
are duplicates. Typical applications involve functions such as stats::mean which accept
arbitrarily many arguments.
E.g., with stats::mean
duplicate rows are replaced by a single row, in which the entries of
the columns c1, c2 etc. are replaced by the
mean values of the corresponding entries of the duplicates.
If no function f is specified, then the default
function _plus is
used.
If column indices are specified more than once, then extra columns with the result of the specified function are inserted into the sample.
stats::tabulate(s, c1..c2, ..., f)
is a short hand notation for
stats::tabulate(s, c1, c1+1, ..., c2, ...,
f).
stats::tabulate(s, [c1, f1], [c2, f2],
...) pairs of columns and corresponding procedures are
specified. Again, rows are regarded as duplicates, if they have
identical entries in the columns that are not specified.
Duplicates are eliminated and replaced by a single instance of the row,
the result of f1 is inserted in column c1,
the result of f2 is inserted in column c2
etc.
If column indices are specified more than once, then extra columns with the result of the specified functions are inserted into the sample.
stats::tabulate(s, [c1, c2, ..., f1],
...) it is possible to apply functions that act on several
columns. The procedure f1 has to accept a sequence of
lists (each representing a column). The specified columns are replaced
by a single column containing the result of f1. If column
indices are specified more than once, then extra columns with the
result of the specified function(s) are inserted into the sample. Cf.
examples 2 and 3.
Example
1We create a sample:
>> s := stats::sample([[a, A, 1], [a, A, 1], [a, A, 2],
[b, B, 5], [b, B, 10]])
a A 1
a A 1
a A 2
b B 5
b B 10
Duplicate rows of the sample are counted. There are four unique rows, one occurring twice:
>> stats::tabulate(s)
a A 1 2
a A 2 1
b B 5 1
b B 10 1
In the following call rows are regarded as duplicates, if the entries in the first two columns coincide. We compute the mean value of the third entry of the duplicates:
>> stats::tabulate(s, 3, stats::mean)
a A 4/3
b B 15/2
We compute both the mean and the standard deviation of the data in the third column for the sub-samples labeled 'a A' and 'b B' by the first two columns:
>> stats::tabulate(s, [3, stats::mean], [3, stats::stdev])
a A 4/3 1/3*2^(1/2)
b B 15/2 5/2
>> delete s:
Example
2We create a sample containing columns for ``gender'', ``age'' and ``size'':
>> s := stats::sample([["f", 25, 166], ["m", 30, 180],
["f", 54, 160], ["m", 40, 170],
["f", 34, 170], ["m", 20, 172]])
"f" 25 166
"m" 30 180
"f" 54 160
"m" 40 170
"f" 34 170
"m" 20 172
We use stats::mean on the second and third
column to calculate the average ``age'' and ``size'' of each
gender:
>> stats::tabulate(s, 2..3, float@stats::mean)
"f" 37.66666667 165.3333333
"m" 30.0 174.0
With the next call both the mean and the standard deviation of ``age'' and ``size'' for each gender are inserted into the sample.
>> stats::tabulate(s,
[2, float@stats::mean], [2, float@stats::stdev],
[3, float@stats::mean], [3, float@stats::stdev])
"f" 37.66666667 12.11977264 165.3333333 4.109609335
"m" 30.0 8.164965809 174.0 4.320493799
We compute the Bravais-Pearson correlation coefficient between ``age'' and ``size'' for each gender:
>> stats::tabulate(s, [2, 3, float@stats::BPCorr])
"f" -0.7540135992
"m" -0.1889822365
>> delete s:
Example
3We create a sample:
>> s := stats::sample([[a, x1, 1, 2], [b, x2, 2, 4],
[b, x1, 2, 4], [e, x2, 3, 5.5]])
a x1 1 2
b x2 2 4
b x1 2 4
e x2 3 5.5
We regard rows with the same entry in the second column as ``of the same kind''. We tabulate the sample using different functions on the remaining columns:
>> stats::tabulate(s, [1, _plus], [3, _mult], [4, stats::mean])
a + b x1 2 3
b + e x2 6 4.75
One can apply customized procedures. In the following we
define the procedure plusmult, which sums up the elements
of two lists (representing columns) and then multiplies the sums.
>> plusmult := proc(x, y) begin _plus(op(x))*_plus(op(y)) end_proc:
This procedure is then used to combine the first and the third column. Simultaneously, the mean and the standard deviation of the fourth column is inserted into the sample.
>> stats::tabulate(s, [1, 3, plusmult], [4, stats::mean],
[4, stats::stdev])
3*a + 3*b x1 3 1
5*b + 5*e x2 4.75 0.75
>> delete plusmult, s: