0%

Pandas数据操作:分组和聚合

导入必要的库

1
2
import numpy as np
import pandas as pd

导入数据集

我下载了seaborn中的数据集(都是小型的数据集),选择了其中的航班数据集flight.csv,并放入了本地seaborn_dataset文件夹(该文件夹放在项目根目录)中
导入flight数据集(使用read_csv函数)

1
flight_df = pd.read_csv('seaborn_dataset/flights.csv')

粗略看看该数据集(使用head函数)

1
print(flight_df.head())

输出:

1
2
3
4
5
6
   year     month  passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121

再看看数据类型(使用info函数)

1
print(flight_df.info())

输出

1
2
3
4
5
6
7
8
9
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
year 144 non-null int64
month 144 non-null object
passengers 144 non-null int64
dtypes: int64(2), object(1)
memory usage: 3.5+ KB
None

注意其中的yearpassengers为int型数据

根据单个column的值分组

使用groupby函数

参数为column名,可代入单个column名,或column名组成的列表

1
2
grouped = flight_df.groupby('year')
print(grouped)

输出:

1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002783D62D860>

可以看到返回的是一个pandas的分组对象DataFrameGroupBy(或者SeriesGroupBy)

获取分组数量

使用分组对象的ngroups属性,类型为一个int型数据

1
print(grouped.ngroups)

输出:

1
12

获取分组内容

利用分组对象的groups属性,类型为一个字典(dict),字典的键(key)为组名,值(value)为每一组所有的对应的索引(index)

1
print(grouped.groups)

输出:

1
{1949: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64'), 1950: Int64Index([12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], dtype='int64'), 1951: Int64Index([24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35], dtype='int64'), 1952: Int64Index([36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], dtype='int64'), 1953: Int64Index([48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59], dtype='int64'), 1954: Int64Index([60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], dtype='int64'), 1955: Int64Index([72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83], dtype='int64'), 1956: Int64Index([84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95], dtype='int64'), 1957: Int64Index([96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107], dtype='int64'), 1958: Int64Index([108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119], dtype='int64'), 1959: Int64Index([120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131], dtype='int64'), 1960: Int64Index([132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143], dtype='int64')}

访问分组结果

输出每个组的细节

编写一个有利于输出分组内容的函数(function),并调用

1
2
3
4
5
6
7
def print_groups(group_object):
for name, group in group_object:
print(name)
print(group)


print_groups(grouped)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
1949
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
5 1949 June 135
6 1949 July 148
7 1949 August 148
8 1949 September 136
9 1949 October 119
10 1949 November 104
11 1949 December 118
1950
year month passengers
12 1950 January 115
13 1950 February 126
14 1950 March 141
15 1950 April 135
16 1950 May 125
17 1950 June 149
18 1950 July 170
19 1950 August 170
20 1950 September 158
21 1950 October 133
22 1950 November 114
23 1950 December 140

··· ···

1960
year month passengers
132 1960 January 417
133 1960 February 391
134 1960 March 419
135 1960 April 461
136 1960 May 472
137 1960 June 535
138 1960 July 622
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432

获取每个组元素的数量

使用size函数

1
print(gruoped.size())

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
year
1949 12
1950 12
1951 12
1952 12
1953 12
1954 12
1955 12
1956 12
1957 12
1958 12
1959 12
1960 12
dtype: int64

获取每个组中每个column的元素数量

使用count函数

1
print(grouped.count())

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
      month  passengers
year
1949 12 12
1950 12 12
1951 12 12
1952 12 12
1953 12 12
1954 12 12
1955 12 12
1956 12 12
1957 12 12
1958 12 12
1959 12 12
1960 12 12

获取指定组名的组的数据

使用get_group函数,参数为组的名字。因为这里的year是整型数据,故代入整数。

1
print(grouped.get_group(1950))

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
    year      month  passengers
12 1950 January 115
13 1950 February 126
14 1950 March 141
15 1950 April 135
16 1950 May 125
17 1950 June 149
18 1950 July 170
19 1950 August 170
20 1950 September 158
21 1950 October 133
22 1950 November 114
23 1950 December 140

获取每个组中前两个个元素

使用head函数,参数为元素数量

1
print(grouped.head(2))

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
     year     month  passengers
0 1949 January 112
1 1949 February 118
12 1950 January 115
13 1950 February 126
24 1951 January 145
25 1951 February 150
36 1952 January 171
37 1952 February 180
48 1953 January 196
49 1953 February 196
60 1954 January 204
61 1954 February 188
72 1955 January 242
73 1955 February 233
84 1956 January 284
85 1956 February 277
96 1957 January 315
97 1957 February 301
108 1958 January 340
109 1958 February 318
120 1959 January 360
121 1959 February 342
132 1960 January 417
133 1960 February 391

获取每个组的第3行数据

使用nth函数,参数为行号

1
print(grouped.nth(3))

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
      month  passengers
year
1949 April 129
1950 April 135
1951 April 163
1952 April 181
1953 April 235
1954 April 227
1955 April 269
1956 April 313
1957 April 348
1958 April 348
1959 April 396
1960 April 461

根据多个column分组*

根据两个column分组,传入column名字列表(list)。并用前面的print_groups函数打印组信息

1
2
mcg = flight_df.groupby(['year', 'month'])
print_groups(mcg)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
(1949, 'April')
year month passengers
3 1949 April 129
(1949, 'August')
year month passengers
7 1949 August 148
(1949, 'December')
year month passengers
11 1949 December 118
(1949, 'February')
year month passengers
1 1949 February 118
(1949, 'January')
year month passengers
0 1949 January 112
(1949, 'July')
year month passengers
6 1949 July 148
(1949, 'June')
year month passengers

··· ···

138 1960 July 622
(1960, 'June')
year month passengers
137 1960 June 535
(1960, 'March')
year month passengers
134 1960 March 419
(1960, 'May')
year month passengers
136 1960 May 472
(1960, 'November')
year month passengers
142 1960 November 390
(1960, 'October')
year month passengers
141 1960 October 461
(1960, 'September')
year month passengers
140 1960 September 508

获取每个组的描述性统计变量

使用describe函数

1
print(mcg.describe())

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
               passengers
count mean std min 25% 50% 75% max
year month
1949 April 1.0 129.0 NaN 129.0 129.0 129.0 129.0 129.0
August 1.0 148.0 NaN 148.0 148.0 148.0 148.0 148.0
December 1.0 118.0 NaN 118.0 118.0 118.0 118.0 118.0
February 1.0 118.0 NaN 118.0 118.0 118.0 118.0 118.0
January 1.0 112.0 NaN 112.0 112.0 112.0 112.0 112.0
... ... ... .. ... ... ... ... ...
1960 March 1.0 419.0 NaN 419.0 419.0 419.0 419.0 419.0
May 1.0 472.0 NaN 472.0 472.0 472.0 472.0 472.0
November 1.0 390.0 NaN 390.0 390.0 390.0 390.0 390.0
October 1.0 461.0 NaN 461.0 461.0 461.0 461.0 461.0
September 1.0 508.0 NaN 508.0 508.0 508.0 508.0 508.0

[144 rows x 8 columns]

分析结果可知,每个组仅有一行数据

使用等级制的索引分组

创建一个原数据的浅复制,使用copy函数

1
mi = flight_df.copy()

使用set_index函数

参数为column名字列表

1
2
mi = mi.set_index(['year', 'month'])
print(mi)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
                passengers
year month
1949 January 112
February 118
March 132
April 129
May 121
... ...
1960 August 606
September 508
October 461
November 390
December 432

[144 rows x 1 columns]

使用第一个等级的索引分组

使用了groupby函数中的level参数,类型为int型。最后使用print_groups函数输出组详情。

1
2
mig_11 = mi.groupby(level=0)
print_groups(mig_11)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
1949
passengers
year month
1949 January 112
February 118
March 132
April 129
May 121
June 135
July 148
August 148
September 136
October 119
November 104
December 118
1950
passengers
year month
1950 January 115
February 126
March 141
April 135
May 125
June 149
July 170
August 170
September 158
October 133
November 114
December 140

··· ···

1959
passengers
year month
1959 January 360
February 342
March 406
April 396
May 420
June 472
July 548
August 559
September 463
October 407
November 362
December 405
1960
passengers
year month
1960 January 417
February 391
March 419
April 461
May 472
June 535
July 622
August 606
September 508
October 461
November 390
December 432

使用多个等级的索引分组

level参数赋值索引列表

1
2
mig_12 = mi.groupby(level=['year', 'month'])
print_groups(mig_12)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
(1949, 'April')
passengers
year month
1949 April 129
(1949, 'August')
passengers
year month
1949 August 148
(1949, 'December')
passengers
year month
1949 December 118
(1949, 'February')
passengers
year month
1949 February 118

··· ···

(1960, 'November')
passengers
year month
1960 November 390
(1960, 'October')
passengers
year month
1960 October 461
(1960, 'September')
passengers
year month
1960 September 508

使用聚合函数agg:参数为函数名

计算组的平均值

传入np.mean

1
print(mig_12.agg(np.mean))

输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
                passengers
year month
1949 April 129
August 148
December 118
February 118
January 112
... ...
1960 March 419
May 472
November 390
October 461
September 508

[144 rows x 1 columns]

不创建索引,使用groupby函数的as_index参数(设置为False,创建索引设置为True),类型为布尔型

1
2
result = flight_df.groupby(['year', 'month'], as_index=False).agg(np.mean)
print(result)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
     year      month  passengers
0 1949 April 129
1 1949 August 148
2 1949 December 118
3 1949 February 118
4 1949 January 112
.. ... ... ...
139 1960 March 419
140 1960 May 472
141 1960 November 390
142 1960 October 461
143 1960 September 508

[144 rows x 3 columns]

一次使用多个函数

向参数as_index传入函数名列表

1
print(mig_12.agg([np.mean, np.std]))

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
               passengers
mean std
year month
1949 April 129 NaN
August 148 NaN
December 118 NaN
February 118 NaN
January 112 NaN
... ... ..
1960 March 419 NaN
May 472 NaN
November 390 NaN
October 461 NaN
September 508 NaN

[144 rows x 2 columns]

对不同的column应用不同的函数

通过对as_index参数传入字典(dict)实现,字典键(key)为column名,值(value)为函数名。

1
2
result = mig_12.agg({'passengers': np.mean})
print(result)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
                passengers
year month
1949 April 129
August 148
December 118
February 118
January 112
... ...
1960 March 419
May 472
November 390
October 461
September 508

[144 rows x 1 columns]

只对指定的column应用函数

对分组对象使用[ ]运算符

1
print(mig_12['passengers'].mean())

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
year  month
1949 April 129
August 148
December 118
February 118
January 112
...
1960 March 419
May 472
November 390
October 461
September 508
Name: passengers, Length: 144, dtype: int64

返回的是一个Series对象

本文标题:Pandas数据操作:分组和聚合

文章作者:SkecisAI

发布时间:2019年11月06日 - 17:37:12

最后更新:2019年12月15日 - 16:30:30

原始链接:http://www.skecis.top/2019/11/06/pandas-groupby/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

感谢你的支持,希望本文能助你一臂之力。