导入必要的库
1 | import numpy as np |
导入数据集
我下载了seaborn中的数据集(都是小型的数据集),选择了其中的航班数据集flight.csv
,并放入了本地seaborn_dataset文件夹(该文件夹放在项目根目录)中
导入flight数据集(使用read_csv
函数)1
flight_df = pd.read_csv('seaborn_dataset/flights.csv')
粗略看看该数据集(使用head
函数)1
print(flight_df.head())
输出:1
2
3
4
5
6 year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
再看看数据类型(使用info
函数)1
print(flight_df.info())
输出1
2
3
4
5
6
7
8
9<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
year 144 non-null int64
month 144 non-null object
passengers 144 non-null int64
dtypes: int64(2), object(1)
memory usage: 3.5+ KB
None
注意其中的year
和passengers
为int型数据
根据单个column的值分组
使用groupby
函数
参数为column名,可代入单个column名,或column名组成的列表1
2grouped = flight_df.groupby('year')
print(grouped)
输出:1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002783D62D860>
可以看到返回的是一个pandas的分组对象DataFrameGroupBy
(或者SeriesGroupBy
)
获取分组数量
使用分组对象的ngroups
属性,类型为一个int型数据1
print(grouped.ngroups)
输出:1
12
获取分组内容
利用分组对象的groups
属性,类型为一个字典(dict),字典的键(key)为组名,值(value)为每一组所有的对应的索引(index)1
print(grouped.groups)
输出:1
{1949: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64'), 1950: Int64Index([12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], dtype='int64'), 1951: Int64Index([24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35], dtype='int64'), 1952: Int64Index([36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], dtype='int64'), 1953: Int64Index([48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59], dtype='int64'), 1954: Int64Index([60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], dtype='int64'), 1955: Int64Index([72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83], dtype='int64'), 1956: Int64Index([84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95], dtype='int64'), 1957: Int64Index([96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107], dtype='int64'), 1958: Int64Index([108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119], dtype='int64'), 1959: Int64Index([120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131], dtype='int64'), 1960: Int64Index([132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143], dtype='int64')}
访问分组结果
输出每个组的细节
编写一个有利于输出分组内容的函数(function),并调用1
2
3
4
5
6
7def print_groups(group_object):
for name, group in group_object:
print(name)
print(group)
print_groups(grouped)
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
451949
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
5 1949 June 135
6 1949 July 148
7 1949 August 148
8 1949 September 136
9 1949 October 119
10 1949 November 104
11 1949 December 118
1950
year month passengers
12 1950 January 115
13 1950 February 126
14 1950 March 141
15 1950 April 135
16 1950 May 125
17 1950 June 149
18 1950 July 170
19 1950 August 170
20 1950 September 158
21 1950 October 133
22 1950 November 114
23 1950 December 140
··· ···
1960
year month passengers
132 1960 January 417
133 1960 February 391
134 1960 March 419
135 1960 April 461
136 1960 May 472
137 1960 June 535
138 1960 July 622
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432
获取每个组元素的数量
使用size
函数1
print(gruoped.size())
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14year
1949 12
1950 12
1951 12
1952 12
1953 12
1954 12
1955 12
1956 12
1957 12
1958 12
1959 12
1960 12
dtype: int64
获取每个组中每个column的元素数量
使用count
函数1
print(grouped.count())
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14 month passengers
year
1949 12 12
1950 12 12
1951 12 12
1952 12 12
1953 12 12
1954 12 12
1955 12 12
1956 12 12
1957 12 12
1958 12 12
1959 12 12
1960 12 12
获取指定组名的组的数据
使用get_group
函数,参数为组的名字。因为这里的year
是整型数据,故代入整数。1
print(grouped.get_group(1950))
输出:1
2
3
4
5
6
7
8
9
10
11
12
13 year month passengers
12 1950 January 115
13 1950 February 126
14 1950 March 141
15 1950 April 135
16 1950 May 125
17 1950 June 149
18 1950 July 170
19 1950 August 170
20 1950 September 158
21 1950 October 133
22 1950 November 114
23 1950 December 140
获取每个组中前两个个元素
使用head
函数,参数为元素数量1
print(grouped.head(2))
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 year month passengers
0 1949 January 112
1 1949 February 118
12 1950 January 115
13 1950 February 126
24 1951 January 145
25 1951 February 150
36 1952 January 171
37 1952 February 180
48 1953 January 196
49 1953 February 196
60 1954 January 204
61 1954 February 188
72 1955 January 242
73 1955 February 233
84 1956 January 284
85 1956 February 277
96 1957 January 315
97 1957 February 301
108 1958 January 340
109 1958 February 318
120 1959 January 360
121 1959 February 342
132 1960 January 417
133 1960 February 391
获取每个组的第3行数据
使用nth
函数,参数为行号1
print(grouped.nth(3))
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14 month passengers
year
1949 April 129
1950 April 135
1951 April 163
1952 April 181
1953 April 235
1954 April 227
1955 April 269
1956 April 313
1957 April 348
1958 April 348
1959 April 396
1960 April 461
根据多个column分组*
根据两个column分组,传入column名字列表(list)。并用前面的print_groups
函数打印组信息1
2mcg = flight_df.groupby(['year', 'month'])
print_groups(mcg)
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42(1949, 'April')
year month passengers
3 1949 April 129
(1949, 'August')
year month passengers
7 1949 August 148
(1949, 'December')
year month passengers
11 1949 December 118
(1949, 'February')
year month passengers
1 1949 February 118
(1949, 'January')
year month passengers
0 1949 January 112
(1949, 'July')
year month passengers
6 1949 July 148
(1949, 'June')
year month passengers
··· ···
138 1960 July 622
(1960, 'June')
year month passengers
137 1960 June 535
(1960, 'March')
year month passengers
134 1960 March 419
(1960, 'May')
year month passengers
136 1960 May 472
(1960, 'November')
year month passengers
142 1960 November 390
(1960, 'October')
year month passengers
141 1960 October 461
(1960, 'September')
year month passengers
140 1960 September 508
获取每个组的描述性统计变量
使用describe
函数1
print(mcg.describe())
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 passengers
count mean std min 25% 50% 75% max
year month
1949 April 1.0 129.0 NaN 129.0 129.0 129.0 129.0 129.0
August 1.0 148.0 NaN 148.0 148.0 148.0 148.0 148.0
December 1.0 118.0 NaN 118.0 118.0 118.0 118.0 118.0
February 1.0 118.0 NaN 118.0 118.0 118.0 118.0 118.0
January 1.0 112.0 NaN 112.0 112.0 112.0 112.0 112.0
... ... ... .. ... ... ... ... ...
1960 March 1.0 419.0 NaN 419.0 419.0 419.0 419.0 419.0
May 1.0 472.0 NaN 472.0 472.0 472.0 472.0 472.0
November 1.0 390.0 NaN 390.0 390.0 390.0 390.0 390.0
October 1.0 461.0 NaN 461.0 461.0 461.0 461.0 461.0
September 1.0 508.0 NaN 508.0 508.0 508.0 508.0 508.0
[144 rows x 8 columns]
分析结果可知,每个组仅有一行数据
使用等级制的索引分组
创建一个原数据的浅复制,使用copy
函数1
mi = flight_df.copy()
使用set_index
函数
参数为column名字列表1
2mi = mi.set_index(['year', 'month'])
print(mi)
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 passengers
year month
1949 January 112
February 118
March 132
April 129
May 121
... ...
1960 August 606
September 508
October 461
November 390
December 432
[144 rows x 1 columns]
使用第一个等级的索引分组
使用了groupby
函数中的level
参数,类型为int型。最后使用print_groups
函数输出组详情。1
2mig_11 = mi.groupby(level=0)
print_groups(mig_11)
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
631949
passengers
year month
1949 January 112
February 118
March 132
April 129
May 121
June 135
July 148
August 148
September 136
October 119
November 104
December 118
1950
passengers
year month
1950 January 115
February 126
March 141
April 135
May 125
June 149
July 170
August 170
September 158
October 133
November 114
December 140
··· ···
1959
passengers
year month
1959 January 360
February 342
March 406
April 396
May 420
June 472
July 548
August 559
September 463
October 407
November 362
December 405
1960
passengers
year month
1960 January 417
February 391
March 419
April 461
May 472
June 535
July 622
August 606
September 508
October 461
November 390
December 432
使用多个等级的索引分组
向level
参数赋值索引列表1
2mig_12 = mi.groupby(level=['year', 'month'])
print_groups(mig_12)
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31(1949, 'April')
passengers
year month
1949 April 129
(1949, 'August')
passengers
year month
1949 August 148
(1949, 'December')
passengers
year month
1949 December 118
(1949, 'February')
passengers
year month
1949 February 118
··· ···
(1960, 'November')
passengers
year month
1960 November 390
(1960, 'October')
passengers
year month
1960 October 461
(1960, 'September')
passengers
year month
1960 September 508
使用聚合函数agg
:参数为函数名
计算组的平均值
传入np.mean
1
print(mig_12.agg(np.mean))
输出1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 passengers
year month
1949 April 129
August 148
December 118
February 118
January 112
... ...
1960 March 419
May 472
November 390
October 461
September 508
[144 rows x 1 columns]
不创建索引,使用groupby
函数的as_index
参数(设置为False
,创建索引设置为True
),类型为布尔型1
2result = flight_df.groupby(['year', 'month'], as_index=False).agg(np.mean)
print(result)
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14 year month passengers
0 1949 April 129
1 1949 August 148
2 1949 December 118
3 1949 February 118
4 1949 January 112
.. ... ... ...
139 1960 March 419
140 1960 May 472
141 1960 November 390
142 1960 October 461
143 1960 September 508
[144 rows x 3 columns]
一次使用多个函数
向参数as_index
传入函数名列表1
print(mig_12.agg([np.mean, np.std]))
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 passengers
mean std
year month
1949 April 129 NaN
August 148 NaN
December 118 NaN
February 118 NaN
January 112 NaN
... ... ..
1960 March 419 NaN
May 472 NaN
November 390 NaN
October 461 NaN
September 508 NaN
[144 rows x 2 columns]
对不同的column应用不同的函数
通过对as_index
参数传入字典(dict)实现,字典键(key)为column名,值(value)为函数名。1
2result = mig_12.agg({'passengers': np.mean})
print(result)
输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 passengers
year month
1949 April 129
August 148
December 118
February 118
January 112
... ...
1960 March 419
May 472
November 390
October 461
September 508
[144 rows x 1 columns]
只对指定的column应用函数
对分组对象使用[ ]运算符1
print(mig_12['passengers'].mean())
输出:1
2
3
4
5
6
7
8
9
10
11
12
13year month
1949 April 129
August 148
December 118
February 118
January 112
...
1960 March 419
May 472
November 390
October 461
September 508
Name: passengers, Length: 144, dtype: int64
返回的是一个Series
对象