Выбор данных и нахождение наиболее частых жалоб



В этой части мы будем использовать новый набор данных, чтобы показать, как быть с большими объёмами данных. Это данные о 311 сервисных запросов (или жалоб) жителей, предоставленные NYC Open Data (скачать данные).

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('ggplot')

plt.rcParams['figure.figsize'] = (10, 5)
In [2]:
complaints = pd.read_csv('data/311-service-requests.csv')
/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

В зависимости от вашей версии pandas, вы можете увидеть предупреждение "DtypeWarning: Columns (8) have mixed types". Это означает, что pandas столкнулся с проблемой чтения в наших данных. В нашем случае это почти наверняка означает, что данные имеют столбцы, где некоторые записи являются строками, а некоторые представляют собой целые числа.

На данный момент мы будем игнорировать это, надеясь на то, что пронесёт, но в долгосрочной перспективе мы должны исследовать это предупреждение.

А что это такое? (сводка)

При печати большого dataframe, будут показаны только первые несколько строк.

Для получения первых 5 строк dataframe, мы можем использовать срез: df[:5].

In [3]:
complaints[:5]
Out[3]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location
0 26589651 10/31/2013 02:08:41 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11432 90-03 169 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.708275 -73.791604 (40.70827532593202, -73.79160395779721)
1 26593698 10/31/2013 02:01:04 AM NaN NYPD New York City Police Department Illegal Parking Commercial Overnight Parking Street/Sidewalk 11378 58 AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.721041 -73.909453 (40.721040535628305, -73.90945306791765)
2 26594139 10/31/2013 02:00:24 AM 10/31/2013 02:40:32 AM NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 10032 4060 BROADWAY ... NaN NaN NaN NaN NaN NaN NaN 40.843330 -73.939144 (40.84332975466513, -73.93914371913482)
3 26595721 10/31/2013 01:56:23 AM 10/31/2013 02:21:48 AM NYPD New York City Police Department Noise - Vehicle Car/Truck Horn Street/Sidewalk 10023 WEST 72 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.778009 -73.980213 (40.7780087446372, -73.98021349023975)
4 26590930 10/31/2013 01:53:44 AM NaN DOHMH Department of Health and Mental Hygiene Rodent Condition Attracting Rodents Vacant Lot 10027 WEST 124 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.807691 -73.947387 (40.80769092704951, -73.94738703491433)

5 rows × 52 columns

Выбор строк и столбцов

Чтобы выбрать столбец, мы берём имя столбца в качестве индекса (как в словаре), например так:

In [4]:
complaints['Complaint Type'][:5]
Out[4]:
0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object

неважно, в каком порядке мы это сделаем (взять первые 5 строк у столбца или столбец у первых 5 строк)

In [5]:
complaints[:5]['Complaint Type']
Out[5]:
0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object

Выбор нескольких столбцов

Что если мы хотим знать только тип жалобы и район, а остальное нам неинтересно? Pandas позволяет легко выбрать подмножество столбцов: просто проиндексируйте списком столбцов.

In [6]:
complaints[['Complaint Type', 'Borough']]
Out[6]:
Complaint Type Borough
0 Noise - Street/Sidewalk QUEENS
1 Illegal Parking QUEENS
2 Noise - Commercial MANHATTAN
3 Noise - Vehicle MANHATTAN
4 Rodent MANHATTAN
5 Noise - Commercial QUEENS
6 Blocked Driveway QUEENS
7 Noise - Commercial QUEENS
8 Noise - Commercial MANHATTAN
9 Noise - Commercial BROOKLYN
10 Noise - House of Worship BROOKLYN
11 Noise - Commercial MANHATTAN
12 Illegal Parking MANHATTAN
13 Noise - Vehicle BRONX
14 Rodent BROOKLYN
15 Noise - House of Worship MANHATTAN
16 Noise - Street/Sidewalk STATEN ISLAND
17 Illegal Parking BROOKLYN
18 Street Light Condition BROOKLYN
19 Noise - Commercial MANHATTAN
20 Noise - House of Worship BROOKLYN
21 Noise - Commercial MANHATTAN
22 Noise - Vehicle QUEENS
23 Noise - Commercial BROOKLYN
24 Blocked Driveway STATEN ISLAND
25 Noise - Street/Sidewalk STATEN ISLAND
26 Street Light Condition BROOKLYN
27 Harboring Bees/Wasps MANHATTAN
28 Noise - Street/Sidewalk MANHATTAN
29 Street Light Condition STATEN ISLAND
... ... ...
111039 Noise - Commercial MANHATTAN
111040 Noise - Commercial MANHATTAN
111041 Noise BROOKLYN
111042 Noise - Street/Sidewalk MANHATTAN
111043 Noise - Commercial BROOKLYN
111044 Noise - Street/Sidewalk MANHATTAN
111045 Water System MANHATTAN
111046 Noise BROOKLYN
111047 Illegal Parking QUEENS
111048 Noise - Street/Sidewalk MANHATTAN
111049 Noise - Commercial BROOKLYN
111050 Noise MANHATTAN
111051 Noise - Commercial MANHATTAN
111052 Water System MANHATTAN
111053 Derelict Vehicles QUEENS
111054 Noise - Street/Sidewalk BROOKLYN
111055 Noise - Commercial BROOKLYN
111056 Street Sign - Missing QUEENS
111057 Noise MANHATTAN
111058 Noise - Commercial BROOKLYN
111059 Noise - Street/Sidewalk MANHATTAN
111060 Noise MANHATTAN
111061 Noise - Commercial QUEENS
111062 Water System MANHATTAN
111063 Water System MANHATTAN
111064 Maintenance or Facility BROOKLYN
111065 Illegal Parking QUEENS
111066 Noise - Street/Sidewalk MANHATTAN
111067 Noise - Commercial BROOKLYN
111068 Blocked Driveway BROOKLYN

111069 rows × 2 columns

Посмотрим первые 10 строк:

In [7]:
complaints[['Complaint Type', 'Borough']][:10]
Out[7]:
Complaint Type Borough
0 Noise - Street/Sidewalk QUEENS
1 Illegal Parking QUEENS
2 Noise - Commercial MANHATTAN
3 Noise - Vehicle MANHATTAN
4 Rodent MANHATTAN
5 Noise - Commercial QUEENS
6 Blocked Driveway QUEENS
7 Noise - Commercial QUEENS
8 Noise - Commercial MANHATTAN
9 Noise - Commercial BROOKLYN

Какой самый частый тип жалобы?

Это очень простой вопрос для pandas! Используем метод .value_counts():

In [8]:
complaints['Complaint Type'].value_counts()
Out[8]:
HEATING                                 14200
GENERAL CONSTRUCTION                     7471
Street Light Condition                   7117
DOF Literature Request                   5797
PLUMBING                                 5373
PAINT - PLASTER                          5149
Blocked Driveway                         4590
NONCONST                                 3998
Street Condition                         3473
Illegal Parking                          3343
Noise                                    3321
Traffic Signal Condition                 3145
Dirty Conditions                         2653
Water System                             2636
Noise - Commercial                       2578
ELECTRIC                                 2350
Broken Muni Meter                        2070
Noise - Street/Sidewalk                  1928
Sanitation Condition                     1824
Rodent                                   1632
Sewer                                    1627
Taxi Complaint                           1227
Consumer Complaint                       1227
Damaged Tree                             1180
Overgrown Tree/Branches                  1083
Missed Collection (All Materials)         973
Graffiti                                  973
Building/Use                              942
Root/Sewer/Sidewalk Condition             836
Derelict Vehicle                          803
                                        ...
Fire Alarm - Modification                   5
Posting Advertisement                       5
Internal Code                               5
Poison Ivy                                  5
Miscellaneous Categories                    5
Ferry Complaint                             4
Transportation Provider Complaint           4
Special Natural Area District (SNAD)        4
Illegal Animal Sold                         4
Illegal Fireworks                           3
Invitation                                  3
Fire Alarm - Replacement                    3
Adopt-A-Basket                              3
Misc. Comments                              2
Legal Services Provider Complaint           2
DFTA Literature Request                     2
Opinion for the Mayor                       2
Public Assembly                             2
Window Guard                                2
Tunnel Condition                            1
Snow                                        1
X-Ray Machine/Equipment                     1
Stalled Sites                               1
Municipal Parking Facility                  1
Open Flame Permit                           1
Trans Fat                                   1
Ferry Permit                                1
DWD                                         1
Highway Sign - Damaged                      1
DHS Income Savings Requirement              1
Name: Complaint Type, dtype: int64

10 наиболее частых типов:

In [9]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]
Out[9]:
HEATING                   14200
GENERAL CONSTRUCTION       7471
Street Light Condition     7117
DOF Literature Request     5797
PLUMBING                   5373
PAINT - PLASTER            5149
Blocked Driveway           4590
NONCONST                   3998
Street Condition           3473
Illegal Parking            3343
Name: Complaint Type, dtype: int64

Теперь мы можем построить график!

In [10]:
complaint_counts[:10].plot(kind='bar')
Out[10]: