Getting Started with EDA — yasc 0.0.1 documentation

Imports¶

First we import yasc and check its version.

# imports
import yasc
from yasc.data import german_data
from yasc.eda import (
    missing_stat,
    numeric_stat,
    categorical_stat,
    corr_analysis,
    describe
)
import pandas as pd
import numpy as np

# show version
yasc.__version__

'0.0.1'

Load data¶

Here we use german_data() to load german credit data and show its first five rows.

# load german credit data
data = german_data()
data.head()

	StatusOfExistingCheckingAccount	DurationInMonth	CreditHistory	Purpose	CreditAmount	SavingsAccountAndBonds	PresentEmploymentSince	InstallmentRateInPercentageOfDisposableIncome	PersonalStatusAndSex	OtherDebtorsOrGuarantors	...	Property	AgeInYears	OtherInstallmentPlans	Housing	NumberOfExistingCreditsAtThisBank	Job	NumberOfPeopleBeingLiableToProvideMaintenanceFor	Telephone	ForeignWorker	Creditability
0	... < 0 DM	6	critical account/ other credits existing (not ...	radio/television	1169	unknown/ no savings account	... >= 7 years	4	male : divorced/separated	none	...	real estate	67	none	own	2	skilled employee / official	1	yes, registered under the customers name	yes	good
1	0 <= ... < 200 DM	48	existing credits paid back duly till now	radio/television	5951	... < 100 DM	1 <= ... < 4 years	2	male : divorced/separated	none	...	real estate	22	none	own	1	skilled employee / official	1	none	yes	bad
2	no checking account	12	critical account/ other credits existing (not ...	education	2096	... < 100 DM	4 <= ... < 7 years	2	male : divorced/separated	none	...	real estate	49	none	own	1	unskilled - resident	2	none	yes	good
3	... < 0 DM	42	existing credits paid back duly till now	furniture/equipment	7882	... < 100 DM	4 <= ... < 7 years	2	male : divorced/separated	guarantor	...	building society savings agreement/ life insur...	45	none	for free	1	skilled employee / official	2	none	yes	good
4	... < 0 DM	24	delay in paying off in the past	car (new)	4870	... < 100 DM	1 <= ... < 4 years	3	male : divorced/separated	none	...	unknown / no property	53	none	for free	2	skilled employee / official	2	none	yes	bad

5 rows × 21 columns

Check missing values I¶

missing_stat() function can help us find out columns with missing values and output missing rates.

missing_stat(data)

No missing values

	column	#missing	missing_rate
0	StatusOfExistingCheckingAccount	0	0.0
18	Telephone	0	0.0
17	NumberOfPeopleBeingLiableToProvideMaintenanceFor	0	0.0
16	Job	0	0.0
15	NumberOfExistingCreditsAtThisBank	0	0.0
14	Housing	0	0.0
13	OtherInstallmentPlans	0	0.0
12	AgeInYears	0	0.0
11	Property	0	0.0
19	ForeignWorker	0	0.0
10	PresentResidenceSince	0	0.0
8	PersonalStatusAndSex	0	0.0
7	InstallmentRateInPercentageOfDisposableIncome	0	0.0
6	PresentEmploymentSince	0	0.0
5	SavingsAccountAndBonds	0	0.0
4	CreditAmount	0	0.0
3	Purpose	0	0.0
2	CreditHistory	0	0.0
1	DurationInMonth	0	0.0
9	OtherDebtorsOrGuarantors	0	0.0
20	Creditability	0	0.0

# only include columns with missing values
missing_stat(data, only_missing_columns=True)

No missing values

	column	#missing	missing_rate

Check missing values II¶

The german data happens to have no missing values, let’s create a data frame with missing values to test missing_stat() function.

# create a data frame with missing values
df1 = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [np.nan, 5, 6, 7], 'c': [8, 9, np.nan, 10]})

missing_stat(df1)

3 columns, of which 2 columns with missing values

	column	#missing	missing_rate
0	a	0	0.00
1	b	1	0.25
2	c	1	0.25

# check missing statistics of single column
missing_stat(df1, "b")

Column b of dtype float64, 1 missing(s) (0.25)

Describe numeric columns¶

# check statistics of numeric columns
numeric_stat(data)

	DurationInMonth	CreditAmount	InstallmentRateInPercentageOfDisposableIncome	PresentResidenceSince	AgeInYears	NumberOfExistingCreditsAtThisBank	NumberOfPeopleBeingLiableToProvideMaintenanceFor
count	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000
mean	20.903000	3271.258000	2.973000	2.845000	35.546000	1.407000	1.155000
std	12.058814	2822.736876	1.118715	1.103718	11.375469	0.577654	0.362086
min	4.000000	250.000000	1.000000	1.000000	19.000000	1.000000	1.000000
25%	12.000000	1365.500000	2.000000	2.000000	27.000000	1.000000	1.000000
50%	18.000000	2319.500000	3.000000	3.000000	33.000000	1.000000	1.000000
75%	24.000000	3972.250000	4.000000	4.000000	42.000000	2.000000	1.000000
max	72.000000	18424.000000	4.000000	4.000000	75.000000	4.000000	2.000000

Describe categorial columns¶

# check statistics of categorical columns
categorical_stat(data)

	StatusOfExistingCheckingAccount	CreditHistory	Purpose	SavingsAccountAndBonds	PresentEmploymentSince	PersonalStatusAndSex	OtherDebtorsOrGuarantors	Property	OtherInstallmentPlans	Housing	Job	Telephone	ForeignWorker	Creditability
count	1000	1000	1000	1000	1000	1000	1000	1000	1000	1000	1000	1000	1000	1000
unique	4	5	10	5	5	4	3	4	3	3	4	2	2	2
top	no checking account	existing credits paid back duly till now	radio/television	... < 100 DM	1 <= ... < 4 years	male : single	none	car or other, not in attribute Savings account...	none	own	skilled employee / official	none	yes	good
freq	394	530	280	603	339	548	907	332	814	713	630	596	963	700

Generate descriptive statistics¶

describe() function is used to generate descriptive statistics of observed data. Beyond what we get from pandas.core.frame.DataFrame.describe(), from describe() we can check missing values in columns and type of column (numeric or categorical).

describe(df1)

	a	b	c
dtype	int64	float64	float64
type	numeric	numeric	numeric
#missing	0	1	1
missing_rate	0	0.25	0.25
count	4	3	3
mean	2.5	6	9
std	1.29099	1	1
min	1	5	8
25%	1.75	5.5	8.5
50%	2.5	6	9
75%	3.25	6.5	9.5
max	4	7	10

# get a descriptive statistics
describe(data)

	StatusOfExistingCheckingAccount	DurationInMonth	CreditHistory	Purpose	CreditAmount	SavingsAccountAndBonds	PresentEmploymentSince	InstallmentRateInPercentageOfDisposableIncome	PersonalStatusAndSex	OtherDebtorsOrGuarantors	...	Property	AgeInYears	OtherInstallmentPlans	Housing	NumberOfExistingCreditsAtThisBank	Job	NumberOfPeopleBeingLiableToProvideMaintenanceFor	Telephone	ForeignWorker	Creditability
dtype	object	int64	object	object	int64	object	object	int64	object	object	...	object	int64	object	object	int64	object	int64	object	object	object
type	categorical	numeric	categorical	categorical	numeric	categorical	categorical	numeric	categorical	categorical	...	categorical	numeric	categorical	categorical	numeric	categorical	numeric	categorical	categorical	categorical
#missing	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
missing_rate	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
count	1000	1000	1000	1000	1000	1000	1000	1000	1000	1000	...	1000	1000	1000	1000	1000	1000	1000	1000	1000	1000
unique	4	NaN	5	10	NaN	5	5	NaN	4	3	...	4	NaN	3	3	NaN	4	NaN	2	2	2
top	no checking account	NaN	existing credits paid back duly till now	radio/television	NaN	... < 100 DM	1 <= ... < 4 years	NaN	male : single	none	...	car or other, not in attribute Savings account...	NaN	none	own	NaN	skilled employee / official	NaN	none	yes	good
freq	394	NaN	530	280	NaN	603	339	NaN	548	907	...	332	NaN	814	713	NaN	630	NaN	596	963	700
mean	NaN	20.903	NaN	NaN	3271.26	NaN	NaN	2.973	NaN	NaN	...	NaN	35.546	NaN	NaN	1.407	NaN	1.155	NaN	NaN	NaN
std	NaN	12.0588	NaN	NaN	2822.74	NaN	NaN	1.11871	NaN	NaN	...	NaN	11.3755	NaN	NaN	0.577654	NaN	0.362086	NaN	NaN	NaN
min	NaN	4	NaN	NaN	250	NaN	NaN	1	NaN	NaN	...	NaN	19	NaN	NaN	1	NaN	1	NaN	NaN	NaN
25%	NaN	12	NaN	NaN	1365.5	NaN	NaN	2	NaN	NaN	...	NaN	27	NaN	NaN	1	NaN	1	NaN	NaN	NaN
50%	NaN	18	NaN	NaN	2319.5	NaN	NaN	3	NaN	NaN	...	NaN	33	NaN	NaN	1	NaN	1	NaN	NaN	NaN
75%	NaN	24	NaN	NaN	3972.25	NaN	NaN	4	NaN	NaN	...	NaN	42	NaN	NaN	2	NaN	1	NaN	NaN	NaN
max	NaN	72	NaN	NaN	18424	NaN	NaN	4	NaN	NaN	...	NaN	75	NaN	NaN	4	NaN	2	NaN	NaN	NaN

15 rows × 21 columns

Correlation analysis¶

corr, ax = corr_analysis(data, show_plot=True)

../_images/get_started_with_eda_20_0.png

Getting Started with EDA¶