AI Pandas 정리

데이터를 처리할때 유용하게 사용되는 Pandas에 대해 정리해보려한다.

Pandas란?

Pandas는 파이썬 프로그래밍 언어를 위한 데이터 분석 및 조작 라이브러리이다. Pandas는 구조화된 데이터를 쉽게 처리하고 분석하는 데 도움이 되는 강력한 도구와 데이터 구조를 제공한다. 주로 표 형식의 데이터를 다루는 데 사용되며, 엑셀 스프레드시트와 유사한 형태의 데이터를 다루는 데 유용하다.

Pandas의 주요 데이터 구조

Series : 1차원 데이터 배열로서, 인덱스와 값으로 구성된다.
DataFrame: 2차원 표 형식의 데이터 구조로서, 여러 개의 열로 구성된 테이블과 같은 형태의 데이터를 다루는 데 사용된다.

Pandas 주요 기능

  
import pandas as pd

Pandas를 불러올때 일반적으로 pd로 줄여서 사용한다.

데이터 로딩

  
titanic_df = pd.read_csv('titanic/train.csv')

pandas의 read_csv()를 이용하여 데이터 프레임으로 불러올 수 있다.

데이터 추출하기

head(), tail()

head()는 DataFrame의 맨 앞부터 5개의 데이터만 추출한다. tail()은 DataFrame의 맨 뒤부터 5개의 데이터만 추출한다. ()안의 숫자를 적어 추출 데이터의 개수를 조절할 수 있다.

titanic_df.head() 

titanic_df.tail() 

shape()

DataFrame의 행(Row)와 열(Column) 크기를 가지고 있는 속성이다.

  
print('DataFrame 크기: ', titanic_df.shape)

DataFrame 크기:  (891, 12)

info()

DataFrame내의 컬럼명, 데이터 타입, Null건수, 데이터 건수 정보를 제공한다.

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

describe()

데이터값들의 평균,표준편차,4분위 분포도를 제공한다. 숫자형 컬럼들에 대해서 해당 정보를 제공한다.

  
titanic_df.describe()

value_counts()

value_counts()는 개별 데이터값의 분포도를 제공한다.

  
value_counts = titanic_df['Pclass'].value_counts()
print(value_counts)

  491
  216
  184
Name: Pclass, dtype: int64

DataFrame 생성

pd.DataFrame()함수를 사용하여 DataFrame을 만들 수 있다.

  
dic1 = {'Name': ['DongGyu', 'San','Dohwan','Ian'],
        'Year': [2001, 2001, 2000, 2001],
        'Gender': ['Male', 'Male', 'Male', 'Male']
       }
# 딕셔너리를 DataFrame으로 변환
data_df = pd.DataFrame(dic1)
print(data_df)
print("#"*30)

# 새로운 컬럼명을 추가
data_df = pd.DataFrame(dic1, columns=["Name", "Year", "Gender", "Age"])
print(data_df)
print("#"*30)

# 인덱스를 새로운 값으로 할당. 
data_df = pd.DataFrame(dic1, index=['one','two','three','four'])
print(data_df)
print("#"*30)

      Name  Year Gender
0  DongGyu  2001   Male
1      San  2001   Male
2   Dohwan  2000   Male
3      Ian  2001   Male
##############################
      Name  Year Gender  Age
0  DongGyu  2001   Male  NaN
1      San  2001   Male  NaN
2   Dohwan  2000   Male  NaN
3      Ian  2001   Male  NaN
##############################
          Name  Year Gender
one    DongGyu  2001   Male
two        San  2001   Male
three   Dohwan  2000   Male
four       Ian  2001   Male
##############################

DataFrame의 컬럼명과 인덱스

.columns, .index, .index.values를 통해 컬럼명과 인덱스, 인덱스 값 을 알 수 있다.

  
print("columns:",titanic_df.columns)
print("index:",titanic_df.index)
print("index value:", titanic_df.index.values)

columns: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
index: RangeIndex(start=0, stop=891, step=1)
index value: [  0   1   2   ... 889 890] # 원래는 값 모두 나오지만 ...으로 생략

DataFrame 수정

DataFrame의 행과 열의 접근으로 원하는 데이터들을 핸들링 할 수 있다.

데이터 수정

  
titanic_df['Age_0']=0
titanic_df.head(3)

.iloc[]를 통해 인덱스를 이용해 데이터 접근이 가능하다.

  
data_df.iloc[0, 0]

데이터 삭제

drop함수를 사용하여 데이터를 삭제할 수 있다. axis를 이용해 열을 삭제할지 행을 삭제할 지 결정할 수 있다. axis = 1은 열 방향으로 데이터를 삭제하는 것이고 axis = 0은 행 방향으로 데이터를 삭제하는 것이다.

  
titanic_drop_df = titanic_df.drop('Age_0', axis=1 )
titanic_drop_df.head(3)

느낀점

Numpy와 Pandas는 데이터 전처리 과정 등 데이터를 핸들링할 때 매우 중요한 부분이다. 아직은 데이터 핸들링에 대해 익숙하지 않아 어렵게 느껴지지만, 자주 사용해보면서 익숙해져야겠다.