[Python] 대용량 csv 엑셀파일 읽기

Notice

Recent Posts

Recent Comments

Link

현지님_블로그

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

에코프로.AI

[Python] 대용량 csv 엑셀파일 읽기 본문

AI Tutorial

[Python] 대용량 csv 엑셀파일 읽기

AI_HitchHiker 2024. 10. 16. 18:08

라이브러리 임포트

import os
import time
import pandas as pd
import chardet

파일 경로 설정

file_path = 'G:/내 드라이브/DataSet/_최종 병합 파일/서울특별시 공공자전거 대여소별 이용정보(시간대별)/서울특별시 공공자전거 대여소별 이용정보(시간대별)_2020.csv'

파일의 Encoding 확인

def Get_ExcelEncoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read(10000))
    
    encoding = result['encoding']
    return encoding

대용량 Excel - csv파일을 처리하는 방법에는 아래와 같이 3 가지 방법이 있습니다.

pandas의 read_csv함수를 사용 시, chunksize(한번에 읽어올 개수)를 설정하여, 여러번 나눠서 읽어 오는 방법
pyarraw 를 이용해서 읽어오는 방법
dask 를 이용해서 읽어오는 방법

함수 선언

pandas - chunk

# chunk_size : 나눠서 읽어올 크기 (기본값 : 100만개)
def Read_Chunk(file_path, chunk_size = 10**6):
    try:
        _encoding = Get_ExcelEncoding(file_path)
        result_df = pd.DataFrame()
        
        for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunksize, encoding=_encoding, low_memory=False)):
            # 청크 데이터프레임을 가공합니다. 여기서는 예시로 첫 5개 행만 가져옵니다.
            # processed_chunk = chunk.head()
            processed_chunk = chunk
            
            # 결과 데이터프레임에 청크를 추가합니다.
            result_df = pd.concat([result_df, processed_chunk])
            print(f" Processed a chunk of size: {len(chunk)}")
        
        return result_df
    except Exception as e:
        print(f"An error occurred while processing the file: {e}")

pyarraw

def Read_pyarrow(file_path):
    from pyarrow import csv
    
    pyarrow_df = csv.read_csv(file_path).to_pandas()
    
    return pyarrow_df

dask

def Read_dask(file_path, _dtype):
    import dask.dataframe as dd

    _encoding = Get_ExcelEncoding(file_path)
    dask_df = dd.read_csv(file_path, encoding=_encoding, dtype = _dtype)

    return dask_df

실행 테스트

pandas - chunk

start_time = time.time()  # 시작 시간 저장
df_rtn = Read_Chunk(file_path)
print("time :", time.time() - start_time) # 현재시간 - 시작시간

 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 1000000
 Processed a chunk of size: 26024
time : 31.085922241210938

df_rtn.columns

Index(['대여일자', '대여시간', '대여소번호', '대여소명', '대여구분코드', '성별', '연령대코드', '이용건수', '운동량',
       '탄소량', '이동거리', '사용시간'],
      dtype='object')

df_rtn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19026024 entries, 0 to 19026023
Data columns (total 12 columns):
 #   Column  Dtype  
---  ------  -----  
 0   대여일자    object 
 1   대여시간    int64  
 2   대여소번호   int64  
 3   대여소명    object 
 4   대여구분코드  object 
 5   성별      object 
 6   연령대코드   object 
 7   이용건수    int64  
 8   운동량     object 
 9   탄소량     object 
 10  이동거리    float64
 11  사용시간    int64  
dtypes: float64(1), int64(4), object(7)
memory usage: 1.7+ GB

df_rtn.head()

df_rtn.count()

대여일자      19026024
대여시간      19026024
대여소번호     19026024
대여소명      19026024
대여구분코드    19026024
성별         9530047
연령대코드     19026024
이용건수      19026024
운동량       19026024
탄소량       19026024
이동거리      19026024
사용시간      19026024
dtype: int64

pyarraw

start_time = time.time()  # 시작 시간 저장
df_rtn = Read_pyarrow(file_path)
print("time :", time.time() - start_time) # 현재시간 - 시작시간

time : 13.659146308898926

df_rtn.columns

Index(['대여일자', '대여시간', '대여소번호', '대여소명', '대여구분코드', '성별', '연령대코드', '이용건수', '운동량',
       '탄소량', '이동거리', '사용시간'],
      dtype='object')

df_rtn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19026024 entries, 0 to 19026023
Data columns (total 12 columns):
 #   Column  Dtype  
---  ------  -----  
 0   대여일자    object 
 1   대여시간    int64  
 2   대여소번호   int64  
 3   대여소명    object 
 4   대여구분코드  object 
 5   성별      object 
 6   연령대코드   object 
 7   이용건수    int64  
 8   운동량     object 
 9   탄소량     object 
 10  이동거리    float64
 11  사용시간    int64  
dtypes: float64(1), int64(4), object(7)
memory usage: 1.7+ GB

df_rtn.head()

df_rtn.count()

대여일자      19026024
대여시간      19026024
대여소번호     19026024
대여소명      19026024
대여구분코드    19026024
성별         9530047
연령대코드     19026024
이용건수      19026024
운동량       19026024
탄소량       19026024
이동거리      19026024
사용시간      19026024
dtype: int64

dask

start_time = time.time()  # 시작 시간 저장
_dtype={'성별': 'object',
       '운동량': 'object',
       '탄소량': 'object'}

df_rtn = Read_dask(file_path, _dtype)
print("time :", time.time() - start_time) # 현재시간 - 시작시간

time : 0.36102890968322754

df_rtn.columns

Index(['대여일자', '대여시간', '대여소번호', '대여소명', '대여구분코드', '성별', '연령대코드', '이용건수', '운동량',
       '탄소량', '이동거리', '사용시간'],
      dtype='object')

df_rtn.info()

<class 'dask_expr.DataFrame'>
Columns: 12 entries, 대여일자 to 사용시간
dtypes: float64(1), int64(4), string(7)

df_rtn.head()

df_rtn.count()

Dask Series Structure:
npartitions=1
대여구분코드    int64
탄소량         ...
Dask Name: count, 2 expressions
Expr=ReadCSV(e7d681a).count()

dask를 사용하는 경우, compute()를 안하고 사용하는 경우, 위와 같이 info(), count() 함수 실행 시,

pandas와 pyarraw와 다르게 정확한 표시가 안됨.

compute()를 사용하여, pandas의 DataFrame으로 변환하여 처리해야 정상적으로 표시 됨!

computed_df = df_rtn.compute()

type(df_rtn)

dask_expr._collection.DataFrame

type(computed_df)

pandas.core.frame.DataFrame

computed_df.columns

Index(['대여일자', '대여시간', '대여소번호', '대여소명', '대여구분코드', '성별', '연령대코드', '이용건수', '운동량',
       '탄소량', '이동거리', '사용시간'],
      dtype='object')

computed_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19026024 entries, 0 to 712514
Data columns (total 12 columns):
 #   Column  Dtype  
---  ------  -----  
 0   대여일자    object 
 1   대여시간    int64  
 2   대여소번호   int64  
 3   대여소명    object 
 4   대여구분코드  object 
 5   성별      object 
 6   연령대코드   object 
 7   이용건수    int64  
 8   운동량     object 
 9   탄소량     object 
 10  이동거리    float64
 11  사용시간    int64  
dtypes: float64(1), int64(4), object(7)
memory usage: 1.8+ GB

computed_df.head()

computed_df.count()

대여일자      706336
대여시간      706336
대여소번호     706336
대여소명      706336
대여구분코드    706336
성별        346011
연령대코드     706336
이용건수      706336
운동량       706336
탄소량       706336
이동거리      706336
사용시간      706336
dtype: int64

끝~

저작자표시 비영리 변경금지 (새창열림)

'AI Tutorial' 카테고리의 다른 글

[Python] Pandas - DataFrame의 컬럼명 변경 (0)	2024.10.23
[JupyterNotebook] 유용한 단축키 모음 (0)	2024.10.18
[Python] 동일 폴더의 Excel 파일 모두 합치기 (1)	2024.10.16
[사전학습] ObjectDetection (Feat. efficientdet, OpenCV) (1)	2024.09.04
[텍스트마이닝] 감성분석-네이버 영화리뷰 (3)	2024.09.04

'AI Tutorial' Related Articles

에코프로.AI

[Python] 대용량 csv 엑셀파일 읽기 본문

[Python] 대용량 csv 엑셀파일 읽기

'AI Tutorial' 카테고리의 다른 글

티스토리툴바