A comma-separated values (csv) file is widely used for storing data. Understanding how to read or import them in python is very crucial for any data scientist, analysts or anyone who works with data.
In this post you will learn –
A comma-separated values (csv) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format – Wikipedia.
B. How to read a csv file in pandas ?
To read a csv file in pandas, we use the pandas read_csv function. Let’s get some hands on practice in reading a csv file together.
Download files –
1 . Google Drive link.
2 . GitHub Link.
Before we read a csv file, first we have to import the pandas library.
# import pandas import pandas as pd
1 . Reading files locally from a computer –
When reading a file locally, you have to provide the file_path + file_name to the pandas read_csv function. You can either use a relative path or you can use an absolute path on Mac, Windows, and Linux.
Reading a csv file with relative path –
# read csv file using relative path on Mac store_sales = pd.read_csv('../data/clothing_store_sales.csv') store_sales.head()
Reading a csv file with absolute path –
# read csv file using absolute path on Windows store_sales = pd.read_csv("D:\workspace\lwd\data\clothing_store_sales.csv") store_sales.head()
2. Reading files remotely from internet –
You can also read a csv or any other format file in pandas from internet. All you have to do is provide the file url to pandas read_csv function.
# read csv file remotely from internet url="https://raw.githubusercontent.com/bprasad26/lwd/master/data/clothing_store_sales.csv" store_sales = pd.read_csv(url) store_sales.head()
C. Reading files with different separators –
By default read_csv function will read a comma-separated file but If you want, you can also uses other separators like semicolon (;), a tab (\t), a space ( ) and a pipe (|).
Let’s read the Gap minder data which is tab separated.
# read a tab separated file gap_minder = pd.read_csv('../data/gapminder.tsv', sep='\t') gap_minder.head()
D. Rename column names when reading a file –
When you read a file, you can also rename the column names using the name parameter of read_csv function.
# new column names cols = ['country','continent','year','life_exp','pop','gdp'] gap_minder = pd.read_csv('../data/gapminder.tsv', names=cols, sep='\t') gap_minder.head()
And if you look at the above result. You can see that the old column names are being added as a row in the dataframe. To avoid this you have to set the header parameter.
# new column names cols = ['country','continent','year','life_exp','pop','gdp'] gap_minder = pd.read_csv('../data/gapminder.tsv', header=0, names=cols, sep='\t') gap_minder.head()
Related post – How to Rename Column names.
E. Reading only few columns from a file –
Sometimes when you read a file, you don’t want to read all of the columns. You just want to read a few of them. In that case you can use the usecols parameter of read_csv function.
Let’s read a different version of clothing sales data that contains some useless columns.
sales = pd.read_csv('../data/cs_corrupted.csv') sales.head()
You can see that we have some useless ‘unnamed’ column. We can exclude these column when reading a file by only specifying column names in a list which we want to read using the usecols parameter and rest of the columns that are not in the list will be ignored by pandas.
# only read selected columns cols_to_read = ['Customer', 'Type of Customer', 'Items', 'Net Sales', 'Method of Payment', 'Gender', 'Marital Status', 'Age'] sales = pd.read_csv('../data/cs_corrupted.csv', usecols=cols_to_read) sales.head()
F. Select only N rows when reading a file –
Sometimes you may want to read only few rows of data from a file rather than the whole file. You can do this using the nrows parameter.
# read only 10 rows of data gap_minder = pd.read_csv('../data/gapminder.tsv', nrows=10, sep='\t') gap_minder
G. Setting columns as Index in read_csv –
Whenever you read a file in pandas, by default it adds an index for you from 0 to n-1. If you want you can set any columns as a index using the index_col parameter.
1 . Setting one column as Index –
# set country column as index gap_minder = pd.read_csv('../data/gapminder.tsv', index_col='country', sep='\t') gap_minder.head()
2. Setting Multiple columns as Index –
To set multiple columns as index, just pass the column names in a list to the index_col parameter.
# set multiple columns as index gap_minder = pd.read_csv('../data/gapminder.tsv', index_col=['continent','country'], sep='\t') gap_minder.head()
H. Parsing Date columns when reading a file –
When you read a file which contains date information, pandas may read them as string object compared to datetime type. If you want to parse these columns as a datetime type, you can use the parse_dates parameter.
# parsing date columns as datetime stock = pd.read_csv('../data/tesla_stock_prices.csv', parse_dates=['Date']) stock.head()
There are many other options in pandas read_csv function, please check the document to learn more – Pandas read_csv doc.
Related Post – Pandas to_csv – write a dataframe to a csv file.
If you like this post then please share it with others and subscribe to our blog below.