How to Read a CSV File in Python Pandas ?

Spread the love

In this post, you will learn how to read a csv file in python using pandas. To read a csv file in pandas, we use the read_csv function. This function takes the path of the csv file and convert the csv file into a pandas dataframe.

pandas.read_csv

pandas.read_csv(filepath_or_buffer,sep, header,names, index_col,usecols,
                prefix,dtype,skiprows,nrows, na_values,parse_dates)

Parameters –

filepath_or_buffer – str, path object or file-like object. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected.

sep – Which delimiter to use. By default uses the comma separated (‘,’) . You can also use other delimiter like tab (‘\t’).

header – int, list of int, None, default ‘infer’. Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None.

names – List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names.

index_col – Column(s) to use as the row labels of the DataFrame, either given as string name or column index

usecols – Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s).

prefix – Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1

dtype – Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}

skiprows – Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

nrows – Number of rows of file to read. Useful for reading pieces of large files.

na_values – Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values.

parse_datesbool or list of int or names or list of lists or dict, (default False) If set to True, will try to parse the index, else parse the columns passed

Reading a CSV File –

To read a csv file in pandas, first we need to import the pandas library. Then use the read_csv() function to read a csv file. In the read csv function, you need to provide the path to the csv file. Here I am going to use a url to read a csv file so that you can also follow along with me.

# import pandas library
import pandas as pd
# read a csv file in pandas
file_path = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/clothing_store_sales.csv'
df = pd.read_csv(file_path)
df.head()

using the sep in read_csv –

By default read_csv function will read a comma-separated file but If you want, you can also uses other separators like semicolon (;), a tab (\t), a space ( ) and a pipe (|).

Let’s read a tab separated file (tsv)

file_path = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/gapminder.tsv'
gapminder = pd.read_csv(file_path, sep='\t')
gapminder.head()

Using names in read_csv –

When you read a file, you can also rename the column names using the name parameter of read_csv function.

# new column names
cols = ['country','continent','year','life_exp','pop','gdp']
gapminder = pd.read_csv('https://raw.githubusercontent.com/bprasad26/lwd/master/data/gapminder.tsv',
                       sep='\t',
                       names=cols)
gapminder.head()

And if you look at the above result. You can see that the old column names are being added as a row in the dataframe. To avoid this you have to set the header parameter.

# new column names
cols = ['country','continent','year','life_exp','pop','gdp']
gapminder = pd.read_csv('https://raw.githubusercontent.com/bprasad26/lwd/master/data/gapminder.tsv',
                       sep='\t',
                       names=cols,
                       header=0)
gapminder.head()

Using index_col in read_csv –

Whenever you read a file in pandas, by default it adds an index for you from 0 to n-1. If you want you can set any columns as a index using the index_col parameter.

1 . Setting one column as Index –

file_path = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/gapminder.tsv'
gapminder = pd.read_csv(file_path, sep='\t', index_col='country')
gapminder.head()

2. Setting Multiple columns as Index –

To set multiple columns as index, just pass the column names in a list to the index_col parameter.

file_path = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/gapminder.tsv'
gapminder = pd.read_csv(file_path, sep='\t', index_col=['continent','country'])
gapminder.head()

Using usecols in read_csv –

Sometimes you may want to read only few columns from a csv file instead of all the columns. You can do this using the usecols parameter of read_csv function. Just pass the names of the columns as a python list.

# columns to read
use_cols = ['continent','country','year','pop']
gapminder = pd.read_csv(file_path, sep='\t', usecols=use_cols)
gapminder.head()

Using dtype in read_csv –

By default, pandas infer the column data types itself. But if you want, you can also specify your own data type.

Let’s read the year column as float instead of int.

gapminder = pd.read_csv(file_path, sep='\t', dtype={'year':'float'})
gapminder.head()

Using nrows in read_csv –

This parameter allows you to control how many rows you want to load from the CSV file. It takes an integer specifying row count.

Let’s only read 5 rows from the gapminder dataset.

gapminder = pd.read_csv(file_path, sep='\t', nrows=5)
gapminder

Using parse_dates in read_csv –

When you read a file which contains date information, pandas may read them as string object compared to datetime type. If you want to parse these columns as a datetime type, you can use the parse_dates parameter.

file_path = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/tesla_stock_prices.csv'
tesla = pd.read_csv(file_path, parse_dates=['Date'])
tesla.head()

Rating: 1 out of 5.

Leave a Reply