
What is a DataFrame?
A DataFrame is a two-dimensional labeled data structure with columns potentially of different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. DataFrames are generally the most commonly used pandas object.
Creating a DataFrame
Creating a DataFrame is simple and can be done in multiple ways, such as from a list, dictionary, or by reading from a file. Here are some examples:
From a list:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
From a dictionary:
import pandas as pd
data = {'Name':['Tom', 'Nick', 'John'], 'Age':[20, 21, 19]}
df = pd.DataFrame(data)
print(df)
Viewing Data
To view a small sample of a DataFrame, use the head()
and tail()
methods. head()
returns the first n rows (default is 5), while tail()
returns the last n rows (default is 5).
print(df.head(3))
print(df.tail(3))
Dataframe Information
It is often useful to get a quick description of the data, especially in large DataFrames. The info()
and describe()
methods can help in this regard. info()
provides a summary of the DataFrame including the data types, non-null values, and memory usage. describe()
provides descriptive statistics for each column.
print(df.info())
print(df.describe())
Selecting Data
You can select data in a DataFrame using column names, or using iloc
and loc
for position-based or label-based data selection, respectively.
Select a column by name:
print(df['Name'])
Select by position:
print(df.iloc[0])
Select by label:
print(df.loc[0])
Sorting Data
You can sort a DataFrame using any column, using sort_values()
. If you want to sort by multiple columns, you can pass a list of column names.
df = df.sort_values('Age')
Applying Functions
You can apply functions to DataFrames in a vectorized way. For instance, using apply()
with a lambda function can let you quickly perform computations across an entire DataFrame.
df['Age'] = df['Age'].apply(lambda x: x + 1)
Missing Data
Pandas uses the special float value NaN
(Not a Number) to represent missing data. Functions like isnull()
or notnull()
allow you to detect missing data, while functions like dropna()
or fillna()
allow you to handle missing data.
# Detect missing values
print(df.isnull())
# Drop rows with missing values
df = df.dropna()
# Fill missing values with a specified value
df = df.fillna(value=0)
Grouping Data
Grouping data is done via the groupby()
function. You can group by a single column or by a list of columns. After grouping, you can apply aggregation functions like sum()
, count()
, mean()
, etc.
df.groupby('Age').sum()
Merging, Joining, and Concatenating
There are several ways to combine DataFrames including merge()
, join()
, and concat()
.
Merge:
df1.merge(df2, on='common_column')
Join:
df1.join(df2)
Concat:
pd.concat([df1, df2])
Reading and Writing to Files
Pandas can easily read data stored in different file formats like CSV, Excel, SQL databases, etc. Similarly, data can be written to these formats as well.
Reading a CSV file:
df = pd.read_csv('file.csv')
Writing to a CSV file:
df.to_csv('newfile.csv')
Conclusion
The pandas DataFrame is a powerful data manipulation tool that forms the foundation for most Python-based data analyses. Its flexibility, functionality, and easy-to-use nature make it a go-to for data scientists worldwide.