目次
Why you should learn Pandas
Pandas is an essential library for data analysis in Python. It provides a variety of features for efficiently handling data preprocessing, transformation, and analysis. In data analysis work, handling large volumes of data and extracting valuable insights from it is a daily task. Because Pandas makes such data manipulation easy, it has become a must-have tool for data scientists and analysts. It also integrates smoothly with Python’s rich ecosystem, dramatically improving the efficiency of data analysis.Benefits of Using Pandas
By using Pandas, you can reduce the amount of code needed for data manipulation and write more readable code. It also enables fast data processing. Pandas is built on the NumPy numerical computing library, and it performs efficient processing under the hood. Therefore, it can handle large datasets quickly, and even complex data operations can be expressed concisely. This improves development efficiency and allows you to focus more on analysis.Installing Pandas and Setting Up the Environment
To start using Pandas, you first need to install it. You can install it easily with pip. We’ll also cover environment setup such as Jupyter Notebook. You can install Pandas with the following command. pipinstall pandas Also, Jupyter Notebook is handy for data analysis. Installing Jupyter Notebook lets you write and run code in a web browser and see results instantly. You can install Jupyter Notebook with the following command. pip installnotebook Once these are set up, you can start analyzing data with Pandas right away.
Basics of Pandas: Series and DataFrames
Creating and Manipulating Series
The series, which are fundamental to Pandas, are one-dimensional data structures. They can be easily created from lists or NumPy arrays. This explains how to access and manipulate elements. A series is like an array with labels, where each element has an index. To create a series from a list, use pandas.Series() as shown below.importpandas as pd
data = [10, 20, 30, 40, 50]
s =pd.Series(data)
print(s)
You can also specify an index.import pandas aspd
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
s =pd.Series(data,index=index)
print(s)
To access elements of a series, specify the index. print(s[‘a’]) Like NumPy arrays, slicing is also possible. print(s[‘b’:’d’])Creating and Manipulating DataFrames
A DataFrame is a two-dimensional data structure that is suitable for handling tabular data. This explains how to load from CSV files and create from dictionaries. DataFrames consist of rows and columns, and each column can store a different data type. To create a DataFrame from a dictionary, use pandas.DataFrame() as shown below.importpandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30,28],
'city': ['Tokyo', 'Osaka', 'Kyoto']}
df =pd.DataFrame(data)
print(df)
To read a DataFrame from a CSV file, use pandas.read_csv().importpandas as pd
df =pd.read_csv('data.csv')
print(df)
To access a DataFrame column, specify the column name. print(df[‘name’]) You can also select multiple columns. print(df[[‘name’,’age’]]) To access rows of a DataFrame, use loc or iloc. Their usage will be explained in detail later.Checking and Converting Data Types
Pandas can handle various data types. This explains how to check data types and convert them when needed. To check the data type of each DataFrame column, use the dtypes attribute.importpandas as pd
data = {'col1': [1, 2, 3], 'col2': [1.1, 2.2, 3.3], 'col3': ['a','b', 'c']}
df =pd.DataFrame(data)
print(df.dtypes)
To convert data types, use the astype() method.df['col1']=df['col1'].astype('float')
print(df.dtypes)
Common data types include int (integer), float (floating-point), str (string), datetime (date and time), etc. Converting to appropriate types as needed can reduce memory usage and improve computation speed.Data Access and Extraction
Data Access Using loc and iloc
loc and iloc are powerful tools for extracting specific rows or columns from a DataFrame. Their usage is explained with concrete examples. loc specifies rows and columns using labels.importpandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30,28],
'city': ['Tokyo', 'Osaka', 'Kyoto']}
df = pd.DataFrame(data,index=['row1', 'row2', 'row3'])
print(df.loc['row1','name'])
print(df.loc['row1':'row2', ['name','age']])
iloc specifies rows and columns using integer positions.print(df.iloc[0,0])
print(df.iloc[0:2,0:2])
loc and iloc are extremely convenient for accessing specific parts of a DataFrame. They are indispensable tools for data exploration and analysis.Extracting Rows Based on Conditions
This explains how to extract only the rows that meet specific conditions, covering the query() method and extraction using comparison operators. To extract rows that satisfy particular conditions from a DataFrame, use comparison operators.importpandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30,28],
'city': ['Tokyo', 'Osaka', 'Kyoto']}
df =pd.DataFrame(data)
print(df[df['age'] >28])
Using the query() method allows you to specify more complex conditions.print(df.query('age >28'))
print(df.query('city =='Tokyo''))
It is also possible to combine multiple conditions.print(df[(df['age'] > 25) &(df['city'] != 'Osaka')])
print(df.query('age > 25 and city !='Osaka''))
By mastering these methods, you can efficiently extract the needed data from a DataFrame.Inspecting Data with head() and tail()
Using the head() and tail() methods lets you easily view the first or last rows of a DataFrame, which is useful for getting an overview of the data. When you use dataframe.head(), it displays the first few rows of the DataFrame. If you don’t specify an argument, it shows the first five rows by default.importpandas as pd
data = {'col1': range(10), 'col2': range(10, 20)}
df =pd.DataFrame(data)
print(df.head())
print(df.head(3))
When you use dataframe.tail(), it displays the last few rows of the DataFrame. If you don’t specify an argument, it shows the last five rows by default.print(df.tail())
print(df.tail(3))
These methods are extremely useful for grasping the overall picture of the data, especially when working with large DataFrames, as they allow efficient data inspection.
Data Manipulation and Organization
Adding, Deleting, and Modifying Columns
This explains how to add new columns to a DataFrame, delete unnecessary columns, and also how to modify data in existing columns. To add a new column to a DataFrame, specify the new column name and assign values to it.importpandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30,28]}
df = pd.DataFrame(data)
df['gender'] = ['female', 'male','male']
print(df)
To change the values of an existing column, select the column and assign new values.df['age'] =df['age'] +1
print(df)
To delete a column, use the drop() method. Provide the name of the column to remove as an argument. Setting inplace=True modifies the original DataFrame.df.drop('gender',axis=1, inplace=True)
print(df)
By combining these operations, you can flexibly manipulate a DataFrame.Sorting Data (Sort)
This explains how to sort DataFrame rows based on the values of specific columns, including how to specify ascending or descending order and sort by multiple columns. To sort a DataFrame by the values of a specific column, use the sort_values() method. Pass the column name(s) to sort by as an argument.importpandas as pd
data = {'name': ['Bob', 'Alice', 'Charlie'],
'age': [30, 25,28]}
df = pd.DataFrame(data)
df_sorted =df.sort_values('age')
print(df_sorted)
By default, sorting is in ascending order. To sort in descending order, set ascending=False.df_sorted= df.sort_values('age',ascending=False)
print(df_sorted)
To sort by multiple columns, provide a list of column names as the argument.df_sorted= df.sort_values(['age','name'])
print(df_sorted)
By leveraging these features, you can organize data into a more manageable form.Data Statistics
This explains how to calculate statistical measures such as mean, median, and standard deviation. Using the describe() method provides a summary overview of the data. To compute the mean of each column in a DataFrame, use the mean() method.importpandas as pd
data = {'col1': [1, 2, 3, 4, 5], 'col2': [2, 4, 6, 8, 10]}
df =pd.DataFrame(data)
print(df.mean())
To calculate the median, use the median() method.print(df.median())
To compute the standard deviation, use the std() method.print(df.std())
The describe() method displays these statistics together.print(df.describe())
These statistics help you understand the distribution and trends of the data.
Data Input/Output and Practical Applications
Writing to CSV Files
We explain how to export a processed DataFrame to a CSV file, including how to specify character encoding and delimiter when writing. To export a DataFrame to a CSV file, use the to_csv() method, providing the filename as an argument.importpandas as pd
data = {'name': ['Alice', 'Bob'], 'age': [25, 30]}
df =pd.DataFrame(data)
df.to_csv('output.csv',index=False)
Specifying index=False prevents the index from being written to the file.
To set a delimiter, use the sep argument.df.to_csv('output.csv',sep='t',index=False)
To set the character encoding, use the encoding argument.df.to_csv('output.csv',encoding='utf-8', index=False)
By configuring these options appropriately, you can output CSV files in various formats.Handling Missing Values
We explain how to handle missing values (NaN), covering detection, imputation, and removal techniques appropriate to different situations. To check whether a DataFrame contains missing values, use the isnull() method, which returns True for missing entries and False otherwise.importpandas as pd
import numpy as np
data = {'col1': [1, 2, np.nan], 'col2': [4,np.nan, 6]}
df =pd.DataFrame(data)
print(df.isnull())
To count the number of missing values, combine it with the sum() method.print(df.isnull().sum())
To fill missing values with a specific value, use the fillna() method.df_filled=df.fillna(0)
print(df_filled)
To remove missing values, use the dropna() method.df_dropped=df.dropna()
print(df_dropped)
Performing these operations appropriately minimizes the impact of missing data on your analysis.