Pythonproject
Project about a dataset
Install / Use
/learn @Sachin23991/PythonprojectREADME
Pythonproject
- Introduction In recent years, climate change and environmental sustainability have become global priorities. As part of this transformation, electric vehicles (EVs) have emerged as a cleaner and more sustainable alternative to traditional fossil fuel-based transportation. To accelerate the adoption of EVs, various governments and organizations have introduced incentive programs that make EVs more financially accessible to consumers. One such program is the Drive Clean Rebate initiative launched by the New York State Energy Research and Development Authority (NYSERDA). 1.1 Overview of the Dataset The dataset used in this project is a detailed record of electric vehicle rebate claims under NYSERDA’s Drive Clean Rebate Program. It includes information on individual rebate transactions made available to the public to promote transparency and allow data-driven decision-making. The dataset begins from 2017 and is regularly updated, making it a rich source of information for analyzing EV trends in New York State. Key details of the dataset include: • Name: NYSERDA Electric Vehicle Drive Clean Rebate Data • Source: data.ny.gov – The official open data portal of New York State • Format: Comma Separated Values (CSV) file • Time Period: January 2017 to present • Number of Records: Over 300,000 entries (as of the latest update) but I use 18000 • File Size: Varies with updates (approx. 10–20 MB) 1.2 Purpose and Relevance of the Dataset This dataset serves several important purposes: • Promotes policy evaluation: By analyzing rebate distributions, government agencies can assess how well their incentive programs are working. • Encourages transparency: Making this data public ensures transparency in rebate distribution. • Enables research and analysis: Students, researchers, and analysts can use the data to study consumer behavior, model EV adoption, or assess environmental impact. • Supports the EV ecosystem: Car manufacturers and dealerships can understand demand trends and tailor offerings accordingly. 1.3 Columns and Their Significance The dataset contains a wide range of columns. Some of the most important ones include: • Purchase Date: The date the vehicle was purchased or leased. • Vehicle Make and Model: Indicates the brand and specific model of EV purchased. • Technology Type: Identifies whether the vehicle is a Battery Electric Vehicle (BEV) or Plug-in Hybrid (PHEV). • MSRP: Manufacturer's Suggested Retail Price of the vehicle. • Rebate Amount: The dollar value of the rebate received. • County: Geographic location of the buyer. • Dealership Name: Name of the selling dealership. Each of these fields offers insights into purchasing patterns, price sensitivity, dealership effectiveness, and geographic EV adoption. 1.4 Suitability for Data Science Projects This dataset is ideal for data science and analytics projects because of the following features: • Large volume: Sufficient data points to draw statistically meaningful insights. • Structured format: Well-organized and easy to parse with tools like Python and Pandas. • Temporal spread: Data across multiple years allows trend analysis and forecasting. • Real-world relevance: EV adoption is a critical issue in both environmental and economic discussions today.
Source of Dataset #Source Dataset Link:-https://catalog.data.gov/dataset/nyserda-electric-vehicle-drive-clean-rebate-data-beginning-2017
Showing Dataset :
- Exploratory Data Analysis (EDA) Process Exploratory Data Analysis (EDA) is a vital part of any data science or analytical project. It helps in developing a deep understanding of the dataset, identifying any inconsistencies, and preparing the data for further analysis. In this section, we carry out a detailed EDA process on the NYSERDA Electric Vehicle Rebate dataset. Each step is described below, followed by its corresponding output screenshots and code.
Step 1: Viewing Basic Information of the Dataset The first step is to explore the fundamental structure of the dataset using the .info() method. This command provides essential details such as the total number of entries (rows), the number of columns (features), the type of data stored in each column (e.g., integer, float, string), and the number of non-null (non-missing) values in each column. This overview helps us understand the completeness and consistency of the data, and gives a general idea of how to proceed with further cleaning or transformation. Code Used:
Screenshot and Output:
Step 2: Counting Unique Values in Each Column To gain insight into the uniqueness of values across columns, we use the .nunique() method. This step tells us how many distinct values exist in each column. This is particularly useful for identifying categorical variables and understanding their diversity. Knowing which columns have limited or excessive unique values is crucial in deciding how they can be grouped, visualized, or transformed in future steps. Code Used: df.nunique() Screenshot and Output:
Step 3: Descriptive Statistics of Numeric Columns We use the .describe() method to generate a summary of all numeric columns in the dataset. This includes key statistics such as the mean, minimum, maximum, standard deviation, and percentiles for each numeric field. This summary gives us a foundational understanding of data distribution, potential outliers, and the scale of each feature. Code Used: print("Summary statistics of numeric columns:\n") print(df.describe()) Screenshot and Output: Stastistical Summary
Step 4: Cleaning and Standardizing Column Names For consistency and ease of reference, we clean up the column names by removing unnecessary characters and standardizing their format. This includes replacing spaces with underscores and removing any parentheses. These changes help avoid syntax errors and make the code more readable and manageable during analysis. Code Used: df.columns = df.columns.str.strip().str.replace(" ", "_").str.replace("(", "").str.replace(")", "") Screenshot of the code:-
Step 5: Renaming Key Columns for Simplicity Some of the column names in the dataset are long and complex. To simplify further analysis and enhance readability, we rename a few important columns with more concise titles. For example: • Annual_GHG_Emissions_Reductions_MT_CO2e is renamed to GHG_Reductions • Annual_Petroleum_Reductions_gallons is renamed to Petroleum_Reductions • Rebate_Amount_USD is renamed to Rebate_Amount These new names retain the original meaning but are easier to use in analysis and visualization. Code Used: df.rename(columns={ "Annual_GHG_Emissions_Reductions_MT_CO2e": "GHG_Reductions", "Annual_Petroleum_Reductions_gallons": "Petroleum_Reductions", "Rebate_Amount_USD": "Rebate_Amount" }, inplace=True) Updated Columns Screenshot: Renaming Key Columns For Simplicity
Step 6: Identifying Missing Values To ensure the dataset is complete and reliable, we check for missing or null values using .isnull().sum(). This gives us a count of how many missing entries exist in each column. Only the columns that contain missing values are displayed, allowing us to focus on cleaning or imputing only those specific areas. Understanding where and how much data is missing helps us determine whether to fill in the values, drop the rows, or handle them through other data-cleaning methods. Code Used: missing_values = df.isnull().sum() missing_values = missing_values[missing_values > 0] print("Missing values in columns:\n") print(missing_values) Screenshot and Output:
Step 7: Exploring Categorical Features To gain insight into the nature of the categorical (non-numeric) variables, we calculate the number of unique values in each object-type column. This step reveals how many different values exist within each feature, such as vehicle makes, models, dealership names, or counties. This is especially useful when deciding how to visualize these categories or whether they need to be grouped for better interpretation. Code Used: for col in df.select_dtypes(include='object').columns: unique_count = df[col].nunique() print(f"{col}: {unique_count} unique value{'s' if unique_count > 1 else ''}") Screenshot and Output:
Step 8: Final Review of Column Names As a final step in data preparation, we print all the current column names to verify that the changes we made earlier (cleaning and renaming) have been applied correctly. This provides a clear view of the structure we’ll be working with in the detailed analysis phase. Code Used: print(df.columns) Screenshot and Output:
Section 4: Exploratory Data Analysis and Visualization
4.0 Correlation and Covariance Analysis of Rebate Amount
i. Introduction Before diving into visual interpretations, it is crucial to understand the numerical relationships between key features in the dataset. Specifically, we are interested in how the rebate amount—a critical financial incentive—relates to other quantitative variables such as GHG reductions and petroleum savings. This section uses correlation and covariance to explore these associations. ii. General Description The correlation coefficient helps assess the strength and direction of a linear relationship between variables. Covariance, on the other hand, measures how much two variables change together, though it is not normalized and is influenced by scale. Together, these two metrics provide a strong foundation for interpreting the dynamics of rebate distribution and its connection to environmental benefits. We first clean the data by removing rows with missing rebate amounts. Then, we select all numeric columns and compute the correlation and covariance of each with respect to the Rebate_Amount. iii. Code Used
iv. Specific Requirements, Functions, and Formulas • dropna(subset=['Rebate_Amount']): Removes rows with missing rebate amounts.
