How to Use ChatGPT Advanced Data Analysis
Introduction
ChatGPT Advanced Data Analysis (formerly ChatGPT Code Interpreter) was a plug-in that has since been rolled into OpenAI’s GPT-4 implementation. It is a feature that fixed some of ChatGPT’s shortcomings. Before Advanced Data Analysis, the basic ChatGPT model had some limitations:
- It couldn’t run code, so the code it produced could have errors or bugs.
- It didn’t allow for the uploading of data.
- It could not produce charts or graphs.
- It was also bad at mathematical questions.
Advanced Data Analysis addressed all these issues. It can run Python code in its sandbox. It allows us to upload data to ChatGPT so it can generate insights. It can also create visualizations based on that data. And finally, it can interpret complex math problems for us. In the following sections, we will look at all these features, and discuss some of their limitations.
Accessing Advanced Data Analysis
To access Advanced Data Analysis, we need a ChatGPT Plus subscription, which currently costs $20 USD a month. You can get one by going to chat.openai.com and clicking the “sign up” button and following the prompts to create a ChatGPT account. If you already have a free ChatGPT account, you can select “Upgrade” from the lower left of the home screen to open a dialog to sign up for a ChatGPT Plus account. A “Plus” subscription gives you access to GPT-4, which now has Advanced Data Analysis built in. (Advanced Data Analysis used to be a separate plugin you needed to enable.) To access GPT-4 and Advanced Data Analysis, select the GPT-4 model in the upper right of the home screen.
You’ll notice that it is enabled because a paperclip icon will appear in the chat box.
This is how we’ll upload files later.
Generating Code and Using Mathematics
The original name for Advanced Data Analysis was “Code Interpreter”. As this implies, it is a powerful code generation tool. The original ChatGPT model could generate code, but it wasn’t trustworthy. The code it generated might have bugs, syntax errors, or calls to non-existent functions. With Advanced Data Analysis, ChatGPT can run the code it generates in its own sandbox and debug it itself. It is limited in that it currently can only work with Python code, but it is still a major step forward.
Getting ChatGPT to generate code is as simple as asking it to write a function for you.
To see the code of the function, click on the “view analysis” link at the end of the response. (Highlighted above.) This will open a window showing the code ChatGPT generated.
At the top of the window, highlighted above, you can see a link to copy the code that was generated. The following block lists the code this prompt generated:
def fibonacci(n):"""Generate the first n numbers in the Fibonacci sequence.Args:n (int): The number of terms in the Fibonacci sequence to generate.Returns:list: A list containing the first n numbers of the Fibonacci sequence.Raises:ValueError: If n is not a positive integer."""if not isinstance(n, int) or n <= 0:raise ValueError("Input must be a positive integer.")sequence = []a, b = 0, 1for _ in range(n):sequence.append(a)a, b = b, a + breturn sequence# Example usage of the functiontry:print(fibonacci(10)) # Should print the first 10 Fibonacci numbersprint(fibonacci(-1)) # Should raise a ValueErrorexcept ValueError as e:print(e)
Scrolling to the bottom of the window, you’ll see the output for the code it ran.
You can see it ran a couple of basic tests for it. But if you’re not satisfied with that, you can ask it to run further tests on the function it generated.
Again, you can click on the “view analysis” link at the end of the response to view the code that ran the tests.
Advanced Data Analysis can also use Python code to solve mathematical problems. Here’s an example of handing it a text problem.
Here is ChatGPT’s response:
You can see the work it did to get this answer by clicking on the “view analysis” link at the end of the response. The code for the above looks like this:
# Calculationcombined_speed = 50 + 60 # km/hremaining_distance = 75 # km# Time for the cars to meettime_to_meet = remaining_distance / combined_speed # hours# Since Car 2 starts at 6:30 AM, we add this time to find the meeting timemeeting_time_hour = 6.5 + time_to_meet # 6:30 AM is 6.5 in 24-hour format# Distance Car 1 travels until they meetdistance_from_city_A = 50 * (0.5 + time_to_meet) # Car 1's speed * total time it travelstime_to_meet, meeting_time_hour, distance_from_city_A
Analyzing Data
As its name suggests, Advanced Data Analysis is a powerful tool to gain insights into data. We can upload various file types to ChatGPT, from images to PDFs, but Advanced Data Analysis is designed to primarily use .txt
and .csv
files. While this is a powerful tool, you should be aware of some caveats before you start uploading data:
- ChatGPT doesn’t know where the data comes from or what assumptions may have been used to produce it. Therefore, generated insights may be flawed for subtle reasons that require subject matter expertise to understand.
- ChatGPT hides the steps of the analysis (though you can view them) so it becomes easier to miss key mistakes. Checking ChatGPT’s analysis thoroughly can be a lengthy and involved process.
- Uploading data to ChatGPT comes with serious privacy concerns.
You can mitigate these issues by doing the following:
- Ensure column names accurately reflect the data.
- Know the data thoroughly yourself; how it was collected, how it was cleaned, how it is still messy.
- Provide any key context within the prompt.
- Analyze ChatGPT’s code and output like an adversary, searching to find what’s wrong with it.
- Never upload closed data, especially data with PII.
To find a dataset to use to demonstrate Advanced Data Analysis we’ll go to Kaggle, and download a dataset of Google Play Store Apps. (If you don’t have a Kaggle account, you’ll have to register. It’s free.) Once you download the archive.zip
file, unzip it into three files. The one we’ll be using is googleplaystore.csv
. Once you have this file locally, you can upload it to ChatGPT using the paperclip icon in the chat box we saw earlier.
Loading the file we can prompt for insights into the data:
By having ChatGPT describe the dataset, you can confirm it “understands” what the data means, as a check on the insights it provides. The insights provided for this dataset look like this:
As before, you can click on the “view analysis” link at the end of the response to see the Python code driving these insights so you can check on ChatGPT’s work.
You can also ask ChatGPT to help clean your data. If we look at another dataset from NASA about meteorite landings, we can ask ChatGPT the following:
As far as data cleaning is concerned, this is ChatGPT’s response:
In addition to asking it to spot potential issues for you, you can ask ChatGPT to clean the data for you.
By clicking on the “view analysis” link you can see the Python code ChatGPT generated to clean our dataset. This is the code ChatGPT generated to clean the Meteorite Landings dataset:
from datetime import datetime# Current yearcurrent_year = datetime.now().year# Removing rows based on specified conditionscleaned_meteorite_data = meteorite_data.copy()# Removing rows where year is prior to 2000, blank, or after the current yearcleaned_meteorite_data = cleaned_meteorite_data[cleaned_meteorite_data['year'].between(2000, current_year, inclusive='both')]# Removing rows where Longitude and Latitude are blank, both zero, or duplicate other entriescleaned_meteorite_data = cleaned_meteorite_data.dropna(subset=['reclat', 'reclong'])cleaned_meteorite_data = cleaned_meteorite_data[~((cleaned_meteorite_data['reclat'] == 0) & (cleaned_meteorite_data['reclong'] == 0))]cleaned_meteorite_data = cleaned_meteorite_data.drop_duplicates(subset=['reclat', 'reclong'])# Removing rows where mass is 0 or missingcleaned_meteorite_data = cleaned_meteorite_data[cleaned_meteorite_data['mass (g)'] > 0]# Renaming the datasetcleaned_meteorite_data.name = "Cleaned Meteorite Landings"cleaned_meteorite_data.head(), cleaned_meteorite_data.shape
You can even ask ChatGPT to provide a downloadable link of the cleaned dataset.
Note: When using the Advanced Data Analysis feature, you may occasionally run into an “Error Analyzing” message. (Clicking on the message will show the code it was trying to run.) This can be due to messy data such as inconsistent data types in columns. But if this happens repeatedly without an obvious data error, try moving your analysis to a fresh chat. If that doesn’t solve the problem, it is possibly the load on the ChatGPT servers, and you should wait a little while and try again.
Visualizing Data
Because ChatGPT uses Python, it can use Python’s tools to create visualizations of the data. You just need to prompt it to do so. Let’s go back to the googleplaystore.csv
dataset and ask for one.
ChatGPT gave a visualization based on the prompt, and even did some data cleanup. This illustrates that in real-world applications it is best practice to spend time thoroughly cleaning your data (perhaps with ChatGPT’s help) before asking for analysis, but here is a nice illustration of how ChatGPT will try and accommodate you. It’s also a warning that it will take it upon itself to try and “understand” the data, so you need to keep an eye on what it is doing by looking at the “view analysis” button.
You can specify the type of visualization in the prompt as well:
If you click on “view analysis” for this response, it will show you the specific code ChatGPT used to create the graph. In this case it looks like this:
# Filtering data for the 'Game' categorygame_category_data = data_clean[data_clean['Category'] == 'GAME']# Converting 'Reviews' to numericgame_category_data['Reviews'] = pd.to_numeric(game_category_data['Reviews'], errors='coerce')# Scatter plotplt.figure(figsize=(12, 8))sns.scatterplot(data=game_category_data, x='Reviews', y='Installs', alpha=0.6)plt.title('Number of Installs vs Number of Reviews for Games Category')plt.xlabel('Number of Reviews')plt.ylabel('Number of Installs')plt.xscale('log') # Using logarithmic scale for better visibilityplt.yscale('log')plt.grid(True)plt.show()
This is particularly useful for generating code for complex visualizations that we can then edit, tweaking the settings for our particular needs.
It’s also possible to ask ChatGPT to visualize its insights into the data.
For this example, we asked ChatGPT to clean the data for us, and we can see, like before, that it had a little trouble initially before coming up with the visualization. Even so, it provided us with an insight into ratings distributions in the Google Play Store.
Conclusion
In this tutorial, we covered the basics of using Advanced Data Analysis in ChatGPT. We covered using it for code generation and using it for solving mathematical problems. We showed you how to upload data and get ChatGPT to analyze it for us. We demonstrated gaining insights, as well as using ChatGPT to clean data for us. Finally, we illustrated the powerful visualization capabilities available through Advanced Data Analysis. This should give you the foundation you need to explore ChatGPT’s powerful data analysis tools.
If you want to learn more about ChatGPT (or generative AI in general), check out these articles:
Author
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
Learn more on Codecademy
- Skill path
Code Foundations
Start your programming journey with an introduction to the world of code and basic concepts.Includes 5 CoursesWith CertificateBeginner Friendly4 hours - Career path
Full-Stack Engineer
A full-stack engineer can get a project done from start to finish, back-end to front-end.Includes 51 CoursesWith Professional CertificationBeginner Friendly150 hours