Getting Started with Using Generative AI for Analytics
Introduction
Generative AI for analytics is possible due to the vast amount of AI tools available. Instead of creating custom algorithms or manually parsing through data, we can provide data sets to AI and let it do the hard work for us. In this article, we will combine Jupyter Notebook and OpenAI to analyze real-world data sets.
We will use Jupyter Notebook for creating and sharing computational documents. If you aren’t familiar with Jupyter Notebook, pandas, or Python, try out our course Getting Started with Python for Data Science to learn about these topics.
How to Set Up OpenAI in a Jupyter Notebook
Go through this guide if you don’t have Jupyter Notebook installed
Open a terminal and type `pip3 install jupyter`. This command will install Jupyter Notebook to your local device. Once complete, let’s make sure it is installed. Type `jupyter notebook` in a terminal to begin a session. You should get a webpage with your current directory contents listed.
With Jupyter Notebook installed, let’s create a new Terminal session.
Once the terminal is open, install the OpenAI library using the pip3 install openai
command. With the library installed, open a new console, and add a statement to import OpenAI.
import openai
We must create an API key to use the OpenAI API. Using the OpenAI API website. Here is a demonstration of how to create an API key.
Note: OpenAI API is not free. It requires a minimum of $5 (USD) to use. If you opt to follow along in this tutorial, the $5 (USD) worth of credits is more than enough.
The secret key will only be accessible one time, at the time of creation. Make sure to save it for future use.
Set your secret key so you can run OpenAI API function calls.
openai.api_key = '<API_KEY>'
Great work! OpenAI is prepared and ready to use in our Jupyter Notebook. We’ll use the OpenAI API to make requests about creating code for analytics as well as assist in analyzing real-world data.
Jupyter Notebook with OpenAI Example
Let’s use OpenAI with Jupyter Notebook to analyze some real-world data. We’ll begin by creating a helper function that allows us to prompt OpenAI. We’ll then prompt OpenAI to help set up code for OpenAI to analyze a use-case of some college football data.
# import packagesimport openaidef get_response(prompt, system = '', temperature = 1.0):response = openai.ChatCompletion.create(model = "gpt-3.5-turbo",messages = [{"role": "user", "content": prompt},{"role": "system", "content": system}],temperature = temperature)return response['choices'][0]['message']['content']# set a system promptsystem_prompt = '''You are a helpful AI assistant for data analysis, providing commented code for human review.You put any code inside a Python code block in Markdown.You include a bulleted code explanation after the code.'''
The code above imports the OpenAI library to access the API calls and sets our API key to OpenAI so it knows who is making the request. We create a function get_response(prompt, system, temperature)
where prompt
is the prompt we will request, system
is the message that will be sent with each prompt
request as a baseline, and temperature
is a value that will be used to determine randomness, varying between 0
and 2
, where the larger number means more randomness. The function calls on the OpenAI library to generate a response provided the system response and the user prompt. We provide the system prompt each time because, unlike ChatGPT, OpenAI cannot remember your history with each conversation.
With OpenAI prepared, let’s import some college football data and use OpenAI to help us better visualize it. Go to Kaggle. Press on the download button at the top-right of the page to download the data locally. Place that file in your Jupyter Notebook and rename it data.csv
. Go back to the console and import the data.
import pandas as pdfootball = pd.read_csv('data.csv')
We can do a quick console visualization in table format with the command football.head()
.
Let’s clean up the data to focus on team rankings, wins, and losses. We’ll modify the data to create a new table that includes Team, Rank, Games, Wins, and Losses, each as its own column.
# Create two new columns "Win" and "Loss"football[['Win', 'Loss']] = football['Win-Loss'].str.split('-', expand=True)# Remove the original "Win-Loss" columnfootball.drop('Win-Loss', axis=1, inplace=True)# Create new table with the specified_datanew_order = ['Off Rank', 'Team', 'Win', 'Loss']football = football.reindex(columns=new_order)# Set Wins and Losses to type `Int` instead of `String`football["Win"] = football["Win"].apply(int)football["Loss"] = football["Loss"].apply(int)
We split football[Wins-Losses]
into individual columns and dropped the original Wins-Losses
columns. We then created a new table with a specific subset of the original data. Lastly, we modified the Win
and Loss
columns to be Int
instead of String
.
Let’s use OpenAI to help us set up code to visualize the data. We’ll prompt OpenAI to create a single bar graph where each team has the wins and losses on the same bar. Additionally, we only want to see teams that have a higher win rate than loss rate.
Prompt (in console):
message_prompt = '''Generate Seaborn code for producing a single bar graph.Include matplotlib and seaborn import statements. Pandas has already been imported.Create a single bar graph for each Team for their Win and Loss columns. I want the Wins and Losses on the same bar. Win should be green and Loss should be red.Let's only include teams that have a higher win rate than their loss rateInclude code for title and axis labels.'''## This code will send the request and display the responsegpt_response = get_response(message_prompt,system_prompt)print(gpt_response)
Response:
import matplotlib.pyplot as plt
import seaborn as sns
# Filter out teams with a higher win rate than loss rate
filtered_df = df[df['Win'] > df['Loss']]
# Create a bar graph with seaborn
sns.set(style='whitegrid')
plt.figure(figsize=(10, 5))
sns.barplot(x='Team', y='Win', data=filtered_df, color='green', label='Wins')
sns.barplot(x='Team', y='Loss', data=filtered_df, color='red', label='Losses')
# Set title and axis labels
plt.title('Wins and Losses by Team')
plt.xlabel('Team')
plt.ylabel('Count')
# Show the legend
plt.legend()
# Display the plot
plt.show()
We didn’t clarify that the data is not referenced as df
, commonly referred to as dataframe
. Make sure to change every df
to football
.
import matplotlib.pyplot as pltimport seaborn as sns# Filter out teams with a higher win rate than loss ratefiltered_df = football[football['Win'] > football['Loss']]# Create a bar graph with seabornsns.set(style='whitegrid')plt.figure(figsize=(10, 5))sns.barplot(x='Team', y='Win', data=filtered_df, color='green', label='Wins')sns.barplot(x='Team', y='Loss', data=filtered_df, color='red', label='Losses')# Set title and axis labelsplt.title('Wins and Losses by Team')plt.xlabel('Team')plt.ylabel('Count')# Show the legendplt.legend()# Display the plotplt.show()
Wow, look at those axes! Let’s change the size of the graph to figsize=(105, 30)
to get a more even image.
Here is the final visualization illustrating all college football teams in 2022 that had a positive win rate.
Conclusion
Great work! We were able to set up OpenAI with Jupyter Notebook. Once set up, we imported some college football data to analyze. We asked OpenAI for assistance in developing code for visualizing the football data. With minor corrections, we were a great graph showcasing the teams with positive win rates!
To see what else you can do with ChatGPT (or generative AI in general), checkout some of the articles located here: AI Articles.
Author
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full team