How to tame ChatGPT for Data

For the past few weeks, the internet is going crazy with a new AI tool called ChatGPT. It is practically a chatbot, but one that has incredible reading comprehension and is trained on almost all the topics you can think of. You can ask it questions, and as long as there is any knowledge or even a thought on it on the internet, it will find it.

With such a powerful tool, a bunch of sectors now have to reassess their automation and job definitions. I do not expect a situation where anyone good in their job has anything to worry about. Instead, they have a lot to be hyped about – because this tool can speed up and optimize any workflow you can think of. In this post, we will go over some of the most impactful ways to use ChatGPT for data science and data analytics.

Perspective & Approach to ChatGPT

Before starting the list of items, I want to share my approach to ChatGPT. It is a tool that deals with knowledge; not one that can decide something. The decisions we humans make are one thing that AI will not be able to replicate. For example, I can provide all the criteria I am looking for when buying a house, and give it 3 different houses and ask it to rank this. However, it can not tell me which is the best option – because “best” is subjective. I may want a big house, but the one place we toured maybe had great sunlight all day long and I will pick that one although it is a smaller one. Not only that, but on any given day my opinion on what is important to me might change – maybe not completely of course, but I can compromise. This is one thing that ChatGPT can not do.

However, what it does best is help you gain all the information you need to make that decision. I can ask it these three questions:

  • What is the average price of a house in the city name?
  • Explain to me step-by-step the process of buying a house in the country
  • How much mortgage can I expect to get, if I am making £X per year, live in the city and have this much saved?

And suddenly, I have much more information to base my decision on, before even looking at any pictures of properties! Asking ChatGPT the right questions can speed up your learning and research immensely, and this is the perspective we will have when listing the potential use cases of ChatGP for our domain; data.

1 – Find Data Sources

We don’t usually get exposed to new data when working at an established company. There are a few sources of information that we use every day and as long as nothing drastic changes (and you are nice to your data engineers) you can spend years without seeing any new data source. However for side projects, research, and practice it is very difficult to find the specific data in the shape we want.

The process of finding data to practice usually goes like this: googling “data sources for a clustering model”, seeing a few options from various niche areas you do not know much about, clicking one and being bombarded with account creation and signups. Only to find that the “free” data is a limited section of what you are looking for.

If this process sounds familiar, you will find ChatGPT very useful! The process completely changes from thinking about how you analyse first, to what you analyse first. You may want to have the average height of football players and their positions for a visualization you have in mind. Well, here you go:

VERY IMPORTANT DISCLAIMER: DO NOT SCRAPE DATA. Don’t do it.

There is a reason I added “open source” to my search query – I do not want to scrape anything that the owners would not want me to scrape. On top of not being polite, there are many different laws around it in different countries. I can not explain all of them and would not want to suggest any ways to scrape things or circumvent any laws. Use the open-source data, and ask for it specifically from ChatGPT. Their warning in the screenshot above is due to data scraping, and I fully understand and respect that. If you are not 100% sure if something is scrapable, don’t do it and find alternatives.

2 – Stack Overflow is dead? Long live ChatGPT.

No matter your seniority, your experience or your skill, there will be many cases where you check language documentation and find solutions to your specific problems online. These syntax searches can take a long time depending on how complex and specific the issue is, and are not always guaranteed to have solutions either.

Well, with ChatGPT the syntax and code research is a breeze. You can use it in two ways: paste your code in and ask for it to correct any issues or ask it to write the code for you.

You can find the data source as explained in the first tip, clean it up, and then ask ChatGPT to build a logistic regression on it:

The best part is the small explanation of the code it gives at the end, where you can see why it wrote specific parts and why. This model can be expanded and other advanced methods like automated train-test splitting can be implemented as well! I do not think ChatGPT has any limits as long as you are precise in your explanation.

3 – Data Storytelling

After the coding and the analysis are done, we always need to show the results somewhere. It can be a meeting without any visual tools, a presentation, or an infographic,… Regardless of the medium, the presentation will not only be charts and numbers – it will have words and action titles. Data stories are something you get better and better at with experience and practice, and the tone has to be specific to the audience. However, you can start ideating and building your own sentences using ChatGPT:

Number 4 looks to be very promising! Although you can see the others are not that great. We can still ask this to ChatGPT many times with different descriptions until we have enough examples to write our own!

Conclusion

I really think that ChatGPT will change the work life of a data professional. I do not believe it is here to replace any data scientists or data analysts, as I mentioned in the perspective section. It will make our lives a lot better by making us deliver results faster, easier, and to be fair more fun. It is much more interesting to chat with something instead of doing keyword gymnastics.