How to categorise text in a Pandas dataframe using Google’s Natural Language API

Why do you need to categorise (or ‘tag’) your content?

If you’re an online content publisher, it’s beneficial to tag articles or forum posts with categories such as Sports, Health, Beauty & Fitness, etc. Doing so has a number of benefits including:

  1. Increase advertising revenue. Usually onsite adverts (e.g. from Google Ads) are targeted to pages using the website’s URL for increased relevance. For example, if you wanted to advertise ‘sports trainers’ you might push the advert to all pages containing ‘sports’. However, realistically not all content about sports will sit within URLs containing ‘sports’, especially for user-generated-content websites where categorisation cannot be so easily controlled. Therefore pushing the tagged category to the dataLayer of the website and targeting ads to the dataLayer will allow you to cast the advertising net much wider.
  2. Improve the onsite user experience. Tagging content will allow users to easily navigate to categories they are most interested in or mute ones they are not interested in.
  3. Allow for recommendations to be shown. Once your content is tagged with a category, you can suggest other similar articles to your users and thereby increase onsite engagement.
  4. Improve SEO. Onsite tags allow search engine crawlers to easily navigate your site, meaning search engines will have a greater understanding of your content. In addition, and as mentioned in the previous point, tags will improve the user experience which can positively affect search engine rankings for your website.
  5. Improve analytics reporting accuracy. Your Data team will be able to more accurately answer questions such as ‘how many pageviews were there of content about sports trainers?’.

Google’s NLP API

Google offers a free API (try it out here) which runs a pre-trained BERT model to automatically tag text into one of 620 categories (with 27 top-level categories — see below). You can check out the full list here.

Here are the top-level categories:

Adult, Arts & Entertainment, Autos & Vehicles, Beauty & Fitness, Books & Literature, Business & Industrial, Computers & Electronics, Finance, Food & Drink, Games, Health, Hobbies & Leisure, Home & Garden, Internet & Telecom, Jobs & Education, Law & Government, News, Online Communities, People & Society, Pets & Animals, Real Estate, Reference, Science, Sensitive Subjects, Shopping, Sports, Travel.

The beauty of this API is that it gives you a confidence score for the categorisation results, for example:

The output:

In this example we are going to use the API to tag BBC News articles with a Google category. The output will be a Pandas dataframe with the categories against each article like so:

I’ve also visualised the size & details of each category in a treemap to provide us with a top-level overview:

Start here:

Download the BBC News (train) dataset from here — there are 1,490 articles altogether.

Read the CSV into a Pandas dataframe. The CSV already contains a BBC defined category- I’ve changed the column title to so not to confuse things:

import pandas as pdbbc_news = pd.read_csv('BBC News Train.csv')
bbc_news.columns= bbc_news.columns.str.lower()
bbc_news.rename(columns={"category": "bbc_defined_category"}, inplace = True)
bbc_news.head()

Some of the articles are very long, therefore I’ve chosen to extract just the first 150 words as I’ve found that the key themes of an article or forum post can be understood within the first 150 words. However, feel free to play around with the numbers as your website might have different textual structures. NB the Google API will only work for text with >=20 words

bbc_news['first_150_words'] = bbc_news['text'].str.extract(pat = '(^(?:\S+\s+\n?){1,150})')bbc_news.head()

Next we want to connect to our Google Cloud Platform account’s credentials. To do this you’ll need to download a JSON key following these instructions.

Below is the full code to pull categories from text & add it onto your existing dataframe. I’ve included a hashed-out line of code that pulls the confidence score along with the category which I won’t be using in this example:

import os
from google.cloud import language_v1
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='path to your JSON file'client = language_v1.LanguageServiceClient()def get_category(content):
type_ = language_v1.Document.Type.PLAIN_TEXT
language = "en"
document = {"content": content, "type_": type_}
categories = client.classify_text(document=document)
return ([c.name for c in categories.categories])
#return ([(c.name, round(c.confidence*100)) for c in categories.categories])
google_categories = bbc_news.first_150_words.apply(get_category)
bbc_news['google_defined_categories'] = google_categories
bbc_news.head()

We can already see the Google defined categories are much more detailed than the BBC defined categories 👍. Now you have your dataframe there is so much you can do with this data in relation to the 5 points (and more!) I stated at the beginning of this article. Please get in touch if you want to bounce some ideas off me :-)

How to build the TreeMap:

Next we want to split-out each of the google_defined_categories (separated by a comma) from its list into their own columns:

pd.DataFrame(bbc_news.google_defined_categories.tolist(), index= bbc_news.index)

We can see that the maximum number of categories Google’s API has found for this dataset is 6.

Then we want to rename these columns:

bbc_news[["google_category_1", "google_category_2", "google_category_3", "google_category_4","google_category_5","google_category_6"]] =  pd.DataFrame(bbc_news.google_defined_categories.tolist(), index= bbc_news.index)

Which will produce this dataframe:

Next we want to pull the highest level Google-defined category, so anything before the second ‘/’ in the ‘google_category_1’ column:

bbc_news['highest_level_google_category'] = bbc_news['google_category_1'].str.split(r'/').str[1]bbc_news.head()

Next, we want to build a smaller dataframe with the highest_level_google_category, google_category_1 and the number of BBC articles against each:

visual = bbc_news.groupby(['highest_level_google_category','google_category_1']).agg({'articleid':'nunique'})
visual.reset_index(inplace = True)
visual.rename(columns={"articleid": "num_articles"}, inplace = True)
visual.sort_values(by='num_articles', ascending = False)

Finally we plot this data:

import plotly.express as pxfig = px.treemap(visual,
path=['highest_level_google_category','google_category_1'],
values='num_articles',
color='google_category_1')
fig.show()

You can hover over the names to get more information about them in your IDE.

Pulling all the code together:

### Pulling the Google-defined categories ###
import pandas as pd
bbc_news = pd.read_csv('BBC News Train.csv')
bbc_news.columns= bbc_news.columns.str.lower()
bbc_news.rename(columns={"category": "bbc_defined_category"}, inplace = True)
bbc_news.head()
bbc_news['first_150_words'] = bbc_news['text'].str.extract(pat = '(^(?:\S+\s+\n?){1,150})')
bbc_news.head()
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='path to your JSON file'
from google.cloud import language_v1
client = language_v1.LanguageServiceClient()
def get_category(content):
type_ = language_v1.Document.Type.PLAIN_TEXT
language = "en"
document = {"content": content, "type_": type_}
categories = client.classify_text(document=document)
#return ([(c.name, round(c.confidence*100)) for c in categories.categories])
return ([c.name for c in categories.categories])
google_categories = bbc_news.first_150_words.apply(get_category)
bbc_news['google_defined_categories'] = google_categories
bbc_news[["google_category_1", "google_category_2", "google_category_3", "google_category_4", "google_category_5", "google_category_6"]] = pd.DataFrame(bbc_news.google_defined_categories.tolist(), index= bbc_news.index)bbc_news['highest_level_google_category'] = bbc_news['google_category_1'].str.split(r'/').str[1]
### Creating the visual ###visual = bbc_news.groupby(['highest_level_google_category','google_category_1']).agg({'articleid':'nunique'})
visual.reset_index(inplace = True)
visual.rename(columns={"articleid": "num_articles"}, inplace = True)
visual.sort_values(by='num_articles', ascending = False)
import plotly.express as px
fig = px.treemap(visual,
path=['highest_level_google_category','google_category_1'],
values='num_articles',
color='google_category_1'
)
fig.show()

I hope you found this helpful. Please message me with any questions :)

Data Analyst | Data Scientist