Mining Song Lyrics from Genius

Posted by Zeses Pitenis on October 21th

Compiling a Lyrics Corpus with Python

Last year, when I took a course on Corpus Linguistics, I was asked to do an assessment, describing and comparing a chosen pair of corpora. The course leader’s suggestion was that we do the assessment on large and freely available corpora, but to be honest I was never a fan of British vs US English. However, I have always been very passionate about music and lyrics; I even write my own occasionally. Then I decided I should do a project, comparing two well-known and prolific Greek song writers.

Unfortunately, there isn’t a major lyrics source for Greek songs, like Genius, so I had to scrape the lyrics from manually downloaded .html pages, listing each of the song writers’ songs and url paths to their lyrics. The fun part was that the website containing the lyrics can be edited by anyone, hence the “greaner” function you see in the code below, a cleaner for Greek quotation marks varying in the different song lyrics. This way I created two corpora with lyrics of approximately 450 songs each, in .txt. format.

import re
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

def greaner(line):
    return re.sub(r"[`|’|‘]", "'", line)

urls = []
with open('raw_urls.txt') as raw:
    for line in raw:
        page = urlopen(line)
        soup = BeautifulSoup(page.read(), features="lxml")
        links = soup.findAll('a', attrs={'href': re.compile(".+song_id.+")})
        for link in links:
            urls.append('http://www.stixoi.info/'+link['href'])
        for i, url in enumerate(urls):
            response = requests.get(url)
            soupa = BeautifulSoup(response.content, "html.parser")
            lyrics = soupa.find('div', {'class' : 'lyrics'})
            with open('%i.txt' %i, 'w', encoding = 'utf-8') as output:
                try:    
                    output.write(greaner(lyrics.text))
                except AttributeError:
                    continue


A Genius Wrapper that Makes Our Lives Easier

The aforementioned assignment went exceptionally well for a Python newbie such as myself and it only amplified my desire to collect and computationally process and analyze song lyrics, even create a song genre classifier. This is where I shifted my focus to songs in English and where the Genius mining starts. I came across an amazing Python wrapper for the Genius API, lyricsgenius by John W. Miller. You can check the full functionality of the interface in his GitHub repository. In the code below, I will suggest a simple, yet efficient demo of how to use this brilliant interface, breaking down the code snippets to three parts: The first one is for setting the client up, the second to get all the information we want from our favorite artists' songs and the third and final to organize this information, along with the lyrics in a nice and clean .csv file.


Part 1: Setting Up the Client


import pandas as pd
import time
import lyricsgenius
client_access_token = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
genius = lyricsgenius.Genius(client_access_token, remove_section_headers=True,
                 skip_non_songs=True, excluded_terms=["Remix", "Live", "Edit", "Mix", "Club"])
				 

We begin by importing pandas for dealing with the data we are about to mine. When setting up your API Client over Genius, be wary that you need to generate a Client Access Token to use this interface, and not the Client ID or Secret. The parameters of the lyricsgenius package can be configured to your needs when searching for songs. Here I am removing section headers (e.g. [Chorus]), non-songs like track lists and songs with words like "Remix" or "Live" in their titles from my search.


Part 2: Getting and Storing Song Information


#Create list of sample artists
sample_artists = ['Rihanna', 'Justin Timberlake']

#Starting the song search for the artists in question and seconds count
query_number = 0
time1 = time.time()
for artist in sample_artists:
    query_number += 1
    #Empty lists for artist, title, album and lyrics information
    artists = []
    titles = []
    albums = []
    years = []
    lyrics = []
    print('\nQuery number:', query_number)
    #Search for max_songs = n and sort them by popularity
    artist = genius.search_artist(artist, max_songs = 10, sort='popularity')
    songs = artist.songs
    song_number = 0
    #Append all information for each song in the previously created lists
    for song in songs:
        if song is not None:
            song_number += 1
            print('\nSong number:', song_number)
            print('\nNow adding: Artist')
            artists.append(song.artist)
            print('Now adding: Title')
            titles.append(song.title)
            print('Now adding: Album')
            albums.append(song.album)
            print('Now adding: Year')
            years.append(song.year[0:4])
            print('Now adding: Lyrics')
            lyrics.append(song.lyrics)
    time2 = time.time()
    print('\nQuery', query_number, 'finished in', round(time2-time1,2), 'seconds.')

I started off by creating a list of sample artists, for this example Rihanna and Justin Timberlake, two of my personal favorites famous pop (?) singers and performers. This list will be used as input in the search_artist module. The original Python wrapper includes status messages, so I thought I would do the same, for sanity purposes. What I am doing is basically creating empty lists, for all types of information to be included in the lyrics corpus (e.g. song titles, year of release, lyrics, etc.). You can get other information for your queries by taking a look at the Song objects' attributes. To populate the empty lists with information, we are going through each of the artists and getting 10 of their songs, sorted by popularity on the Genius website. Each type of information is then appended to the appropriate list


Part 3: Get a Pretty and Organized Output

#Create a dataframe for our collected tracklist   
tracklist = pd.DataFrame({'artist':artists, 'title':titles, 'album':albums, 'year':years, 'lyrics':lyrics})   
time3 = time.time()   
print('\nFinal tracklist of', query_number, 'artists finished in', round(time3-time1,2), 'seconds.')
#Save the final tracklist to csv format
tracklist.to_csv('mini_lyrics_corpus.csv', encoding = 'utf-8', index=False)

After the query for each artist is finished, all the information for our tracklist is then stored in a pandas DataFrame, which is saved as a .csv file. I opted for index set to False, because pandas creates an unwanted column of extra indices. The output, should look like the screenshot below:

Tracklist DataFrame

Conclusion

And there you have it! A plain, yet not so elegant, approach for compiling a lyrics corpus from the Genius API with the lyricsgenius package and Python. Feel free to check the whole code in my relevant GitHub repository and provide your feedback, comments and suggestions.

Note: You should be wary of the API response limitations and optimize the code for your mining needs.