Hi! Have you ever felt the burning need to plagiarize information from the internet? Ever copied data from Wikipedia to use as a reference for an academic article? No? Neither have I...
Since you and I are without fault, this article is for a friend.
Web scraping using Google Sheets
Web scraping is a method for extracting information from websites. When you read this definition, think of any digital possession you have: a personal website, blog, an academic article, a social media page, and so on. Now imagine that I can write code that extracts any public-facing content on your digital site and use it for whatever I want. If this intrusion is not slightly unsettling to you, read on my dear savage; this is your content.
For the most part, the average person can use either Excel or Google Sheets for any spreadsheet work. However, every now and then, either of them has a useful feature that is missing in the other. Today, we will talk about such a feature in Google Sheets useful for scraping websites. Google Sheets has WEB functions useful for extracting data from various web pages.
Scenario
You are drunk at a party full of tech bros and everyone is playing an action/adventure movie-themed drinking game. During each player's turn, they must name an action/adventure movie and if the movie is not on this Top 100 list on Rotten Tomatoes, the player must take a shot. I know, this might be an immature game, but you also should have learned how to use Excel 10 years ago, so let's not point fingers.
To win the game and ensure you take few shots, you surreptitiously pull up the corresponding web page of movies on Rotten Tomatoes on your laptop. Since you don't want your online cheatsheet to be too obvious, you decide to pull up the data in Google Sheets. Anyone at a party that walks past your laptop screen and sees a spreadsheet won't suspect you of cheating. They simply won't invite you to the next party.
WEB functions
Google Sheets uses WEB functions for data scraping. To take a look at the different WEB functions, while in your browser, press Ctrl + T to open a new tab. In the address bar, type sheets.new (this is a hack that works for creating any new file in Google Drive. Try docs.new and slides.new) and press ENTER. These shortcuts work for most browsers, such as Chrome, Edge and Opera. If you use, DuckDuckGo and the like, I am not sure if nice things are allowed on the dark web.
A new google spreadsheet is created and we browse the WEB functions by clicking Insert on the toolbar, then Function, then Web.
Extract a table from a web page using IMPORTHTML
Using IMPORTHTML, you can extract the table with the Top 100 Action/Adventure Movies from Rotten Tomatoes. The IMPORTHTML function takes the following arguments:
The tricky part about using this function is the index argument. The index refers to the position of a specific table or list on the webpage, identified by a number. There might be multiple tables or lists on the webpage, so you may have to try different numbers and check what output is returned. Usually, if there is one table on the webpage, the index will be 1. The index will likely correspond to the order in which the tables appear on the webpage.
In your browser, you have the desired webpage on Rotten Tomatoes and the new spreadsheet. To extract the table of the Top 100 Action/Adventure movies from the webpage, we click inside any cell in the spreadsheet and begin the IMPORTHTML function, by typing an equal sign and IMPORTHTML followed by an open bracket or parenthesis "(". We fill the arguments with the corresponding values while following the example shown in the figure above. With the locale, you can change the language of the text when copied in the spreadsheet. However, if we ignore the locale, Google Sheets will reflect the language from the webpage by default.
As seen in the image above, we enter 3 for the index number. This is because, there are 2 other tables on the page we do not see. These are shown below. Try indices 1 and 2 to prove this point.
After entering 3 as the index number, press ENTER and Google Sheets will load the data into the spreadsheet. You are done! Instead of just saying you don't want to play this potentially harmful drinking game like the adult you are, you have a trick to bend the rules.
You have successfully stolen data from Rotten Tomatoes and the FBI is on the way to your house. If they ask any questions, my blog does not exist and you are in fact a child prodigy erring on the wild side of life unsupervised. Subscribe to the blog as your final act as a free human though.
Conclusion
Web scraping can be useful to extract information and data from web pages for analysis. For example, if your country releases contentious election results and you want to get picked up in an unmarked van, Excel and Google Sheets can be used to extract the data and analyze it. However, issues of consent and copyright can arise depending on where you get the data and how you use it. In the very least, if you are using the extracted data for analysis and presentation, indicate the source of the data in any reports or visualizations.
Until next time, cheers!
Comments