What is Beautiful Soup?
Overview
“You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.” (Opening lines of Beautiful Soup)
Beautiful Soup is a python library for getting data out of html, xlm, and other markup. It provides a way to extract particular content in a webpage by location in the html tags (think X-Paths and navigating the DOM), by CSS identifier, or by html id or class identifiers (or some combination thereof).
So say there is a website that contains data that is relevant to your research, such as date or address information, or link information for other sources of data that you also want to scrape. Beautiful Soup offers easy ways to pull that particular content from the webpage, remove it from its html wrappings, and put it into a new context which will allow you to do whatever next operation you desire to do.
I highly recommend looking at the Beautiful Soup documentation pages to get a sense of variety of things you can do with simple Beautiful Soup commands, from isolating titles and links to extracting all of the text from the html tags to altering the html within the document you’re working with.
Installing Beautiful Soup
Installing Beautiful Soup is easiest if you already have pip or another python installer already in place. If you don’t have pip, start with Fred’s tutorial on installing python modules. Once you have pip installed, run the following command to install Beautiful Soup:
pip install beautifulsoup4
You may need to include “sudo” in your command. Sudo gives your computer permission to write to your root directories and requires you to re-enter your password. This is the same logic behind your being prompted to enter your password when you install a new program.
With sudo, the command is:
sudo pip install beautifulsoup4
Using Beautiful Soup in a Python Script
There are two basic steps to using Beautiful Soup in your python script. First is to import the library at the beginning of your script by writing:
from bs4 import BeautifulSoup
Second, you have to pass the document or url to Beautiful Soup to make the “soup.” For this example we will be using a locally saved file and will create the soup this way:
soup = BeautifulSoup(open("example.txt"))
This creates a large soup object out of the content of our “example.txt” file and we can then run the Beautiful Soup methods on that object.
Application: Extracting names and URLs from an HTML page
Preview: Where we are going
Because I like to see where the finish line is before starting, I will begin with a view of what we are trying to create. We are attempting to go from a search results page where the html page looks like this:
</pre> <table border="1" cellspacing="2" cellpadding="3"> <tbody> <tr> <th>Member Name</th> <th>Birth-Death</th> </tr> <tr> <td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035">ADAMS, George Madison</a></td> <td>1837-1920</td> </tr> <tr> <td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074">ALBERT, William Julian</a></td> <td>1816-1879</td> </tr> <tr> <td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077">ALBRIGHT, Charles</a></td> <td>1830-1880</td> </tr> <tr> <td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079">ALCORN, James Lusk</a></td> <td>1816-1894</td> </tr> <tr> <td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160">ALLISON, William Boyd</a></td> <td>1829-1908</td> </tr> </tbody> </table> <pre>
to a CSV file with names and urls that looks like this:
"ADAMS, George Madison",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035 "ALBERT, William Julian",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074 "ALBRIGHT, Charles",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077 "ALCORN, James Lusk",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079 "ALLISON, William Boyd",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160 "AMES, Adelbert",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000172 "ANTHONY, Henry Bowen",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000262 "ARCHER, Stevenson",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000274 "ARMSTRONG, Moses Kimball",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000283 "ARTHUR, William Evans",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000304 "ASHE, Thomas Samuel",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000309 "ATKINS, John DeWitt Clinton",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000327 "AVERILL, John Thomas",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000344 ...
The finished code is:
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
clean_list = []
links = soup.find_all('a')
for link in links:
single_link = link['href']
name = link.contents[0]
entry = "%s, %s" % (name, single_link)
clean_list.append(entry)
f = open("43rd_results.csv", "w")
f.write("\n".join(clean_list))
f.close
but follow along to understand how Beautiful Soup gets us to that point.
Get a file to scrape
The first step is getting the files for scraping. This can be done in a variety of ways. Usually, I would recommend scraping using wget or cURL (see my slides on an introduction to webscraping). To do this, use wget or cURL in Terminal and point it at the particular webpage or folder that you want to download. However, the Congressional database is a bit more complicated because the URL for particular search results is hidden. While this can be bypassed programmatically, it is easier for our purposes to go to http://bioguide.congress.gov/biosearch/biosearch.asp, search for Congress number 43, and to save a copy of the webpage of results.
Selecting “File” and “Save Page As …” from your browser window will accomplish this. For a filename, avoid spaces – I am using “43rd-congress.html”. Move the file into the folder you want to work in and let’s proceed.
Identify content
One of the first things Beautiful Soup can help us with is getting a sense of how the different HTML tags are nested within each other. This can be very useful when you need to isolate content that is buried within the HTML structure as Beautiful Soup allows you to select content based upon tag within tag within tag (example: soup.body.p.b finds the first bold item inside a paragraph tag inside the body tag in the document). To get a good view of how the tags are nested in the document, we can use the method “prettify” on our soup object. Create a new file called soupexample.py. This file will contain your Python script that we will be developing over the course of the tutorial. In this file we need to import the Beautiful Soup library, open the file and pass it to Beautiful Soup, and then print the pretty version in the terminal.
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("43rd-congress.html"))
print(soup.prettify())
Save this file in the folder with your text file and go to the command line. Navigate (use ‘cd’) to the folder you’re working in and execute the following:
python soupexample.py
You should see your terminal window fill up with a nicely indented version of the original html text. This is a clean picture of how the various tags relate to one another.
Using BeautifulSoup to select particular content
So, we are interested in the links and names of the various member of the 43rd Congress. Looking at the ”pretty” version of the file, the first thing to notice is that this is a relatively flat file – our tags are not too deeply embedded within each other.
While this makes some of the identifying more difficult, we are interested in the names and urls and all of these are, most fortunately, embedded in “<a>” tags. So, we need to isolate out all of the “<a>” tags. We can do this by updating the code in “soupexample.py” to the following:
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("43rd-congress.html"))
links = soup.find_all('a')
for link in links:
print link
Save and run the script again to see all of the anchor tags in the document.
python soupexample.py
One thing to notice is that there is an additional link in our file – the link for an additional search. We can get rid of this with just a line or two of additional code. Going back to the pretty version, notice that this last “<a>” tag is not within the table but is within a “<p>” tag.
Because Beautiful Soup allows us to modify the data, we can remove the “<a>” that is under the “<p>” before searching for all the “<a>” tags.
To do this, we can use the “decompose” method, which erases whatever you tell Beautiful Soup to decompose. Do be careful when using “decompose” – you are deleting both the html tag and all of the data inside of that tag. If you have not correctly isolated the data, you may be deleting information that you needed to extract. Update the file as below and run again.
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
print link
And success! We have isolated out all of the links we want and none of the links we don’t!
Stripping Tags and Writing Content to a CSV file
While displaying these things in the Terminal is useful for verifying that the scripts are working, we need to save the data into a file in order to use it for other projects. And, the html tags are still surrounding all of our data. Let’s strip away the tags and save the data into a file.
In order to clean up the html tags and split the URLs from the names, we need to isolate the information from the html tags. To do this, we will use two powerful, and commonly used Beautiful Soup methods: contents and get.
Here is the file – I will explain the different pieces below.
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup (open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
f = csv.writer(open("43rd_Congress.csv", "a"))
f.writerow([names, fullLink])
The first change we’ve made is to add “import csv” to the beginning of the file. This is because we are going to use the csv library to write the file. You should not need to download the csv library.
The second change comes in the for loop. Instead of merely printing all of the content of “link,” we are identifying the pieces of information that we want. To isolate the names, we are using the method “contents,” and for the links, we are using the method “get.”
Contents isolates out the text from within html tags. For example, if you started with “<h2>This is my Header text</h2>”, you would be left with “This is my Header text” after applying the contents method. In this case, we are taking the contents inside the first elements of the array. (There is only one element in our array at the moment, but the computer is ever literal and needs to be told where to look.)
Get is another method for selecting the text out from the html tags. Here we are getting the text associated with the tag “href.”
Finally, we are using the csv library to write the file. Because we are executing this within the loop, we need to append (‘a’) rather than write (‘w’) to the file. This syntax tells the computer to include the data from names and the data from fullLinks on each row, separated by a comma.
When executed, this gives us a clean CSV file that we can then use for other purposes. And so ends we have solved our first challenge and have extracted names and URLs from the HTML file.
In the first part of this tutorial, we extracted the names and links from the webpage. In this part, we will go one step further and move all of the table data into the csv file so that we can more easily use it elsewhere.
Reviewing the Challenge
Back to the HTML
Let’s review again the file that we’re attempting to extract data from.
<!-- saved from url=(0053)http://bioguide.congress.gov/biosearch/biosearch1.asp --> Congressional Biographical Directory</pre> <table width="100%" border="1" cellspacing="0" cellpadding="0"> <tbody> <tr> <td valign="TOP" bgcolor="#990000" width="100%"><center><img src="./43rd-congress_files/topbanner.jpg" alt="" border="0" /></center></td> </tr> </tbody> </table> <pre></pre> <center><strong><em>Click Member Name to view Biography</em></strong> <table border="1" cellspacing="2" cellpadding="3"> <tbody> <tr> <th>Member Name</th> <th>Birth-Death</th> <th>Position</th> <th>Party</th> <th>State</th> <th>Congress (Year)</th> </tr> <tr> <td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035">ADAMS, George Madison</a></td> <td>1837-1920</td> <td>Representative</td> <td>Democrat</td> <td align="center">KY</td> <td align="center">43 (1873-1874)</td> </tr> <tr> <td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074">ALBERT, William Julian</a></td> <td>1816-1879</td> <td>Representative</td> <td>Republican</td> <td align="center">MD</td> <td align="center">43 (1873-1874)</td> </tr> <tr> <td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077">ALBRIGHT, Charles</a></td> <td>1830-1880</td> <td>Representative</td> <td>Republican</td> <td align="center">PA</td> <td align="center">43 (1873-1874)</td> </tr> <tr> <td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079">ALCORN, James Lusk</a></td> <td>1816-1894</td> <td>Senator</td> <td>Republican</td> <td align="center">MS</td> <td align="center">43 (1873-1874)</td> </tr> <tr> <td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160">ALLISON, William Boyd</a></td> <td>1829-1908</td> </tr> </tbody> </table>
When we were looking for names, all of the data that we wanted was contained within the anchor tags, which allowed us to make a targeted search. Now, all of the data we want is contained in the html table structure. Getting this data out is the puzzle we’re going to solve.
Previewing the Final Product
We know what the html file looks like. The CSV file will look as follows:
"ADAMS, George Madison",1837-1920,Representative,Democrat,KY,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035 "ALBERT, William Julian",1816-1879,Representative,Republican,MD,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074 "ALBRIGHT, Charles",1830-1880,Representative,Republican,PA,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077 "ALCORN, James Lusk",1816-1894,Senator,Republican,MS,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079 "ALLISON, William Boyd",1829-1908,Senator,Republican,IA,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160 "AMES, Adelbert",1835-1933,Senator,Republican,MS,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000172 "ANTHONY, Henry Bowen",1815-1884,Senator,Republican,RI,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000262 "ARCHER, Stevenson",1827-1898,Representative,Democrat,MD,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000274 "ARMSTRONG, Moses Kimball",1832-1906,Delegate,Democrat,DK,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000283 "ARTHUR, William Evans",1825-1897,Representative,Democrat,KY,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000304 "ASHE, Thomas Samuel",1812-1887,Representative,Democrat,NC,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000309 "ATKINS, John DeWitt Clinton",1825-1908,Representative,Democrat,TN,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000327 "AVERILL, John Thomas",1825-1889,Representative,Republican,MN,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000344
And this is the code that will get us there:
from bs4 import BeautifulSoup
import csv
#open the html file and create a soup object
soup = BeautifulSoup(open("43rd-congress.html"))
#get rid of the final link that is outside the table
final_link = soup.p.a
final_link.decompose()
#get rid of the link that is within the table data but is not part of the data for inclusion in the CSV file
rogue = soup.find(bgcolor="#990000")
rogue.decompose()
trs = soup.find_all("tr") #find all of the table rows
for tr in trs: #for each item in the list of rows
for link in tr.find_all('a'): #this is a bit tricky - you are combining the search for anchor tags and the for loop in one step
fullLink = link.get('href') #get the value of the href
tds = tr.find_all("td") #run another search for all of the table data
try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
names = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
years = str(tds[1].get_text())
positions = str(tds[2].get_text())
parties = str(tds[3].get_text())
states = str(tds[4].get_text())
congress = tds[5].get_text()
except:
print "bad tr string"
continue #This tells the computer to move on to the next item after it encounters an error
f = csv.writer(open("43rd_Congress.csv", "a"))
f.writerow([names, years, positions, parties, states, congress, fullLink]) #you can write the fields in whatever order you wish.
Writing the Script
The Problem of Extra Data
This is the code we had from the end of Part I:
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup (open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
f = csv.writer(open("43rd_Congress.csv", "a"))
f.writerow([names, fullLink])
The problem of extra data that we had in Part I was that there was an additional anchor tag, giving us an additional line. While that problem still exists, we have an additional problem. There is an additional table at the top of the file that has styling data. We can use an additional decompose line, identifying this table by the color information as this is the only place where there is color information in the file.
rogue = soup.find(bgcolor="#990000") rogue.decompose()
These two lines find everything within the tags containing the color information bgcolor = “990000″. We know we don’t want any of this information, so we can decompose it.
Identifying the Parts
We know that everything we do want for our CSV file lives within table row (“tr”) tags. We also know that these items appear in the same order within the tags. Because we are dealing with lists, we can identify pieces of information by its place in the list. This means that the first item in the table is identified by [0], the second by [1], etc.
Extracting the Data
We can extract the data in two moves. First, we isolate the link information and then we move on to the information within the various html tags.
For the first, we create a loop from a search for all of the anchor tags. Then we need to move through this to “get” all of the data associated with the “href” tag.
for link in tr.find_all('a'):
fullLink = link.get('href')
We then need to run a search for the table data within the table rows.
tds = tr.find_all("td")
Next, we need to extract the data we want. Because not all of the rows contain the same number of data items, we need to build in a way to tell the script to move on if it encounters an error. This is the logic of the “try”, “except”. If a particular line fails, the script will continue on to the next line.
Within this we are using the following structure:
years = str(tds[1].get_text())
In this, we are applying the “get_text” method to the 2nd element in the row (because computers count beginning with 0) and then creates a string from the result. This we assign to the variable, which we will use to create the csv file. We repeat this for every item in the table that we want to capture in our file.
Writing the CSV file
The last step in this file is to create the CSV file. Here we are using the same process as we did in Part I, just with more variables. Again, because we are writing within the loop, use ‘a’ for append rather than ‘w’ for write.
As a result, our file will look like:
from bs4 import BeautifulSoup
import csv
#open the html file and create a soup object
soup = BeautifulSoup(open("43rd-congress.html"))
#get rid of the final link that is outside the table
final_link = soup.p.a
final_link.decompose()
#get rid of the link that is within the table data but is not part of the data for inclusion in the CSV file
rogue = soup.find(bgcolor="#990000")
rogue.decompose()
trs = soup.find_all("tr") #find all of the table rows
for tr in trs: #for each item in the list of rows
for link in tr.find_all('a'): #this is a bit tricky - you are combining the search for anchor tags and the for loop in one step
fullLink = link.get('href') #get the value of the href
tds = tr.find_all("td") #run another search for all of the table data
try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
names = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
years = str(tds[1].get_text())
positions = str(tds[2].get_text())
parties = str(tds[3].get_text())
states = str(tds[4].get_text())
congress = tds[5].get_text()
except:
print "bad tr string"
continue #This tells the computer to move on to the next item after it encounters an error
f = csv.writer(open("43rd_Congress.csv", "a"))
f.writerow([names, years, positions, parties, states, congress, fullLink]) #you can write the fields in whatever order you wish.
You’ve done it! You have created a CSV file from all of the data in the table, creating useful data from the confusion of the html page.















