Beautiful Soup Tutorial

What is Beautiful Soup?

Overview

“You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.” (Opening lines of Beautiful Soup)

Beautiful Soup is a python library for getting data out of html, xlm, and other markup. It provides a way to extract particular content in a webpage by location in the html tags (think X-Paths and navigating the DOM), by CSS identifier, or by html id or class identifiers (or some combination thereof).

So say there is a website that contains data that is relevant to your research, such as date or address information, or link information for other sources of data that you also want to scrape. Beautiful Soup offers easy ways to pull that particular content from the webpage, remove it from its html wrappings, and put it into a new context which will allow you to do whatever next operation you desire to do.

I highly recommend looking at the Beautiful Soup documentation pages to get a sense of variety of things you can do with simple Beautiful Soup commands, from isolating titles and links to extracting all of the text from the html tags to altering the html within the document you’re working with.

Installing Beautiful Soup

Installing Beautiful Soup is easiest if you already have pip or another python installer already in place. If you don’t have pip, start with Fred’s tutorial on installing python modules. Once you have pip installed, run the following command to install Beautiful Soup:

pip install beautifulsoup4

You may need to include “sudo” in your command. Sudo gives your computer permission to write to your root directories and requires you to re-enter your password. This is the same logic behind your being prompted to enter your password when you install a new program.
With sudo, the command is:

sudo pip install beautifulsoup4
XKCD Sudo
XKCD goodness

Using Beautiful Soup in a Python Script

There are two basic steps to using Beautiful Soup in your python script. First is to import the library at the beginning of your script by writing:

from bs4 import BeautifulSoup

Second, you have to pass the document or url to Beautiful Soup to make the “soup.” For this example we will be using a locally saved file and will create the soup this way:

soup = BeautifulSoup(open("example.txt"))

This creates a large soup object out of the content of our “example.txt” file and we can then run the Beautiful Soup methods on that object.

Application: Extracting names and URLs from an HTML page

Preview: Where we are going

Because I like to see where the finish line is before starting, I will begin with a view of what we are trying to create. We are attempting to go from a search results page where the html page looks like this:

</pre>
<table border="1" cellspacing="2" cellpadding="3">
<tbody>
<tr>
<th>Member Name</th>
<th>Birth-Death</th>
</tr>
<tr>
<td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035">ADAMS, George Madison</a></td>
<td>1837-1920</td>
</tr>
<tr>
<td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074">ALBERT, William Julian</a></td>
<td>1816-1879</td>
</tr>
<tr>
<td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077">ALBRIGHT, Charles</a></td>
<td>1830-1880</td>
</tr>
<tr>
<td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079">ALCORN, James Lusk</a></td>
<td>1816-1894</td>
</tr>
<tr>
<td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160">ALLISON, William Boyd</a></td>
<td>1829-1908</td>
</tr>
</tbody>
</table>
<pre>

to a CSV file with names and urls that looks like this:

"ADAMS, George Madison",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035
"ALBERT, William Julian",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074
"ALBRIGHT, Charles",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077
"ALCORN, James Lusk",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079
"ALLISON, William Boyd",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160
"AMES, Adelbert",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000172
"ANTHONY, Henry Bowen",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000262
"ARCHER, Stevenson",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000274
"ARMSTRONG, Moses Kimball",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000283
"ARTHUR, William Evans",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000304
"ASHE, Thomas Samuel",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000309
"ATKINS, John DeWitt Clinton",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000327
"AVERILL, John Thomas",http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000344
...

The finished code is:

from bs4 import BeautifulSoup

soup = BeautifulSoup (open("43rd-congress.html"))

final_link = soup.p.a
final_link.decompose()

clean_list = []
links = soup.find_all('a')
for link in links:
	single_link = link['href']
	name = link.contents[0]
	entry = "%s, %s" % (name, single_link)
	clean_list.append(entry)

f = open("43rd_results.csv", "w")
f.write("\n".join(clean_list))
f.close

but follow along to understand how Beautiful Soup gets us to that point.

Get a file to scrape

The first step is getting the files for scraping. This can be done in a variety of ways. Usually, I would recommend scraping using wget or cURL (see my slides on an introduction to webscraping). To do this, use wget or cURL in Terminal and point it at the particular webpage or folder that you want to download. However, the Congressional database is a bit more complicated because the URL for particular search results is hidden. While this can be bypassed programmatically, it is easier for our purposes to go to http://bioguide.congress.gov/biosearch/biosearch.asp, search for Congress number 43, and to save a copy of the webpage of results.

Selecting “File” and “Save Page As …” from your browser window will accomplish this. For a filename, avoid spaces – I am using “43rd-congress.html”. Move the file into the folder you want to work in and let’s proceed.

Identify content

One of the first things Beautiful Soup can help us with is getting a sense of how the different HTML tags are nested within each other. This can be very useful when you need to isolate content that is buried within the HTML structure as Beautiful Soup allows you to select content based upon tag within tag within tag (example: soup.body.p.b finds the first bold item inside a paragraph tag inside the body tag in the document). To get a good view of how the tags are nested in the document, we can use the method “prettify” on our soup object. Create a new file called soupexample.py. This file will contain your Python script that we will be developing over the course of the tutorial. In this file we need to import the Beautiful Soup library, open the file and pass it to Beautiful Soup, and then print the pretty version in the terminal.

from bs4 import BeautifulSoup

soup = BeautifulSoup (open("43rd-congress.html"))

print(soup.prettify())

Save this file in the folder with your text file and go to the command line. Navigate (use ‘cd’) to the folder you’re working in and execute the following:

python soupexample.py

You should see your terminal window fill up with a nicely indented version of the original html text. This is a clean picture of how the various tags relate to one another.

Using BeautifulSoup to select particular content

So, we are interested in the links and names of the various member of the 43rd Congress. Looking at the ”pretty” version of the file, the first thing to notice is that this is a relatively flat file – our tags are not too deeply embedded within each other.

While this makes some of the identifying more difficult, we are interested in the names and urls and all of these are, most fortunately, embedded in “<a>” tags. So, we need to isolate out all of the “<a>” tags. We can do this by updating the code in “soupexample.py” to the following:

from bs4 import BeautifulSoup

soup = BeautifulSoup (open("43rd-congress.html"))

links = soup.find_all('a')

for link in links:
	print link

Save and run the script again to see all of the anchor tags in the document.

python soupexample.py

One thing to notice is that there is an additional link in our file – the link for an additional search. We can get rid of this with just a line or two of additional code. Going back to the pretty version, notice that this last “<a>” tag is not within the table but is within a “<p>” tag.

Because Beautiful Soup allows us to modify the data, we can remove the “<a>” that is under the “<p>” before searching for all the “<a>” tags.

To do this, we can use the “decompose” method, which erases whatever you tell Beautiful Soup to decompose. Do be careful when using “decompose” – you are deleting both the html tag and all of the data inside of that tag. If you have not correctly isolated the data, you may be deleting information that you needed to extract. Update the file as below and run again.

from bs4 import BeautifulSoup

soup = BeautifulSoup (open("43rd-congress.html"))

final_link = soup.p.a
final_link.decompose()

links = soup.find_all('a')

for link in links:
	print link

And success! We have isolated out all of the links we want and none of the links we don’t!

Stripping Tags and Writing Content to a CSV file

While displaying these things in the Terminal is useful for verifying that the scripts are working, we need to save the data into a file in order to use it for other projects. And, the html tags are still surrounding all of our data. Let’s strip away the tags and save the data into a file.

In order to clean up the html tags and split the URLs from the names, we need to isolate the information from the html tags. To do this, we will use two powerful, and commonly used Beautiful Soup methods: contents and get.

Here is the file – I will explain the different pieces below.

from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup (open("43rd-congress.html"))

final_link = soup.p.a
final_link.decompose()

links = soup.find_all('a')
for link in links:
	names = link.contents[0]
	fullLink = link.get('href')

	f = csv.writer(open("43rd_Congress.csv", "a"))
	f.writerow([names, fullLink])

The first change we’ve made is to add “import csv” to the beginning of the file. This is because we are going to use the csv library to write the file. You should not need to download the csv library.

The second change comes in the for loop. Instead of merely printing all of the content of “link,” we are identifying the pieces of information that we want. To isolate the names, we are using the method “contents,” and for the links, we are using the method “get.”

Contents isolates out the text from within html tags. For example, if you started with “<h2>This is my Header text</h2>”, you would be left with “This is my Header text” after applying the contents method. In this case, we are taking the contents inside the first elements of the array. (There is only one element in our array at the moment, but the computer is ever literal and needs to be told where to look.)

Get is another method for selecting the text out from the html tags. Here we are getting the text associated with the tag “href.”

Finally, we are using the csv library to write the file. Because we are executing this within the loop, we need to append (‘a’) rather than write (‘w’) to the file. This syntax tells the computer to include the data from names and the data from fullLinks on each row, separated by a comma.

When executed, this gives us a clean CSV file that we can then use for other purposes. And so ends we have solved our first challenge and have extracted names and URLs from the HTML file.


In the first part of this tutorial, we extracted the names and links from the webpage. In this part, we will go one step further and move all of the table data into the csv file so that we can more easily use it elsewhere.

Reviewing the Challenge

Back to the HTML

Let’s review again the file that we’re attempting to extract data from.

<!-- saved from url=(0053)http://bioguide.congress.gov/biosearch/biosearch1.asp -->
Congressional Biographical Directory</pre>
<table width="100%" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="TOP" bgcolor="#990000" width="100%"><center><img src="./43rd-congress_files/topbanner.jpg" alt="" border="0" /></center></td>
</tr>
</tbody>
</table>
<pre></pre>
&nbsp;

<center><strong><em>Click Member Name to view Biography</em></strong>
<table border="1" cellspacing="2" cellpadding="3">
<tbody>
<tr>
<th>Member Name</th>
<th>Birth-Death</th>
<th>Position</th>
<th>Party</th>
<th>State</th>
<th>Congress (Year)</th>
</tr>
<tr>
<td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035">ADAMS, George Madison</a></td>
<td>1837-1920</td>
<td>Representative</td>
<td>Democrat</td>
<td align="center">KY</td>
<td align="center">43 (1873-1874)</td>
</tr>
<tr>
<td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074">ALBERT, William Julian</a></td>
<td>1816-1879</td>
<td>Representative</td>
<td>Republican</td>
<td align="center">MD</td>
<td align="center">43 (1873-1874)</td>
</tr>
<tr>
<td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077">ALBRIGHT, Charles</a></td>
<td>1830-1880</td>
<td>Representative</td>
<td>Republican</td>
<td align="center">PA</td>
<td align="center">43 (1873-1874)</td>
</tr>
<tr>
<td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079">ALCORN, James Lusk</a></td>
<td>1816-1894</td>
<td>Senator</td>
<td>Republican</td>
<td align="center">MS</td>
<td align="center">43 (1873-1874)</td>
</tr>
<tr>
<td><a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160">ALLISON, William Boyd</a></td>
<td>1829-1908</td>
</tr>
</tbody>
</table>

When we were looking for names, all of the data that we wanted was contained within the anchor tags, which allowed us to make a targeted search. Now, all of the data we want is contained in the html table structure. Getting this data out is the puzzle we’re going to solve.

Previewing the Final Product

We know what the html file looks like. The CSV file will look as follows:

"ADAMS, George Madison",1837-1920,Representative,Democrat,KY,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035
"ALBERT, William Julian",1816-1879,Representative,Republican,MD,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074
"ALBRIGHT, Charles",1830-1880,Representative,Republican,PA,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077
"ALCORN, James Lusk",1816-1894,Senator,Republican,MS,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079
"ALLISON, William Boyd",1829-1908,Senator,Republican,IA,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160
"AMES, Adelbert",1835-1933,Senator,Republican,MS,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000172
"ANTHONY, Henry Bowen",1815-1884,Senator,Republican,RI,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000262
"ARCHER, Stevenson",1827-1898,Representative,Democrat,MD,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000274
"ARMSTRONG, Moses Kimball",1832-1906,Delegate,Democrat,DK,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000283
"ARTHUR, William Evans",1825-1897,Representative,Democrat,KY,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000304
"ASHE, Thomas Samuel",1812-1887,Representative,Democrat,NC,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000309
"ATKINS, John DeWitt Clinton",1825-1908,Representative,Democrat,TN,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000327
"AVERILL, John Thomas",1825-1889,Representative,Republican,MN,43(1873-1874),http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000344

And this is the code that will get us there:

from bs4 import BeautifulSoup
import csv

#open the html file and create a soup object
soup = BeautifulSoup(open("43rd-congress.html"))

#get rid of the final link that is outside the table
final_link = soup.p.a
final_link.decompose()

#get rid of the link that is within the table data but is not part of the data for inclusion in the CSV file
rogue = soup.find(bgcolor="#990000")
rogue.decompose()

trs = soup.find_all("tr") #find all of the table rows

for tr in trs: #for each item in the list of rows
        for link in tr.find_all('a'): #this is a bit tricky - you are combining the search for anchor tags and the for loop in one step
		fullLink = link.get('href') #get the value of the href

	tds = tr.find_all("td") #run another search for all of the table data

	try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
		names = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
		years = str(tds[1].get_text())
		positions = str(tds[2].get_text())
		parties = str(tds[3].get_text())
		states = str(tds[4].get_text())
		congress = tds[5].get_text()

	except:
		print "bad tr string"
		continue #This tells the computer to move on to the next item after it encounters an error

	f = csv.writer(open("43rd_Congress.csv", "a"))
	f.writerow([names, years, positions, parties, states, congress, fullLink]) #you can write the fields in whatever order you wish.

 

Writing the Script

 

The Problem of Extra Data

This is the code we had from the end of Part I:

from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup (open("43rd-congress.html"))

final_link = soup.p.a
final_link.decompose()

links = soup.find_all('a')
for link in links:
	names = link.contents[0]
	fullLink = link.get('href')

	f = csv.writer(open("43rd_Congress.csv", "a"))
	f.writerow([names, fullLink])

The problem of extra data that we had in Part I was that there was an additional anchor tag, giving us an additional line. While that problem still exists, we have an additional problem. There is an additional table at the top of the file that has styling data. We can use an additional decompose line, identifying this table by the color information as this is the only place where there is color information in the file.

rogue = soup.find(bgcolor="#990000")
rogue.decompose()

These two lines find everything within the tags containing the color information bgcolor = “990000″. We know we don’t want any of this information, so we can decompose it.

Identifying the Parts

We know that everything we do want for our CSV file lives within table row (“tr”) tags. We also know that these items appear in the same order within the tags. Because we are dealing with lists, we can identify pieces of information by its place in the list. This means that the first item in the table is identified by [0], the second by [1], etc.

Extracting the Data

We can extract the data in two moves. First, we isolate the link information and then we move on to the information within the various html tags.

For the first, we create a loop from a search for all of the anchor tags. Then we need to move through this to “get” all of the data associated with the “href” tag.

for link in tr.find_all('a'):
    fullLink = link.get('href')

We then need to run a search for the table data within the table rows.

tds = tr.find_all("td")

Next, we need to extract the data we want. Because not all of the rows contain the same number of data items, we need to build in a way to tell the script to move on if it encounters an error. This is the logic of the “try”, “except”. If a particular line fails, the script will continue on to the next line.

Within this we are using the following structure:

years = str(tds[1].get_text())

In this, we are applying the “get_text” method to the 2nd element in the row (because computers count beginning with 0) and then creates a string from the result. This we assign to the variable, which we will use to create the csv file. We repeat this for every item in the table that we want to capture in our file.

Writing the CSV file

The last step in this file is to create the CSV file. Here we are using the same process as we did in Part I, just with more variables. Again, because we are writing within the loop, use ‘a’ for append rather than ‘w’ for write.

As a result, our file will look like:

from bs4 import BeautifulSoup
import csv

#open the html file and create a soup object
soup = BeautifulSoup(open("43rd-congress.html"))

#get rid of the final link that is outside the table
final_link = soup.p.a
final_link.decompose()

#get rid of the link that is within the table data but is not part of the data for inclusion in the CSV file
rogue = soup.find(bgcolor="#990000")
rogue.decompose()

trs = soup.find_all("tr") #find all of the table rows

for tr in trs: #for each item in the list of rows
        for link in tr.find_all('a'): #this is a bit tricky - you are combining the search for anchor tags and the for loop in one step
		fullLink = link.get('href') #get the value of the href

	tds = tr.find_all("td") #run another search for all of the table data

	try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
		names = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
		years = str(tds[1].get_text())
		positions = str(tds[2].get_text())
		parties = str(tds[3].get_text())
		states = str(tds[4].get_text())
		congress = tds[5].get_text()

	except:
		print "bad tr string"
		continue #This tells the computer to move on to the next item after it encounters an error

	f = csv.writer(open("43rd_Congress.csv", "a"))
	f.writerow([names, years, positions, parties, states, congress, fullLink]) #you can write the fields in whatever order you wish.

You’ve done it! You have created a CSV file from all of the data in the table, creating useful data from the confusion of the html page.

Cleaning up a Google Book for text mining

Lesson Goals & Reasons

Why?

Google Books has increasingly become one of the main repositories of textual sources for historians and other humanities scholars. As many have discussed, it has its issues. Nonetheless, it remains a valuable place to gather texts for those interested in text mining–provided we clean the text first.

Why is this important? You want your text mining to give you accurate results. The example that I am using, a volume of the Diplomatic Correspondence of the Republic of Texas, exemplifies some of these issues. When I ran the text file through Voyant, Here’s what happened:

As you can see, “Digitized” and “Google” appeared as two of the most common words! Presumably Texas diplomats in the 1830s and 1840s would not have been writing about digitization or about Google. Additionally, the words United and States popped up frequently–because the top of each page included “Correspondence with the United States” for a large part of the book. I want to know if the words “United States” do indeed show up frequently otherwise, along with other important ideas, like words having to do with annexation. Using text mining techniques could help contribute to an understanding of Texas’s diplomatic policies when it was a republic–but only if we can run those features on an accurate copy of the correspondence.

Goals

This lesson will use a sample Google Books document with some of the issues that tend to accompany texts from that repository. You will clean that document–removing some of the specific Google Books formatting, then going through basic steps to prepare it for text mining. Some of these principles will be applicable for many different books.

You will need:

  • A Google Books document, preferably in text format. I downloaded this one from the Internet Archive. Otherwise, you can convert a Google Books PDF to text.
  • A text editor to create Python scripts.
  • The ability to execute Python. I personally like to use Komodo Edit because it includes the ability to execute a Python script right in your window. Programming Historian 2 has a method for setting that up. You can also, of course, test your scripts using the command line.

What Will You Do?

In this tutorial, we will create a Python script to:

  • Download the text file of a Google Book from Archive.org.
  • Strip out page numbers and titles on the tops of pages
  • Remove the introductory portions of the file & strip out the HTML.
  • Strip out “Digitized by Google”.
  • Save the file to your drive

Begin the script and get the document

To start, we’ll begin our Python script in KomodoEdit. Open KomodoEdit and save a new file as “tx-dip-corr.py” (or whatever you’d like to, for whatever you will be using).

First we want to give our script access to a couple of different Python libraries–one that deals with getting documents from the Internet, and one that gives us access to regular expressions (which will be explained later). To access those, we use the “import” command:

import urllib2
import re

Next, we want to open a file from the web. To do so, let’s create a variable, called “url,” and give it the web address of the item we want to open. So type:

url = ""

Then we want to get our URL, which will go into the quotation marks.

Many of the out-of-copyright works that Google has scanned have been uploaded to the Internet Archive and are available in multiple formats. In this case, we want to use the text version. For this tutorial, I am using a volume of the Diplomatic Correspondence of the Republic of Texas from the Internet Archive.

Right-click on the link that says “Full Text” and copy the URL. This is the file we will be using. Paste the URL between the quotation marks. So, you will now have:

url = "http://archive.org/stream/diplomaticcorre33statgoog/diplomaticcorre33statgoog_djvu.txt"

That only gave us the URL for opening the file, though. Next we want actually to open it. First we have it opened, then we have it read. Python makes us do this in two lines:

response = urllib2.urlopen(url)
txt = response.read()

If you want to use a file you have already downloaded, this lesson from Programming Historian 2 shows how to open a file already on your computer.

Strip out page numbers and titles on the tops of pages

The title of the book or section, plus the page number, was captured in the scan–not to mention digitization information:

Here is what it looks like in the text:

This could throw off our text mining results, so let’s get rid of it! We want to strip out the page numbers and titles on the tops of pages. Unfortunately, the scan of this book was not the best (to put it mildly), and so the text on the top of each page rendered differently. It should say, on one side, “[page number] American Historical Association” (because the book was published by the American Historical Association). On the other side, it should say, “Correspondence with [country]. [page number].” As you can see, it doesn’t do that; the only consistency we get is that the page number is at the beginning or ending of the line.

Luckily, we can use regular expressions to find these lines and get rid of them.

We begin with setting a variable–let’s call it “txt”–for holding the text. We set it empty–for now:

txt = ""

Next we have the program go through the file looking for what we want it to find. To do that, we set up a variable that we’ll call “line”:

for line in response:

Then we set up a search, using regular expressions, to find the one consistency that we identified:

for line in response:
    matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line)

We are creating a variable, “matchObj,” that is searching the whole text line-by-line, using regular expressions (“re.search”). My classmate Laura explains more about regular expressions in her tutorial. Some explanation for what is happening here: We are parsing through the document and finding instances where either a line begins (indicated by ^) with a number (indicated by [0-9]) or ends (indicated by $) with a number (again, [0-9]). The “|” splitter gets the program to do either one or the other. It also includes any words (indicated by *) and any spaces and other white space (indicated by “\s”, which would typically stop the search). The “+”, meanwhile, indicates that we want to find the previous item (e.g., numbers) more than once. Here is a complete listing of regular expressions and what they do.

But, all that this has done is search through the text. We then need to tell it what to do. For that, we use an “if…else” statement. First the if:

for line in response:
    matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line)
    if matchObj is not None:
        pass

Here, we are telling the program that if the search results yield something, then that line is not to be saved; that is the equivalent to deleting it from the file.

Next, we want to make sure that every other line is coming through. So we tell it:

for line in response:
    matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line)
    if matchObj is not None:
        pass
    else:
        txt += line

This is saying that if something the line does not come up in the search, it is added to the text to be saved to our computer. Now, those page headers are all gone!

Remove the introductory portions of the file & strip out the HTML

Diplomatic Correspondence of the Republic of Texas is a compilation of primary source documents. As such, for our purposes we are not interested in parts that are not the primary source documents; thus, we do not want the introductory material to the volume, not to mention Google’s information about it. This particular file also has information from Archive.org on the top.

Open the file in your web browser. Scroll to where the primary sources begin–in this case, it’s a line that says “CORRESPONDENCE HITHERTO UNPUBLISHED.” Note this, as it will be important.

Write the function

Now we are going to write a function to grab the parts of the file that we want and strip out any HTML in it, leaving us with text. We begin by naming our function:

def stripTags(pageContents):

First, in our function, we want to define where to start grabbing the file. Remember where we noted the text where we wanted to begin? Now we see it again:

def stripTags(pageContents):
    startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.")
    pageContents = pageContents[startLoc:]

This tells the program to go through that document and find the line we previously identified, and set that as the starting location. Then, it takes everything from that point forward as the text that we want. In other words, it doesn’t send over the beginning, introductory text.

Next, we want to strip out the HTML. We do this with an “if…else” statement. All HTML in the file, of course, is found between “<>.” So we want to take anything between those symbols and remove it.

To do this, we will use the integers 0 and 1 for what is inside and outside the “<>.” So, we define the variable “inside” and set up an empty list for the variable “cleantext” (which will hold our cleaned-up text). Here is the code:

def stripTags(pageContents):
    startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.")
    pageContents = pageContents[startLoc:]

    inside = 0
    cleantext = ''

After this, we want to search for the “<>” and remove them, plus anything inside. We do this with an if…else statement:

def stripTags(pageContents):
    startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.")
    pageContents = pageContents[startLoc:]

    inside = 0
    cleantext = ''
    for char in pageContents:
        if char == '<':
            inside = 1
        elif(inside == 1 and char == '>'):
            inside = 0
        elif inside == 1:
            continue
        else:
            cleantext += char  
    return cleantext

After all of that, we now have our cleaned text. Sort of.

Execute the function

We now need to execute our function on the text we’ve retrieved from the Internet.

Let’s call the variable for the executed function “ctext” for “cleaned text.” We have that variable execute that function on the text (defined as txt) downloaded:

ctext = stripTags(txt)

Next, we’ll clean some other results of the digitization out of the text.

Strip out “Digitized by Google”

As you scroll through the file, you’ll notice there are parts that repeat. In Google Books, each page image gets “Digitized by Google” placed on it. When whoever created the plain text file from Diplomatic Correspondence of the Republic of Texas did the OCR, that portion was caught.

Luckily, it’s rather easy to strip out the “Digitized by Google” lines using Python’s replace function. We set a variable to house the text without that portion–let’s call it “stripGoogle”:

stripGoogle =

Next, we get the source of the text–in this case, the variable we just created to execute the stripTags function. So we tell stripGoogle to take that variable:

stripGoogle = ctext

Unfortunately, those words exist on separate lines, so we have to replace them separately. We append this to the end of “ctext”–we are replacing the text found in “ctext.” We replace this with nothing, indicated by the quotation marks with nothing between them:

stripGoogle = ctext.replace("Digitized by","").replace("Google","")

Save the file to your drive

Now that the file is cleaned, we’re prepared to commit it to our hard drive. From there, we can take the file into tools like Voyant, or use it for text mining.

First, we want to create the file, which we’ll call “tx-dip-corr.txt.” So we set a variable–we’ll call it “f”–to create a blank file of that title:

f = open('tx-dip-corr.txt','w')

Next, we want to write the final result of our manipulations–which we contained in a variable called “stripGoogle” (this will change as I figure out the regular expressions for the last part!)–into the file. We do this with the write function:

f = open('tx-dip-corr.txt','w')
f.write(stripGoogle)

Finally, we close that file at the end of our script:

f = open('tx-dip-corr.txt','w')
f.write(stripGoogle)
f.close

Wrapping it up

Unfortunately, this tutorial doesn’t deliver you a perfectly clean copy of the Diplomatic Correspondence of the Republic of Texas. A lot of the OCR is pretty bad, so some manual cleanup will still be needed. Some of the OCR errors are consistent, so a simple find/replace in a text editor might help–or even a simple Python script.

In the end, the cleaned version still shows “United” and “States” as two of the most common words. So, one could do even more with this. Nonetheless, we have removed some of the issues:

This tutorial has shown how to strip out some of the more common issues with the text of a Google Book. Once you execute the script, you will have the file saved on your computer. Enjoy the text mining! For your reference, here is the entire script, including comments about what does what:

#tx-dip-corr.py
#import libraries
import urllib2
import re

#open file from the web
url = "http://archive.org/stream/diplomaticcorre33statgoog/diplomaticcorre33statgoog_djvu.txt"
response = urllib2.urlopen(url)  # open('C:\\temp\\diplomaticcorre33statgoog_djvu.txt', 'r') 

# build up txt without page numbers
txt = ""
for line in response:
    matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line)
    if matchObj is not None:
        pass
    else:
        txt += line

#function to strip the introductory portion and HTML tags
def stripTags(pageContents):
    startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.")
    pageContents = pageContents[startLoc:]

    inside = 0
    cleantext = ''
    for char in pageContents:
        if char == '<':
            inside = 1
        elif(inside == 1 and char == '>'):
            inside = 0
        elif inside == 1:
            continue
        else:
            cleantext += char  
    return cleantext

#now execute the function
ctext = stripTags(txt)

#strip out "Digitized by Google" through the whole file
stripGoogle = ctext.replace("Digitized by","").replace("Google","")

#create the file and write results to it
f = open('tx-dip-corr.txt','w')
f.write(stripGoogle)
f.close

I am eternally grateful to my friend Kelvin Pan for helping me resolve some issues that arose as I tried to figure this out, and to my professor, Fred Gibbs, for his helpful comments.

The process of processing data

My project is to create an online archive of documents found in researching for my dissertation, with the ability to crowd source the transcription and translating of said documents. Fortunately the tools to do this already exist, so I just needed to put them together and upload my data.

In my preliminary look through the documents, I named the image files in a certain format that would help with data manipulation and organization (see the post about that on my dissertation site). I also kept a spreadsheet with the file name and other information and notes about each document.

This lesson will follow the process of taking those files and spreadsheet and getting them ready to do a mass import into Omeka. The benefit of this lesson is not necessarily getting data into Omeka, but more importantly how to manipulate data using spread sheets and command line tools.

Lesson 1: Start with the end in mind.

In creating the original spreadsheet, I made columns based on information I found on the documents. This ended up causing a little bit of a problem, in that the columns needed for Omeka were totally different. I knew from the start I was going to use Omeka, so I should have started my spreadsheet with the columns required by Omeka in the first place.

Here are the columns I started with:

Date
DocumentNumber
PageNumber
To
From
FileName
Betretung
Notes
Transcription
Translation

And here are the columns I needed, and ended up with:

FileName
Title
Subject
Description
Creator
Source
Publisher
Date
Contributor
Rights
Relation
Format
Language
Type
Identifier
Coverage
Transcription
Text
Original Format

As you can see, some of the fields correspond nicely, others I had to combine. Here is the “crosswalk” I used:

Date           -> Date
PageNumber     -> Text (or didn't use at all)
To             -> Title (combined with From)
From           -> Title (combined with From)
FileName       -> FileName
Betretung      -> Subject
Notes          -> Description
Transcription  -> Transcription
Translation    -> Text

The non-matching fields from the Omeka required fields I filled in or left blank.

Lesson 2: Fixin’ the data

Through some fancy data manipulation with regular expressions, I was able to get info for the following fields:

 Title
 Creator
 Source
 Contributor
 Rights
 Language
 Identifier
 Original Format

Let’s start with the Title field. I wanted to make the title reflect who the correspondence was written by and to whom. That information is in the file name, so a little text manipulation can get us that info.

I copied the FileName column into a text file, getting me a list of all of the file names, one on each line, like so:

194x.04.10--1+To_From.jpg
194x.xx.xx--1+Berger_Kammler.jpg
194x.xx.xx--1+Brandt_Kammler-English.jpg
194x.xx.xx--1+Frosch_From.jpg
194x.xx.xx--1+Himmler_Saur-English.jpg
194x.xx.xx--1+To_From-HandDrawnMapEnglish.jpg
194x.xx.xx--1+To_From-HandwrittenNote.jpg
194x.xx.xx--2+To_From-HandwrittenNote.jpg
1939.11.06-2657-1+To_From-HandDrawnGraph.jpg
1943.06.20--1+To_Heicke.jpg
1943.07.01--1+To_From.jpg

I only want the To and From information. When manipulating text, what you are really doing is looking for patterns of inclusion and exclusion. You also need to keep in mind the varying ways that your patters can be interpreted. For a good beginners tutorial on regular expression (the programming way to make patterns) is at http://regex.learncodethehardway.org/book/.

In creating the file names, I made sure to include symbols that would act as separators to the different parts of information that I would want to get. In this case, the To and From are separated by an underscore (_). This should be the only place an underscore exists in the file name.

To only get the To and From I can use a regular expression to select the correct text, and discard the unwanted text.

If I color code the text, it makes it a bit easier to see what sections exist in the file name:

194x.04.10--1+To_From.jpg
194x.xx.xx--1+Berger_Kammler.jpg
194x.xx.xx--1+Brandt_Kammler-English.jpg
194x.xx.xx--1+Frosch_From.jpg
194x.xx.xx--1+Himmler_Saur-English.jpg
194x.xx.xx--1+To_From-HandDrawnMapEnglish.jpg
194x.xx.xx--1+To_From-HandwrittenNote.jpg
194x.xx.xx--2+To_From-HandwrittenNote.jpg
1939.11.06-2657-1+To_From-HandDrawnGraph.jpg
1943.06.20--1+To_Heicke.jpg
1943.07.01--1+To_From.jpg

I’ll be using vim, but the regular expressions can be used in almost any other programming language like python, perl, and php, or programs like sed, awk and grep. If using the command line is intimidating, a good crash course is found at
http://cli.learncodethehardway.org/book/
A really great and interactive tutorial for learning the basics of vim is found at http://www.openvim.com/tutorial.html.

To create a text file in vim, I would type this on the command line:

$ vim title.csv

You notice I gave the file name the .csv extension. This is a little trick that will help us easily copy and paste the manipulated text back into the master spreadsheet.

You’re presented with a blank screen in vim. To enter the lines you copied from the spread sheet, first type the ‘i’ key to enter insert mode. Then paste the text as normal, with the CTRL-V keys. Now type the ESC key to get back to command mode.

The easiest thing to do first is to get rid of the .jpg at the end each line. Just type:

:%s/.jpg//

and hit enter. What does this do? Again, colors help us understand what’s going on.

:%s/.jpg//

: = what follows is a command
% = apply the command to all lines in the file
s = use the search and replace command
/ = forward slashes are delimiters that separate the search and replace 
     fields. You could actually use any non-alphanumeric character.
.jpg = the search field, in between forward slashes. This can be text 
        and/or regular expressions
 = There's nothing in between the final two forward slashes, because we 
     want to replace the ".jpg" with nothing.

If we had wanted to change all of the .jpg’s to .png’s, the command would have looked like this:

:%s/.jpg/.png/

In general the format is:

:%s/search terms/replace terms/

Now, we could have removed the beginning and ending parts of the line all in one go, leaving us with just the To_From portion. Here is where a regular expression comes into play.

We want to get rid of everything from the beginning of the line until the +, and then the .jpg at the end. We would do that by using a regular expression which utilizes the grouping ability. Grouping allows us to mark a part of the regular expression and use it in the replace field. You can have up to 9 groups in the search field, and they are referenced in the replace field with a backslash and the associated number (like \1 to \9). It is kind of like creating a variable, the value is in the search field and the variable is called in the replace field.

The regular expression we need looks like this:

:%s/^\(.*+\)\(.*_.*\)\(.jpg\)/\2/

Looks pretty scary, eh? Let’s throw some color and explanation in there.

:%s/^\(.*+\)\(.*_.*\)\(.jpg\)/\2/

: = command
% = apply to every line in the file
s = search and replace
/ = beginning of search field
^ = start at the beginning of the line
\( = beginning of a group (group 1)
.* = . means any character, * means 0 or more of the preceding 
        character (so match 0 or more of any character)
+ = in this case, a literal plus sign
\) = end of the group
\( = beginning of a new group (group 2)
.* = match 0 or more of any character
_ = a literal underscore
.* =  match 0 or more of any character
\) = end of the group
\( = beginning of a new group (group 3)
.jpg = the literal characters .jgp
\) = end of the group
/ = end of search field, beginning of replace field
\2 = replace with the contents of group 2
/ = end of replace field, end of search and replace command

Now we have a list of just the To’s and From’s.

To_From
Berger_Kammler
Brandt_Kammler-English
Frosch_From
Himmler_Saur-English
To_From-HandDrawnMapEnglish
To_From-HandwrittenNote
To_From-HandwrittenNote
To_From-HandDrawnGraph
To_Heicke
To_From

We can get rid of the underscore and improve the title a bit.

:%s/^\(.*\)_\(.*\)/Letter to \1 from \2/

Results in:

Letter from To to From
Letter from Berger to Kammler
Letter from Brandt to Kammler-English
Letter from Frosch to From
Letter from Himmler to Saur-English
Letter from To to From-HandDrawnMapEnglish
Letter from To to From-HandwrittenNote
Letter from To to From-HandwrittenNote
Letter from To to From-HandDrawnGraph
Letter from To to Heicke
Letter from To to From

For homework, can you figure out how to do all of the regular expressions up to this point in one go? see answer

For some of the documents, it was unknown who the author or recipient was, so we can change the generic To and From to be Unknown.

:%s/To\|From/Unknown/g

This introduces two new parts of the search and replace command, highlighted above.

\| = allows us to search for multiple patterns at the same time. 
        Both of them will be replaced with the same thing.
g = at the end of the command, the g means to apply the search and 
        replace to all matches on the line. Without it, only the first 
        match is replaced.

Now we end up with a data like this:

Letter from Unknown to Unknown
Letter from Berger to Kammler
Letter from Brandt to Kammler-English
Letter from Frosch to Unknown
Letter from Himmler to Saur-English
Letter from Unknown to Unknown-HandDrawnMapEnglish
Letter from Unknown to Unknown-HandwrittenNote
Letter from Unknown to Unknown-HandwrittenNote
Letter from Unknown to Unknown-HandDrawnGraph
Letter from Unknown to Heicke
Letter from Unknown to Unknown

I can’t do much right now about all of the entries that will have the same title. Some day I, or someone, will have to go through and give them better, more descriptive titles.

Another homework assignment, what regex can you use to get rid of the extra text after the dash in some titles? answer 2

Now with a csv file with corrected titles, I can simply open the csv file in my favorite spread sheet software, copy the column of titles, and paste them into my master spread sheet, and they all match up nicely.

Making the link

One other field I had to edit was the FileName field. Omeka can take a URL to an image in the csv file and import that image for the object record. I had to upload all of the images to the server, then add the URL to the beginning of the existing file name to create the new FileName field. Basically, for each row, turning this: 1944.06.07-1936-1+Pohl_Brandt.jpg into this: http://nazitunnels.org/ushmm-images/1944.06.07-1936-1+Pohl_Brandt.jpg

Homework number 3, how to do that with a regex in a csv file? answer 3

Dealing with dates

One final field to deal with is the date field. The dates are in the file name, so I just need a way to strip everything but the date away.

This can be accomplished with the following regexp:

:%s/^\(\w\+\.\w\+.\w\+\).\+/\1/

And the explanation of the above command:

: = use a command
% = apply to each line in the file
s = use the search and replace command
/ = beginning of the search field
^ = start searching at the beginning of the line
\( = beginning of group 1
\w = match any word character, same as [a-zA-Z0-9_]
\+ = match one or more of the previous characters
\. = match a literal period
\w\+. = match one or more word characters and a literal period
\w\+ = match one or more word characters
\) = end of group 1
.\+ = one or more of any character
/ = close search field and begin of replace field
\1 = replace with contents of group 1
/ = end of replace field

Now that I just have dates, I can clean them up a little bit by removing the xx placeholders and changing the format from YYYY.MM.DD to something like MM/DD/YYYY like we’re used to seeing.

:%s/^\(\w\+\).\(\w\+\).\(\w\+\)/\3\/\2\/\1/

We have covered all of the commands and symbols before, so I’ll leave it up to you to interpret the above command. Hint, writing it with a different field delimiter helps.

The neat thing about regular expressions, is that you can use them with so many programs. For example, we could do the same change above using a command on the command line. This is useful if there are multiple files that need the change.

Answers

answer 1 :%s/^\(.*+\)\(.*\)_\(.*\)\(.jpg\)/Letter from \2 to \3/

answer 2 :%s/\(.*\)-.*/\1/

answer 3 :%s#^#http://nazitunnels.org/ushmm-images/# Notice how this regexp uses # instead of / for the field delimiters. Otherwise the forward slashes would need to be escaped with a backslash, which makes things ugly :%s/^/http:\/\/nazitunnels.org\/ushmm-images\//

Topic Modeling: A Basic Introduction

The purpose of this post is to help explain some of the basic concepts of topic modeling, introduce some topic modeling tools, and point out some other posts on topic modeling. The intended audience is historians, but it will hopefully prove useful to the general reader.

What is Topic Modeling?

Topic modeling is a form of text mining, a way of identifying patterns in a corpus. You take your corpus and run it through a tool which groups words across the corpus into ‘topics’. Miriam Posner has described topic modeling as “a method for finding and tracing clusters of words (called “topics” in shorthand) in large bodies of texts.”

What, then, is a topic? One definition offered on Twitter during a conference on topic modeling described a topic as “a recurring pattern of co-occurring words.” A topic modeling tool looks through a corpus for these clusters of words and groups them together by a process of similarity (more on that later). In a good topic model, the words in topic make sense, for example “navy, ship, captain” and “tobacco, farm, crops.”

How does it work?

One way to think about how the process of topic modeling works is to imagine working through an article with a set of highlighters. As you read through the article, you use a different color for the key words of themes within the paper as you come across them. When you were done, you could copy out the words as grouped by the color you assigned them. That list of words is a topic, and each color represents a different topic. (Note: this description is inspired by the following illustration from Blei, 2012, which is one of the best visual representations of a topic I’ve found.)

D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

How the actual topic modeling programs is determined by mathematics. Many topic modeling articles include equations to explain the mathematics, but I personally cannot parse them. The best non-equation explanation of how at least one topic modeling program assigns words to topics was given by David Mimno at a conference on topic modeling held in November 2012 by the Maryland Institute for Technology in the Humanities and the National Endowment for the Humanities. As he explains (starting at around 9:00), the computer compares the occurrence of topics within a document to how a word has been assigned in other documents to find the best match (you can find Mimno’s slides on his website).

The model Mimno is explaining is latent Dirichlet allocation, or LDA, which seems to be the most widely used model in the humanities. LDA has strengths and weaknesses, and it may not be right for all projects. It does form the basis of MALLET, which is an open source and fairly accessible tool for topic modeling.

For more detailed explanations of how topic modeling works, and how it can be applied, take a look at the other speaker videos from the MITH/NEH conference. Ted Underwood has offered his explanation of how the process works in a post titled Topic Modeling Made Just Simple Enough.

Scott B. Weingart has written an excellent overview of current scholarship on topic modeling with links to everything from a fable-like explanation of topic modeling to articles which delve into the technical side. Many of the more complex articles and posts include complex-looking equations, but it is possible to understand the basics of topic modeling without knowing how to unravel the equations.

What do you need to topic model?

1. A corpus, preferably a large one
If you wanted to topic model one fairly short document, you might be better off with a set of highlighters or a good pdf annotation tool. Topic modeling is built for large collections of texts. The people behind Paper Machines, a tool which allows you to topic model your Zotero library, recommend that you have at least 1,000 items in the library or collection you want to model. The question of “how big” or “how small” is ultimately subjective, but I think you want to have at least in the hundreds if not a minimum of 1,000 documents in your corpus. Bear in mind that you define what a document is for the tool. If you have a particularly long work you can divide it into pieces and call each piece a document.

With some tools, you will have to prepare the corpus before you can topic model. Essentially what you have to do is tokenize the text, changing it from human-readable sentences to a string of words by stripping out the punctuation and removing capitalization. You can also tell it to ignore “stopwords” which you define, which usually include things like a, the, and, etc. What you (hopefully) end up with is a document with no capitalization, punctuation, or numbers to throw off the algorithms.

There are a number of ways to clean up your text for topic modeling (and text mining), some of which are covered in other posts on this site. You can use Python and Regular Expressions, the command line (Terminal), and R.

If you want to give topic modeling a try, but do not have a corpus of your own, there are sources for large data. You could, for example, download the complete works of Charles Dickens as a series of text files from Project Gutenberg, which makes a large number of public domain works available as txt files. JSTOR Data for Research, which requires registration, allows you to download the results of a search as a csv file, which is accessible for MALLET and other topic modeling and text mining processes.

2. Familiarity with the corpus
This may seem counterintuitive if you’re planning to use topic modeling to help you find out more about a large corpus, and yet it is very important that you at least have an idea of what should be there. Topic modeling is not an exact science by any means. The only way to know if your results are useful or wildly off the mark is to have a general idea of what you should be seeing. Most people would probably spot the outlier in a topic of “tobacco, farm, crops, navy” but more complex topics might be less obvious.

3. A tool to do the topic modeling
However you’re going to topic model, you need to decide what you are going to use and have a way to use it.

Many humanists use MALLET and by extension LDA. MALLET is particularly useful for those who are comfortable working in the command line, and it takes care of tokenizing and stopwords for you. The Programming Historian has a tutorial which walks you through the basics of working with MALLET.

The Stanford Natural Language Processing Group has created a visual interface for working with MALLET, the Stanford Topic Modeling Toolbox. If you chose to work with TMT, read Miriam Posner’s blog post on very basic strategies for interpreting results from the Topic Modeling Tool.

If you have a WordPress install and are comfortable with Python, check out Peter Organisciak’s post on processing WordPress exports for MALLET.

It is important to be aware that you need to train these tools. Topic modeling tools only return as many topics as you tell them to; it matters whether you specify 50, 5, or 500. If you imagine topic modeling as a switchboard, there are a large number of knobs and dials which can be adjusted. These have to be tuned, mostly through trial and error, before the results are useful.

If you use Zotero, you can use Paper Machines to topic model particularly large collections. Paper Machines is an open-source project, the result of a collaboration between Jo Guldi and Chris Johnson-Roberson, supported by Google Summer of Code, the William F. Milton Fund, and metaLAB @ Harvard. You can do nifty visualizations with Paper Machines, but for topic modeling you need at least 1000 documents. Luckily, you can supplement your Zotero library with data from JSTOR Data for Research.

4. A way to understand your results
Topic modeling output is not entirely human readable. One way to understand what the program is telling you is through a visualization, but be sure that you know how to understand what the visualization is telling you. Topic modeling tools are fallible, and if the algorithm isn’t right, they can return some bizarre results.

Paper Machines output. Pretty, but what does it mean?

Ben Schmidt, who is using k-means clustering to classify whaling voyages, plugged his data into LDA to demonstrate the ways in which modeling can return results which ultimately make no sense. His post explains the dangers of chimerical models, where two clusters get stuck together (think “cat, fish, mouse” and “gun, rod, hunt”).

Topic Modeling and History

Topic modeling is not necessarily useful as evidence but it makes an excellent tool for discovery.

Cameron Blevins has a series of posts on his work text mining and topic modeling the diary of Martha Ballard. He has compared his results to Laurel Thatcher Ulrich’s work, which was done by hand, and the two result sets generally align. His work is particularly useful for understanding the potential and limitations of topic modeling, as so many historians are already familiar with the source material, having read Ulrich’s book A Midwife’s Tale. Both Blevins and Ulrich had to be familiar with the content of the diary and its historical context in order to make sense of their findings. The results of the topic modeling help to uncover evidence already in the text.

Newspapers have proved to be a popular subject for topic modeling, as it provides a way to get at change over time from a daily source. David J. Newman, a computer scientists, and Sharon Block, a historian, worked together to topic model the Pennsylvania Gazette. Table 4 in their article (pdf) lists off the most likely words in a topic and the label they assigned to that topic; some of the topics are obvious but others make it clear that you have to understand the context of a corpus in order to read the results. Another example of topic modeling a historic newspaper is a project from the University of Richmond (VA), Mining the Dispatch. The objective of the project was to explore social and political life in Richmond during the Civil War. The site allows you to interact with the topic models with some interpretation. Exploring this site might help you understand how modifying settings in a topic modeling tool changes the output.

Topic modeling is complicated and potentially messy but useful and even fun. The best way to understand how it works is to try it. Don’t be afraid to fail or to get bad results, because those will help you find the settings which give you good results. Plug in some data and see what happens.

Mapping Tools and the Google Maps JavaScript API

This tutorial will help you build custom Google Maps on your own webpage using the Google Map JavaScript API v3 and a few mapping tools. The following example shows the Washington, D.C., residences for Democrats and African-American Republicans in the 43rd Congress (1873-1875). I’ve layered a map of the Washington from 1873 underneath.

Google Map Example.

(Please note: This is the “post version” for the class website. While I am in class, I will maintain a version on my own website that includes navigation.)

Getting Started

What do you need?

A CSV file with address information as part of your data. Be sure to organize your data in the way you would want it shown in your map. For example, I concatenated the names and states to serve as the title of each point on my map. You can download my csv file, but here are the first three lines so you can see how it’s structured:

Name,Party,Street,City,State,Hotel
Edward CROSSLAND of KY,Democrat,1013 E St. NW,Washington,DC,
George Madison ADAMS of KY,Democrat,1013 E St. NW,Washington,DC,
Alonzo Jacob RANSIER of SC,Black Republican,1017 12th St. NW,Washington,DC,

What would be nice to have?

A high resolution PNG, JPEG, or TIFF of a historic map or other map overlay.

Some places to find Historic Maps (there are many, many more)

What will you do?

  • Generate a KML file with latitude and longitude points for your addresses.
  • Warp a historic map image to accurately (well…as accurately as possible) fit over a modern map
  • Create three files: 1) A JavaScript file that uses the Google Maps JavaScript API v3; 2) An HTML file containing an empty div tag, which serves as the canvas on which the API generates your map; 3) A CSS file that structures the size and placement of the map.

Creating KML Files & Embeddable Maps

Definitions

  • KML (KMZ) -stands for “Keyhole Markup Language,” which is an XML file for maps. Sometimes you will see KMZ files, which are zipped KML files. See Google Developer documentation for more details.

Resources & Instructions

There are two resources for making embeddable maps that might meet your needs. Should you want to customize further, these services also allow you to export a KML file to use elsewhere.

  • BatchGeo – This services geocode address information in a CSV file in a single step. You can also export to a KML file for use in your own map.
  • Geocommons – This service allows much more aesthetic customization, allows you to download a cleaner KML file, and files maps into a searchable mapping library.

Using Batchgeo

  • Copy and paste the entire document into batchgeo.com. Be sure to validate and set options for your map before you geocode. Validating assures that batchgeo can read your file. Options allow you to customize how your markers will look. (Click “Show Advanced Options” for more.) Some handy options include:
  • Group by/ Thematic Value – Allows you to color code your markers based on one column in your table. (I group by Party.)
  • Title – Allows you to customize the title of your marker. (I chose the concatenated name.)
  • Enable clustering for high density markers – if you do not select this option and have multiple markers in one place, it will only show one marker. (In my case, since multiple people lived in one hotel, I select this option.)
  • Marker Description – Use the field by which you are grouping your data so that the field names show up in the bubble.
  • Click “Make Map” and watch the magic happen!
  • When the map appears, you can check the various markers. Batchgeo allows you to move makers manually (which comes in handy if you have historic addresses that no longer exist!). When you are happy with how things look, press “Save and Continue” and follow instructions. You will get an e-mail with instructions for embedding or editing your maps.
  • It might look like your map takes up the entire screen. But, your data is listed below. As well, scroll to the very bottom of your new map to download the KML file. Here is my KML file and here is my embedded map:

View Clio3 BatchGeo Example in a full screen map

Using Geocommons

In addition to serving as a tool to create your own maps, Geocommons offers tools for data analysis. Additionally, public maps and data (you can choose to make data public or private) are housed in a searchable library. Using Geocommons is pretty straightforward. Their tour videos are thorough enough that I do not feel the need to walk step by step in this tutorial.

Here is the map I made using the same CSV file as I used in BatchGeo:

Batchgeo vs. Geocommons – Which one do I use?

It depends on what you want. Here are some scenarios and my recommendation on what service to use:

  • I want to compare different data categories in a single field: BATCHGEO - using a single csv file, you can see differences by selecting that field under ”Group by/ Thematic Value” while setting options. In Geocommons, you would have to create a separate csv file for each field type.
  • I want to make SURE my points are the most accurate for a close analysis: BATCHGEO- It’s been my experience that BatchGeo’s geocoding is more accurate. Moreover, you can manually move any errant points.
  • My data is really clustered together: BATCHGEO- BatchGeo deals better with densely clustered points.
  • I want some bells and whistles (filters, temporal analysis, etc.): GECOMMONS- filters, temporal animations, colors, point types, oh my! There are may robust features for Geocommons maps.
  • I want to do more than map my data: GEOCOMMONS- This service offers data analysis tools that go beyond mapping.

Other Resources

Geocoding

  • You can geocode from a Google spreadsheet here. It works pretty well if you have your addresses. You will have to make your own KML file (see below.)
  • I had mixed results with this program, but you can download it and work offline.
  • Several services will accurately geocode for a small fee, including Smarty Streets.

Creating KML Files

  • Earth Point: Excel to KML is a very good tool for batch converting geocoded CSV to KML files if you already have coordinates (using one of the services above). This service provides more options for customization than BatchGeo or Geocommons. The standard version limits the number of lines you can convert; however, contact the author for a free, 1-year, renewable educational/humanitarian license which allows for unlimited lines. You will have to provide your contact information and a description of your project.

These maps just might meet your needs. But, if you want to customize your map, it’s better to create your own map. You can use the KML files you create using some of these services. Or you can use Fred Gibbs’ tutorial on geocoding with Python to create your own KML file.

Using the Google Maps JavaScript API v3

Google released its Maps JavaScript API v3 in 2010. I was able to build a custom map using the API and the KML file I created above. Later, I will add an overlay of a historic map.

Resources

Create an HTML Page and Add CSS

The page should include the following:

  • include the API for the map you are using.
    • In this case I am using the Google API. The “false” at the end means I am not getting any information from the user.
  • include your JavaScript file in which you will code the maps (mymap-google.js).
  • link to link to a CSS file in your head (or include the actual CSS). I found that it does not work without some bounding CSS on the map div.
  • Since we are using a separate JavaScript file, add “onload=’initialize()’” to your body tag. This is the function we want to call when we load the page to create the maps.
  • Include and empty div (with an ID) for your map.

Here’s my HTML code:


<!DOCTYPE html>
<html>
  <head>
    <title>Google Map API</title>

    <!--This does not work without the stylesheet-->
    <link rel="stylesheet" type="text/css" href="style/map-style.css" />

  	<!--Including the Google Maps JavaScript API - sensor=false because not detecting the user's location-->
    <script src="https://maps.googleapis.com/maps/api/js?sensor=false"></script>
    <script src="js/mymap-google.js"></script>

  </head>

  <!--Calling our JavaScript function when the page loads-->
  <body onload="initialize()">

    <h1> Google Map API</h1>

    <p>See <a href="https://developers.google.com/maps/documentation/javascript/"> Google Maps JavaScript API v3</a> for documentation.</p>

		<!--Empty <div> tag for the map. Give it an ID to reference in the JavaScript file-->
    <div id="map_canvas"></div>

  </body>
</html>

And here’s my CSS (map-style.css):

html, body, #map_canvas {
	margin: auto;
	padding: 0;
	height: 600px;
	width: 1000px;
}

Use the Google Maps API to create the Basic Map

All the Google JavaScript API code is described in the Google Developer Reference Library.

Here’s the basic map. The code for rendering it is below.

Google Example with the basic map.

var map;

      function initialize() {
       //Setting the initial zoom level, initial center point (U.S. Capitol), and using the Google Road Map
        var mapOptions = {
          zoom: 15,
          center: new google.maps.LatLng(38.889864,-77.009017),
          mapTypeId: google.maps.MapTypeId.ROADMAP
        };

        map = new google.maps.Map(document.getElementById('map_canvas'),
            mapOptions);
      }

Line 1 – creates the map variable
Line 3 – creates the initialize() function that will create our map when the HTML page loads.
Lines 5-9 – initializes the map options object. Google provides good instructions on setting and choosing options.
Line 6 - selects initial zoom level (low zoom levels are closer, high are far away. It’s a bit of a guessing game, so try one out and adjust later.)
Line 7 – sets initial center point in latitude and longitude (I chose the U.S. Capitol)
Line 8 – sets the type of Google Map I am using.
Line 11 – finds the place on the DOM where the map should be rendered.

Adding your KML file containing your data

Use the API’s KMLLayer() function.

Important thing to remember: the KML file has to be publicly available, so make sure your file is on your live server and use the full url.

Here’s the basic map with the KMLLayer. The code for rendering it is below.

Example with my KML file layered over it.

var map;

      function initialize() {
       //Setting the initial zoom level, initial center point (U.S. Capitol), and using the Google Road Map
        var mapOptions = {
          zoom: 15,
          center: new google.maps.LatLng(38.889864,-77.009017),
          mapTypeId: google.maps.MapTypeId.ROADMAP
        };

        map = new google.maps.Map(document.getElementById('map_canvas'),
            mapOptions);

        //Creating layer with address markers
        //KML file has to be publicly available, so it has to be on your live site. Call by full URL.
        var addressLayer = new google.maps.KmlLayer('http://www.rungiraffe.com/clio3/mapexample/google/maps/address.kml');
        	addressLayer.setMap(map);

      }

Line 16-17 – Grabs the KML file with the address data and places it on the map.

So, now you have your points on the map. But, this is a modern map. What if you wanted to add a historic map layer for context or even for aesthetic purposes?

Adding a Historic Map Layer

If you want to layer a historic map under your data, for better context and better visual, you have to “warp” the map. Older maps are most likely not to scale. They are also flat and thus do not follow the curvature of the Earth. Warping the map changes the shape so that it will lie accurately on top of your base map. Here is my warped PNG map. And here is the original.

Resource for Warping a Historic Map Image

  • MapWarper (beta) – Tool for warping map images. Rectifies images against a real map by matching points on the image to points on an OpenStreetMap.

Using MapWarper to create a Historic Map Overlay

  • Upload a high res image (JPEG, TIFF, PNG) to mapwarper.net.
  • Follow the instructions to rectify your map.
  • Export your rectified map both as a KML file and as a PNG file (at the top of the page). This is important because you will use both when you add the historic map to your Google map.

Screenshot: Rectifying a historic map at mapwarper.net

Tips for using Map Warper

  • Find the highest quality and clearest map possible. This can be somewhat difficult for historians, but will help you warp your map more accurately.
  • Crop your map as closely as you can. Also make sure it’s oriented correctly before you upload it.
  • Map warper recommends that you rectify on 3 points. That’s a bare minimum. Your map will be more accurate the more points you can accurately pinpoint. The map I warped used 11 points.

Adding the Historic Map to your Google Map

Use the API’s GroundOverlay() function, you are going to layer your PNG image on top of the Google Map. You could layer your KML file that you exported from mapwarper, but this file grabs the map image from mapwarper’s website and I’ve found that the quality is poor. Layering the PNG file makes for a higher quality map.

Important thing to remember: order is key. You need to load the historic overlay before you load your address points so they render in the right layer order.

Once again, here’s a screenshot of the final product with the map layer. The code that puts it all together is below.

Google Map Example.

var map;

      function initialize() {
       //Setting the initial zoom level, initial center point (U.S. Capitol), and using the Google Road Map
        var mapOptions = {
          zoom: 15,
          center: new google.maps.LatLng(38.889864,-77.009017),
          mapTypeId: google.maps.MapTypeId.ROADMAP
        };

        map = new google.maps.Map(document.getElementById('map_canvas'),
            mapOptions);

        //Creating the historic map layer with PNG file

        //Setting Image boundaries from southwest to northeast corners
        var imageBounds = new google.maps.LatLngBounds(
        new google.maps.LatLng(38.8550166245068,-77.0833042589072),
        new google.maps.LatLng(38.9265575316522,-76.933616305016));

        var oldmap = new google.maps.GroundOverlay(
          "http://rungiraffe.com/clio3/mapexample/google/maps/warped_map.png", imageBounds);
            oldmap.setMap(map);

        //Creating layer with address markers
        //KML file has to be publicly available, so it has to be on your live site. Call by full URL.
        var addressLayer = new google.maps.KmlLayer('http://www.rungiraffe.com/clio3/mapexample/google/maps/address.kml');
        	addressLayer.setMap(map);

      }

Line 17 – Creates a variable imageBounds and gives it the bounding coordinate for the map. The first Latitude-Longitude coordinate is the point where the southwest corner of my warped map should go (line 18). Line 19 is the place for the northeast corner.

Lines 22-24 – Grabs the image file for the warped map and places it on the bounding coordinates.

Errr… how do you get your bounding coordinates?

I extracted them from the KML file that I downloaded from mapwarper. Google requires that you list the southwest corner first and the northeast corner second. As Latitude and Longitude coordinates are clear as mud, I suggest looking up the points on Google Maps (just plug in a latitude, longitude) to get your bearings.

Here’s the line of the historic map’s KML file that provides the bounding coordinates:

<href>http://mapwarper.net/maps/482.kml?DBOX=-77.0833042589072,38.8550166245068,-76.933616305016,38.9265575316522,1</href>

The southwest corner is first; northeast is second. But, Google requires Latitude first, so reverse the latitude and longitude in your code.

So… that’s it. The Google Maps JavaScript API give many different options, so be sure to check it out.

Cleaning Bad OCR with Regular Expressions and Python

Optical Character Recognition (OCR)—the conversion of scanned images to machine-encoded text—has proven a godsend for historical research. This process allows texts to be searchable on one hand and more easily parsed and mined on the other. But we’ve all noticed that the OCR for historic texts is far from perfect. Old type faces and formats make for unique OCR. Take for example, this page from the Congressional Directory from the 50th Congress (1887). The PDF scan downloaded from HeinOnline looks organized:

This is a screenshot of the PDF page.

However, the OCR layer (downloaded as a text file*) shows that the machine-encoded text is not nearly as neat:

This is a screenshot of the OCR.


*Note: If you do not have the option to download a text file, you can use the pdfminer module to extract text from the pdf.

Since I want to use this to map the Washington residences for Members of these late 19th-century Congresses, how might I make this data more useable?

The answer is Regular Expressions or “regex.” Here’s what regex did for me. Though this is not a “real” CSV file (the commas are not quite right), it can be easily viewed in Excel and prepped for geocoding. Much better than the text file from above, right?

Aldrich, N. W,Providence, R. I
Allison, William B, Dubuque, Iowa,24Vermont avenue,
Bate, William,Nashville, Ten, Ebbitt House
Beck, James B,Lexington, Ky
Berry, James I, Bentonville, Ark, National Hotel,
Blair, I lenry \V, Manchester, N. H,2o East Capitol stree_._"
Blodgett, Rufus,Long Branch, N. J
Bowen, Thomas M,Del Norte, Colo
Brown, Joseph E, Atlanta, Ga, Woodmont Flats,
Butler, M. C,Edgefield, S. C, 1751 P street NW
Call, Wilkinson, Jacksonville, Fla, 1903 N street NW
Cameron, J. D,Harrisburg, Pa, 21 Lafayette Square,
Chace, Jonathan,Providence, R, I
Chandler, William E, Concord, N. H, 1421 I street NW
Cockrell, Francis M,Warrensburgh,Mo, I518 R street NW
Coke, Richard,Waco, Tex, 419 Sixth street NW
Colquitt, Alfred I I,Atlanta, Ga, 920 New York avenue
Cullom, Shelby M,Springfield, Ill, 1402 Massachusetts avenue
Daniel, John W,,Lynchburgh, Va, I7OO Nineteenth st. NW
Davis, Cushman K, Saint Paul, Minn, 17oo Fifteenth street NW
Dawes, Henry L,Pittsfield, Mass, 1632Rhode Island avenue.
Dolph, Joseph N,Portland, Oregon, 8 Lafayette Square,
Edmunds, George F, Burlington, Vt, 2111 Massachusetts avenue
Eustis, James B,,New Orleans, La, 1761 N street NW
Evarts, William M,New York, N. Y, i6oi K street NW
Farwell, Charles B, Chicago, Ill,
Faulkner, Charles James, Martinsburgh, W. Va,
Frye, William P,Lewiston, Me, Hamilton House,
George, James Z,Jackson, Miss, Metropolitan Hotel
Gibson, Randall Lee, New Orleans, La, 1723 Rhode Island avenue.
Gorman, Arthur P, Laurel, Md .,1403 K street NW
Gray, George,Wilmington, Del,
Hale, Eugene,Ellsworth, Me, 917 Sixthteenth st. NW
Hampton, Wade, Columbia, S. C,
Harris, Isham G, Memphis,Tenn, 13 First street NE
Hawley, Joseph R,Hartford, Corn, 1514 K street NW
Hearst, George,San Francisco, Cal,
Hiscock, Frank, Syracuse, N. Y, Arlington Hotel
Hoar, George F, Worcester, Mass, 1325 K street NW
Ingalls, John James, Atchison, Kans, I B street NW
Jones, James K,Washington, Ark, 915 M street NW
Jones, John P,Gold Hill, Nev
Kenna, John E,Charleston, W. Va, 14o B street NW
McPherson, John ,Jersey City, N. J, 1014 Vermont avenue,
Manderson, CharlesF. Omaha, Nebr,The Portland
Morgan, John T,.Selma, Ala,I 13 First street NE
Morrill, Justin S, Stratford, Vt, x Thomas Circle,

Regular Expressions (Regex)

Regex is not a programming language. Rather it follows a syntax used in many different languages, employing a series of characters to find and/or replace precise patterns in texts. For example, using this sample text:

Let's get all this bad OCR and $tuff. Gr8!

1. You could isolate all the capital letters (L, O, C, R, G) with this regex:

[A-Z]

2. You could isolate the first capital letter (L) with this regex:

^[A-Z]

3. You could isolate all characters BUT the capital letters with this regex:

[^A-Z]

4. You could isolate the acronym “OCR” with this regex:

[A-Z]{3}

5. You could isolate the punctuation using this regex:

[[:punct:]]

6. You could isolate all the punctuation, spaces, and numbers this way:

[[:punct:], ,0-9]

The character set is not that large, but the patterns can get complicated. Moreover, different characters can mean different things depending on their placement. Take for example, the difference between example 2 and example 3 above. In example 2, the caret (^) means isolate the pattern at the beginning of the line or document. However, when you put the caret inside the character class (demarcated by []) it means “except” these sets of characters.

The best way to understand Regular Expressions is to learn what the characters do in different positions and practice, practice, practice. And since experimentation is best way to learn, I suggest using a regex tester tool and experiment with the syntax. For Mac users, I had a lot of luck with the Patterns App (Mac Store $2.99), which allowed me to see what the regular expressions were doing in real time. It also comes with a built-in cheat sheet for the symbols, but I actually found this generic (meaning it works across languages) cheat sheet more comprehensive. For PC users (or people who don’t want to pay or download software), I also found another tester tool that was fairly transparent.

Python and Regex

In this tutorial, I use the Regular Expressions Python module to extract a “cleaner” version of the Congressional Directory text file. Though the documentation for this module is fairly comprehensive, beginners will have more luck with the simpler Regular Expression HOWTO documentation.

Two things to note before you get started

  • From what I’ve observed, Python is not the most efficient way to use Regular Expressions if you have to clean a single document. Command Line programs like sed or grep appear to be more efficient for this process. (I will leave it to the better grep/sed users to create tutorials on those tools.) I use Python for several reasons: 1) I understand the syntax best; 2) I appreciate seeing each step written out in a single file so I can easily backtrack mistakes; and 3) I want a program I could use over and over again, since I am cleaning multiple pages from the Congressional Directory.
  • The OCR in this document is far from consistent (within a single page or across multiple pages). Thus, the results of this cleaning tutorial are not perfect. My goal is to let regex do the heavy lifting and export a document in my chosen format that is more organized than the document with which I started. This significantly reduces, but does not eliminate, any hand-cleaning I might need to do before geocoding the address data.

My example Python File

Here’s the Python file that I used to created to clean my document:

 
#cdocr.py
#strip the punctuation and extra information from HeinOnline text document

#import re module
import re

#Open the text file with the ocr
ocr = open("../../data/txt/50-1-p1.txt")
#read the text file into a list
Text = ocr.readlines()

#Create an empty list to fill with lines of corrected text
CleanText = []

# checks each line in the imported text file for all the following patterns
for line in Text:
	#lines with multi-dashes contain data - searches for those lines
	# -- does not isolate intro text lines with one dash.
	dashes = re.search("(--+)", line)
	
	#isolates lines with dashes and cleans
	if dashes:
		#replaces dashes with my chosen delimiter
		nodash = re.sub(".(-+)", ",", line)
		#strikes multiple periods
		nodots = re.sub(".(\.\.+)", "", nodash)
		#strikes extra spaces
		nospaces = re.sub("(  +)", ",", nodots)
		#strikes * 
		nostar = re.sub(".[*]", "", nospaces)
		#strikes new line and comma at the beginning of the line
		flushleft = re.sub("^\W", "", nostar)
		#getting rid of double commas (i.e. - Evarts)
		comma = re.sub(",{2,3}", ",", flushleft)
		#cleaning up some words that are stuck together (i.e. -  Dawes, Manderson)
		#skips double OO that was put in place of 00 in address
		caps = re.sub("[A-N|P-Z]{2,}", ",", comma)
		#Clean up NE and NW quadrant indicators by removing periods
		ne = re.sub("(\,*? N\. ?E.)", " NE", caps)	
		nw = re.sub("(\,*? N\. ?W[\.\,]*?_?)$", " NW", ne) #MAKE VERBOSE
		#Replace periods with commas between last and first names (i.e. - Chace, Cockrell)
		match = re.search("^([A-Z][a-z]+\. )", nw) #MAKE VERBOSE
		if match:
			names = re.sub("\.", ",", nw)
		else:
			names = nw
           #Append each line to CleanText list while it loops through
		CleanText.append(names)

#Saving into a "fake" csv file
fcsv = open("cdocr2/50-1p1.csv", "w")
#Write each line in CleanText to a file
for line in CleanText:
	fcsv.write(line)

I’ve commented it pretty extensively, so I will explain why I structured the code the way I did. I will also demonstrate a different way to format long regular expressions for better legibility.

  • Lines 16-22 – Notice in my original text file that my data is all on lines with multiple dashes. This code effectively isolates those lines. I use the re.search() function to find all lines with multiple dashes. The “if” statement on line 20 only works with the lines with dashes in the rest of the code. (This eliminates all introductory text and the rows of page numbers that follow the data I want.)
  • Lines 23-40 – This is the long process by which I eliminate all of the extraneous punctuation and put the pieces of my data (last name, first name, home post office, washington address) into different fields for a csv document. I use the re.sub() function, which substitutes pattern with another character. I comment extensively here, so you can see what each piece does. This may not be the most efficient way of doing this, but by doing this piece by piece, I could check my work as I went. As I built loop, I checked each step by printing the variable in the command line. So, for example, after line 24 (when I eliminate the dashes), I would add “print nodash” (inside the if loop) before I ran the file in the command line. I checked each step to make sure my patterns were only changing the things I wanted and not changing things I did not want changed.
  • Lines 41-46 - I used a slightly different method here. The OCR in the text file separated some names with a period (for example, Chace.Jonathan vs. Chase,Jonathan). I wanted to isolate the periods that came up in this pattern and change those periods to commas. So I searched for the pattern ^([A-Z][a-z]+\.), which looks at the beginning of a line (^) and finds a pattern with one capital letter, multiple lowercase letters and a period. After I had isolated that pattern, I substitute the period those lines that fit the pattern with a comma.

Using Verbose Mode

Most regular expressions are difficult to read. But lines 39 and 40 look especially bad. How might you clarify these patterns for people who might look at your code (or for yourself when you are staring at them at 2:00 AM someday)? You can use the module’s verbose mode. By putting your patterns in verbose mode, python ignores white space and the # character, so you can split the patterns across multiple lines and comment each piece. Keep in mind that, because it ignores spaces, if spaces are part of your pattern, you need to escape them with a backslash (\). Also note that re.VERBOSE and re.X are the same thing.

Here are lines 39 and 40 in verbose mode:

#This is the same as (\,*? N\. ?E.)
#All spaces need to be escaped in verbose mode.
ne_pattern = re.compile(r"""
	( 				#start group
		\,*? 		#look for comma (escaped); *? = 0 or more commas with fewest results
		\ N\.? 	    #look for (escaped) space + N that might have an (escaped) period after it
		\ ?E 		#look for an E that may or may not have an space in front of it
		. 			#the E might be followed by another character.
	) 				#close group
	$ 				#ONLY look at the end of a line
""", re.VERBOSE)

#This is the same as (\,*? N\. ?W[\.\,]*?_?)$
nw_pattern = re.compile(r"""
	( 					#start group
		\,*? 			#look for comma (escaped); *? = 0 or more commas with fewest results
		\ N\.? 		    #look for (escaped) space + N that might have an (escaped) period after it
		\ ?W 			#look for an W that may or may not have an space in front of it
		[\.\,]*?		#look for commas or periods (both escaped) that might come after W
		_?				#look for underscore that comes after one of these NW quadrant indicators
	) 					#close group
	$ 					#ONLY look at the end of a line
""", re.X)

In above example, I use the re.compile() function to save the pattern for future use. So, adjusting my full python code to use verbose mode would look like the following. Note that I define my verbose patterns on lines 17-39 and store them in variables (ne_pattern and nw_pattern). I use them in my loop on lines 65 and 66.

#cdocrverbose.py
#strip the punctuation and extra information from HeinOnline text document

#import re module
import re

#Open the text file with the ocr
ocr = open("../../data/txt/50-1-p1.txt")
#read the text file into a list
Text = ocr.readlines()

#Create an empty list to fill with lines of corrected text
CleanText = []

##Creating verbose patterns for the more complicated pieces that I use later on.##

#This is the same as (\,*? N\. ?E.)
#All spaces need to be escaped in verbose mode.
ne_pattern = re.compile(r"""
	( 				#start group
		\,*? 		#look for comma (escaped); *? = 0 or more commas with fewest results
		\ N\.? 	    #look for (escaped) space + N that might have an (escaped) period after it
		\ ?E 		#look for an E that may or may not have an space in front of it
		. 			#the E might be followed by another character.
	) 				#close group
	$ 				#ONLY look at the end of a line
""", re.VERBOSE)

#This is the same as (\,*? N\. ?W[\.\,]*?_?)$
nw_pattern = re.compile(r"""
	( 					#start group
		\,*? 			#look for comma (escaped); *? = 0 or more commas with fewest results
		\ N\.? 		    #look for (escaped) space + N that might have an (escaped) period after it
		\ ?W 			#look for an W that may or may not have an space in front of it
		[\.\,]*?		#look for commas or periods (both escaped) that might come after W
		_?				#look for underscore that comes after one of these NW quadrant indicators
	) 					#close group
	$ 					#ONLY look at the end of a line
""", re.VERBOSE)

# checks each line in the imported text file for all the following patterns
for line in Text:
	#lines with multi-dashes contain data - searches for those lines
	# -- does not isolate intro text lines with one dash.
	dashes = re.search("(--+)", line)
	
	#isolates lines with dashes and cleans
	if dashes:
		#replaces dashes with my chosen delimiter
		nodash = re.sub(".(-+)", ",", line)
		#strikes multiple periods
		nodots = re.sub(".(\.\.+)", "", nodash)
		#strikes extra spaces
		nospaces = re.sub("(  +)", ",", nodots)
		#strikes * 
		nostar = re.sub(".[*]", "", nospaces)
		#strikes new line and comma at the beginning of the line
		flushleft = re.sub("^\W", "", nostar)
		#getting rid of double commas (i.e. - Evarts)
		comma = re.sub(",{2,3}", ",", flushleft)
		#cleaning up some words that are stuck together (i.e. -  Dawes, Manderson)
		#skips double OO that was put in place of 00 in address
		caps = re.sub("[A-N|P-Z]{2,}", ",", comma)
		#Clean up NE and NW quadrant indicators by removing periods (using Verbose regex defined above)
		ne = re.sub(ne_pattern, " NE", caps)	
		nw = re.sub(nw_pattern, " NW", ne) 
		#Replace periods with commas between last and first names (i.e. - Chace, Cockrell)
		match = re.search("^([A-Z][a-z]+\.)", nw)
		if match:
			names = re.sub("\.", ",", nw)
		else:
			names = nw
		 #Append each line to CleanText list while it loops through
		CleanText.append(names)

#Saving into a "fake" csv file
fcsv = open("cdocr2/50-1p1.csv", "w")
#Write each line in CleanText to a file
for line in CleanText:
	fcsv.write(line)

In conclusion, I will note that this is not for the faint of heart. Regular Expressions are powerful. Yes, they are powerful enough to completely destroy your data. So practice on copies and take it one itty bitty step at a time.

Cleaning up downloaded files

So you used wget or Python to pull down a collection of files from the web. Excellent! But in looking through your loot, you notice that a number of the files are oddly small and, on further examination, find that they are functionally empty. How do you clean up your collection of files quickly?

There are a number of very powerful command line tools built into UNIX systems (Mac and Linux) that allow you to manipulate your files quickly and easily. This is a brief tutorial on how to use those tools to locate all of the files that are too small and then remove those files from your collection.

Begin by navigating in Terminal to your collection of folders. For example, my files were located a couple of folders downs within my Documents folder.

cd Documents/Github/Clio3/Webscraping/hymn-files

Once in the folder with your downloaded files, you need to find a way to isolate out the files that are too small to be interesting. To do this, use the “find” command.

If the files we are interested in sorting through are all one file type (in my case they’re .json files), we can tell the computer to find all of the files of a particular type and particular size as follows:

find *.json -size 28c

This would find all of the json files that are 28 bytes in size.

However, we want all the files that are 28 bytes or less.

find *.json -size -28c

This is important in my case because the files are not truly empty. There is a simple way to identify truly empty files, if that is more appropriate for your data.

find *.json -empty

You can also add

-maxdepth 1

if there are additional folders that you do not want to work through.

Finding the files is great, but now we need to do something with that collection of files.

There are two ways you can do this. First, you can simply add a delete option to your command, as follows

find *.json -size -28c -delete

In general, this should work. However, if you are dealing with a large number of files, it is often better to use a pipe (“|”). The pipe takes the results of the first command and feeds them to the second command. In the example below, we are taking the results of the find command and passing them to the remove command (“rm”).

find *.json -size -28c | xargs rm

You will notice that we included “xargs” on the right side of the pipe. Xargs helps the computer handle a long list of file names.

Note: The Wikipedia entry on xargs suggests using “-0″ (zero) when dealing with file names with spaces in them as xargs defaults to separating at white space (another reason to avoid spaces in filenames). If you run this command without -0 and it doesn’t work, try adding the -0.

And the command we are running on each filename is “rm” or remove. rm has a number of options that you can research, some of which really make data un-recoverable and should be used with care. However, a basic “rm” command will be sufficient for this example.

Run this command to remove all of the files less than 28 bytes from your current directory.

(I used 28 bytes because the computer was having trouble with 0 and all of the files I wanted to keep were larger than 28 bytes. Not exactly sure why 0 was a problem but this is why experimenting with the find options before moving on to removing the files is a good idea!)

Python Dictionaries for Historical Data

Doing digital history often means manipulating data to get it in a form that a visualization tool can use. This may well mean using Python dictionary (that you have used to scrape data from a webpage or have produced from loading a CSV file) to a Javascript object for display on a webpage. This quick example discusses how a python dictionary can be helpful for helping standardize data.

Your data

Let’s say you want to track battle casualties in the Civil War, organized by year, battle, and army. This data is easily findable online, and it might look like:

1862
– Battle 1
– - Union 12
– - Confederate 15

– Battle 2
– - Union 9
– - Confederate 23

1863
– Battle 3
– - Union 252
– - Confederate 1156

– Battle 4
– - Union 2
– - Confederate 0

Visualizing this data is fairly straightforward in that a simple stacked bar graph could be very helpful. But the details of moving data across scripting languages can be tricky and frustrating.

Python Dictionaries

Python dictionaries are incredibly powerful objects that are surprising useful for historians who need to track data over time. How would our civil war data look in a python dictionary? First, a few basics.

Let’s make a dictionary with a key named ‘key’ and a value named ‘value’. The single quotes here mean that these are literal strings (ie not variables)

# define our dictionary with one key and one value
myDictionary = {'key': 'value'} 

# retrieve a value
print myDictionary['key']
# prints value

Of course we don’t have to use ‘key’ and ‘value’:

battles = {'battleName': 'Gettysburg'} 

print battles['battleName'] 
# prints Gettysburg

print battles['key'] 
# gives an error, since there is no key called 'key'.

We can have multiple keys and values, but the keys must be different (or there would be no way to retrive the appropriate value):

myDictionary = {'key1': 'value1', 'key2': 'value2' } #define our dictionary with one key and one value

print myDictionary['key1'] 
# prints 'value1'

print myDictionary['key2']
# prints 'value2'

Dictionary values don’t have to be strings, they can be other objects (like dictionaries). Check this out:

myDictionary = {'casualties': {'union':12}} 

print myDictionary['casualties']['union'] 
# prints 12

Combining the last two examples:

myDictionary = {'casualties': {'union':12, 'confederate':15}} 

print myDictionary['casualties']['union'] 
# prints 12

print myDictionary['casualties']['confederate'] 
# prints 15

Let’s see if we can replicate our first year of battles, using the battle names as keys.

myBattles = {
'Battle1': {'casualties': {'union':12, 'confederate':15}},
'Battle2': {'casualties': {'union':9, 'confederate':23}}
} 

print myBattles['Battle1']['casualties']['union'] 
# prints 12

print myBattles['Battle2']['casualties']['confederate'] 
# prints 23

Adding years is easy; we just add another key and value pair, where the key is the year and the values are dictionaries that contain information about battles. We’re creating a nested dictionary. One dictionary contains another, which contains another, and so on.

civilWarBattles = {
  '1862': {
    'Battle1': {'casualties': {'union':12, 'confederate':15}},
    'Battle2': {'casualties': {'union':9, 'confederate':23}}
  },
  '1863': {
    'Battle3': {'casualties': {'union':252, 'confederate':1156}},
    'Battle4': {'casualties': {'union':2, 'confederate':0}}
  }
} 

print civilWarBattles['1863']['Battle4']['casualties']['confederate'] 
# prints 0

In the above example we are using battle names as keys. Sometimes it’s easier to have all the keys be constant, like we have union and confederate appearing in each battle entry. This would of course be essential if two different battles went by the same name in the same year. In this case, we can use a python list to hold our list of battles for each year.

civilWarBattles = {
  '1862': [{'battleName':'Battle1', 'casualties': {'union':12, 'confederate':15}},
           {'battleName':'Battle2', 'casualties': {'union':9, 'confederate':23}}
          ]
  }

# we can access elements of a list with by its index (beginning with 0, of course)
print civilWarBattles['1862'][1]['casualties']['confederate'] 
# prints 23

Whether or not this is a useful technique depends on how you will access your data later on.

Since you have nicely organized your data, it would be nice to save this data and keep it in this format rather than convert it to a CSV file (or similar). This is where JSON (Javascript Object Notation) is invaluable. This allows python to create a plain text file that looks exactly like the object does when we define it. We’d say that this file is a text file that uses JSON. But we say that it’s a JSON file like we say something in a CSV format is a CSV file. Let’s save it as ‘civilWarBattleData.json’

import json

civilWarBattles = {
  '1862': [{'battleName':'Battle1', 'casualties': {'union':12, 'confederate':15}},
           {'battleName':'Battle2', 'casualties': {'union':9, 'confederate':23}}
          ]
  }

outfile = open('civilWarBattleData.json', 'w')
json.dump(civilWarBattles,outfile)

Now, if you look at this file, you see that it represents the python dictionary we created. But of course it’s not really a dictionary anymore, just a plain text file that uses Javascript Object Notation. This is especially helpful to load this object into a javascript file, when it can be loaded as a standard javascript object, and thus allows easy access to our well-structured data.

Reading with R

When reading historical documents, historians may not consider statistical packages like R to be of much help. But historians like to read texts in various ways, and R helps do exactly that.

This tutorial provides a skeletal recipe for reading a set of texts (a corpus) through the statistical software R. Although we often consider a text a vehicle for meaning, a text can also be considered as a string of words (ordered), or a bag of words (unordered). These words can be represented in ways that are useful for analysis that aren’t constrained by grammatical order. It is possible, in fact, to read texts as a function, statistically speaking, of how words relate to each other in a single text compared to how they relate to words in other texts.

In other words, we can (and must) learn new ways of reading texts, and to embrace mathematical abstraction and visualization as interpretative allies rather than block-box enemies. The typical humanist creation of meaning from text is no less black-boxy that reading the text through mathematical lenses. These new ways are not the savior of the humanities, and they do not guarantee new insights into anything. They may be utterly useless for your purposes, in the same way that trying to analyze sources through a Marxist or a Feminist lens may be totally unhelpful. But we expect good historians to carry a large bag full of lenses, and R is just another one.

One example of a new ways of reading is the case of document similarity: How do you know what texts are similar to others? Their “meaning”? Okay. But let’s make the reasonable assumption that an author’s intended meaning dictates her word choice. Therefore, similar documents—at least those that are not highly allegorical or metaphorical in different ways—will use similar sets of words in similar ways. Implicitly detecting these patterns is largely how we constitute meaning, even when reading in the traditional way. But it’s not the only way.

R is a statistics package for carrying out generic statistical calculations, most of which have nothing to do with texts. But by using a special text mining module provides us with a lot of built-in mathematical functions that we can use to explore—and, more importantly, read—texts.

There are several tutorials that can help you get started with R, like this one and this one. The TM documentation (PDF) proves helpful as well.

I found that I needed a few bits here and a few bits there, so I have tried to cobble them together here.
What follows is the recipe for creating a basic dendrogram from a small set of documents. This is probably more useful as a way of getting you to think differently about your texts than to gain serious proficiency with R itself. But motivation is at least half the battle.

Picking Your Corpus

You’ll need to have a small set (= corpus) of texts (= documents) to play with. Having a small corpus is a double-edged sword, as it is easier to see what’s going on, but the results often appear to be trivial and so it’s easy to think the whole method is worthless. Whether or not you’ll immediately think R is cool probably depends on selecting a corpus and documents lengths that aren’t too big or too small. You need to have all of your texts in separate plain text files (like a .txt extension), gathered together in a single directory.

Download R for your operating system. Launch the R application. Everything that follows should be done in the R “console” (= application window), which provides you with an interface to your R workspace.

Getting Started

First, download and load up the text mining module with:

install.packages("tm")
require("tm")

Load your corpus into your workspace with the following command, where YOURPATH is the full path to your directory full of texts (keep the quotes when you paste in your own path).

my.corpus <- Corpus(DirSource("YOURPATH"))

Prepare your Texts

First thing to do is to normalize the texts, which means to remove punctuation, cases, and numbers. These are called “transformations” in R-speak and you can get a list of available transformations by typing:

getTransformations

(Note that there is one transformation that does not appear on this list, but that is nevertheless handy: “tolower” which makes all words lowercase.)

The syntax for using the transformations is:

my.corpus <- tm_map(my.corpus, removePunctuation)

You also might want to remove common words (= stopwords) that obscure the more interesting ones. You can remove standard English stopwords:

my.corpus <- tm_map(my.corpus, function(x) removeWords(x, stopwords("english")))

(Here is a different syntax that also works):

my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))

You can also remove your own list of stopwords if you know there are words that appear in every text and might skew your results. There are two ways to do this. If you have a short list, you can create a “character vector” (more R-speak) right in the console:

my.stops <- c("history","clio", "programming")
my.corpus <- tm_map(my.corpus, removeWords, my.stops)

If you have a longer list, you can type the words in (or programmatically create) a text file with all of your stopwords. The file should list all of your words with a space in between. Like this:

history clio programming historians text mining...

From the R console, you import the file, create a character vector, and remove the words:

my.list <- unlist(read.table("PATH TO STOPWORD FILE", stringsAsFactors=FALSE)
my.stops <- c(my.list)
my.corpus <- tm_map(my.corpus, removeWords, my.stops)

SAVE your Place

At this point, you might want to save your “place” so to speak, so that you can open this corpus with all the transformations in another R session. The command line version of this is here. I prefer using the GUI. In the Workspace Menu, choose “Save Workspace File…” and save under a name in a location of your choice. To open again, chose “Load Workspace File…”.

Play with R

We want to see our corpus expressed as a document matrix. There are two kinds that are useful to us, depending on how we want to read the text. there are TermDocument matrix and a DocumentTerm matrix. The only difference is how variables are mapped to the X and Y axis of the matrix. TermDocument means that the terms are on the vertical axis and the documents are along the horizontal axis. DocumentTerm is the reverse. Each of these let you see the texts in different ways, although one (or both) may not be very helpful for you.

my.tdm <- TermDocumentMatrix(my.corpus)

To see the new representation of your corpus, inspect it:

inspect(my.tdm)

You can compare this to the DocumentTermMatrix:

my.dtm <- DocumentTermMatrix(my.corpus, control = list(weighting = weightTfIdf, stopwords = TRUE))
inspect(my.dtm)

Up until now, maybe you have considered your corpus as composed of discrete texts, themselves composed of discrete words. It can be helpful to know how they are associated with each other.

There are commands to compute word frequencies and associations:

findFreqTerms(my.tdm, 2)
findAssocs(my.tdm, 'mining', 0.20)

But sometimes it’s nice to see larger relationships of words and texts.

my.df <- as.data.frame(inspect(my.tdm))
my.df.scale <- scale(my.df)
d <- dist(my.df.scale,method="euclidean")
fit <- hclust(d, method="ward")
plot(fit)

What does the dendrogram say? Terms higher in the plot appear more frequently within the corpus; terms grouped near to each other are more frequently found together. The nice thing about a trivial corpus is that you can easily inspect each document at a glance (and all at the same time) and match that up with what the dendrogram shows.

Isn’t this what you discover from reading? In this case, obviously yes. But the dendrogram scales much nicer with more texts. Even with a smaller corpus of say hundreds of texts instead of many thousands, R is likely to read the texts a bit different from how a person would. It is perhaps useful in seeing what relationships do NOT appear—perhaps ones that curious and not fully objective readers sort-of want to be there and might more easily identify—and thus provide an important complement to our more heuristic ways in which we read.

TimeMap Tutorial: Show Your Geo-Spatial Data on a Timeline

This tutorial is designed to help historians “map” their geo-spatial data to a timeline using TimeMap, an open-source JavaScript library that combines map functionality from a variety of mapping sources, including Google and OpenStreetMap, with SIMILE’s TimeLine widget. For example, the TimeMap below shows the executions of those convicted of witchcraft in colonial Massachusetts.

Example of a successful mashup of timeline and mapped information using TimeMap

For the purposes of this tutorial, you will need to download TimeMap and upload it to a folder on your server. *There are a lot of files and folders with this download, I recommend you house TimeMap in its own folder on your server. (For information on how to download JavaScript libraries and where to put your files, see Using JavaScript Libraries.)

We will create a TimeMap with just 3 elements:

  • a KML file containing your data (or other data source)
  • an HTML file containing the JavaScript used to create the TimeMap.
  • Style elements, either in a separate CSS file or in scripts in your HTML file.

The Data
TimeMap allows you to “load” from a variety of sources including, Flickr, Google spreadsheets, JSON, XML, and KML. No matter your source file, your data needs to contain coordinates (longitude & latitude or polygon), date (in yyyy-mm-dd format), title/name and a description for the infobox shown above.

We’ll map from a KML file containing details of those executed for witchcraft. There are several ways to create a KML file of your data. Fred’s tutorial on creating a KML file by using Python is very handy if you’re trying to convert from a CSV or other data source. A condensed version of my file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document>
<Placemark>
<Point><coordinates>-71.1367953,42.6583356</coordinates></Point>
<name>Samuel Wardwell</name>
<TimeStamp><when>1692-09-22</when></TimeStamp>
<description>More info here.<![CDATA[If you'd like to include HTML, you must use CDATA. <a href="http://www.insertlink.com">More info.</a>]]>.</description>
</Placemark>
<Placemark>
<Point><coordinates>-70.8967155,42.5195400</coordinates></Point>
<name>Giles Corey</name>
<TimeStamp><when>1692-09-16</when></TimeStamp>
<description>Giles Corey, husband of Martha Corey, is pressed to death. He dies after two days "under the weight."</description>
</Placemark>
</Document>
</kml>

Be sure to upload your KML to your server in the same directory as your other map files.

All the data in the KML file will show up in the web page that houses your TimeMap. (See the first image.) For more complex data, the description field offers a nice home for exploratory analysis or detailed information. You can add HTML and even images here by using the CDATA section (as shown above.)

The HTML file

The HTML file will contain all of the needed JavaScript to build your map. Because you are using APIs, there will be very little code.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    
    <title>Witchcraft</title>
	<!-- Designate Your Stylesheet, or put your style elements here -->
	<link rel="stylesheet" type="text/css" href="sbox_style.css" />

	<!-- Begin Map Scripts -->
	
    <script type="text/javascript" src="http://maps.google.com/maps/api/js?sensor=false"></script>
    <script type="text/javascript" src="http://sandbox.erinbush.com/js/timemap/lib/jquery-1.6.2.min.js"></script>
    <script type="text/javascript" src="http://sandbox.erinbush.com/js/timemap/lib/mxn/mxn.js?(googlev3)"></script>
    <script type="text/javascript" src="http://sandbox.erinbush.com/js/timemap/lib/timeline-1.2.js"></script>
    <script src="http://sandbox.erinbush.com/js/timemap/src/timemap.js" type="text/javascript"></script>
	<script src="http://sandbox.erinbush.com/js/timemap/src/param.js" type="text/javascript"></script>
    <script src="http://sandbox.erinbush.com/js/timemap/src/loaders/xml.js" type="text/javascript"></script>
    <script src="http://sandbox.erinbush.com/js/timemap/src/loaders/kml.js" type="text/javascript"></script>
    <script type="text/javascript">
	
	// variable "tm" below initiates the map and includes the required buckets in which we will put our TimeMap
	var tm;
	$(function() {
    
		tm = TimeMap.init({
			mapId: "map",               // Id of map div element (required)
			timelineId: "timeline",     // Id of timeline div element (required)
			options: {
				eventIconPath: "http://sandbox.erinbush.com/js/timemap/images/" // Loads the appropriate icons for the mep. Point this URL to the "images" file from your TimeMap download on your server.
			},
			datasets: [
					{
					title: "Executions for Witchcraft",
					theme: "red",    // You can choose from any of Timeline's color themes. 
                    type: "kml",     // Data to be loaded in KML, change for other sources - must be a local URL
					options: {
						url: "witchcraftexe.kml" // KML file to load
						}
					}
			],
			bandIntervals: [
				Timeline.DateTime.YEAR,  // You can load a maximum of two timebands without adjusting your code.
				Timeline.DateTime.MONTH,
			]
		});
    
		// set the map to our custom style
		var gmap = tm.getNativeMap();
		gmap.mapTypes.set("white", styledMapType);
		gmap.setMapTypeId("white");
	});
    </script>
	
  </head>
  <body>
  
  <div id="container">
	
		  
    <div id="help">
    <h1>Witchcraft Executions</h1>
    Puritans were executed for witchcraft long before the famous events in Salem in 1692/1693. See for yourself.
    </div>
    <div id="timemap">
        <div id="timelinecontainer">
          <div id="timeline"></div>
        </div>
        <div id="mapcontainer">
          <div id="map"></div>
        </div>
		
		</div>
    
	</div>
  </body>
</html>

In your initial script tags, you are calling all the APIs required to make the TimeMap function. These include APIs from Google Maps v3, jQuery, Mapstraction, Timeline and TimeMap. (Documentation is available for each API separately and is linked here.)

In the above example, we are calling the required loaders to read our KML page. TimeMap has separate loaders for each data source, so please check to make sure you call the one needed for your source.

Inside our <body> tags, we add some text to explain the map and then call each of our <div> containers and the elements, “map” and “timeline”.

CSS
As we created two new <div> containers to house our map and timeline, you need to style them.

/* Map Style */

div#help {
font-size: 12px;
width: 45em;
padding: 1em;
}

div#timemap {
padding: 1em;
}

div#timelinecontainer{
width: 100%;
height: 400px;
}

div#timeline{
 width: 100%;
 height: 100%;
 font-size: 12px;
 background: #CCCCCC;
}

div#mapcontainer {
 width: 100%;
 height: 400px;
}

div#map {
 width: 100%;
 height: 100%;
 background: #EEEEEE;
}

div.infotitle {
    font-size: 14px;
    font-weight: bold;
}
div.infodescription {
    font-size: 14px;
    font-style: italic;
}

div.custominfostyle {
    font-size: 1.5em;
    font-style: italic;
    width: 20em;
}

The sizes or your map and timeline will depend on your site, so adjust these accordingly. The div should be at least 300px high in order to hold both the map and the timeline. They can be bigger, but not smaller, or else they will not load.

Because TimeMap corrals the different functionality of mapping APIs with jQuery, Mapstraction and Timeline, the available features and functions seem endless. You can create many different types of Timelines and maps if you familiarize yourself with the possibilities. For example, if you wanted to show overlapping reigns of monarchs, the TimeMap has a timespan function that can show start and end dates. Similarly, if I wanted to “timemap” the sequence of events leading up to the Salem Witchcraft Trials, I could a bit more research and map all the relevant information. The beauty of TimeMap is that with just 3 files you can quickly create a functional TimeMap in a day. Happy coding.