Computer programs can become long, unwieldy and confusing without special mechanisms for managing complexity. This lesson will show you how to reuse parts of your code by writing Functions and break your programs into Modules, in order to keep everything concise and easier to debug. Being able to remove a single dysfunctional module can save time and effort.
Your list is now clean enough that you can begin analyzing its contents in meaningful ways. Counting the frequency of specific words in the list can provide illustrative data. Python has an easy way to count frequencies, but it requires the use of a new type of variable: the dictionary. Before you begin working with a dictionary, consider the processes used to calculate frequencies in a list.
This lesson uses Python to create and view an HTML file. If you write programs that output HTML, you can use any browser to look at your results. This is especially convenient if your program is automatically creating hyperlinks or graphic entities like charts and diagrams.
Here you will learn how to create HTML files with Python scripts, and how to use Python to automatically open an HTML file in Firefox.
In this two-part lesson, we will build on what you’ve learned about Working with Webpages, learning how to remove the HTML markup from the webpage of Benjamin Bowsey’s 1780 criminal trial transcript. We will achieve this by using a variety of string operators, string methods and close reading skills. We introduce looping and branching so that programs can repeat tasks and test for certain conditions, making it possible to separate the content from the HTML tags. Finally, we convert content from a long string to a list of words that can later be sorted, indexed, and counted.
In this lesson, you will learn the Python commands needed to implement the second part of the algorithm begun in the From HTML to a List of Words (part 1). The first half of the algorithm gets the content of an HTML page and saves only the content that follows the tags.
Like in Output Data as HTML File, this lesson takes the frequency pairs collected in Counting Frequencies and outputs them in HTML. This time the focus is on keywords in context (KWIC) which creates n-grams from the original document content – in this case a trial transcript from the Old Bailey Online. You can use your program to select a keyword and the computer will output all instances of that keyword, along with the words to the left and right of it, making it easy to see at a glance how the keyword is used.
Once the KWICs have been created, they are then wrapped in HTML and sent to the browser where they can be viewed. This reinforces what was learned in Output Data as HTML File, opting for a slightly different output.
At the end of this lesson, you will be able to extract all possible n-grams from the text. In the next lesson, you will be learn how to output all of the n-grams of a given keyword in a document downloaded from the Internet, and display them clearly in your browser window.
The list that we created in the From HTML to a List of Words (2) needs some normalizing before it can be used further. We are going to do this by applying additional string methods, as well as by using regular expressions. Once normalized, we will be able to more easily analyze our data.
This lesson takes the frequency pairs created in Counting Frequencies and outputs them to an HTML file.
Here you will learn how to output data as an HTML file using Python. You will also learn about string formatting. The final result is an HTML file that shows the keywords found in the original source in order of descending frequency, along with the number of times that each keyword appears.
This lesson builds on Keywords in Context (Using N-grams), where n-grams were extracted from a text. Here, you will learn how to output all of the n-grams of a given keyword in a document downloaded from the Internet, and display them clearly in your browser window.
This first lesson in our section on dealing with Online Sources is designed to get you and your computer set up to start programming. We will focus on installing the relevant software – all free and reputable – and finally we will help you to get your toes wet with some simple programming that provides immediate results.
In this opening module you will install the Python programming language, the Beautiful Soup HTML/XML parser, and a text editor. Screencaps provided here come from Komodo Edit, but you can use any text editor capable of working with Python. Here’s a list of other options: Python Editors. Once everything is installed, you will write your first programs, “Hello World” in Python and HTML.
When you are working with online sources, much of the time you will be using files that have been marked up with HTML (Hyper Text Markup Language). Your browser already knows how to interpret HTML, which is handy for human readers. Most browsers also let you see the HTML source code for any page that you visit. The two images below show a typical web page (from the Old Bailey Online) and the HTML source used to generate that page, which you can see with the Tools -> Web Developer -> Page Source command in Firefox.
This lesson introduces Uniform Resource Locators (URLs) and explains how to use Python to download and save the contents of a web page to your local hard drive.
In this lesson you will learn how to manipulate text files using Python. This includes opening, closing, reading from, and writing to .txt files.
The next few lessons will involve downloading a web page from the Internet and reorganizing the contents into useful chunks of information. You will be doing most of your work using Python code written and executed in Komodo Edit.