BeautifulSoup to scrape the Ulysses Page by Page blog (http://ulyssespages.blogspot.com/)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
Charles Reid 267d558af4 adding wget instructions to readme 7 years ago
.gitignore Initial commit 7 years ago
LICENSE Initial commit 7 years ago
OUTPUT_media_links.txt Adding the big huge list of links. 7 years ago
OUTPUT_media_links_sorted.txt adding sorted output lists 7 years ago
OUTPUT_other_links.txt Adding the big huge list of links. 7 years ago
OUTPUT_other_links_sorted.txt adding sorted output lists 7 years ago
OUTPUT_ubumexico_links.txt Adding the big huge list of links. 7 years ago
OUTPUT_ubumexico_links_sorted.txt adding sorted output lists 7 years ago
README.md adding wget instructions to readme 7 years ago
bad_links.txt adding the script (working), list of bad links (complete), readme (logo) 7 years ago
get_soup.py adding the script (working), list of bad links (complete), readme (logo) 7 years ago
ss.png moving screen shot image, updating readme 7 years ago

README.md

Ulysses Soup

This repository contains a script to use BeautifulSoup to scrape the Ulysses Page by Page blog (http://ulyssespages.blogspot.com/) and get a list of MP3 files embedded on each page.

This is not terribly complicated - use the Ulysses Page by Page index (which provides one link for each page of Ulysses). The only difficulty is that some of the links on the index are broken, but the broken links (and corresponding correct links) are listed in the bad_links.txt CSV file.

The output files that result contain one link per line, and can be used with the wget utility to download the MP3s from the command line. For example, to download all of the MP3 files to a directory called mp3/, you would run the following commands from the repository:

mkdir mp3/
cd mp3/
wget -i ../OUTPUT_media_links_sorted.txt

Screenshot of sample output