Scraping data and statistics from charlesreid1.com/wiki
 
 
Charles Reid 8ebecca581 update readme with note 7 years ago
d3-calendar update color schemes on edit/char count calendars 7 years ago
.gitignore add initial framework for page history db 7 years ago
GPL-LICENSE adding page graph script, and supporting graph object + GPL license. 7 years ago
LICENSE Initial commit 7 years ago
README.md update readme with note 7 years ago
collections_sites_pages.py update collections method: get or do not get the meta collections 7 years ago
graph.py Adding a strong connected components detector 7 years ago
graph_algorithms.py Adding a strong connected components detector 7 years ago
mongo_graph.py Handle insert edge exception 7 years ago
page_graph.py Adding a strong connected components detector 7 years ago
page_history.py Do character counts in page history 7 years ago
strongly-connected.py scratch code for finding strongly connected components. 7 years ago

README.md

charlesreid1-wiki-data

This repository contains scripts for analyzing graph and edit data from the charlesreid1 wiki: https://charlesreid1.com/wiki

This data must be prepared ahead of time and loaded into a MongoDB. For the scripts that actually assemble the data set, see the dotfiles/debian repo, specifically the dotfiles/jupiter_scripts/ directory and the push_wiki.py script.

NOTE (08/23/2018): This repo has lots of cobwebs and needs to be cleaned up.

page history

Implements a method to scrape the wiki and create commits for each revision. Implemented in page_history.py.

The page history database contains a document for every revision on the wiki. Each revision document contains:

  • Revision sha1 hash of page contents (id)
  • Page title
  • Page revision timestamp
  • Page revision character count
  • Namespace, tags, categories

page graph

This implements a graph data structure in the files:

  • graph.py - base Graph object
  • mongodb_graph.py - base MongoDB Graph object (MongoDB I/O)
  • graph_algorithms.py - implements several algorithms for graphs

The page graph is actually constructed in memory as the wiki was scraped, and the database is exported when the process is finished.