wiki

Scraping data and statistics from charlesreid1.com/wiki

Charles Reid 8ebecca581 update readme with note		7 years ago
d3-calendar	update color schemes on edit/char count calendars	7 years ago
.gitignore	add initial framework for page history db	7 years ago
GPL-LICENSE	adding page graph script, and supporting graph object + GPL license.	7 years ago
LICENSE	Initial commit	7 years ago
README.md	update readme with note	7 years ago
collections_sites_pages.py	update collections method: get or do not get the meta collections	7 years ago
graph.py	Adding a strong connected components detector	7 years ago
graph_algorithms.py	Adding a strong connected components detector	7 years ago
mongo_graph.py	Handle insert edge exception	7 years ago
page_graph.py	Adding a strong connected components detector	7 years ago
page_history.py	Do character counts in page history	7 years ago
strongly-connected.py	scratch code for finding strongly connected components.	7 years ago

README.md

charlesreid1-wiki-data

This repository contains scripts for analyzing graph and edit data from the charlesreid1 wiki: https://charlesreid1.com/wiki

This data must be prepared ahead of time and loaded into a MongoDB. For the scripts that actually assemble the data set, see the dotfiles/debian repo, specifically the dotfiles/jupiter_scripts/ directory and the push_wiki.py script.

NOTE (08/23/2018): This repo has lots of cobwebs and needs to be cleaned up.

page history

Implements a method to scrape the wiki and create commits for each revision. Implemented in page_history.py.

The page history database contains a document for every revision on the wiki. Each revision document contains:

Revision sha1 hash of page contents (id)
Page title
Page revision timestamp
Page revision character count
Namespace, tags, categories

page graph

This implements a graph data structure in the files:

graph.py - base Graph object
mongodb_graph.py - base MongoDB Graph object (MongoDB I/O)
graph_algorithms.py - implements several algorithms for graphs

The page graph is actually constructed in memory as the wiki was scraped, and the database is exported when the process is finished.