70 Commits
v0.3 ... master

Author SHA1 Message Date
de796880c5 Merge branch 'master' of github.com:charlesreid1/centillion
* 'master' of github.com:charlesreid1/centillion:
  update config_flask.example.py to strip dc info
2018-08-13 19:14:54 -07:00
f79f711a38 Merge branch 'master' of github.com:dcppc/centillion
* 'master' of github.com:dcppc/centillion:
  Update Readme.md
2018-08-13 19:14:07 -07:00
00b862b83e Merge branch 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion
* 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion:
2018-08-13 19:13:53 -07:00
a06c3b645a Update Readme.md 2018-08-13 12:42:18 -07:00
878ff011fb locked out by rate limit, but otherwise successful in indexing so far. 2018-08-13 00:54:12 -07:00
33cf78a524 successfully grabbing threads from 1st page of every subgroup 2018-08-13 00:27:45 -07:00
c1bcd8dc22 add import pdb where things are currently stuck 2018-08-12 20:25:29 -07:00
757e9d79a1 keep going with spider idea 2018-08-12 20:24:29 -07:00
c47682adb4 fix typo with groupsio key 2018-08-12 20:13:45 -07:00
f2662c3849 adding calls to index groupsio emails
this is currently work in progress.
we have a debug statement in place as a bookmark.

we are currently:
- creating a login session
- getting all the subgroups
- going to first subgroup
- getting list of titles and links
- getting emails for each title and link

still need to:
- figure out how to assemble email {}
- assemble content/etc and how to parse text of emails
2018-08-12 18:00:33 -07:00
2478a3f857 Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc
* 'dcppc' of github.com:dcppc/centillion:
  fix how search results are bundled
  fix search template
2018-08-10 06:05:44 -07:00
f174080dfd catch exception when file info not found 2018-08-10 06:05:33 -07:00
ca8b12db06 Merge pull request #2 from charlesreid1/dcppc-merge-master
Merge dcppc changes into master
2018-08-10 05:49:29 -07:00
a1ffdad292 Merge branch 'master' into dcppc-merge-master 2018-08-10 05:49:19 -07:00
ce76396096 update config_flask.example.py to strip dc info 2018-08-10 05:46:07 -07:00
175ff4f71d Merge pull request #17 from dcppc/github-files
fix search template
2018-08-09 18:57:30 -07:00
94f956e2d0 fix how search results are bundled 2018-08-09 18:56:56 -07:00
dc015671fc fix search template 2018-08-09 18:55:49 -07:00
1e9eec81d7 make it valid json 2018-08-09 18:15:14 -07:00
31e12476af Merge pull request #16 from dcppc/inception
add inception
2018-08-09 18:08:11 -07:00
bbe4e32f63 Merge pull request #15 from dcppc/github-files
index all github filenames, not just markdown
2018-08-09 18:07:56 -07:00
5013741958 while we're at it 2018-08-09 17:40:56 -07:00
1ce80a5da0 closes #11 2018-08-09 17:38:20 -07:00
3ed967bd8b remove unused function 2018-08-09 17:28:22 -07:00
1eaaa32007 index all github filenames, not just markdown 2018-08-09 17:25:09 -07:00
9c7e696b6a Merge branch 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion
* 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion:
  Move images, resize images, update image markdown in readme
  update readme to use <img> tags
  merge image files in from master
  fix <title>
  fix the readme to reflect current state of things/links/descriptions
  fix typos/wording in readme
  adding changes to enable https, update callback to http, and everything still passes through https (proxy)
  update footer repo info
  update screen shots
  add mkdocs-material-dib submodule
  remove mkdocs material submodule
  update tagline
  update tagline
  add _example_ config file for flask
2018-08-09 16:39:18 -07:00
262a0c19e7 Merge pull request #14 from dcppc/local-fixes
Fix centillion to work for local instances
2018-08-09 16:37:37 -07:00
bd2714cc0b Merge branch 'dcppc' into local-fixes 2018-08-09 16:36:34 -07:00
899d6fed53 comment out localhost only env var 2018-08-09 16:25:37 -07:00
a7756049e5 revert changes 2018-08-09 16:23:42 -07:00
3df427a8f8 fix how existing issues in search index are collected. closes #10 2018-08-09 16:17:17 -07:00
0dd06748de fix centillion to work for local instance 2018-08-09 16:16:30 -07:00
1a04814edf Merge pull request #9 from dcppc/ACharbonneau-patch-1
Update config_centillion.json
2018-08-07 16:09:45 -07:00
Amanda Charbonneau
3fb72d409b Update config_centillion.json
I fixed it
2018-08-07 18:24:32 -04:00
d89e01221a Merge pull request #8 from dcppc/dcppc-test
Fix the name of the milestones repo: 'dcppc-milestones' not 'milestones'
2018-08-07 14:59:06 -07:00
6736f3f8ad add centillion configuration json file 2018-08-07 14:54:56 -07:00
abd13aba29 Merge pull request #7 from dcppc/fix-docstrings
Fix docstrings
2018-08-07 14:43:42 -07:00
13e49cdaa6 improve docstrings on gdrive_util.py too 2018-08-07 14:42:19 -07:00
83b2ce17fb fix docstrings in centillion_search.py 2018-08-07 14:41:26 -07:00
5be0709070 Merge pull request #6 from dcppc/fix-docs
Move images, resize images, update image markdown in readme
2018-08-07 13:02:08 -07:00
9edd95a78d Merge branch 'fix-docs'
* fix-docs:
  Move images, resize images, update image markdown in readme
  update readme to use <img> tags
  merge image files in from master
  fix <title>
  fix the readme to reflect current state of things/links/descriptions
  fix typos/wording in readme
  adding changes to enable https, update callback to http, and everything still passes through https (proxy)
  update footer repo info
  update screen shots
  add mkdocs-material-dib submodule
  remove mkdocs material submodule
  update tagline
  update tagline
  add _example_ config file for flask
2018-08-07 12:50:29 -07:00
37615d8707 Move images, resize images, update image markdown in readme 2018-08-07 12:40:38 -07:00
4b218f63b9 update readme to use <img> tags 2018-08-03 15:56:49 -07:00
4e17c890bc merge image files in from master 2018-08-03 15:53:51 -07:00
1129ec38e0 update the readme 2018-08-03 15:49:46 -07:00
875508c796 update screen shot images 2018-08-03 15:49:12 -07:00
abc7a2aedf fix <title> 2018-08-03 15:45:56 -07:00
8f1e5faefc update readme to reflect latest 2018-08-03 15:38:23 -07:00
d5f63e2322 Merge pull request #1 from dcppc/fix-readme
fix the readme to reflect current state of things/links/descriptions
2018-08-03 15:28:51 -07:00
84e5560423 fix the readme to reflect current state of things/links/descriptions 2018-08-03 15:28:16 -07:00
924c562c0a fix typos/wording in readme 2018-08-03 15:22:35 -07:00
13c410ac5e adding changes to enable https, update callback to http, and everything still passes through https (proxy) 2018-08-03 15:21:41 -07:00
4e79800e83 update footer repo info 2018-08-03 15:19:55 -07:00
5b9570d8cd Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc
* 'dcppc' of github.com:dcppc/centillion:
  add mkdocs-material-dib submodule
  remove mkdocs material submodule
2018-08-03 14:54:25 -07:00
297a4b5977 update screen shots 2018-08-03 14:53:43 -07:00
69a6b5d680 add mkdocs-material-dib submodule 2018-08-03 13:51:13 -07:00
3feca1aba3 remove mkdocs material submodule 2018-08-03 13:50:37 -07:00
493581f861 update tagline 2018-08-03 13:38:00 -07:00
1b0ded809d update tagline 2018-08-03 13:36:56 -07:00
78e77c7cf2 add _example_ config file for flask 2018-08-03 13:34:27 -07:00
2f890d1aee Merge branch 'all-the-docs' of charlesreid1/centillion into master 2018-08-03 20:28:27 +00:00
937327f2cb update search template to treat drive files and documents differently. 2018-08-03 13:24:03 -07:00
ca0d88cfe6 index all the google drive things 2018-08-03 13:15:02 -07:00
5eda472072 improve handling of tokens for gh api, fix set ordering/logic 2018-08-03 13:07:46 -07:00
d943c14678 Merge branch 'master' into all-the-docs
* master:
  Update '.gitignore'
  no secrets plz
2018-08-03 12:37:49 -07:00
6be785a056 indexing all markdown is working. 2018-08-03 12:36:32 -07:00
65113a95f7 Update '.gitignore' 2018-08-03 17:52:04 +00:00
87c3f12c8f no secrets plz 2018-08-03 17:51:39 +00:00
933884e9ab search all the docs. search all the repos. 2018-08-03 10:29:52 -07:00
da9dea3f6b Merge branch 'github-markdown' of charlesreid1/centillion into master 2018-08-03 07:20:45 +00:00
17 changed files with 695 additions and 167 deletions

2
.gitignore vendored
View File

@@ -1,4 +1,4 @@
config_*
config_flask.py
vp
credentials.json
drive*.json

6
.gitmodules vendored
View File

@@ -1,3 +1,3 @@
[submodule "mkdocs-material"]
path = mkdocs-material
url = https://git.charlesreid1.com/charlesreid1/mkdocs-material.git
[submodule "mkdocs-material-dib"]
path = mkdocs-material-dib
url = https://github.com/dib-lab/mkdocs-material-dib.git

View File

@@ -1,18 +1,19 @@
# The Centillion
# Centillion
**the centillion**: a pan-github-markdown-issues-google-docs search engine.
**centillion**: a pan-github-markdown-issues-google-docs search engine.
**a centillion**: a very large number consisting of a 1 with 303 zeros after it.
the centillion is 3.03 log-times better than the googol.
one centillion is 3.03 log-times better than a googol.
![Screen shot of centillion](img/ss.png)
![Screen shot of centillion](docs/images/ss.png)
## what is it
The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
a Python library for building search engines.
Centillion (https://github.com/dcppc/centillion) is a search engine that can index
three kinds of collections: Google Documents, Github issues, and Markdown files in
Github repos.
We define the types of documents the centillion should index,
what info and how. The centillion then builds and
@@ -24,6 +25,30 @@ defined in `centillion.py`.
The centillion keeps it simple.
## authentication layer
Centillion lives behind a Github authentication layer, implemented with
[flask-dance](https://github.com/singingwolfboy/flask-dance). When you first
visit the site it will ask you to authenticate with Github so that it can
verify you have permission to access the site.
## technologies
Centillion is a Python program built using whoosh (search engine library). It
indexes the full text of docx files in Google Documents, just the filenames for
non-docx files. The full text of issues and their comments are indexed, and
results are grouped by issue. Centillion requires Google Drive and Github OAuth
apps. Once you provide credentials to Flask you're all set to go.
## control panel
There's also a control panel at <https://search.nihdatacommons.us/control_panel>
that allows you to rebuild the search index from scratch (the Google Drive indexing
takes a while).
![Screen shot of centillion control panel](docs/images/cp.png)
## quickstart (with Github auth)
@@ -31,6 +56,8 @@ Start by creating a Github OAuth application.
Get the public and private application key
(client token and client secret token)
from the Github application's page.
You will also need a Github access token
(in addition to the app tokens).
When you create the application, set the callback
URL to `/login/github/authorized`, as in:
@@ -65,11 +92,3 @@ as HTTP by Github, even though there is an HTTPS address, and
everything else seems fine, try deleting the Github OAuth app
and creating a new one.
## more info
For more info see the documentation: <https://charlesreid1.github.io/centillion>

View File

@@ -27,10 +27,16 @@ You provide:
class UpdateIndexTask(object):
def __init__(self, gh_oauth_token, diff_index=False):
def __init__(self, app_config, diff_index=False):
self.diff_index = diff_index
thread = threading.Thread(target=self.run, args=())
self.gh_oauth_token = gh_oauth_token
self.gh_token = app_config['GITHUB_TOKEN']
self.groupsio_credentials = {
'groupsio_token' : app_config['GROUPSIO_TOKEN'],
'groupsio_username' : app_config['GROUPSIO_USERNAME'],
'groupsio_password' : app_config['GROUPSIO_PASSWORD']
}
thread.daemon = True
thread.start()
@@ -43,9 +49,10 @@ class UpdateIndexTask(object):
from get_centillion_config import get_centillion_config
config = get_centillion_config('config_centillion.json')
search.update_index_markdown(self.gh_oauth_token,config)
search.update_index_issues(self.gh_oauth_token,config)
search.update_index_gdocs(config)
search.update_index_groupsioemails(self.groupsio_credentials,config)
###search.update_index_ghfiles(self.gh_token,config)
###search.update_index_issues(self.gh_token,config)
###search.update_index_gdocs(config)
@@ -55,11 +62,11 @@ app.wsgi_app = ProxyFix(app.wsgi_app)
# Load default config and override config from an environment variable
app.config.from_pyfile("config_flask.py")
github_bp = make_github_blueprint()
#github_bp = make_github_blueprint(
# client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
# client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
# scope='read:org')
#github_bp = make_github_blueprint()
github_bp = make_github_blueprint(
client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
scope='read:org')
app.register_blueprint(github_bp, url_prefix="/login")
@@ -172,11 +179,10 @@ def update_index():
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
if mresp.status_code==204:
gh_oauth_token = github.token['access_token']
# --------------------
# Business as usual
UpdateIndexTask(gh_oauth_token, diff_index=False)
UpdateIndexTask(app.config,
diff_index=False)
flash("Rebuilding index, check console output")
return render_template("controlpanel.html",
totals={})
@@ -216,5 +222,7 @@ def oops(e):
return contents404
if __name__ == '__main__':
# if running local instance, set to true
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = 'true'
app.run(host="0.0.0.0",port=5000)

View File

@@ -1,10 +1,11 @@
import shutil
import html.parser
from github import Github
from github import Github, GithubException
import base64
from gdrive_util import GDrive
from groupsio_util import GroupsIOArchivesCrawler
from apiclient.http import MediaIoBaseDownload
import mistune
@@ -128,7 +129,6 @@ class Search:
schema = Schema(
id = ID(stored=True, unique=True),
kind = ID(stored=True),
#fingerprint = ID(stored=True),
created_time = ID(stored=True),
modified_time = ID(stored=True),
@@ -252,7 +252,6 @@ class Search:
with open(fullpath_input, 'wb') as f:
f.write(r.content)
# Try to convert docx file to plain text
try:
output = pypandoc.convert_file(fullpath_input,
@@ -267,7 +266,6 @@ class Search:
# If export was successful, read contents of markdown
# into the content variable.
# into the content variable.
if os.path.isfile(fullpath_output):
# Export was successful
with codecs.open(fullpath_output, encoding='utf-8') as f:
@@ -277,12 +275,14 @@ class Search:
# No matter what happens, clean up.
print(" > Cleaning up \"%s\""%item['name'])
subprocess.call(['rm','-fr',fullpath_output])
## test
#print(" ".join(['rm','-fr',fullpath_output]))
subprocess.call(['rm','-fr',fullpath_input])
#print(" ".join(['rm','-fr',fullpath_input]))
# do it
subprocess.call(['rm','-fr',fullpath_output])
subprocess.call(['rm','-fr',fullpath_input])
if update:
print(" > Removing old record")
writer.delete_by_term('id',item['id'])
@@ -316,7 +316,7 @@ class Search:
# to a search index.
def add_issue(self, writer, issue, config, update=True):
def add_issue(self, writer, issue, gh_token, config, update=True):
"""
Add a Github issue/comment to a search index.
"""
@@ -368,44 +368,58 @@ class Search:
def add_markdown(self, writer, d, config, update=True):
def add_ghfile(self, writer, d, gh_token, config, update=True):
"""
Use a Github markdown document API record
to add a markdown document's contents to
the search index.
Use a Github file API record to add a filename
to the search index.
"""
MARKDOWN_EXTS = ['.md','.markdown']
repo = d['repo']
org = d['org']
repo_name = org + "/" + repo
repo_url = "https://github.com/" + repo_name
try:
fpath = d['path']
furl = d['url']
fsha = d['sha']
_, fname = os.path.split(fpath)
_, fext = os.path.splitext(fpath)
except:
print(" > XXXXXXXX Failed to find file info.")
return
print("Indexing markdown doc %s"%(fname))
indexed_time = clean_timestamp(datetime.now())
if fext in MARKDOWN_EXTS:
print("Indexing markdown doc %s from repo %s"%(fname,repo_name))
# Unpack the requests response and decode the content
response = requests.get(furl)
#
# don't forget the headers for private repos!
# useful: https://bit.ly/2LSAflS
headers = {'Authorization' : 'token %s'%(gh_token)}
response = requests.get(furl, headers=headers)
if response.status_code==200:
jresponse = response.json()
content = ""
try:
binary_content = re.sub('\n','',jresponse['content'])
content = base64.b64decode(binary_content).decode('utf-8')
except KeyError:
print(" > XXXXXXXX Failed to extract 'content' field. You probably hit the rate limit.")
return
# Now create the actual search index record
indexed_time = clean_timestamp(datetime.now())
else:
print(" > XXXXXXXX Failed to reach file URL. There may be a problem with authentication/headers.")
return
usable_url = "https://github.com/%s/blob/master/%s"%(repo_name, fpath)
# Add one document per issue thread,
# containing entire text of thread.
# Now create the actual search index record
writer.add_document(
id = fsha,
kind = 'markdown',
@@ -425,12 +439,41 @@ class Search:
content = content
)
else:
print("Indexing github file %s from repo %s"%(fname,repo_name))
key = fname+"_"+fsha
# Now create the actual search index record
writer.add_document(
id = key,
kind = 'ghfile',
created_time = '',
modified_time = '',
indexed_time = indexed_time,
title = fname,
url = repo_url,
mimetype='',
owner_email='',
owner_name='',
repo_name = repo_name,
repo_url = repo_url,
github_user = '',
issue_title = '',
issue_url = '',
content = ''
)
# ------------------------------
# Define how to update search index
# using different kinds of collections
# ------------------------------
# Google Drive Files/Documents
def update_index_gdocs(self,
config):
"""
@@ -478,7 +521,7 @@ class Search:
remote_ids = set()
full_items = {}
while True:
ps = 12
ps = 100
results = drive.list(
pageSize=ps,
pageToken=nextPageToken,
@@ -496,11 +539,11 @@ class Search:
# Also store the doc
full_items[f['id']] = f
# Shorter:
## Shorter:
#break
# Longer:
if nextPageToken is None:
break
## Longer:
#if nextPageToken is None:
# break
writer = self.ix.writer()
@@ -544,13 +587,13 @@ class Search:
print("Done, updated %d documents in the index" % count)
# ------------------------------
# Github Issues/Comments
def update_index_issues(self, gh_oauth_token, config):
def update_index_issues(self, gh_token, config):
"""
Update the search index using a collection of
Github repo issues and comments.
gh_oauth_token can also be an access token.
"""
# Updated algorithm:
# - get set of indexed ids
@@ -562,7 +605,7 @@ class Search:
# ------
indexed_issues = set()
p = QueryParser("kind", schema=self.ix.schema)
q = p.parse("gdoc")
q = p.parse("issue")
with self.ix.searcher() as s:
results = s.search(q,limit=None)
for result in results:
@@ -572,25 +615,29 @@ class Search:
# Get the set of remote ids:
# ------
# Start with api object
g = Github(gh_oauth_token)
g = Github(gh_token)
# Now index all issue threads in the user-specified repos
# Iterate over each repo
list_of_repos = config['repositories']
for r in list_of_repos:
# Start by collecting all the things
remote_issues = set()
full_items = {}
# Iterate over each repo
list_of_repos = config['repositories']
for r in list_of_repos:
if '/' not in r:
err = "Error: specify org/reponame or user/reponame in list of repos"
raise Exception(err)
this_org, this_repo = re.split('/',r)
try:
org = g.get_organization(this_org)
repo = org.get_repo(this_repo)
except:
print("Error: could not gain access to repository %s"%(r))
continue
# Iterate over each issue thread
issues = repo.get_issues()
@@ -622,7 +669,7 @@ class Search:
# cop out
writer.delete_by_term('id',update_issue)
item = full_items[update_issue]
self.add_issue(writer, item, config, update=True)
self.add_issue(writer, item, gh_token, config, update=True)
count += 1
@@ -631,7 +678,7 @@ class Search:
add_issues = remote_issues - indexed_issues
for add_issue in add_issues:
item = full_items[add_issue]
self.add_issue(writer, item, config, update=False)
self.add_issue(writer, item, gh_token, config, update=False)
count += 1
@@ -640,16 +687,15 @@ class Search:
# ------------------------------
# Github Files
def update_index_markdown(self, gh_oauth_token, config):
def update_index_ghfiles(self, gh_token, config):
"""
Update the search index using a collection of
Markdown files from a Github repo.
gh_oauth_token can also be an access token.
files (and, separately, Markdown files) from
a Github repo.
"""
EXT = '.md'
# Updated algorithm:
# - get set of indexed ids
# - get set of remote ids
@@ -660,6 +706,12 @@ class Search:
# ------
indexed_ids = set()
p = QueryParser("kind", schema=self.ix.schema)
q = p.parse("ghfiles")
with self.ix.searcher() as s:
results = s.search(q,limit=None)
for result in results:
indexed_ids.add(result['id'])
q = p.parse("markdown")
with self.ix.searcher() as s:
results = s.search(q,limit=None)
@@ -669,62 +721,66 @@ class Search:
# Get the set of remote ids:
# ------
# Start with api object
g = Github(gh_oauth_token)
g = Github(gh_token)
# Now index all markdown files
# in the user-specified repos
# Iterate over each repo
list_of_repos = config['repositories']
for r in list_of_repos:
# Now index all the files.
# Start by collecting all the things
remote_ids = set()
full_items = {}
# Iterate over each repo
list_of_repos = config['repositories']
for r in list_of_repos:
if '/' not in r:
err = "Error: specify org/reponame or user/reponame in list of repos"
raise Exception(err)
this_org, this_repo = re.split('/',r)
try:
org = g.get_organization(this_org)
repo = org.get_repo(this_repo)
except:
print("Error: could not gain access to repository %s"%(r))
continue
# ---------
# begin markdown-specific code
# Get head commit
commits = repo.get_commits()
try:
last = commits[0]
sha = last.sha
except GithubException:
print("Error: could not get commits from repository %s"%(r))
continue
# Get all the docs
tree = repo.get_git_tree(sha=sha, recursive=True)
docs = tree.raw_data['tree']
print("Parsing file ids from repository %s"%(r))
for d in docs:
# For each doc, get the file extension
# If it matches EXT, download the file
# and decide what to do with it.
fpath = d['path']
_, fname = os.path.split(fpath)
_, fext = os.path.splitext(fpath)
if fext==EXT:
key = d['sha']
d['org'] = this_org
d['repo'] = this_repo
value = d
# Stash the doc for later
remote_ids.add(key)
full_items[key] = value
writer = self.ix.writer()
count = 0
# Drop any id in indexed_ids
# not in remote_ids
drop_ids = indexed_ids - remote_ids
@@ -736,10 +792,10 @@ class Search:
# and in remote_ids
update_ids = indexed_ids & remote_ids
for update_id in update_ids:
# cop out
# cop out: just delete and re-add
writer.delete_by_term('id',update_id)
item = full_items[update_id]
self.add_markdown(writer, item, config, update=True)
self.add_ghfile(writer, item, gh_token, config, update=True)
count += 1
@@ -748,15 +804,42 @@ class Search:
add_ids = remote_ids - indexed_ids
for add_id in add_ids:
item = full_items[add_id]
self.add_markdown(writer, item, config, update=False)
self.add_ghfile(writer, item, gh_token, config, update=False)
count += 1
writer.commit()
print("Done, updated %d markdown documents in the index" % count)
print("Done, updated %d Github files in the index" % count)
# ------------------------------
# Groups.io Emails
def update_index_groupsioemails(self, groupsio_token, config):
"""
Update the search index using the email archives
of groups.io groups.
This requires the use of a spider.
RELEASE THE SPIDER!!!
"""
spider = GroupsIOArchivesCrawler(groupsio_token,'dcppc')
# - ask spider to crawl the archives
spider.crawl_group_archives()
# - ask spider for list of all email records
# - 1 email = 1 dictionary
# - email records compiled by the spider
archives = spider.get_archives()
# - email object is sent off to add email method
print("Finished indexing groups.io emails")
# ---------------------------------
# Search results bundler
@@ -864,31 +947,27 @@ class Search:
def get_document_total_count(self):
p = QueryParser("kind", schema=self.ix.schema)
kind_labels = {
"documents" : "gdoc",
"markdown" : "markdown",
"issues" : "issue",
}
counts = {
"documents" : None,
"gdoc" : None,
"issue" : None,
"ghfile" : None,
"markdown" : None,
"issues" : None,
"total" : None
}
for key in kind_labels:
kind = kind_labels[key]
q = p.parse(kind)
for key in counts.keys():
q = p.parse(key)
with self.ix.searcher() as s:
results = s.search(q,limit=None)
counts[key] = len(results)
## These two should NOT be different, but they are...
#counts['total'] = self.ix.searcher().doc_count_all()
counts['total'] = counts['documents'] + counts['markdown'] + counts['issues']
counts['total'] = sum(counts[k] for k in counts.keys())
return counts
if __name__ == "__main__":
raise Exception("Error: main method not implemented (fix groupsio credentials first)")
search = Search("search_index")
from get_centillion_config import get_centillion_config

View File

@@ -1,6 +1,27 @@
{
"repositories" : [
"dcppc/project-management",
"dcppc/nih-demo-meetings",
"dcppc/internal",
"dcppc/organize",
"dcppc/dcppc-bot",
"dcppc/full-stacks",
"dcppc/design-guidelines-discuss",
"dcppc/dcppc-deliverables",
"dcppc/dcppc-milestones",
"dcppc/crosscut-metadata",
"dcppc/lucky-penny",
"dcppc/dcppc-workshops",
"dcppc/metadata-matrix",
"dcppc/data-stewards",
"dcppc/dcppc-phase1-demos",
"dcppc/apis",
"dcppc/2018-june-workshop",
"dcppc/2018-july-workshop"
"dcppc/2018-july-workshop",
"dcppc/2018-august-workshop",
"dcppc/2018-september-workshop",
"dcppc/design-guidelines",
"dcppc/2018-may-workshop",
"dcppc/centillion"
]
}

View File

@@ -2,8 +2,9 @@
INDEX_DIR = "search_index"
# oauth client deets
GITHUB_OAUTH_CLIENT_ID = "63f8d49c651840cbe31e"
GITHUB_OAUTH_CLIENT_SECRET = "36d9a4611f7427336d3c89ed041c45d086b793ee"
GITHUB_OAUTH_CLIENT_ID = "XXX"
GITHUB_OAUTH_CLIENT_SECRET = "YYY"
GITHUB_TOKEN = "ZZZ"
# More information footer: Repository label
FOOTER_REPO_ORG = "charlesreid1"
@@ -12,8 +13,8 @@ FOOTER_REPO_NAME = "centillion"
# Toggle to show Whoosh parsed query
SHOW_PARSED_QUERY=True
TAGLINE = "Search all the things"
TAGLINE = "Search All The Things"
# Flask settings
DEBUG = True
SECRET_KEY = '42c5a8eda356ca9d9c3ab2d149541e6b91d843fa'
SECRET_KEY = 'WWWWW'

BIN
docs/images/cp.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 498 KiB

BIN
docs/images/ss.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 355 KiB

View File

@@ -29,8 +29,7 @@ class GDrive(object):
):
"""
Set up the Google Drive API instance.
Factory method: create it and hand it over.
Then we're finished.
Factory method: create it here, hand it over in get_service().
"""
self.credentials_file = credentials_file
self.client_secret_file = client_secret_file
@@ -40,6 +39,9 @@ class GDrive(object):
self.store = file.Storage(credentials_file)
def get_service(self):
"""
Return an instance of the Google Drive API service.
"""
creds = self.store.get()
if not creds or creds.invalid:

382
groupsio_util.py Normal file
View File

@@ -0,0 +1,382 @@
import requests, os, re
from bs4 import BeautifulSoup
class GroupsIOArchivesCrawler(object):
"""
This is a Groups.io spider
designed to crawl the email
archives of a group.
credentials (dictionary):
groupsio_token : api access token
groupsio_username : username
groupsio_password : password
"""
def __init__(self,
credentials,
group_name):
# template url for archives page (list of topics)
self.url = "https://{group}.groups.io/g/{subgroup}/topics"
self.login_url = "https://groups.io/login"
self.credentials = credentials
self.group_name = group_name
self.crawled_archives = False
self.archives = None
def get_archives(self):
"""
Return a list of dictionaries containing
information about each email topic in the
groups.io email archive.
Call crawl_group_archives() first!
"""
return self.archives
def get_subgroups_list(self):
"""
Use the API to get a list of subgroups.
"""
subgroups_url = 'https://api.groups.io/v1/getsubgroups'
key = self.credentials['groupsio_token']
data = [('group_name', self.group_name),
('limit',100)
]
response = requests.post(subgroups_url,
data=data,
auth=(key,''))
response = response.json()
data = response['data']
subgroups = {}
for group in data:
k = group['id']
v = re.sub(r'dcppc\+','',group['name'])
subgroups[k] = v
return subgroups
def crawl_group_archives(self):
"""
Spider will crawl the email archives of the entire group
by crawling the email archives of each subgroup.
"""
subgroups = self.get_subgroups_list()
# ------------------------------
# Start by logging in.
# Create session object to persist session data
session = requests.Session()
# Log in to the website
data = dict(email = self.credentials['groupsio_username'],
password = self.credentials['groupsio_password'],
timezone = 'America/Los_Angeles')
r = session.post(self.login_url,
data = data)
csrf = self.get_csrf(r)
# ------------------------------
# For each subgroup, crawl the archives
# and return a list of dictionaries
# containing all the email threads.
for subgroup_id in subgroups.keys():
self.crawl_subgroup_archives(session,
csrf,
subgroup_id,
subgroups[subgroup_id])
# Done. archives are now tucked away
# in the variable self.archives
#
# self.archives is a list of dictionaries,
# with each dictionary containing info about
# a topic/email thread in a subgroup.
# ------------------------------
def crawl_subgroup_archives(self, session, csrf, subgroup_id, subgroup_name):
"""
This kicks off the process to crawl the entire
archives of a given subgroup on groups.io.
For a given subgroup the url is self.url,
https://{group}.groups.io/g/{subgroup}/topics
This is the first of a paginated list of topics.
Procedure is:
- passed a starting page (or its contents)
- iterate through all topics via the HTML page elements
- assemble a bundle of information about each topic:
- topic title, by, URL, date, content, permalink
- content filtering:
- ^From, Reply-To, Date, To, Subject
- Lines containing phone numbers
- 9 digits
- XXX-XXX-XXXX, (XXX) XXX-XXXX
- XXXXXXXXXX, XXX XXX XXXX
- ^Work: or (Work) or Work$
- Home, Cell, Mobile
- +1 XXX
- \w@\w
- while next button is not greyed out,
- click the next button
everything stored in self.archives:
list of dictionaries.
"""
self.archives = []
prefix = "https://{group}.groups.io".format(group=self.group_name)
url = self.url.format(group=self.group_name,
subgroup=subgroup_name)
# ------------------------------
# Now get the first page
r = session.get(url)
# ------------------------------
# Fencepost algorithm:
# First page:
# Extract a list of (title, link) items
items = self.extract_archive_page_items_(r)
# Get the next link
next_url = self.get_next_url_(r)
# Now add each item to the archive of threads,
# then find the next button.
self.add_items_to_archives_(session,subgroup_name,items)
if next_url is None:
return
else:
full_next_url = prefix + next_url
# Now click the next button
next_request = requests.get(full_next_url)
while next_request.status_code==200:
items = self.extract_archive_page_items_(next_request)
next_url = self.get_next_url_(next_request)
self.add_items_to_archives_(session,subgroup_name,items)
if next_url is None:
return
else:
full_next_url = prefix + next_url
next_request = requests.get(full_next_url)
def add_items_to_archives_(self,session,subgroup_name,items):
"""
Given a set of items from a list of threads,
items being title and link,
get the page and store all info
in self.archives variable
(list of dictionaries)
"""
for (title, link) in items:
# Get the thread page:
prefix = "https://{group}.groups.io".format(group=self.group_name)
full_link = prefix + link
r = session.get(full_link)
soup = BeautifulSoup(r.text,'html.parser')
# soup contains the entire thread
# What are we extracting:
# 1. thread number
# 2. permalink
# 3. content/text (filtered)
# - - - - - - - - - - - - - -
# 1. topic/thread number:
# <a rel="nofollow" href="">
# where link is:
# https://{group}.groups.io/g/{subgroup}/topic/{topic_id}
# example topic id: 24209140
#
# ugly links are in the form
# https://dcppc.groups.io/g/{subgroup}/topic/some_text_here/{thread_id}?p=,,,,,1,2,3,,,4,,5
# split at ?, 0th portion
# then split at /, last (-1th) portion
topic_id = link.split('?')[0].split('/')[-1]
# - - - - - - - - - - - - - - -
# 2. permalink:
# - current link is ugly link
# - permalink is the nice one
# - topic id is available from the ugly link
# https://{group}.groups.io/g/{subgroup}/topic/{topic_id}
permalink_template = "https://{group}.groups.io/g/{subgroup}/topic/{topic_id}"
permalink = permalink_template.format(
group = self.group_name,
subgroup = subgroup_name,
topic_id = topic_id
)
# - - - - - - - - - - - - - - -
# 3. content:
# Need to rearrange how we're assembling threads here.
# This is one thread, no?
content = []
subject = soup.find('title').text
# Extract information for the schema:
# - permalink for thread (done)
# - subject/title (done)
# - original sender email/name (done)
# - content (done)
# Groups.io pages have zero CSS classes, which makes everything
# a giant pain in the neck to interact with. Thanks Groups.io!
original_sender = ''
for i, tr in enumerate(soup.find_all('tr',{'class':'test'})):
# Every other tr row contains an email.
if (i+1)%2==0:
# nope, no email here
pass
else:
# found an email!
# this is a maze, thanks groups.io
td = tr.find('td')
divrow = td.find('div',{'class':'row'}).find('div',{'class':'pull-left'})
if (i+1)==1:
original_sender = divrow.text.strip()
for div in td.find_all('div'):
if div.has_attr('id'):
# purge any signatures
for x in div.find_all('div',{'id':'Signature'}):
x.extract()
# purge any headers
for x in div.find_all('div'):
nonos = ['From:','Sent:','To:','Cc:','CC:','Subject:']
for nono in nonos:
if nono in x.text:
x.extract()
message_text = div.get_text()
# More filtering:
# phone numbers
message_text = re.sub(r'[0-9]{3}-[0-9]{3}-[0-9]{4}','XXX-XXX-XXXX',message_text)
message_text = re.sub(r'[0-9]\{10\}','XXXXXXXXXX',message_text)
content.append(message_text)
full_content = "\n".join(content)
thread = {
'permalink' : permalink,
'subject' : subject,
'original_sender' : original_sender,
'content' : full_content
}
print('*'*40)
for k in thread.keys():
if k=='content':
pass
else:
print("%s : %s"%(k,thread[k]))
print('*'*40)
self.archives.append(thread)
def extract_archive_page_items_(self, response):
"""
(Private method)
Given a response from a GET request,
use beautifulsoup to extract all items
(thread titles and ugly thread links)
and pass them back in a list.
"""
soup = BeautifulSoup(response.content,"html.parser")
rows = soup.find_all('tr',{'class':'test'})
if 'rate limited' in soup.text:
raise Exception("Error: rate limit in place for Groups.io")
results = []
for row in rows:
# We don't care about anything except title and ugly link
subject = row.find('span',{'class':'subject'})
title = subject.get_text()
link = row.find('a')['href']
print(title)
results.append((title,link))
return results
def get_next_url_(self, response):
"""
(Private method)
Given a response (which is a list of threads),
find the next button and return the URL.
If no next URL, if is disabled, then return None.
"""
soup = BeautifulSoup(response.text,'html.parser')
chevron = soup.find('i',{'class':'fa-chevron-right'})
try:
if '#' in chevron.parent['href']:
# empty link, abort
return None
except AttributeError:
# I don't even now
return None
if chevron.parent.parent.has_attr('class') and 'disabled' in chevron.parent.parent['class']:
# no next link, abort
return None
return chevron.parent['href']
def get_csrf(self,resp):
"""
Find the CSRF token embedded in the subgroup page
"""
soup = BeautifulSoup(resp.text,'html.parser')
csrf = ''
for i in soup.find_all('input'):
# Note that i.name is different from i['name']
# the first is the actual tag,
# the second is the attribute name="xyz"
if i['name']=='csrf':
csrf = i['value']
if csrf=='':
err = "ERROR: Could not find csrf token on page."
raise Exception(err)
return csrf

Binary file not shown.

Before

Width:  |  Height:  |  Size: 356 KiB

Submodule mkdocs-material deleted from 6569122bb1

1
mkdocs-material-dib Submodule

Submodule mkdocs-material-dib added at c3dd912f3c

View File

@@ -10,3 +10,4 @@ pypandoc>=1.4
requests>=2.19
pandoc>=1.0
flask-dance>=1.0.0
beautifulsoup4>=4.6

View File

@@ -1,5 +1,5 @@
<!doctype html>
<title>Markdown Search</title>
<title>Centillion Search Engine</title>
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='bootstrap.min.css') }}">

View File

@@ -86,9 +86,11 @@
<div class="container-fluid">
<div class="row">
<div class="col-xs-12 info">
<b>Indexing:</b> <span class="badge">{{totals["documents"]}}</span> Google Documents,
<span class="badge">{{totals["issues"]}}</span> Github issues,
<span class="badge">{{totals["markdown"]}}</span> markdown files.
<b>Indexing:</b> <span
class="badge">{{totals["gdoc"]}}</span> Google Documents,
<span class="badge">{{totals["issue"]}}</span> Github issues,
<span class="badge">{{totals["ghfile"]}}</span> Github files,
<span class="badge">{{totals["markdown"]}}</span> Github markdown files.
</div>
</div>
</div>
@@ -107,12 +109,19 @@
<div class="url">
{% if e.kind=="gdoc" %}
<b>Google Drive File:</b>
{% if e.mimetype=="" %}
<b>Google Document:</b>
<a href='{{e.url}}'>{{e.title}}</a>
(Owner: {{e.owner_name}}, {{e.owner_email}})<br />
<b>Document Type</b>: {{e.mimetype}}
{% else %}
<b>Google Drive:</b>
<a href='{{e.url}}'>{{e.title}}</a>
(Owner: {{e.owner_name}}, {{e.owner_email}})
{% endif %}
{% elif e.kind=="issue" %}
<b>Issue:</b>
<b>Github Issue:</b>
<a href='{{e.url}}'>{{e.title}}</a>
{% if e.github_user %}
opened by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
@@ -121,7 +130,7 @@
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
{% elif e.kind=="markdown" %}
<b>Markdown:</b>
<b>Github Markdown:</b>
<a href='{{e.url}}'>{{e.title}}</a>
<br/>
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
@@ -131,9 +140,15 @@
{% endif %}
<br />
score: {{'%d' % e.score}}
Score: {{'%d' % e.score}}
</div>
<div class="markdown-body">
{% if e.content_highlight %}
{{ e.content_highlight|safe}}
{% else %}
<p>(A preview of this document is not available.)</p>
{% endif %}
</div>
<div class="markdown-body">{{ e.content_highlight|safe}}</div>
</li>
{% endfor %}