34 Commits
v0.2 ... v0.3

Author SHA1 Message Date
4d6386e74a add results-handling for markdown files 2018-08-03 00:19:57 -07:00
a93b7519de improve counts accounting, and construct usable urls for markdown 2018-08-03 00:19:35 -07:00
5e2c37164b fix markdown indexing 2018-08-02 23:56:56 -07:00
829e9c4263 finish subsuming repotree into centillion_search 2018-08-02 23:14:55 -07:00
283991017c add repotree script. temporary/standalone, but doing exactly what centillion needs to do. 2018-08-02 22:29:18 -07:00
653af18f24 add update_index_markdown() function, rough/unfinished 2018-08-02 22:27:30 -07:00
fae184f1f3 re-indexer now calls (nonexistent file) update_index_markdown 2018-08-02 22:26:56 -07:00
d40bb3557f Merge branch 'flask-dance' of charlesreid1/centillion into master 2018-08-03 04:09:20 +00:00
a848f3ec3e complete the conversion to oauth tokens 2018-08-02 19:06:34 -07:00
50d27a915a update readme 2018-08-02 19:04:40 -07:00
1b950b7790 update re-index task to use gh token; reorganize logic; use werkzeug proxy 2018-08-02 19:02:00 -07:00
04d4195668 Add flask-dance to centillion.
- Remove config file, which now contains secrets
- Add flask dance to requirements
- Update instructions in readme to include Github application setup
2018-08-02 11:52:56 -07:00
d0fe7aa799 ignore config files, which may have keys in them 2018-08-02 11:24:33 -07:00
acc28aab44 Merge branch 'cache-and-hash' of charlesreid1/centillion into master 2018-08-02 17:59:45 +00:00
adc2666a9b actually fix flashed messages 2018-08-02 00:58:37 -07:00
581f0a67ed fix messages so they are js and dismissable 2018-08-02 00:54:56 -07:00
0b96061bc5 update documentation, add new docs pages on components/flask/whoosh 2018-08-01 23:04:35 -07:00
c7acdea889 finally. make results comprehensible. 2018-08-01 22:39:07 -07:00
4eabd4536e remove last searches from search.html 2018-08-01 22:32:20 -07:00
78276c14d9 align badges higher 2018-08-01 22:31:59 -07:00
68f90d383f fix up how issues are added, and how all issues are iterated over (use set algebra) 2018-08-01 22:31:41 -07:00
202643b85e add control_panel route, remove last_search silliness 2018-08-01 22:29:06 -07:00
dc9ac74d68 add control panel page 2018-08-01 20:12:55 -07:00
36cc94a854 Fix bootstrap div classes, badgify counts, fix <li> styles 2018-08-01 20:12:10 -07:00
740e757bcd update todo with what we have done 2018-08-01 15:54:03 -07:00
bf6afe39c6 caching is working 2018-08-01 15:48:43 -07:00
54c09ce80b call add drive file function with add/update docIDs. fix method headers. 2018-08-01 15:17:07 -07:00
1407178f39 updating flask config and templates to parameterize repo info in footer 2018-08-01 13:43:43 -07:00
2bf9abfd6f update footer: prior searches are now badges, and link to more info now points to repo 2018-08-01 13:36:45 -07:00
8328f96f76 make "prior searches" a badge and infobox bg color 2018-08-01 13:36:05 -07:00
d5a9fe85af Merge branch 'master' into cache-and-hash
* master:
  update installation preparation step
2018-08-01 12:50:10 -07:00
f8d2156d85 update installation preparation step 2018-08-01 12:48:09 -07:00
a753ba4963 update centillion search with comment blocks laying out what to change and where 2018-08-01 11:32:37 -07:00
8cca4b2c8d add TAGLINE param 2018-08-01 00:49:56 -07:00
19 changed files with 1019 additions and 305 deletions

2
.gitignore vendored
View File

@@ -1,8 +1,8 @@
config_*
vp
credentials.json
drive*.json
*.pyc
config.py
out/
search_index/
venv/

View File

@@ -8,6 +8,7 @@ the centillion is 3.03 log-times better than the googol.
![Screen shot of centillion](img/ss.png)
## what is it
The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
@@ -24,17 +25,46 @@ defined in `centillion.py`.
The centillion keeps it simple.
## quickstart
## quickstart (with Github auth)
Run the centillion app with a github access token API key set via
environment variable:
Start by creating a Github OAuth application.
Get the public and private application key
(client token and client secret token)
from the Github application's page.
When you create the application, set the callback
URL to `/login/github/authorized`, as in:
```
GITHUB_TOKEN="XXXXXXXX" python centillion.py
https://<url>/login/github/authorized
```
Edit the Flask configuration `config_flask.py`
and set the public and private application keys.
Now run centillion:
```
python centillion.py
```
or if you used http instead of https:
```
OAUTHLIB_INSECURE_TRANSPORT="true" python centillion.py
```
This will start a Flask server, and you can view the minimal search engine
interface in your browser at <http://localhost:5000>.
interface in your browser at `http://<ip>:5000`.
## troubleshooting
If you are having problems with your callback URL being treated
as HTTP by Github, even though there is an HTTPS address, and
everything else seems fine, try deleting the Github OAuth app
and creating a new one.
## more info

48
Todo.md
View File

@@ -1,7 +1,47 @@
# todo
current problems:
- some github issues have no title
- github issues are just being re-indexed over and over
- documents not showing up in results
Main task:
- hashing and caching
- <s>first, working out the logic of how we group items into sets
- needs to be deleted
- needs to be updated
- needs to be added
- for docs, issues, and comments</s>
- second, when we add or update an item, need to:
- go through the motions, download file, extract text
- check for existing indexed doc with that id
- check if existing indexed doc has same hash
- if so, skip
- otherwise, delete and re-index
Other bugs:
- Some github issues have no title (?)
- <s>Need to combine issues with comments</s>
- Not able to index markdown files _in a repo_
- (Longer term) update main index vs update diff index
Needs:
- <s>control panel</s>
Thursday product:
- Everything re-indexed nightly
- Search engine built on all documents in Google Drive, all issues, markdown files
- Using pandoc to extract Google Drive document contents
- BRIEF quickstart documentation
Future:
- Future plans to improve - plugins, improving matching
- Subdomain plans
- Folksonomy tagging and integration plans
config options for plugins
conditional blocks with import github inside
complicated tho - better to have components split off

View File

@@ -2,8 +2,11 @@ import threading
from subprocess import call
import codecs
import os
import os, json
from werkzeug.contrib.fixers import ProxyFix
from flask import Flask, request, redirect, url_for, render_template, flash
from flask_dance.contrib.github import make_github_blueprint, github
# create our application
from centillion_search import Search
@@ -22,10 +25,12 @@ You provide:
- Google Drive API key via file
"""
class UpdateIndexTask(object):
def __init__(self, diff_index=False):
def __init__(self, gh_oauth_token, diff_index=False):
self.diff_index = diff_index
thread = threading.Thread(target=self.run, args=())
self.gh_oauth_token = gh_oauth_token
thread.daemon = True
thread.start()
@@ -38,30 +43,90 @@ class UpdateIndexTask(object):
from get_centillion_config import get_centillion_config
config = get_centillion_config('config_centillion.json')
gh_token = os.environ['GITHUB_TOKEN']
search.update_index_issues(gh_token, config)
search.update_index_markdown(self.gh_oauth_token,config)
search.update_index_issues(self.gh_oauth_token,config)
search.update_index_gdocs(config)
app = Flask(__name__)
app.wsgi_app = ProxyFix(app.wsgi_app)
# Load default config and override config from an environment variable
app.config.from_pyfile("config_flask.py")
last_searches_file = app.config["INDEX_DIR"] + "/last_searches.txt"
github_bp = make_github_blueprint()
#github_bp = make_github_blueprint(
# client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
# client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
# scope='read:org')
app.register_blueprint(github_bp, url_prefix="/login")
contents404 = "<html><body><h1>Status: Error 404 Page Not Found</h1></body></html>"
contents403 = "<html><body><h1>Status: Error 403 Access Denied</h1></body></html>"
contents200 = "<html><body><h1>Status: OK 200</h1></body></html>"
##############################
# Flask routes
@app.route('/')
def index():
if not github.authorized:
return redirect(url_for("github.login"))
else:
username = github.get("/user").json()['login']
resp = github.get("/user/orgs")
if resp.ok:
# If they are in team copper, redirect to search.
# Otherwise, hit em with a 403
all_orgs = resp.json()
for org in all_orgs:
if org['login']=='dcppc':
copper_team_id = '2700235'
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
if mresp.status_code==204:
# --------------------
# Business as usual
return redirect(url_for("search", query="", fields=""))
return contents403
return contents404
### @app.route('/')
### def index():
### return redirect(url_for("search", query="", fields=""))
@app.route('/search')
def search():
if not github.authorized:
return redirect(url_for("github.login"))
username = github.get("/user").json()['login']
resp = github.get("/user/orgs")
if resp.ok:
all_orgs = resp.json()
for org in all_orgs:
if org['login']=='dcppc':
copper_team_id = '2700235'
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
if mresp.status_code==204:
# --------------------
# Business as usual
query = request.args['query']
fields = request.args.get('fields')
if fields == 'None':
@@ -74,7 +139,6 @@ def search():
else:
parsed_query, result = search.search(query.split(), fields=[fields])
store_search(query, fields)
totals = search.get_document_total_count()
@@ -83,46 +147,74 @@ def search():
query=query,
parsed_query=parsed_query,
fields=fields,
last_searches=get_last_searches(),
totals=totals)
return contents403
@app.route('/update_index')
def update_index():
rebuild = request.args.get('rebuild')
UpdateIndexTask(diff_index=False)
if not github.authorized:
return redirect(url_for("github.login"))
username = github.get("/user").json()['login']
resp = github.get("/user/orgs")
if resp.ok:
all_orgs = resp.json()
for org in all_orgs:
if org['login']=='dcppc':
copper_team_id = '2700235'
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
if mresp.status_code==204:
gh_oauth_token = github.token['access_token']
# --------------------
# Business as usual
UpdateIndexTask(gh_oauth_token, diff_index=False)
flash("Rebuilding index, check console output")
return render_template("search.html",
query="",
fields="",
last_searches=get_last_searches(),
return render_template("controlpanel.html",
totals={})
return contents403
##############
# Utility methods
def get_last_searches():
if os.path.exists(last_searches_file):
with codecs.open(last_searches_file, 'r', encoding='utf-8') as f:
contents = f.readlines()
else:
contents = []
return contents
def store_search(query, fields):
if os.path.exists(last_searches_file):
with codecs.open(last_searches_file, 'r', encoding='utf-8') as f:
contents = f.readlines()
else:
contents = []
@app.route('/control_panel')
def control_panel():
search = "query=%s&fields=%s\n" % (query, fields)
if not search in contents:
contents.insert(0, search)
if not github.authorized:
return redirect(url_for("github.login"))
with codecs.open(last_searches_file, 'w', encoding='utf-8') as f:
f.writelines(contents[:30])
username = github.get("/user").json()['login']
resp = github.get("/user/orgs")
if resp.ok:
all_orgs = resp.json()
for org in all_orgs:
if org['login']=='dcppc':
copper_team_id = '2700235'
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
if mresp.status_code==204:
return render_template("controlpanel.html",
totals={})
return contents403
@app.errorhandler(404)
def oops(e):
return contents404
if __name__ == '__main__':
app.run()
app.run(host="0.0.0.0",port=5000)

5
centillion_prepare.py Normal file
View File

@@ -0,0 +1,5 @@
from gdrive_util import GDrive
gd = GDrive()
service = gd.get_service()

View File

@@ -2,6 +2,7 @@ import shutil
import html.parser
from github import Github
import base64
from gdrive_util import GDrive
from apiclient.http import MediaIoBaseDownload
@@ -42,6 +43,7 @@ Search object functions:
Schema:
- id
- kind
- fingerprint
- created_time
- modified_time
- indexed_time
@@ -95,6 +97,11 @@ class Search:
def __init__(self, index_folder):
self.open_index(index_folder)
# ------------------------------
# Create a schema and open a search index
# on disk.
def open_index(self, index_folder, create_new=False):
"""
Create a schema,
@@ -115,13 +122,13 @@ class Search:
# ------------------------------
# IMPORTANT:
# This is where the search index's document schema
# is defined.
schema = Schema(
id = ID(stored=True, unique=True),
kind = ID(stored=True),
#fingerprint = ID(stored=True),
created_time = ID(stored=True),
modified_time = ID(stored=True),
@@ -160,16 +167,13 @@ class Search:
# Define how to add documents
def add_drive_file(self, writer, item, indexed_ids, temp_dir, config):
def add_drive_file(self, writer, item, temp_dir, config, update=False):
"""
Add a Google Drive document/file to a search index.
If it is a document, extract the contents.
"""
gd = GDrive()
service = gd.get_service()
# ------------------------
# Two kinds of documents:
# There are two kinds of documents:
# - documents with text that can be extracted (docx)
# - everything else
@@ -179,88 +183,13 @@ class Search:
}
content = ""
if(mimetype not in mimemap.keys()):
# Not a document -
# Just a file
print("Indexing document \"%s\" of type %s"%(item['name'], mimetype))
else:
# Document with text
# Perform content extraction
if mimetype not in mimemap.keys():
# -----------
# docx Content Extraction:
#
# We can only do this with .docx files
# This is a file type we know how to convert
# Construct the URL and download it
# Not a document - just a file
print("Indexing Google Drive file \"%s\" of type %s"%(item['name'], mimetype))
writer.delete_by_term('id',item['id'])
print("Extracting content from \"%s\" of type %s"%(item['name'], mimetype))
# Create a URL and a destination filename
file_ext = mimemap[mimetype]
file_url = "https://docs.google.com/document/d/%s/export?format=%s"%(item['id'], file_ext)
# This re could probablybe improved
name = re.sub('/','_',item['name'])
# Now make the pandoc input/output filenames
out_ext = 'txt'
pandoc_fmt = 'plain'
if name.endswith(file_ext):
infile_name = name
outfile_name = re.sub(file_ext,out_ext,infile_name)
else:
infile_name = name+'.'+file_ext
outfile_name = name+'.'+out_ext
# assemble input/output file paths
fullpath_input = os.path.join(temp_dir,infile_name)
fullpath_output = os.path.join(temp_dir,outfile_name)
# Use requests.get to download url to file
r = requests.get(file_url, allow_redirects=True)
with open(fullpath_input, 'wb') as f:
f.write(r.content)
# Try to convert docx file to plain text
try:
output = pypandoc.convert_file(fullpath_input,
pandoc_fmt,
format='docx',
outputfile=fullpath_output
)
assert output == ""
except RuntimeError:
print("XXXXXX Failed to index document \"%s\""%(item['name']))
# If export was successful, read contents of markdown
# into the content variable.
# into the content variable.
if os.path.isfile(fullpath_output):
# Export was successful
with codecs.open(fullpath_output, encoding='utf-8') as f:
content = f.read()
# No matter what happens, clean up.
print("Cleaning up \"%s\""%item['name'])
subprocess.call(['rm','-fr',fullpath_output])
#print(" ".join(['rm','-fr',fullpath_output]))
subprocess.call(['rm','-fr',fullpath_input])
#print(" ".join(['rm','-fr',fullpath_input]))
# ------------------------------
# IMPORTANT:
# This is where the search documents are actually created.
mimetype = re.split('[/\.]', item['mimeType'])[-1]
# Index a plain google drive file
writer.add_document(
id = item['id'],
kind = 'gdoc',
@@ -281,23 +210,143 @@ class Search:
)
def add_issue(self, writer, issue, repo, config):
else:
# Document with text
# Perform content extraction
# -----------
# docx Content Extraction:
#
# We can only do this with .docx files
# This is a file type we know how to convert
# Construct the URL and download it
print("Indexing Google Drive document \"%s\" of type %s"%(item['name'], mimetype))
print(" > Extracting content")
# Create a URL and a destination filename
file_ext = mimemap[mimetype]
file_url = "https://docs.google.com/document/d/%s/export?format=%s"%(item['id'], file_ext)
# This re could probablybe improved
name = re.sub('/','_',item['name'])
# Now make the pandoc input/output filenames
out_ext = 'txt'
pandoc_fmt = 'plain'
if name.endswith(file_ext):
infile_name = name
outfile_name = re.sub(file_ext,out_ext,infile_name)
else:
infile_name = name+'.'+file_ext
outfile_name = name+'.'+out_ext
# Assemble input/output file paths
fullpath_input = os.path.join(temp_dir,infile_name)
fullpath_output = os.path.join(temp_dir,outfile_name)
# Use requests.get to download url to file
r = requests.get(file_url, allow_redirects=True)
with open(fullpath_input, 'wb') as f:
f.write(r.content)
# Try to convert docx file to plain text
try:
output = pypandoc.convert_file(fullpath_input,
pandoc_fmt,
format='docx',
outputfile=fullpath_output
)
assert output == ""
except RuntimeError:
print(" > XXXXXX Failed to index document \"%s\""%(item['name']))
# If export was successful, read contents of markdown
# into the content variable.
# into the content variable.
if os.path.isfile(fullpath_output):
# Export was successful
with codecs.open(fullpath_output, encoding='utf-8') as f:
content = f.read()
# No matter what happens, clean up.
print(" > Cleaning up \"%s\""%item['name'])
subprocess.call(['rm','-fr',fullpath_output])
#print(" ".join(['rm','-fr',fullpath_output]))
subprocess.call(['rm','-fr',fullpath_input])
#print(" ".join(['rm','-fr',fullpath_input]))
if update:
print(" > Removing old record")
writer.delete_by_term('id',item['id'])
else:
print(" > Creating a new record")
writer.add_document(
id = item['id'],
kind = 'gdoc',
created_time = item['createdTime'],
modified_time = item['modifiedTime'],
indexed_time = datetime.now().replace(microsecond=0).isoformat(),
title = item['name'],
url = item['webViewLink'],
mimetype = mimetype,
owner_email = item['owners'][0]['emailAddress'],
owner_name = item['owners'][0]['displayName'],
repo_name='',
repo_url='',
github_user='',
issue_title='',
issue_url='',
content = content
)
# ------------------------------
# Add a single github issue and its comments
# to a search index.
def add_issue(self, writer, issue, config, update=True):
"""
Add a Github issue/comment to a search index.
"""
repo = issue.repository
repo_name = repo.owner.login+"/"+repo.name
repo_url = repo.html_url
count = 0
# Handle the issue content
print("Indexing issue %s"%(issue.html_url))
# Combine comments with their respective issues.
# Otherwise just too noisy.
issue_comment_content = issue.body.rstrip()
issue_comment_content += "\n"
# Handle the comments content
if(issue.comments>0):
comments = issue.get_comments()
for comment in comments:
issue_comment_content += comment.body.rstrip()
issue_comment_content += "\n"
# Now create the actual search index record
created_time = clean_timestamp(issue.created_at)
modified_time = clean_timestamp(issue.updated_at)
indexed_time = clean_timestamp(datetime.now())
# Add one document per issue thread,
# containing entire text of thread.
writer.add_document(
id = issue.html_url,
kind = 'issue',
@@ -314,46 +363,68 @@ class Search:
github_user = issue.user.login,
issue_title = issue.title,
issue_url = issue.html_url,
content = issue.body.rstrip()
content = issue_comment_content
)
count += 1
# Handle the comments content
if(issue.comments>0):
def add_markdown(self, writer, d, config, update=True):
"""
Use a Github markdown document API record
to add a markdown document's contents to
the search index.
"""
repo = d['repo']
org = d['org']
repo_name = org + "/" + repo
repo_url = "https://github.com/" + repo_name
comments = issue.get_comments()
for comment in comments:
fpath = d['path']
furl = d['url']
fsha = d['sha']
_, fname = os.path.split(fpath)
_, fext = os.path.splitext(fpath)
print(" > Indexing comment %s"%(comment.html_url))
print("Indexing markdown doc %s"%(fname))
created_time = clean_timestamp(comment.created_at)
modified_time = clean_timestamp(comment.updated_at)
# Unpack the requests response and decode the content
response = requests.get(furl)
jresponse = response.json()
content = ""
try:
binary_content = re.sub('\n','',jresponse['content'])
content = base64.b64decode(binary_content).decode('utf-8')
except KeyError:
print(" > XXXXXXXX Failed to extract 'content' field. You probably hit the rate limit.")
return
# Now create the actual search index record
indexed_time = clean_timestamp(datetime.now())
usable_url = "https://github.com/%s/blob/master/%s"%(repo_name, fpath)
# Add one document per issue thread,
# containing entire text of thread.
writer.add_document(
id = comment.html_url,
kind = 'comment',
created_time = created_time,
modified_time = modified_time,
id = fsha,
kind = 'markdown',
created_time = '',
modified_time = '',
indexed_time = indexed_time,
title = "Comment on "+issue.title,
url = comment.html_url,
title = fname,
url = usable_url,
mimetype='',
owner_email='',
owner_name='',
repo_name = repo_name,
repo_url = repo_url,
github_user = comment.user.login,
issue_title = issue.title,
issue_url = issue.html_url,
content = comment.body.rstrip()
github_user = '',
issue_title = '',
issue_url = '',
content = content
)
count += 1
return count
# ------------------------------
@@ -365,86 +436,107 @@ class Search:
"""
Update the search index using a collection of
Google Drive documents and files.
Uses the 'id' field to uniquely identify documents.
Also see:
https://developers.google.com/drive/api/v3/reference/files
"""
gd = GDrive()
service = gd.get_service()
# -----
# Get the set of all documents on Google Drive:
# Updated algorithm:
# - get set of indexed ids
# - get set of remote ids
# - drop indexed ids not in remote ids
# - index all remote ids
# - add hash check in add_
# ------------------------------
# IMPORTANT:
# This determines what information about the Google Drive files
# you'll get back, and that's all you're going to have to work with.
# If you need more information, modify the statement below.
# Also see:
# https://developers.google.com/drive/api/v3/reference/files
# Get the set of indexed ids:
# ------
indexed_ids = set()
p = QueryParser("kind", schema=self.ix.schema)
q = p.parse("gdoc")
with self.ix.searcher() as s:
results = s.search(q,limit=None)
for result in results:
indexed_ids.add(result['id'])
# Get the set of remote ids:
# ------
# Start with google drive api object
gd = GDrive()
service = gd.get_service()
drive = service.files()
# We should do more here
# to check if we should update
# or not...
#
# loop over existing documents in index:
#
# p = QueryParser("kind", schema=self.ix.schema)
# q = p.parse("gdoc")
# with self.ix.searcher() as s:
# results = s.search(q,limit=None)
# counts[key] = len(results)
# Now index all the docs in the google drive folder
# The trick is to set next page token to None 1st time thru (fencepost)
nextPageToken = None
# Use the pager to return all the things
items = []
remote_ids = set()
full_items = {}
while True:
ps = 12
results = drive.list(
pageSize=ps,
pageToken=nextPageToken,
fields="nextPageToken, files(id, kind, createdTime, modifiedTime, mimeType, name, owners, webViewLink)",
fields = "nextPageToken, files(id, kind, createdTime, modifiedTime, mimeType, name, owners, webViewLink)",
spaces="drive"
).execute()
nextPageToken = results.get("nextPageToken")
items += results.get("files", [])
files = results.get("files",[])
for f in files:
# Keep it short
# Add all remote docs to a set
remote_ids.add(f['id'])
# Also store the doc
full_items[f['id']] = f
# Shorter:
break
## Longer:
#if nextPageToken is None:
# break
# Here is where we update.
# Grab indexed ids
# Grab remote ids
# Drop indexed ids not in remote ids
# Index all remote ids
# Change add_ to update_
# Add a hash check in update_
indexed_ids = set()
for item in items:
indexed_ids.add(item['id'])
writer = self.ix.writer()
count = 0
temp_dir = tempfile.mkdtemp(dir=os.getcwd())
print("Temporary directory: %s"%(temp_dir))
if not os.path.exists(temp_dir):
os.mkdir(temp_dir)
count = 0
for item in items:
self.add_drive_file(writer, item, indexed_ids, temp_dir, config)
# Drop any id in indexed_ids
# not in remote_ids
drop_ids = indexed_ids - remote_ids
for drop_id in drop_ids:
writer.delete_by_term('id',drop_id)
# Update any id in indexed_ids
# and in remote_ids
update_ids = indexed_ids & remote_ids
for update_id in update_ids:
# cop out
writer.delete_by_term('id',update_id)
item = full_items[update_id]
self.add_drive_file(writer, item, temp_dir, config, update=True)
count += 1
# Add any id not in indexed_ids
# and in remote_ids
add_ids = remote_ids - indexed_ids
for add_id in add_ids:
item = full_items[add_id]
self.add_drive_file(writer, item, temp_dir, config, update=False)
count += 1
print("Cleaning temporary directory: %s"%(temp_dir))
subprocess.call(['rm','-fr',temp_dir])
@@ -453,69 +545,218 @@ class Search:
def update_index_issues(self,
gh_access_token,
config):
def update_index_issues(self, gh_oauth_token, config):
"""
Update the search index using a collection of
Github repo issues and comments.
gh_oauth_token can also be an access token.
"""
# Strategy:
# To get the proof of concept up and running,
# we are just deleting and re-indexing every issue/comment.
# Updated algorithm:
# - get set of indexed ids
# - get set of remote ids
# - drop indexed ids not in remote ids
# - index all remote ids
g = Github(gh_access_token)
# Get the set of indexed ids:
# ------
indexed_issues = set()
p = QueryParser("kind", schema=self.ix.schema)
q = p.parse("gdoc")
with self.ix.searcher() as s:
results = s.search(q,limit=None)
for result in results:
indexed_issues.add(result['id'])
# Set of all URLs as existing on github
to_index = set()
writer = self.ix.writer()
# Get the set of remote ids:
# ------
# Start with api object
g = Github(gh_oauth_token)
# Now index all issue threads in the user-specified repos
# Iterate over each repo
list_of_repos = config['repositories']
for r in list_of_repos:
# Start by collecting all the things
remote_issues = set()
full_items = {}
if '/' not in r:
err = "Error: specify org/reponame or user/reponame in list of repos"
raise Exception(err)
this_org, this_repo = re.split('/',r)
org = g.get_organization(this_org)
repo = org.get_repo(this_repo)
count = 0
# Iterate over each thread
# Iterate over each issue thread
issues = repo.get_issues()
for issue in issues:
# This approach is more work than is needed
# but PoC||GTFO
# For each issue/comment URL,
# remove the corresponding item
# and re-add it to the index
# grab the key and store the
# corresponding issue object
key = issue.html_url
value = issue
to_index.add(issue.html_url)
writer.delete_by_term('url', issue.html_url)
count -= 1
comments = issue.get_comments()
remote_issues.add(key)
full_items[key] = value
for comment in comments:
to_index.add(comment.html_url)
writer.delete_by_term('url', comment.html_url)
writer = self.ix.writer()
count = 0
# Now re-add this issue to the index
# (this will also add the comments)
count += self.add_issue(writer, issue, repo, config)
# Drop any issues in indexed_issues
# not in remote_issues
drop_issues = indexed_issues - remote_issues
for drop_issue in drop_issues:
writer.delete_by_term('id',drop_issue)
# Update any issue in indexed_issues
# and in remote_issues
update_issues = indexed_issues & remote_issues
for update_issue in update_issues:
# cop out
writer.delete_by_term('id',update_issue)
item = full_items[update_issue]
self.add_issue(writer, item, config, update=True)
count += 1
# Add any issue not in indexed_issues
# and in remote_issues
add_issues = remote_issues - indexed_issues
for add_issue in add_issues:
item = full_items[add_issue]
self.add_issue(writer, item, config, update=False)
count += 1
writer.commit()
print("Done, updated %d documents in the index" % count)
def update_index_markdown(self, gh_oauth_token, config):
"""
Update the search index using a collection of
Markdown files from a Github repo.
gh_oauth_token can also be an access token.
"""
EXT = '.md'
# Updated algorithm:
# - get set of indexed ids
# - get set of remote ids
# - drop indexed ids not in remote ids
# - index all remote ids
# Get the set of indexed ids:
# ------
indexed_ids = set()
p = QueryParser("kind", schema=self.ix.schema)
q = p.parse("markdown")
with self.ix.searcher() as s:
results = s.search(q,limit=None)
for result in results:
indexed_ids.add(result['id'])
# Get the set of remote ids:
# ------
# Start with api object
g = Github(gh_oauth_token)
# Now index all markdown files
# in the user-specified repos
# Iterate over each repo
list_of_repos = config['repositories']
for r in list_of_repos:
# Start by collecting all the things
remote_ids = set()
full_items = {}
if '/' not in r:
err = "Error: specify org/reponame or user/reponame in list of repos"
raise Exception(err)
this_org, this_repo = re.split('/',r)
org = g.get_organization(this_org)
repo = org.get_repo(this_repo)
# ---------
# begin markdown-specific code
# Get head commit
commits = repo.get_commits()
last = commits[0]
sha = last.sha
# Get all the docs
tree = repo.get_git_tree(sha=sha, recursive=True)
docs = tree.raw_data['tree']
for d in docs:
# For each doc, get the file extension
# If it matches EXT, download the file
fpath = d['path']
_, fname = os.path.split(fpath)
_, fext = os.path.splitext(fpath)
if fext==EXT:
key = d['sha']
d['org'] = this_org
d['repo'] = this_repo
value = d
# Stash the doc for later
remote_ids.add(key)
full_items[key] = value
writer = self.ix.writer()
count = 0
# Drop any id in indexed_ids
# not in remote_ids
drop_ids = indexed_ids - remote_ids
for drop_id in drop_ids:
writer.delete_by_term('id',drop_id)
# Update any id in indexed_ids
# and in remote_ids
update_ids = indexed_ids & remote_ids
for update_id in update_ids:
# cop out
writer.delete_by_term('id',update_id)
item = full_items[update_id]
self.add_markdown(writer, item, config, update=True)
count += 1
# Add any issue not in indexed_ids
# and in remote_ids
add_ids = remote_ids - indexed_ids
for add_id in add_ids:
item = full_items[add_id]
self.add_markdown(writer, item, config, update=False)
count += 1
writer.commit()
print("Done, updated %d markdown documents in the index" % count)
# ---------------------------------
# Search results bundler
@@ -580,21 +821,18 @@ class Search:
highlights = self.html_parser.unescape(highlights)
html = self.markdown(highlights)
html = re.sub(r'\n','<br />',html)
sr.content_highlight = html
search_results.append(sr)
return search_results
# ------------------
# github issues
# create search results
def search(self, query_list, fields=None):
with self.ix.searcher() as searcher:
query_string = " ".join(query_list)
query = None
@@ -628,13 +866,13 @@ class Search:
kind_labels = {
"documents" : "gdoc",
"markdown" : "markdown",
"issues" : "issue",
"comments" : "comment"
}
counts = {
"documents" : None,
"markdown" : None,
"issues" : None,
"comments" : None,
"total" : None
}
for key in kind_labels:
@@ -644,7 +882,9 @@ class Search:
results = s.search(q,limit=None)
counts[key] = len(results)
counts['total'] = self.ix.searcher().doc_count_all()
## These two should NOT be different, but they are...
#counts['total'] = self.ix.searcher().doc_count_all()
counts['total'] = counts['documents'] + counts['markdown'] + counts['issues']
return counts

View File

@@ -1,9 +1,19 @@
# Location of index file
INDEX_DIR = "search_index"
# oauth client deets
GITHUB_OAUTH_CLIENT_ID = "63f8d49c651840cbe31e"
GITHUB_OAUTH_CLIENT_SECRET = "36d9a4611f7427336d3c89ed041c45d086b793ee"
# More information footer: Repository label
FOOTER_REPO_ORG = "charlesreid1"
FOOTER_REPO_NAME = "centillion"
# Toggle to show Whoosh parsed query
SHOW_PARSED_QUERY=True
TAGLINE = "Search all the things"
# Flask settings
DEBUG = True
SECRET_KEY = '42c5a8eda356ca9d9c3ab2d149541e6b91d843fa'

View File

@@ -0,0 +1,22 @@
# Centillion Components
Centillion keeps it simple.
There are two components:
* The `Search` object, which uses whoosh and various
APIs (Github, Google Drive) to build and manage
the search index. The `Search` object also runs all
queries against the search index. (See the
[Centillion Whoosh](centillion_whoosh.md) page
or the `centillion_search`.py` file
for details.)
* Flask app, which uses Jinja templates to present the
user with a minimal web frontend that allows them
to interact with the search engine. (See the
[Centillion Flask](centillion_flask.md) page
or the `centillion`.py` file
for details.)

30
docs/centillion_flask.md Normal file
View File

@@ -0,0 +1,30 @@
# Centillion Flask
## What the flask server does
Flask is a web server framework
that allows developers to define
behavior for specific endpoints,
such as `/hello_world`, or
<http://localhost:5000/hello_world>
on a web server running locally.
## Flask server routes
- `/home`
- if not logged in, this redirects to a "log into github" landing page (not implemented yet)
- if logged in, this redirects to the search route
- `/search`
- search template
- `/main_index_update`
- update main index, all docs period
- `/control_panel`
- this is the control panel, where you can trigger
the search index to be re-made

34
docs/centillion_whoosh.md Normal file
View File

@@ -0,0 +1,34 @@
# Centillion Whoosh
The `centillion_search.py` file defines a
`Search` class that serves as the backend
for centillion.
## What the Search class does
The `Search` class has two roles:
- create (and update) the search index
- this also requires the `Search` class
to define the schema for storing documents
- run queries against the search index,
and package results up for Flask and Jinja
## Search class functions
The `Search` class defines several functions:
- `open_index()` creates the schema
- `add_issue()`, `add_md()`, `add_document()` have three diff method sigs and add diff types
of documents to the search index
- `update_all_issues()` or `update_all_md()` or `update_all_documents()` iterates over items
and determines whether each item needs to be updated in the search index
- `update_main_index()` - update the entire search index
- calls all three update_all methods
- `create_search_results()` - package things up for jinja
- `search()` - run the query, pass results to the jinja-packager

View File

@@ -1,30 +1,31 @@
# The Centillion
# Centillion
**the centillion**: a pan-github-markdown-issues-google-docs search engine.
**centillion**: a pan-github-markdown-issues-google-docs search engine.
**a centillion**: a very large number consisting of a 1 with 303 zeros after it.
the centillion is 3.03 log-times better than the googol.
centillion is 3.03 log-times better than the googol.
## what is it
## What is centillion
The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
Centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
a Python library for building search engines.
We define the types of documents the centillion should index,
what info and how. The centillion then builds and
We define the types of documents centillion should index,
what info and how. Centillion then builds and
updates a search index. That's all done in `centillion_search.py`.
The centillion also provides a simple web frontend for running
Centillion also provides a simple web frontend for running
queries against the search index. That's done using a Flask server
defined in `centillion.py`.
The centillion keeps it simple.
Centillion keeps it simple.
## quickstart
## Quickstart
Run the centillion app with a github access token API key set via
Run centillion with a github access token API key set via
environment variable:
```
@@ -34,21 +35,50 @@ GITHUB_TOKEN="XXXXXXXX" python centillion.py
This will start a Flask server, and you can view the minimal search engine
interface in your browser at <http://localhost:5000>.
## Configuration
## work that is done
### Centillion configuration
See [standalone.md](standalone.md) for the summary of
the three standalone whoosh servers that were built:
one for a folder of markdown files, one for github issues
and comments, and one for google drive documents.
`config_centillion.json` defines configuration variables
for centillion - namely, what to index, and how, and where.
## work that is being done
### Flask configuration
See [workinprogress.md](workinprogress.md) for details about
work in progress.
`config_flask.py` defines configuration variables
used by flask, which controls the web frontend
for centillion.
## work that is planned
## Control Panel/Rebuilding Search Index
See [plans.md](plans.md)
To rebuild the search engine, visit the control panel route (`/control_panel`),
for example at <http://localhost:5000/control_panel>.
This allows you to rebuild the search engine index. The search index
is stored in the `search_index/` directory, and that directory
can be configured with centillion's configuration file.
The diff search index is faster to build, as it only
indexes documents that have been added since the last
new document was added to the search index.
The main search index is slower to build, as it will
re-index everything.
(Cron scripts? Threaded task that runs hourly?)
## Details
More on the details of how centillion works.
Under the hood, centillion uses flask and whoosh.
Flask builds and runs the web server.
Whoosh handles search requests and management
of the search index.
[Centillion Components](centillion_components.md)
[Centillion Flask](centillion_flask.md)
[Centillion Whoosh](centillion_whoosh.md)

19
install_pandoc.sh Executable file
View File

@@ -0,0 +1,19 @@
#!/bin/bash
#
# for ubuntu
if [ "$(id -u)" != "0" ]; then
echo ""
echo ""
echo "This script should be run as root."
echo ""
echo ""
exit 1;
fi
OFILE="/tmp/pandoc.deb"
curl -L https://github.com/jgm/pandoc/releases/download/2.2.2.1/pandoc-2.2.2.1-1-amd64.deb -o ${OFILE}
dpkg -i ${OFILE}
rm -f ${OFILE}

View File

@@ -9,3 +9,4 @@ PyGithub>=1.39
pypandoc>=1.4
requests>=2.19
pandoc>=1.0
flask-dance>=1.0.0

7
static/bootstrap.min.js vendored Normal file

File diff suppressed because one or more lines are too long

2
static/jquery.min.js vendored Normal file

File diff suppressed because one or more lines are too long

View File

@@ -1,17 +1,38 @@
span.badge {
vertical-align: text-bottom;
}
li.search-group-item {
position: relative;
display: block;
padding: 0px;
margin-bottom: -1px;
background-color: #fff;
border: 1px solid #ddd;
a.badgelinks, a.badgelinks:hover {
color: #fff;
text-decoration: none;
}
div.list-group {
border: 1px solid rgba(86,61,124,.2);
}
li.list-group-item {
position: relative;
display: block;
/*padding: 20px 10px;*/
margin-bottom: -1px;
background-color: #f8f8f8;
border: 1px solid #ddd;
}
li.search-group-item {
position: relative;
display: block;
padding: 0px;
margin-bottom: -1px;
background-color: #fff;
border: 1px solid #ddd;
}
div.url {
background-color: rgba(86,61,124,.15);
padding: 8px;

108
templates/controlpanel.html Executable file
View File

@@ -0,0 +1,108 @@
{% extends "layout.html" %}
{% block body %}
{% with messages = get_flashed_messages() %}
{% if messages %}
<div class="container">
<div class="alert alert-success alert-dismissible">
<a href="#" class="close" data-dismiss="alert" aria-label="close">&times;</a>
<ul class=flashes>
{% for message in messages %}
<li>{{ message }}</li>
{% endfor %}
</ul>
</div>
</div>
{% endif %}
{% endwith %}
<div class="container">
<div class="row">
<div class="col-md-12">
<center>
<a href="{{ url_for('search')}}?query=&fields=">
<img src="{{ url_for('static', filename='centillion_white.png') }}">
</a>
{% if config['TAGLINE'] %}
<h2><a href="{{ url_for('search')}}?query=&fields=">
{{config['TAGLINE']}}
</a></h2>
{% endif %}
</center>
</div>
</div>
{% if config['zzzTAGLINE'] %}
<div class="row">
<div class="col12sm">
<center>
<h2><a href="{{ url_for('search')}}?query=&fields=">
{{config['TAGLINE']}}
</a></h2>
</center>
</div>
</div>
{% endif %}
</div>
<hr />
<div class="container">
<div class="row">
{# update main search index #}
<div class="panel panel-danger">
<div class="panel-heading">
<h3 class="panel-title">
Update Main Search Index
</h3>
</div>
<div class="panel-body">
<div class="container-fluid">
<div class="row">
<div class="col-md-12">
<p class="panel-text">Re-index <i>every</i> document in the
remote collection in the search index. <b>Warning: this operation may take a while.</b>
<p/> <p>
<a href="{{ url_for('update_index') }}" class="btn btn-large btn-danger">Update Main Index</a>
<p/>
</div>
</div>
</div>
</div>
</div>
{# update diff search index #}
<div class="panel panel-danger">
<div class="panel-heading">
<h3 class="panel-title">
Update Diff Search Index
</h3>
</div>
<div class="panel-body">
<div class="container-fluid">
<div class="row">
<div class="col-md-12">
<p class="panel-text">Diff search index only re-indexes documents created after the last
search index update. <b>Not currently implemented.</b>
<p/> <p>
<a href="#" class="btn btn-large disabled btn-danger">Update Diff Index</a>
<p/>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
{% endblock %}

View File

@@ -3,9 +3,10 @@
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='bootstrap.min.css') }}">
<script src="{{ url_for('static', filename='jquery.min.js') }}"></script>
<script src="{{ url_for('static', filename='bootstrap.min.js') }}"></script>
<div>
{% for message in get_flashed_messages() %}
<div class="flash">{{ message }}</div>
{% endfor %}
{% block body %}{% endblock %}
</div>

View File

@@ -4,34 +4,33 @@
<div class="container">
{#
banner image
#}
<div class="row">
<div class="col12sm">
<center>
<a href="{{ url_for('search')}}?query=&fields=">
<img src="{{ url_for('static', filename='centillion_white.png') }}">
</a>
{#
need a tag line
#}
{% if config['TAGLINE'] %}
<h2><a href="{{ url_for('search')}}?query=&fields=">
{{config['TAGLINE']}}
</a></h2>
{% endif %}
</center>
</div>
</div>
</div>
<div class="container">
<div class="row">
<div class="col12sm">
<center>
<h2>
<a href="{{ url_for('search')}}?query=&fields=">
Search the DCPPC
</a>
</h2>
</center>
</div>
</div>
<div class="row">
<div class="col-12">
<div class="col-xs-12">
<center>
<a class="index" href="{{ url_for('update_index')}}">[update index]</a>
<a class="index" href="{{ url_for('update_index')}}?rebuild=True">[rebuild index]</a>
<form action="{{ url_for('search') }}" name="search">
<input type="text" name="query" value="{{ query }}"> <br />
<button type="submit" style="font-size: 20px; padding: 10px; padding-left: 50px; padding-right: 50px;"
@@ -48,8 +47,8 @@
<div class="row">
{% if directories %}
<div class="col-12 info directories-cloud">
File directories:&nbsp
<div class="col-xs-12 info directories-cloud">
<b>File directories:</b>
{% for d in directories %}
<a href="{{url_for('search')}}?query={{d|trim}}&fields=filename">{{d|trim}}</a>
{% endfor %}
@@ -60,25 +59,38 @@
{% if config['SHOW_PARSED_QUERY'] and parsed_query %}
<li class="list-group-item">
<div class="col-12 info">
<div class="container-fluid">
<div class="row">
<div class="col-xs-12 info">
<b>Parsed query:</b> {{ parsed_query }}
</div>
</div>
</div>
</li>
{% endif %}
{% if parsed_query %}
<li class="list-group-item">
<div class="col-12 info">
<b>Found:</b> {{entries|length}} documents with results, out of {{totals["total"]}} total documents
<div class="container-fluid">
<div class="row">
<div class="col-xs-12 info">
<b>Found:</b> <span class="badge">{{entries|length}}</span> results
out of <span class="badge">{{totals["total"]}}</span> total items indexed
</div>
</div>
</div>
</li>
{% endif %}
<li class="list-group-item">
<div class="col-12 info">
<b>Indexing:</b> {{totals["documents"]}} Google Documents,
{{totals["issues"]}} Github issues, and
{{totals["comments"]}} Github comments
<div class="container-fluid">
<div class="row">
<div class="col-xs-12 info">
<b>Indexing:</b> <span class="badge">{{totals["documents"]}}</span> Google Documents,
<span class="badge">{{totals["issues"]}}</span> Github issues,
<span class="badge">{{totals["markdown"]}}</span> markdown files.
</div>
</div>
</div>
</li>
@@ -97,28 +109,26 @@
{% if e.kind=="gdoc" %}
<b>Google Drive File:</b>
<a href='{{e.url}}'>{{e.title}}</a>
({{e.owner_name}}, {{e.owner_email}})
{% elif e.kind=="comment" %}
<b>Comment:</b>
<a href='{{e.url}}'>Comment (link)</a>
{% if e.github_user %}
by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
{% endif %}
on issue <a href='{{e.issue_url}}'>{{e.issue_title}}</a>
<br/>
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
{% if e.github_user %}
{% endif %}
(Owner: {{e.owner_name}}, {{e.owner_email}})
{% elif e.kind=="issue" %}
<b>Issue:</b>
<a href='{{e.issue_url}}'>{{e.issue_title}}</a>
<a href='{{e.url}}'>{{e.title}}</a>
{% if e.github_user %}
by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
opened by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
{% endif %}
<br/>
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
{% elif e.kind=="markdown" %}
<b>Markdown:</b>
<a href='{{e.url}}'>{{e.title}}</a>
<br/>
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
{% else %}
<b>Item:</b> (<a href='{{e.url}}'>link</a>)
{% endif %}
<br />
score: {{'%d' % e.score}}
@@ -134,17 +144,29 @@
<div class="container">
<div class="row">
<div class="col-12">
<div class="last-searches">Last searches: <br/>
{% for s in last_searches %}
<span><a href="{{url_for('search')}}?{{s}}">{{s}}</a></span>
{% endfor %}
<ul class="list-group">
{% if config['FOOTER_REPO_NAME'] %}
{% if config['FOOTER_REPO_ORG'] %}
<li class="list-group-item">
<div class="container-fluid">
<div class="row">
<div class="col-xs-12 info">
More information about {{config['FOOTER_REPO_NAME']}} can be found
in the <a href="https://github.com/{{config['FOOTER_REPO_ORG']}}/{{config['FOOTER_REPO_NAME']}}">{{config['FOOTER_REPO_ORG']}}/{{config['FOOTER_REPO_NAME']}}</a>
repository on Github.
</div>
<p>
More info can be found in the <a href="https://github.com/BernhardWenzel/markdown-search">README.md file</a>
</p>
</div>
</div>
</li>
{% endif %}
{% endif %}
</ul>
</div>
</div>
{% endblock %}