Compare commits
104 Commits
Author | SHA1 | Date | |
---|---|---|---|
de796880c5 | |||
f79f711a38 | |||
00b862b83e | |||
a06c3b645a | |||
878ff011fb | |||
33cf78a524 | |||
c1bcd8dc22 | |||
757e9d79a1 | |||
c47682adb4 | |||
f2662c3849 | |||
2478a3f857 | |||
f174080dfd | |||
ca8b12db06 | |||
a1ffdad292 | |||
ce76396096 | |||
175ff4f71d | |||
94f956e2d0 | |||
dc015671fc | |||
1e9eec81d7 | |||
31e12476af | |||
bbe4e32f63 | |||
5013741958 | |||
1ce80a5da0 | |||
3ed967bd8b | |||
1eaaa32007 | |||
9c7e696b6a | |||
262a0c19e7 | |||
bd2714cc0b | |||
899d6fed53 | |||
a7756049e5 | |||
3df427a8f8 | |||
0dd06748de | |||
1a04814edf | |||
![]() |
3fb72d409b | ||
d89e01221a | |||
6736f3f8ad | |||
abd13aba29 | |||
13e49cdaa6 | |||
83b2ce17fb | |||
5be0709070 | |||
9edd95a78d | |||
37615d8707 | |||
4b218f63b9 | |||
4e17c890bc | |||
1129ec38e0 | |||
875508c796 | |||
abc7a2aedf | |||
8f1e5faefc | |||
d5f63e2322 | |||
84e5560423 | |||
924c562c0a | |||
13c410ac5e | |||
4e79800e83 | |||
5b9570d8cd | |||
297a4b5977 | |||
69a6b5d680 | |||
3feca1aba3 | |||
493581f861 | |||
1b0ded809d | |||
78e77c7cf2 | |||
2f890d1aee | |||
937327f2cb | |||
ca0d88cfe6 | |||
5eda472072 | |||
d943c14678 | |||
6be785a056 | |||
65113a95f7 | |||
87c3f12c8f | |||
933884e9ab | |||
da9dea3f6b | |||
4d6386e74a | |||
a93b7519de | |||
5e2c37164b | |||
829e9c4263 | |||
283991017c | |||
653af18f24 | |||
fae184f1f3 | |||
d40bb3557f | |||
a848f3ec3e | |||
50d27a915a | |||
1b950b7790 | |||
04d4195668 | |||
d0fe7aa799 | |||
acc28aab44 | |||
adc2666a9b | |||
581f0a67ed | |||
0b96061bc5 | |||
c7acdea889 | |||
4eabd4536e | |||
78276c14d9 | |||
68f90d383f | |||
202643b85e | |||
dc9ac74d68 | |||
36cc94a854 | |||
740e757bcd | |||
bf6afe39c6 | |||
54c09ce80b | |||
1407178f39 | |||
2bf9abfd6f | |||
8328f96f76 | |||
d5a9fe85af | |||
f8d2156d85 | |||
a753ba4963 | |||
8cca4b2c8d |
2
.gitignore
vendored
2
.gitignore
vendored
@@ -1,8 +1,8 @@
|
|||||||
|
config_flask.py
|
||||||
vp
|
vp
|
||||||
credentials.json
|
credentials.json
|
||||||
drive*.json
|
drive*.json
|
||||||
*.pyc
|
*.pyc
|
||||||
config.py
|
|
||||||
out/
|
out/
|
||||||
search_index/
|
search_index/
|
||||||
venv/
|
venv/
|
||||||
|
6
.gitmodules
vendored
6
.gitmodules
vendored
@@ -1,3 +1,3 @@
|
|||||||
[submodule "mkdocs-material"]
|
[submodule "mkdocs-material-dib"]
|
||||||
path = mkdocs-material
|
path = mkdocs-material-dib
|
||||||
url = https://git.charlesreid1.com/charlesreid1/mkdocs-material.git
|
url = https://github.com/dib-lab/mkdocs-material-dib.git
|
||||||
|
79
Readme.md
79
Readme.md
@@ -1,17 +1,19 @@
|
|||||||
# The Centillion
|
# Centillion
|
||||||
|
|
||||||
**the centillion**: a pan-github-markdown-issues-google-docs search engine.
|
**centillion**: a pan-github-markdown-issues-google-docs search engine.
|
||||||
|
|
||||||
**a centillion**: a very large number consisting of a 1 with 303 zeros after it.
|
**a centillion**: a very large number consisting of a 1 with 303 zeros after it.
|
||||||
|
|
||||||
the centillion is 3.03 log-times better than the googol.
|
one centillion is 3.03 log-times better than a googol.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## what is it
|
## what is it
|
||||||
|
|
||||||
The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
|
Centillion (https://github.com/dcppc/centillion) is a search engine that can index
|
||||||
a Python library for building search engines.
|
three kinds of collections: Google Documents, Github issues, and Markdown files in
|
||||||
|
Github repos.
|
||||||
|
|
||||||
We define the types of documents the centillion should index,
|
We define the types of documents the centillion should index,
|
||||||
what info and how. The centillion then builds and
|
what info and how. The centillion then builds and
|
||||||
@@ -23,23 +25,70 @@ defined in `centillion.py`.
|
|||||||
|
|
||||||
The centillion keeps it simple.
|
The centillion keeps it simple.
|
||||||
|
|
||||||
|
## authentication layer
|
||||||
|
|
||||||
## quickstart
|
Centillion lives behind a Github authentication layer, implemented with
|
||||||
|
[flask-dance](https://github.com/singingwolfboy/flask-dance). When you first
|
||||||
|
visit the site it will ask you to authenticate with Github so that it can
|
||||||
|
verify you have permission to access the site.
|
||||||
|
|
||||||
Run the centillion app with a github access token API key set via
|
## technologies
|
||||||
environment variable:
|
|
||||||
|
Centillion is a Python program built using whoosh (search engine library). It
|
||||||
|
indexes the full text of docx files in Google Documents, just the filenames for
|
||||||
|
non-docx files. The full text of issues and their comments are indexed, and
|
||||||
|
results are grouped by issue. Centillion requires Google Drive and Github OAuth
|
||||||
|
apps. Once you provide credentials to Flask you're all set to go.
|
||||||
|
|
||||||
|
|
||||||
|
## control panel
|
||||||
|
|
||||||
|
There's also a control panel at <https://search.nihdatacommons.us/control_panel>
|
||||||
|
that allows you to rebuild the search index from scratch (the Google Drive indexing
|
||||||
|
takes a while).
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
## quickstart (with Github auth)
|
||||||
|
|
||||||
|
Start by creating a Github OAuth application.
|
||||||
|
Get the public and private application key
|
||||||
|
(client token and client secret token)
|
||||||
|
from the Github application's page.
|
||||||
|
You will also need a Github access token
|
||||||
|
(in addition to the app tokens).
|
||||||
|
|
||||||
|
When you create the application, set the callback
|
||||||
|
URL to `/login/github/authorized`, as in:
|
||||||
|
|
||||||
```
|
```
|
||||||
GITHUB_TOKEN="XXXXXXXX" python centillion.py
|
https://<url>/login/github/authorized
|
||||||
|
```
|
||||||
|
|
||||||
|
Edit the Flask configuration `config_flask.py`
|
||||||
|
and set the public and private application keys.
|
||||||
|
|
||||||
|
Now run centillion:
|
||||||
|
|
||||||
|
```
|
||||||
|
python centillion.py
|
||||||
|
```
|
||||||
|
|
||||||
|
or if you used http instead of https:
|
||||||
|
|
||||||
|
```
|
||||||
|
OAUTHLIB_INSECURE_TRANSPORT="true" python centillion.py
|
||||||
```
|
```
|
||||||
|
|
||||||
This will start a Flask server, and you can view the minimal search engine
|
This will start a Flask server, and you can view the minimal search engine
|
||||||
interface in your browser at <http://localhost:5000>.
|
interface in your browser at `http://<ip>:5000`.
|
||||||
|
|
||||||
## more info
|
|
||||||
|
|
||||||
For more info see the documentation: <https://charlesreid1.github.io/centillion>
|
|
||||||
|
|
||||||
|
|
||||||
|
## troubleshooting
|
||||||
|
|
||||||
|
If you are having problems with your callback URL being treated
|
||||||
|
as HTTP by Github, even though there is an HTTPS address, and
|
||||||
|
everything else seems fine, try deleting the Github OAuth app
|
||||||
|
and creating a new one.
|
||||||
|
|
||||||
|
48
Todo.md
48
Todo.md
@@ -1,7 +1,47 @@
|
|||||||
# todo
|
# todo
|
||||||
|
|
||||||
current problems:
|
Main task:
|
||||||
- some github issues have no title
|
- hashing and caching
|
||||||
- github issues are just being re-indexed over and over
|
- <s>first, working out the logic of how we group items into sets
|
||||||
- documents not showing up in results
|
- needs to be deleted
|
||||||
|
- needs to be updated
|
||||||
|
- needs to be added
|
||||||
|
- for docs, issues, and comments</s>
|
||||||
|
- second, when we add or update an item, need to:
|
||||||
|
- go through the motions, download file, extract text
|
||||||
|
- check for existing indexed doc with that id
|
||||||
|
- check if existing indexed doc has same hash
|
||||||
|
- if so, skip
|
||||||
|
- otherwise, delete and re-index
|
||||||
|
|
||||||
|
Other bugs:
|
||||||
|
- Some github issues have no title (?)
|
||||||
|
- <s>Need to combine issues with comments</s>
|
||||||
|
- Not able to index markdown files _in a repo_
|
||||||
|
- (Longer term) update main index vs update diff index
|
||||||
|
|
||||||
|
Needs:
|
||||||
|
- <s>control panel</s>
|
||||||
|
|
||||||
|
Thursday product:
|
||||||
|
- Everything re-indexed nightly
|
||||||
|
- Search engine built on all documents in Google Drive, all issues, markdown files
|
||||||
|
- Using pandoc to extract Google Drive document contents
|
||||||
|
- BRIEF quickstart documentation
|
||||||
|
|
||||||
|
Future:
|
||||||
|
- Future plans to improve - plugins, improving matching
|
||||||
|
- Subdomain plans
|
||||||
|
- Folksonomy tagging and integration plans
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
config options for plugins
|
||||||
|
conditional blocks with import github inside
|
||||||
|
complicated tho - better to have components split off
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
172
centillion.py
172
centillion.py
@@ -2,8 +2,11 @@ import threading
|
|||||||
from subprocess import call
|
from subprocess import call
|
||||||
|
|
||||||
import codecs
|
import codecs
|
||||||
import os
|
import os, json
|
||||||
|
|
||||||
|
from werkzeug.contrib.fixers import ProxyFix
|
||||||
from flask import Flask, request, redirect, url_for, render_template, flash
|
from flask import Flask, request, redirect, url_for, render_template, flash
|
||||||
|
from flask_dance.contrib.github import make_github_blueprint, github
|
||||||
|
|
||||||
# create our application
|
# create our application
|
||||||
from centillion_search import Search
|
from centillion_search import Search
|
||||||
@@ -22,10 +25,18 @@ You provide:
|
|||||||
- Google Drive API key via file
|
- Google Drive API key via file
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
class UpdateIndexTask(object):
|
class UpdateIndexTask(object):
|
||||||
def __init__(self, diff_index=False):
|
def __init__(self, app_config, diff_index=False):
|
||||||
self.diff_index = diff_index
|
self.diff_index = diff_index
|
||||||
thread = threading.Thread(target=self.run, args=())
|
thread = threading.Thread(target=self.run, args=())
|
||||||
|
|
||||||
|
self.gh_token = app_config['GITHUB_TOKEN']
|
||||||
|
self.groupsio_credentials = {
|
||||||
|
'groupsio_token' : app_config['GROUPSIO_TOKEN'],
|
||||||
|
'groupsio_username' : app_config['GROUPSIO_USERNAME'],
|
||||||
|
'groupsio_password' : app_config['GROUPSIO_PASSWORD']
|
||||||
|
}
|
||||||
thread.daemon = True
|
thread.daemon = True
|
||||||
thread.start()
|
thread.start()
|
||||||
|
|
||||||
@@ -38,30 +49,91 @@ class UpdateIndexTask(object):
|
|||||||
from get_centillion_config import get_centillion_config
|
from get_centillion_config import get_centillion_config
|
||||||
config = get_centillion_config('config_centillion.json')
|
config = get_centillion_config('config_centillion.json')
|
||||||
|
|
||||||
gh_token = os.environ['GITHUB_TOKEN']
|
search.update_index_groupsioemails(self.groupsio_credentials,config)
|
||||||
search.update_index_issues(gh_token, config)
|
###search.update_index_ghfiles(self.gh_token,config)
|
||||||
search.update_index_gdocs(config)
|
###search.update_index_issues(self.gh_token,config)
|
||||||
|
###search.update_index_gdocs(config)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
app = Flask(__name__)
|
app = Flask(__name__)
|
||||||
|
app.wsgi_app = ProxyFix(app.wsgi_app)
|
||||||
|
|
||||||
# Load default config and override config from an environment variable
|
# Load default config and override config from an environment variable
|
||||||
app.config.from_pyfile("config_flask.py")
|
app.config.from_pyfile("config_flask.py")
|
||||||
|
|
||||||
last_searches_file = app.config["INDEX_DIR"] + "/last_searches.txt"
|
#github_bp = make_github_blueprint()
|
||||||
|
github_bp = make_github_blueprint(
|
||||||
|
client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
|
||||||
|
client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
|
||||||
|
scope='read:org')
|
||||||
|
|
||||||
|
app.register_blueprint(github_bp, url_prefix="/login")
|
||||||
|
|
||||||
|
contents404 = "<html><body><h1>Status: Error 404 Page Not Found</h1></body></html>"
|
||||||
|
contents403 = "<html><body><h1>Status: Error 403 Access Denied</h1></body></html>"
|
||||||
|
contents200 = "<html><body><h1>Status: OK 200</h1></body></html>"
|
||||||
|
|
||||||
|
|
||||||
##############################
|
##############################
|
||||||
# Flask routes
|
# Flask routes
|
||||||
|
|
||||||
|
|
||||||
@app.route('/')
|
@app.route('/')
|
||||||
def index():
|
def index():
|
||||||
|
|
||||||
|
if not github.authorized:
|
||||||
|
return redirect(url_for("github.login"))
|
||||||
|
|
||||||
|
else:
|
||||||
|
|
||||||
|
username = github.get("/user").json()['login']
|
||||||
|
|
||||||
|
resp = github.get("/user/orgs")
|
||||||
|
if resp.ok:
|
||||||
|
|
||||||
|
# If they are in team copper, redirect to search.
|
||||||
|
# Otherwise, hit em with a 403
|
||||||
|
all_orgs = resp.json()
|
||||||
|
for org in all_orgs:
|
||||||
|
if org['login']=='dcppc':
|
||||||
|
copper_team_id = '2700235'
|
||||||
|
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
|
||||||
|
if mresp.status_code==204:
|
||||||
|
|
||||||
|
# --------------------
|
||||||
|
# Business as usual
|
||||||
return redirect(url_for("search", query="", fields=""))
|
return redirect(url_for("search", query="", fields=""))
|
||||||
|
|
||||||
|
return contents403
|
||||||
|
|
||||||
|
return contents404
|
||||||
|
|
||||||
|
### @app.route('/')
|
||||||
|
### def index():
|
||||||
|
### return redirect(url_for("search", query="", fields=""))
|
||||||
|
|
||||||
@app.route('/search')
|
@app.route('/search')
|
||||||
def search():
|
def search():
|
||||||
|
|
||||||
|
if not github.authorized:
|
||||||
|
return redirect(url_for("github.login"))
|
||||||
|
|
||||||
|
username = github.get("/user").json()['login']
|
||||||
|
|
||||||
|
resp = github.get("/user/orgs")
|
||||||
|
if resp.ok:
|
||||||
|
|
||||||
|
all_orgs = resp.json()
|
||||||
|
for org in all_orgs:
|
||||||
|
if org['login']=='dcppc':
|
||||||
|
|
||||||
|
copper_team_id = '2700235'
|
||||||
|
|
||||||
|
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
|
||||||
|
if mresp.status_code==204:
|
||||||
|
|
||||||
|
# --------------------
|
||||||
|
# Business as usual
|
||||||
query = request.args['query']
|
query = request.args['query']
|
||||||
fields = request.args.get('fields')
|
fields = request.args.get('fields')
|
||||||
if fields == 'None':
|
if fields == 'None':
|
||||||
@@ -74,7 +146,6 @@ def search():
|
|||||||
|
|
||||||
else:
|
else:
|
||||||
parsed_query, result = search.search(query.split(), fields=[fields])
|
parsed_query, result = search.search(query.split(), fields=[fields])
|
||||||
store_search(query, fields)
|
|
||||||
|
|
||||||
totals = search.get_document_total_count()
|
totals = search.get_document_total_count()
|
||||||
|
|
||||||
@@ -83,46 +154,75 @@ def search():
|
|||||||
query=query,
|
query=query,
|
||||||
parsed_query=parsed_query,
|
parsed_query=parsed_query,
|
||||||
fields=fields,
|
fields=fields,
|
||||||
last_searches=get_last_searches(),
|
|
||||||
totals=totals)
|
totals=totals)
|
||||||
|
|
||||||
|
return contents403
|
||||||
|
|
||||||
|
|
||||||
@app.route('/update_index')
|
@app.route('/update_index')
|
||||||
def update_index():
|
def update_index():
|
||||||
rebuild = request.args.get('rebuild')
|
|
||||||
UpdateIndexTask(diff_index=False)
|
if not github.authorized:
|
||||||
|
return redirect(url_for("github.login"))
|
||||||
|
|
||||||
|
username = github.get("/user").json()['login']
|
||||||
|
|
||||||
|
resp = github.get("/user/orgs")
|
||||||
|
if resp.ok:
|
||||||
|
|
||||||
|
all_orgs = resp.json()
|
||||||
|
for org in all_orgs:
|
||||||
|
if org['login']=='dcppc':
|
||||||
|
|
||||||
|
copper_team_id = '2700235'
|
||||||
|
|
||||||
|
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
|
||||||
|
if mresp.status_code==204:
|
||||||
|
|
||||||
|
# --------------------
|
||||||
|
# Business as usual
|
||||||
|
UpdateIndexTask(app.config,
|
||||||
|
diff_index=False)
|
||||||
flash("Rebuilding index, check console output")
|
flash("Rebuilding index, check console output")
|
||||||
return render_template("search.html",
|
return render_template("controlpanel.html",
|
||||||
query="",
|
|
||||||
fields="",
|
|
||||||
last_searches=get_last_searches(),
|
|
||||||
totals={})
|
totals={})
|
||||||
|
|
||||||
|
return contents403
|
||||||
|
|
||||||
##############
|
|
||||||
# Utility methods
|
|
||||||
|
|
||||||
def get_last_searches():
|
|
||||||
if os.path.exists(last_searches_file):
|
|
||||||
with codecs.open(last_searches_file, 'r', encoding='utf-8') as f:
|
|
||||||
contents = f.readlines()
|
|
||||||
else:
|
|
||||||
contents = []
|
|
||||||
return contents
|
|
||||||
|
|
||||||
def store_search(query, fields):
|
@app.route('/control_panel')
|
||||||
if os.path.exists(last_searches_file):
|
def control_panel():
|
||||||
with codecs.open(last_searches_file, 'r', encoding='utf-8') as f:
|
|
||||||
contents = f.readlines()
|
|
||||||
else:
|
|
||||||
contents = []
|
|
||||||
|
|
||||||
search = "query=%s&fields=%s\n" % (query, fields)
|
if not github.authorized:
|
||||||
if not search in contents:
|
return redirect(url_for("github.login"))
|
||||||
contents.insert(0, search)
|
|
||||||
|
|
||||||
with codecs.open(last_searches_file, 'w', encoding='utf-8') as f:
|
username = github.get("/user").json()['login']
|
||||||
f.writelines(contents[:30])
|
|
||||||
|
resp = github.get("/user/orgs")
|
||||||
|
if resp.ok:
|
||||||
|
|
||||||
|
all_orgs = resp.json()
|
||||||
|
for org in all_orgs:
|
||||||
|
if org['login']=='dcppc':
|
||||||
|
|
||||||
|
copper_team_id = '2700235'
|
||||||
|
|
||||||
|
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
|
||||||
|
if mresp.status_code==204:
|
||||||
|
|
||||||
|
return render_template("controlpanel.html",
|
||||||
|
totals={})
|
||||||
|
|
||||||
|
return contents403
|
||||||
|
|
||||||
|
|
||||||
|
@app.errorhandler(404)
|
||||||
|
def oops(e):
|
||||||
|
return contents404
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
app.run()
|
# if running local instance, set to true
|
||||||
|
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = 'true'
|
||||||
|
app.run(host="0.0.0.0",port=5000)
|
||||||
|
|
||||||
|
5
centillion_prepare.py
Normal file
5
centillion_prepare.py
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
from gdrive_util import GDrive
|
||||||
|
|
||||||
|
gd = GDrive()
|
||||||
|
service = gd.get_service()
|
||||||
|
|
@@ -1,9 +1,11 @@
|
|||||||
import shutil
|
import shutil
|
||||||
import html.parser
|
import html.parser
|
||||||
|
|
||||||
from github import Github
|
from github import Github, GithubException
|
||||||
|
import base64
|
||||||
|
|
||||||
from gdrive_util import GDrive
|
from gdrive_util import GDrive
|
||||||
|
from groupsio_util import GroupsIOArchivesCrawler
|
||||||
from apiclient.http import MediaIoBaseDownload
|
from apiclient.http import MediaIoBaseDownload
|
||||||
|
|
||||||
import mistune
|
import mistune
|
||||||
@@ -42,6 +44,7 @@ Search object functions:
|
|||||||
Schema:
|
Schema:
|
||||||
- id
|
- id
|
||||||
- kind
|
- kind
|
||||||
|
- fingerprint
|
||||||
- created_time
|
- created_time
|
||||||
- modified_time
|
- modified_time
|
||||||
- indexed_time
|
- indexed_time
|
||||||
@@ -95,6 +98,11 @@ class Search:
|
|||||||
def __init__(self, index_folder):
|
def __init__(self, index_folder):
|
||||||
self.open_index(index_folder)
|
self.open_index(index_folder)
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Create a schema and open a search index
|
||||||
|
# on disk.
|
||||||
|
|
||||||
def open_index(self, index_folder, create_new=False):
|
def open_index(self, index_folder, create_new=False):
|
||||||
"""
|
"""
|
||||||
Create a schema,
|
Create a schema,
|
||||||
@@ -115,7 +123,6 @@ class Search:
|
|||||||
|
|
||||||
|
|
||||||
# ------------------------------
|
# ------------------------------
|
||||||
# IMPORTANT:
|
|
||||||
# This is where the search index's document schema
|
# This is where the search index's document schema
|
||||||
# is defined.
|
# is defined.
|
||||||
|
|
||||||
@@ -160,16 +167,13 @@ class Search:
|
|||||||
# Define how to add documents
|
# Define how to add documents
|
||||||
|
|
||||||
|
|
||||||
def add_drive_file(self, writer, item, indexed_ids, temp_dir, config):
|
def add_drive_file(self, writer, item, temp_dir, config, update=False):
|
||||||
"""
|
"""
|
||||||
Add a Google Drive document/file to a search index.
|
Add a Google Drive document/file to a search index.
|
||||||
If it is a document, extract the contents.
|
If it is a document, extract the contents.
|
||||||
"""
|
"""
|
||||||
gd = GDrive()
|
|
||||||
service = gd.get_service()
|
|
||||||
|
|
||||||
# ------------------------
|
# There are two kinds of documents:
|
||||||
# Two kinds of documents:
|
|
||||||
# - documents with text that can be extracted (docx)
|
# - documents with text that can be extracted (docx)
|
||||||
# - everything else
|
# - everything else
|
||||||
|
|
||||||
@@ -179,88 +183,13 @@ class Search:
|
|||||||
}
|
}
|
||||||
|
|
||||||
content = ""
|
content = ""
|
||||||
if(mimetype not in mimemap.keys()):
|
if mimetype not in mimemap.keys():
|
||||||
# Not a document -
|
|
||||||
# Just a file
|
|
||||||
print("Indexing document \"%s\" of type %s"%(item['name'], mimetype))
|
|
||||||
else:
|
|
||||||
# Document with text
|
|
||||||
# Perform content extraction
|
|
||||||
|
|
||||||
# -----------
|
# Not a document - just a file
|
||||||
# docx Content Extraction:
|
print("Indexing Google Drive file \"%s\" of type %s"%(item['name'], mimetype))
|
||||||
#
|
writer.delete_by_term('id',item['id'])
|
||||||
# We can only do this with .docx files
|
|
||||||
# This is a file type we know how to convert
|
|
||||||
# Construct the URL and download it
|
|
||||||
|
|
||||||
print("Extracting content from \"%s\" of type %s"%(item['name'], mimetype))
|
# Index a plain google drive file
|
||||||
|
|
||||||
|
|
||||||
# Create a URL and a destination filename
|
|
||||||
file_ext = mimemap[mimetype]
|
|
||||||
file_url = "https://docs.google.com/document/d/%s/export?format=%s"%(item['id'], file_ext)
|
|
||||||
|
|
||||||
# This re could probablybe improved
|
|
||||||
name = re.sub('/','_',item['name'])
|
|
||||||
|
|
||||||
# Now make the pandoc input/output filenames
|
|
||||||
out_ext = 'txt'
|
|
||||||
pandoc_fmt = 'plain'
|
|
||||||
if name.endswith(file_ext):
|
|
||||||
infile_name = name
|
|
||||||
outfile_name = re.sub(file_ext,out_ext,infile_name)
|
|
||||||
else:
|
|
||||||
infile_name = name+'.'+file_ext
|
|
||||||
outfile_name = name+'.'+out_ext
|
|
||||||
|
|
||||||
|
|
||||||
# assemble input/output file paths
|
|
||||||
fullpath_input = os.path.join(temp_dir,infile_name)
|
|
||||||
fullpath_output = os.path.join(temp_dir,outfile_name)
|
|
||||||
|
|
||||||
# Use requests.get to download url to file
|
|
||||||
r = requests.get(file_url, allow_redirects=True)
|
|
||||||
with open(fullpath_input, 'wb') as f:
|
|
||||||
f.write(r.content)
|
|
||||||
|
|
||||||
|
|
||||||
# Try to convert docx file to plain text
|
|
||||||
try:
|
|
||||||
output = pypandoc.convert_file(fullpath_input,
|
|
||||||
pandoc_fmt,
|
|
||||||
format='docx',
|
|
||||||
outputfile=fullpath_output
|
|
||||||
)
|
|
||||||
assert output == ""
|
|
||||||
except RuntimeError:
|
|
||||||
print("XXXXXX Failed to index document \"%s\""%(item['name']))
|
|
||||||
|
|
||||||
|
|
||||||
# If export was successful, read contents of markdown
|
|
||||||
# into the content variable.
|
|
||||||
# into the content variable.
|
|
||||||
if os.path.isfile(fullpath_output):
|
|
||||||
# Export was successful
|
|
||||||
with codecs.open(fullpath_output, encoding='utf-8') as f:
|
|
||||||
content = f.read()
|
|
||||||
|
|
||||||
|
|
||||||
# No matter what happens, clean up.
|
|
||||||
print("Cleaning up \"%s\""%item['name'])
|
|
||||||
|
|
||||||
subprocess.call(['rm','-fr',fullpath_output])
|
|
||||||
#print(" ".join(['rm','-fr',fullpath_output]))
|
|
||||||
|
|
||||||
subprocess.call(['rm','-fr',fullpath_input])
|
|
||||||
#print(" ".join(['rm','-fr',fullpath_input]))
|
|
||||||
|
|
||||||
|
|
||||||
# ------------------------------
|
|
||||||
# IMPORTANT:
|
|
||||||
# This is where the search documents are actually created.
|
|
||||||
|
|
||||||
mimetype = re.split('[/\.]', item['mimeType'])[-1]
|
|
||||||
writer.add_document(
|
writer.add_document(
|
||||||
id = item['id'],
|
id = item['id'],
|
||||||
kind = 'gdoc',
|
kind = 'gdoc',
|
||||||
@@ -281,23 +210,143 @@ class Search:
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def add_issue(self, writer, issue, repo, config):
|
else:
|
||||||
|
# Document with text
|
||||||
|
# Perform content extraction
|
||||||
|
|
||||||
|
# -----------
|
||||||
|
# docx Content Extraction:
|
||||||
|
#
|
||||||
|
# We can only do this with .docx files
|
||||||
|
# This is a file type we know how to convert
|
||||||
|
# Construct the URL and download it
|
||||||
|
|
||||||
|
print("Indexing Google Drive document \"%s\" of type %s"%(item['name'], mimetype))
|
||||||
|
print(" > Extracting content")
|
||||||
|
|
||||||
|
|
||||||
|
# Create a URL and a destination filename
|
||||||
|
file_ext = mimemap[mimetype]
|
||||||
|
file_url = "https://docs.google.com/document/d/%s/export?format=%s"%(item['id'], file_ext)
|
||||||
|
|
||||||
|
# This re could probablybe improved
|
||||||
|
name = re.sub('/','_',item['name'])
|
||||||
|
|
||||||
|
# Now make the pandoc input/output filenames
|
||||||
|
out_ext = 'txt'
|
||||||
|
pandoc_fmt = 'plain'
|
||||||
|
if name.endswith(file_ext):
|
||||||
|
infile_name = name
|
||||||
|
outfile_name = re.sub(file_ext,out_ext,infile_name)
|
||||||
|
else:
|
||||||
|
infile_name = name+'.'+file_ext
|
||||||
|
outfile_name = name+'.'+out_ext
|
||||||
|
|
||||||
|
|
||||||
|
# Assemble input/output file paths
|
||||||
|
fullpath_input = os.path.join(temp_dir,infile_name)
|
||||||
|
fullpath_output = os.path.join(temp_dir,outfile_name)
|
||||||
|
|
||||||
|
# Use requests.get to download url to file
|
||||||
|
r = requests.get(file_url, allow_redirects=True)
|
||||||
|
with open(fullpath_input, 'wb') as f:
|
||||||
|
f.write(r.content)
|
||||||
|
|
||||||
|
# Try to convert docx file to plain text
|
||||||
|
try:
|
||||||
|
output = pypandoc.convert_file(fullpath_input,
|
||||||
|
pandoc_fmt,
|
||||||
|
format='docx',
|
||||||
|
outputfile=fullpath_output
|
||||||
|
)
|
||||||
|
assert output == ""
|
||||||
|
except RuntimeError:
|
||||||
|
print(" > XXXXXX Failed to index document \"%s\""%(item['name']))
|
||||||
|
|
||||||
|
|
||||||
|
# If export was successful, read contents of markdown
|
||||||
|
# into the content variable.
|
||||||
|
if os.path.isfile(fullpath_output):
|
||||||
|
# Export was successful
|
||||||
|
with codecs.open(fullpath_output, encoding='utf-8') as f:
|
||||||
|
content = f.read()
|
||||||
|
|
||||||
|
|
||||||
|
# No matter what happens, clean up.
|
||||||
|
print(" > Cleaning up \"%s\""%item['name'])
|
||||||
|
|
||||||
|
## test
|
||||||
|
#print(" ".join(['rm','-fr',fullpath_output]))
|
||||||
|
#print(" ".join(['rm','-fr',fullpath_input]))
|
||||||
|
|
||||||
|
# do it
|
||||||
|
subprocess.call(['rm','-fr',fullpath_output])
|
||||||
|
subprocess.call(['rm','-fr',fullpath_input])
|
||||||
|
|
||||||
|
if update:
|
||||||
|
print(" > Removing old record")
|
||||||
|
writer.delete_by_term('id',item['id'])
|
||||||
|
else:
|
||||||
|
print(" > Creating a new record")
|
||||||
|
|
||||||
|
writer.add_document(
|
||||||
|
id = item['id'],
|
||||||
|
kind = 'gdoc',
|
||||||
|
created_time = item['createdTime'],
|
||||||
|
modified_time = item['modifiedTime'],
|
||||||
|
indexed_time = datetime.now().replace(microsecond=0).isoformat(),
|
||||||
|
title = item['name'],
|
||||||
|
url = item['webViewLink'],
|
||||||
|
mimetype = mimetype,
|
||||||
|
owner_email = item['owners'][0]['emailAddress'],
|
||||||
|
owner_name = item['owners'][0]['displayName'],
|
||||||
|
repo_name='',
|
||||||
|
repo_url='',
|
||||||
|
github_user='',
|
||||||
|
issue_title='',
|
||||||
|
issue_url='',
|
||||||
|
content = content
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Add a single github issue and its comments
|
||||||
|
# to a search index.
|
||||||
|
|
||||||
|
|
||||||
|
def add_issue(self, writer, issue, gh_token, config, update=True):
|
||||||
"""
|
"""
|
||||||
Add a Github issue/comment to a search index.
|
Add a Github issue/comment to a search index.
|
||||||
"""
|
"""
|
||||||
|
repo = issue.repository
|
||||||
repo_name = repo.owner.login+"/"+repo.name
|
repo_name = repo.owner.login+"/"+repo.name
|
||||||
repo_url = repo.html_url
|
repo_url = repo.html_url
|
||||||
|
|
||||||
count = 0
|
|
||||||
|
|
||||||
|
|
||||||
# Handle the issue content
|
|
||||||
print("Indexing issue %s"%(issue.html_url))
|
print("Indexing issue %s"%(issue.html_url))
|
||||||
|
|
||||||
|
# Combine comments with their respective issues.
|
||||||
|
# Otherwise just too noisy.
|
||||||
|
issue_comment_content = issue.body.rstrip()
|
||||||
|
issue_comment_content += "\n"
|
||||||
|
|
||||||
|
# Handle the comments content
|
||||||
|
if(issue.comments>0):
|
||||||
|
|
||||||
|
comments = issue.get_comments()
|
||||||
|
for comment in comments:
|
||||||
|
|
||||||
|
issue_comment_content += comment.body.rstrip()
|
||||||
|
issue_comment_content += "\n"
|
||||||
|
|
||||||
|
# Now create the actual search index record
|
||||||
created_time = clean_timestamp(issue.created_at)
|
created_time = clean_timestamp(issue.created_at)
|
||||||
modified_time = clean_timestamp(issue.updated_at)
|
modified_time = clean_timestamp(issue.updated_at)
|
||||||
indexed_time = clean_timestamp(datetime.now())
|
indexed_time = clean_timestamp(datetime.now())
|
||||||
|
|
||||||
|
# Add one document per issue thread,
|
||||||
|
# containing entire text of thread.
|
||||||
writer.add_document(
|
writer.add_document(
|
||||||
id = issue.html_url,
|
id = issue.html_url,
|
||||||
kind = 'issue',
|
kind = 'issue',
|
||||||
@@ -314,45 +363,106 @@ class Search:
|
|||||||
github_user = issue.user.login,
|
github_user = issue.user.login,
|
||||||
issue_title = issue.title,
|
issue_title = issue.title,
|
||||||
issue_url = issue.html_url,
|
issue_url = issue.html_url,
|
||||||
content = issue.body.rstrip()
|
content = issue_comment_content
|
||||||
)
|
)
|
||||||
count += 1
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# Handle the comments content
|
|
||||||
if(issue.comments>0):
|
|
||||||
|
|
||||||
comments = issue.get_comments()
|
def add_ghfile(self, writer, d, gh_token, config, update=True):
|
||||||
for comment in comments:
|
"""
|
||||||
|
Use a Github file API record to add a filename
|
||||||
|
to the search index.
|
||||||
|
"""
|
||||||
|
MARKDOWN_EXTS = ['.md','.markdown']
|
||||||
|
|
||||||
print(" > Indexing comment %s"%(comment.html_url))
|
repo = d['repo']
|
||||||
|
org = d['org']
|
||||||
|
repo_name = org + "/" + repo
|
||||||
|
repo_url = "https://github.com/" + repo_name
|
||||||
|
|
||||||
|
try:
|
||||||
|
fpath = d['path']
|
||||||
|
furl = d['url']
|
||||||
|
fsha = d['sha']
|
||||||
|
_, fname = os.path.split(fpath)
|
||||||
|
_, fext = os.path.splitext(fpath)
|
||||||
|
except:
|
||||||
|
print(" > XXXXXXXX Failed to find file info.")
|
||||||
|
return
|
||||||
|
|
||||||
created_time = clean_timestamp(comment.created_at)
|
|
||||||
modified_time = clean_timestamp(comment.updated_at)
|
|
||||||
indexed_time = clean_timestamp(datetime.now())
|
indexed_time = clean_timestamp(datetime.now())
|
||||||
|
|
||||||
|
if fext in MARKDOWN_EXTS:
|
||||||
|
print("Indexing markdown doc %s from repo %s"%(fname,repo_name))
|
||||||
|
|
||||||
|
# Unpack the requests response and decode the content
|
||||||
|
#
|
||||||
|
# don't forget the headers for private repos!
|
||||||
|
# useful: https://bit.ly/2LSAflS
|
||||||
|
|
||||||
|
headers = {'Authorization' : 'token %s'%(gh_token)}
|
||||||
|
|
||||||
|
response = requests.get(furl, headers=headers)
|
||||||
|
if response.status_code==200:
|
||||||
|
jresponse = response.json()
|
||||||
|
content = ""
|
||||||
|
try:
|
||||||
|
binary_content = re.sub('\n','',jresponse['content'])
|
||||||
|
content = base64.b64decode(binary_content).decode('utf-8')
|
||||||
|
except KeyError:
|
||||||
|
print(" > XXXXXXXX Failed to extract 'content' field. You probably hit the rate limit.")
|
||||||
|
|
||||||
|
else:
|
||||||
|
print(" > XXXXXXXX Failed to reach file URL. There may be a problem with authentication/headers.")
|
||||||
|
return
|
||||||
|
|
||||||
|
usable_url = "https://github.com/%s/blob/master/%s"%(repo_name, fpath)
|
||||||
|
|
||||||
|
# Now create the actual search index record
|
||||||
writer.add_document(
|
writer.add_document(
|
||||||
id = comment.html_url,
|
id = fsha,
|
||||||
kind = 'comment',
|
kind = 'markdown',
|
||||||
created_time = created_time,
|
created_time = '',
|
||||||
modified_time = modified_time,
|
modified_time = '',
|
||||||
indexed_time = indexed_time,
|
indexed_time = indexed_time,
|
||||||
title = "Comment on "+issue.title,
|
title = fname,
|
||||||
url = comment.html_url,
|
url = usable_url,
|
||||||
mimetype='',
|
mimetype='',
|
||||||
owner_email='',
|
owner_email='',
|
||||||
owner_name='',
|
owner_name='',
|
||||||
repo_name = repo_name,
|
repo_name = repo_name,
|
||||||
repo_url = repo_url,
|
repo_url = repo_url,
|
||||||
github_user = comment.user.login,
|
github_user = '',
|
||||||
issue_title = issue.title,
|
issue_title = '',
|
||||||
issue_url = issue.html_url,
|
issue_url = '',
|
||||||
content = comment.body.rstrip()
|
content = content
|
||||||
)
|
)
|
||||||
|
|
||||||
count += 1
|
else:
|
||||||
return count
|
print("Indexing github file %s from repo %s"%(fname,repo_name))
|
||||||
|
|
||||||
|
key = fname+"_"+fsha
|
||||||
|
|
||||||
|
# Now create the actual search index record
|
||||||
|
writer.add_document(
|
||||||
|
id = key,
|
||||||
|
kind = 'ghfile',
|
||||||
|
created_time = '',
|
||||||
|
modified_time = '',
|
||||||
|
indexed_time = indexed_time,
|
||||||
|
title = fname,
|
||||||
|
url = repo_url,
|
||||||
|
mimetype='',
|
||||||
|
owner_email='',
|
||||||
|
owner_name='',
|
||||||
|
repo_name = repo_name,
|
||||||
|
repo_url = repo_url,
|
||||||
|
github_user = '',
|
||||||
|
issue_title = '',
|
||||||
|
issue_url = '',
|
||||||
|
content = ''
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@@ -360,91 +470,116 @@ class Search:
|
|||||||
# Define how to update search index
|
# Define how to update search index
|
||||||
# using different kinds of collections
|
# using different kinds of collections
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Google Drive Files/Documents
|
||||||
|
|
||||||
def update_index_gdocs(self,
|
def update_index_gdocs(self,
|
||||||
config):
|
config):
|
||||||
"""
|
"""
|
||||||
Update the search index using a collection of
|
Update the search index using a collection of
|
||||||
Google Drive documents and files.
|
Google Drive documents and files.
|
||||||
|
|
||||||
|
Uses the 'id' field to uniquely identify documents.
|
||||||
|
|
||||||
|
Also see:
|
||||||
|
https://developers.google.com/drive/api/v3/reference/files
|
||||||
"""
|
"""
|
||||||
gd = GDrive()
|
|
||||||
service = gd.get_service()
|
|
||||||
|
|
||||||
# -----
|
# Updated algorithm:
|
||||||
# Get the set of all documents on Google Drive:
|
# - get set of indexed ids
|
||||||
|
# - get set of remote ids
|
||||||
|
# - drop indexed ids not in remote ids
|
||||||
|
# - index all remote ids
|
||||||
|
# - add hash check in add_
|
||||||
|
|
||||||
# ------------------------------
|
|
||||||
# IMPORTANT:
|
|
||||||
# This determines what information about the Google Drive files
|
|
||||||
# you'll get back, and that's all you're going to have to work with.
|
|
||||||
# If you need more information, modify the statement below.
|
|
||||||
# Also see:
|
|
||||||
# https://developers.google.com/drive/api/v3/reference/files
|
|
||||||
|
|
||||||
|
# Get the set of indexed ids:
|
||||||
|
# ------
|
||||||
|
indexed_ids = set()
|
||||||
|
p = QueryParser("kind", schema=self.ix.schema)
|
||||||
|
q = p.parse("gdoc")
|
||||||
|
with self.ix.searcher() as s:
|
||||||
|
results = s.search(q,limit=None)
|
||||||
|
for result in results:
|
||||||
|
indexed_ids.add(result['id'])
|
||||||
|
|
||||||
|
|
||||||
|
# Get the set of remote ids:
|
||||||
|
# ------
|
||||||
|
# Start with google drive api object
|
||||||
gd = GDrive()
|
gd = GDrive()
|
||||||
service = gd.get_service()
|
service = gd.get_service()
|
||||||
drive = service.files()
|
drive = service.files()
|
||||||
|
|
||||||
|
# Now index all the docs in the google drive folder
|
||||||
# We should do more here
|
|
||||||
# to check if we should update
|
|
||||||
# or not...
|
|
||||||
#
|
|
||||||
# loop over existing documents in index:
|
|
||||||
#
|
|
||||||
# p = QueryParser("kind", schema=self.ix.schema)
|
|
||||||
# q = p.parse("gdoc")
|
|
||||||
# with self.ix.searcher() as s:
|
|
||||||
# results = s.search(q,limit=None)
|
|
||||||
# counts[key] = len(results)
|
|
||||||
|
|
||||||
|
|
||||||
# The trick is to set next page token to None 1st time thru (fencepost)
|
# The trick is to set next page token to None 1st time thru (fencepost)
|
||||||
nextPageToken = None
|
nextPageToken = None
|
||||||
|
|
||||||
# Use the pager to return all the things
|
# Use the pager to return all the things
|
||||||
items = []
|
remote_ids = set()
|
||||||
|
full_items = {}
|
||||||
while True:
|
while True:
|
||||||
ps = 12
|
ps = 100
|
||||||
results = drive.list(
|
results = drive.list(
|
||||||
pageSize=ps,
|
pageSize=ps,
|
||||||
pageToken=nextPageToken,
|
pageToken=nextPageToken,
|
||||||
fields="nextPageToken, files(id, kind, createdTime, modifiedTime, mimeType, name, owners, webViewLink)",
|
fields = "nextPageToken, files(id, kind, createdTime, modifiedTime, mimeType, name, owners, webViewLink)",
|
||||||
spaces="drive"
|
spaces="drive"
|
||||||
).execute()
|
).execute()
|
||||||
|
|
||||||
nextPageToken = results.get("nextPageToken")
|
nextPageToken = results.get("nextPageToken")
|
||||||
items += results.get("files", [])
|
files = results.get("files",[])
|
||||||
|
for f in files:
|
||||||
|
|
||||||
# Keep it short
|
# Add all remote docs to a set
|
||||||
|
remote_ids.add(f['id'])
|
||||||
|
|
||||||
|
# Also store the doc
|
||||||
|
full_items[f['id']] = f
|
||||||
|
|
||||||
|
## Shorter:
|
||||||
|
#break
|
||||||
|
# Longer:
|
||||||
|
if nextPageToken is None:
|
||||||
break
|
break
|
||||||
|
|
||||||
#if nextPageToken is None:
|
|
||||||
# break
|
|
||||||
|
|
||||||
# Here is where we update.
|
|
||||||
# Grab indexed ids
|
|
||||||
# Grab remote ids
|
|
||||||
# Drop indexed ids not in remote ids
|
|
||||||
# Index all remote ids
|
|
||||||
# Change add_ to update_
|
|
||||||
# Add a hash check in update_
|
|
||||||
|
|
||||||
indexed_ids = set()
|
|
||||||
for item in items:
|
|
||||||
indexed_ids.add(item['id'])
|
|
||||||
|
|
||||||
writer = self.ix.writer()
|
writer = self.ix.writer()
|
||||||
|
count = 0
|
||||||
temp_dir = tempfile.mkdtemp(dir=os.getcwd())
|
temp_dir = tempfile.mkdtemp(dir=os.getcwd())
|
||||||
print("Temporary directory: %s"%(temp_dir))
|
print("Temporary directory: %s"%(temp_dir))
|
||||||
if not os.path.exists(temp_dir):
|
|
||||||
os.mkdir(temp_dir)
|
|
||||||
|
|
||||||
count = 0
|
|
||||||
for item in items:
|
|
||||||
self.add_drive_file(writer, item, indexed_ids, temp_dir, config)
|
# Drop any id in indexed_ids
|
||||||
|
# not in remote_ids
|
||||||
|
drop_ids = indexed_ids - remote_ids
|
||||||
|
for drop_id in drop_ids:
|
||||||
|
writer.delete_by_term('id',drop_id)
|
||||||
|
|
||||||
|
|
||||||
|
# Update any id in indexed_ids
|
||||||
|
# and in remote_ids
|
||||||
|
update_ids = indexed_ids & remote_ids
|
||||||
|
for update_id in update_ids:
|
||||||
|
# cop out
|
||||||
|
writer.delete_by_term('id',update_id)
|
||||||
|
item = full_items[update_id]
|
||||||
|
self.add_drive_file(writer, item, temp_dir, config, update=True)
|
||||||
count += 1
|
count += 1
|
||||||
|
|
||||||
|
|
||||||
|
# Add any id not in indexed_ids
|
||||||
|
# and in remote_ids
|
||||||
|
add_ids = remote_ids - indexed_ids
|
||||||
|
for add_id in add_ids:
|
||||||
|
item = full_items[add_id]
|
||||||
|
self.add_drive_file(writer, item, temp_dir, config, update=False)
|
||||||
|
count += 1
|
||||||
|
|
||||||
|
|
||||||
print("Cleaning temporary directory: %s"%(temp_dir))
|
print("Cleaning temporary directory: %s"%(temp_dir))
|
||||||
subprocess.call(['rm','-fr',temp_dir])
|
subprocess.call(['rm','-fr',temp_dir])
|
||||||
|
|
||||||
@@ -452,25 +587,41 @@ class Search:
|
|||||||
print("Done, updated %d documents in the index" % count)
|
print("Done, updated %d documents in the index" % count)
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Github Issues/Comments
|
||||||
|
|
||||||
|
def update_index_issues(self, gh_token, config):
|
||||||
def update_index_issues(self,
|
|
||||||
gh_access_token,
|
|
||||||
config):
|
|
||||||
"""
|
"""
|
||||||
Update the search index using a collection of
|
Update the search index using a collection of
|
||||||
Github repo issues and comments.
|
Github repo issues and comments.
|
||||||
"""
|
"""
|
||||||
# Strategy:
|
# Updated algorithm:
|
||||||
# To get the proof of concept up and running,
|
# - get set of indexed ids
|
||||||
# we are just deleting and re-indexing every issue/comment.
|
# - get set of remote ids
|
||||||
|
# - drop indexed ids not in remote ids
|
||||||
|
# - index all remote ids
|
||||||
|
|
||||||
g = Github(gh_access_token)
|
# Get the set of indexed ids:
|
||||||
|
# ------
|
||||||
|
indexed_issues = set()
|
||||||
|
p = QueryParser("kind", schema=self.ix.schema)
|
||||||
|
q = p.parse("issue")
|
||||||
|
with self.ix.searcher() as s:
|
||||||
|
results = s.search(q,limit=None)
|
||||||
|
for result in results:
|
||||||
|
indexed_issues.add(result['id'])
|
||||||
|
|
||||||
# Set of all URLs as existing on github
|
|
||||||
to_index = set()
|
|
||||||
|
|
||||||
writer = self.ix.writer()
|
# Get the set of remote ids:
|
||||||
|
# ------
|
||||||
|
# Start with api object
|
||||||
|
g = Github(gh_token)
|
||||||
|
|
||||||
|
# Now index all issue threads in the user-specified repos
|
||||||
|
|
||||||
|
# Start by collecting all the things
|
||||||
|
remote_issues = set()
|
||||||
|
full_items = {}
|
||||||
|
|
||||||
# Iterate over each repo
|
# Iterate over each repo
|
||||||
list_of_repos = config['repositories']
|
list_of_repos = config['repositories']
|
||||||
@@ -481,41 +632,214 @@ class Search:
|
|||||||
raise Exception(err)
|
raise Exception(err)
|
||||||
|
|
||||||
this_org, this_repo = re.split('/',r)
|
this_org, this_repo = re.split('/',r)
|
||||||
|
try:
|
||||||
org = g.get_organization(this_org)
|
org = g.get_organization(this_org)
|
||||||
repo = org.get_repo(this_repo)
|
repo = org.get_repo(this_repo)
|
||||||
|
except:
|
||||||
|
print("Error: could not gain access to repository %s"%(r))
|
||||||
|
continue
|
||||||
|
|
||||||
count = 0
|
# Iterate over each issue thread
|
||||||
|
|
||||||
# Iterate over each thread
|
|
||||||
issues = repo.get_issues()
|
issues = repo.get_issues()
|
||||||
for issue in issues:
|
for issue in issues:
|
||||||
|
|
||||||
# This approach is more work than is needed
|
|
||||||
# but PoC||GTFO
|
|
||||||
|
|
||||||
# For each issue/comment URL,
|
# For each issue/comment URL,
|
||||||
# remove the corresponding item
|
# grab the key and store the
|
||||||
# and re-add it to the index
|
# corresponding issue object
|
||||||
|
key = issue.html_url
|
||||||
|
value = issue
|
||||||
|
|
||||||
to_index.add(issue.html_url)
|
remote_issues.add(key)
|
||||||
writer.delete_by_term('url', issue.html_url)
|
full_items[key] = value
|
||||||
count -= 1
|
|
||||||
comments = issue.get_comments()
|
|
||||||
|
|
||||||
for comment in comments:
|
writer = self.ix.writer()
|
||||||
to_index.add(comment.html_url)
|
count = 0
|
||||||
writer.delete_by_term('url', comment.html_url)
|
|
||||||
|
|
||||||
# Now re-add this issue to the index
|
# Drop any issues in indexed_issues
|
||||||
# (this will also add the comments)
|
# not in remote_issues
|
||||||
count += self.add_issue(writer, issue, repo, config)
|
drop_issues = indexed_issues - remote_issues
|
||||||
|
for drop_issue in drop_issues:
|
||||||
|
writer.delete_by_term('id',drop_issue)
|
||||||
|
|
||||||
|
|
||||||
|
# Update any issue in indexed_issues
|
||||||
|
# and in remote_issues
|
||||||
|
update_issues = indexed_issues & remote_issues
|
||||||
|
for update_issue in update_issues:
|
||||||
|
# cop out
|
||||||
|
writer.delete_by_term('id',update_issue)
|
||||||
|
item = full_items[update_issue]
|
||||||
|
self.add_issue(writer, item, gh_token, config, update=True)
|
||||||
|
count += 1
|
||||||
|
|
||||||
|
|
||||||
|
# Add any issue not in indexed_issues
|
||||||
|
# and in remote_issues
|
||||||
|
add_issues = remote_issues - indexed_issues
|
||||||
|
for add_issue in add_issues:
|
||||||
|
item = full_items[add_issue]
|
||||||
|
self.add_issue(writer, item, gh_token, config, update=False)
|
||||||
|
count += 1
|
||||||
|
|
||||||
|
|
||||||
writer.commit()
|
writer.commit()
|
||||||
print("Done, updated %d documents in the index" % count)
|
print("Done, updated %d documents in the index" % count)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Github Files
|
||||||
|
|
||||||
|
def update_index_ghfiles(self, gh_token, config):
|
||||||
|
"""
|
||||||
|
Update the search index using a collection of
|
||||||
|
files (and, separately, Markdown files) from
|
||||||
|
a Github repo.
|
||||||
|
"""
|
||||||
|
# Updated algorithm:
|
||||||
|
# - get set of indexed ids
|
||||||
|
# - get set of remote ids
|
||||||
|
# - drop indexed ids not in remote ids
|
||||||
|
# - index all remote ids
|
||||||
|
|
||||||
|
# Get the set of indexed ids:
|
||||||
|
# ------
|
||||||
|
indexed_ids = set()
|
||||||
|
p = QueryParser("kind", schema=self.ix.schema)
|
||||||
|
q = p.parse("ghfiles")
|
||||||
|
with self.ix.searcher() as s:
|
||||||
|
results = s.search(q,limit=None)
|
||||||
|
for result in results:
|
||||||
|
indexed_ids.add(result['id'])
|
||||||
|
|
||||||
|
q = p.parse("markdown")
|
||||||
|
with self.ix.searcher() as s:
|
||||||
|
results = s.search(q,limit=None)
|
||||||
|
for result in results:
|
||||||
|
indexed_ids.add(result['id'])
|
||||||
|
|
||||||
|
# Get the set of remote ids:
|
||||||
|
# ------
|
||||||
|
# Start with api object
|
||||||
|
g = Github(gh_token)
|
||||||
|
|
||||||
|
# Now index all the files.
|
||||||
|
|
||||||
|
# Start by collecting all the things
|
||||||
|
remote_ids = set()
|
||||||
|
full_items = {}
|
||||||
|
|
||||||
|
# Iterate over each repo
|
||||||
|
list_of_repos = config['repositories']
|
||||||
|
for r in list_of_repos:
|
||||||
|
|
||||||
|
if '/' not in r:
|
||||||
|
err = "Error: specify org/reponame or user/reponame in list of repos"
|
||||||
|
raise Exception(err)
|
||||||
|
|
||||||
|
this_org, this_repo = re.split('/',r)
|
||||||
|
try:
|
||||||
|
org = g.get_organization(this_org)
|
||||||
|
repo = org.get_repo(this_repo)
|
||||||
|
except:
|
||||||
|
print("Error: could not gain access to repository %s"%(r))
|
||||||
|
continue
|
||||||
|
|
||||||
|
|
||||||
|
# Get head commit
|
||||||
|
commits = repo.get_commits()
|
||||||
|
try:
|
||||||
|
last = commits[0]
|
||||||
|
sha = last.sha
|
||||||
|
except GithubException:
|
||||||
|
print("Error: could not get commits from repository %s"%(r))
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get all the docs
|
||||||
|
tree = repo.get_git_tree(sha=sha, recursive=True)
|
||||||
|
docs = tree.raw_data['tree']
|
||||||
|
print("Parsing file ids from repository %s"%(r))
|
||||||
|
|
||||||
|
for d in docs:
|
||||||
|
|
||||||
|
# For each doc, get the file extension
|
||||||
|
# and decide what to do with it.
|
||||||
|
|
||||||
|
fpath = d['path']
|
||||||
|
_, fname = os.path.split(fpath)
|
||||||
|
_, fext = os.path.splitext(fpath)
|
||||||
|
|
||||||
|
key = d['sha']
|
||||||
|
|
||||||
|
d['org'] = this_org
|
||||||
|
d['repo'] = this_repo
|
||||||
|
value = d
|
||||||
|
|
||||||
|
remote_ids.add(key)
|
||||||
|
full_items[key] = value
|
||||||
|
|
||||||
|
writer = self.ix.writer()
|
||||||
|
count = 0
|
||||||
|
|
||||||
|
# Drop any id in indexed_ids
|
||||||
|
# not in remote_ids
|
||||||
|
drop_ids = indexed_ids - remote_ids
|
||||||
|
for drop_id in drop_ids:
|
||||||
|
writer.delete_by_term('id',drop_id)
|
||||||
|
|
||||||
|
|
||||||
|
# Update any id in indexed_ids
|
||||||
|
# and in remote_ids
|
||||||
|
update_ids = indexed_ids & remote_ids
|
||||||
|
for update_id in update_ids:
|
||||||
|
# cop out: just delete and re-add
|
||||||
|
writer.delete_by_term('id',update_id)
|
||||||
|
item = full_items[update_id]
|
||||||
|
self.add_ghfile(writer, item, gh_token, config, update=True)
|
||||||
|
count += 1
|
||||||
|
|
||||||
|
|
||||||
|
# Add any issue not in indexed_ids
|
||||||
|
# and in remote_ids
|
||||||
|
add_ids = remote_ids - indexed_ids
|
||||||
|
for add_id in add_ids:
|
||||||
|
item = full_items[add_id]
|
||||||
|
self.add_ghfile(writer, item, gh_token, config, update=False)
|
||||||
|
count += 1
|
||||||
|
|
||||||
|
|
||||||
|
writer.commit()
|
||||||
|
print("Done, updated %d Github files in the index" % count)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Groups.io Emails
|
||||||
|
|
||||||
|
|
||||||
|
def update_index_groupsioemails(self, groupsio_token, config):
|
||||||
|
"""
|
||||||
|
Update the search index using the email archives
|
||||||
|
of groups.io groups.
|
||||||
|
|
||||||
|
This requires the use of a spider.
|
||||||
|
RELEASE THE SPIDER!!!
|
||||||
|
"""
|
||||||
|
spider = GroupsIOArchivesCrawler(groupsio_token,'dcppc')
|
||||||
|
|
||||||
|
# - ask spider to crawl the archives
|
||||||
|
spider.crawl_group_archives()
|
||||||
|
|
||||||
|
# - ask spider for list of all email records
|
||||||
|
# - 1 email = 1 dictionary
|
||||||
|
# - email records compiled by the spider
|
||||||
|
archives = spider.get_archives()
|
||||||
|
|
||||||
|
# - email object is sent off to add email method
|
||||||
|
|
||||||
|
print("Finished indexing groups.io emails")
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------
|
# ---------------------------------
|
||||||
# Search results bundler
|
# Search results bundler
|
||||||
|
|
||||||
@@ -580,21 +904,18 @@ class Search:
|
|||||||
|
|
||||||
highlights = self.html_parser.unescape(highlights)
|
highlights = self.html_parser.unescape(highlights)
|
||||||
html = self.markdown(highlights)
|
html = self.markdown(highlights)
|
||||||
|
html = re.sub(r'\n','<br />',html)
|
||||||
sr.content_highlight = html
|
sr.content_highlight = html
|
||||||
|
|
||||||
search_results.append(sr)
|
search_results.append(sr)
|
||||||
|
|
||||||
return search_results
|
return search_results
|
||||||
|
|
||||||
# ------------------
|
|
||||||
# github issues
|
|
||||||
# create search results
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def search(self, query_list, fields=None):
|
def search(self, query_list, fields=None):
|
||||||
|
|
||||||
with self.ix.searcher() as searcher:
|
with self.ix.searcher() as searcher:
|
||||||
query_string = " ".join(query_list)
|
query_string = " ".join(query_list)
|
||||||
query = None
|
query = None
|
||||||
@@ -626,29 +947,27 @@ class Search:
|
|||||||
def get_document_total_count(self):
|
def get_document_total_count(self):
|
||||||
p = QueryParser("kind", schema=self.ix.schema)
|
p = QueryParser("kind", schema=self.ix.schema)
|
||||||
|
|
||||||
kind_labels = {
|
|
||||||
"documents" : "gdoc",
|
|
||||||
"issues" : "issue",
|
|
||||||
"comments" : "comment"
|
|
||||||
}
|
|
||||||
counts = {
|
counts = {
|
||||||
"documents" : None,
|
"gdoc" : None,
|
||||||
"issues" : None,
|
"issue" : None,
|
||||||
"comments" : None,
|
"ghfile" : None,
|
||||||
|
"markdown" : None,
|
||||||
"total" : None
|
"total" : None
|
||||||
}
|
}
|
||||||
for key in kind_labels:
|
for key in counts.keys():
|
||||||
kind = kind_labels[key]
|
q = p.parse(key)
|
||||||
q = p.parse(kind)
|
|
||||||
with self.ix.searcher() as s:
|
with self.ix.searcher() as s:
|
||||||
results = s.search(q,limit=None)
|
results = s.search(q,limit=None)
|
||||||
counts[key] = len(results)
|
counts[key] = len(results)
|
||||||
|
|
||||||
counts['total'] = self.ix.searcher().doc_count_all()
|
counts['total'] = sum(counts[k] for k in counts.keys())
|
||||||
|
|
||||||
return counts
|
return counts
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|
||||||
|
raise Exception("Error: main method not implemented (fix groupsio credentials first)")
|
||||||
|
|
||||||
search = Search("search_index")
|
search = Search("search_index")
|
||||||
|
|
||||||
from get_centillion_config import get_centillion_config
|
from get_centillion_config import get_centillion_config
|
||||||
|
@@ -1,6 +1,27 @@
|
|||||||
{
|
{
|
||||||
"repositories" : [
|
"repositories" : [
|
||||||
|
"dcppc/project-management",
|
||||||
|
"dcppc/nih-demo-meetings",
|
||||||
|
"dcppc/internal",
|
||||||
|
"dcppc/organize",
|
||||||
|
"dcppc/dcppc-bot",
|
||||||
|
"dcppc/full-stacks",
|
||||||
|
"dcppc/design-guidelines-discuss",
|
||||||
|
"dcppc/dcppc-deliverables",
|
||||||
|
"dcppc/dcppc-milestones",
|
||||||
|
"dcppc/crosscut-metadata",
|
||||||
|
"dcppc/lucky-penny",
|
||||||
|
"dcppc/dcppc-workshops",
|
||||||
|
"dcppc/metadata-matrix",
|
||||||
|
"dcppc/data-stewards",
|
||||||
|
"dcppc/dcppc-phase1-demos",
|
||||||
|
"dcppc/apis",
|
||||||
"dcppc/2018-june-workshop",
|
"dcppc/2018-june-workshop",
|
||||||
"dcppc/2018-july-workshop"
|
"dcppc/2018-july-workshop",
|
||||||
|
"dcppc/2018-august-workshop",
|
||||||
|
"dcppc/2018-september-workshop",
|
||||||
|
"dcppc/design-guidelines",
|
||||||
|
"dcppc/2018-may-workshop",
|
||||||
|
"dcppc/centillion"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
20
config_flask.example.py
Normal file
20
config_flask.example.py
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
# Location of index file
|
||||||
|
INDEX_DIR = "search_index"
|
||||||
|
|
||||||
|
# oauth client deets
|
||||||
|
GITHUB_OAUTH_CLIENT_ID = "XXX"
|
||||||
|
GITHUB_OAUTH_CLIENT_SECRET = "YYY"
|
||||||
|
GITHUB_TOKEN = "ZZZ"
|
||||||
|
|
||||||
|
# More information footer: Repository label
|
||||||
|
FOOTER_REPO_ORG = "charlesreid1"
|
||||||
|
FOOTER_REPO_NAME = "centillion"
|
||||||
|
|
||||||
|
# Toggle to show Whoosh parsed query
|
||||||
|
SHOW_PARSED_QUERY=True
|
||||||
|
|
||||||
|
TAGLINE = "Search All The Things"
|
||||||
|
|
||||||
|
# Flask settings
|
||||||
|
DEBUG = True
|
||||||
|
SECRET_KEY = 'WWWWW'
|
@@ -1,9 +0,0 @@
|
|||||||
# Location of index file
|
|
||||||
INDEX_DIR = "search_index"
|
|
||||||
|
|
||||||
# Toggle to show Whoosh parsed query
|
|
||||||
SHOW_PARSED_QUERY=True
|
|
||||||
|
|
||||||
# Flask settings
|
|
||||||
DEBUG = True
|
|
||||||
SECRET_KEY = '42c5a8eda356ca9d9c3ab2d149541e6b91d843fa'
|
|
22
docs/centillion_components.md
Normal file
22
docs/centillion_components.md
Normal file
@@ -0,0 +1,22 @@
|
|||||||
|
# Centillion Components
|
||||||
|
|
||||||
|
Centillion keeps it simple.
|
||||||
|
|
||||||
|
There are two components:
|
||||||
|
|
||||||
|
* The `Search` object, which uses whoosh and various
|
||||||
|
APIs (Github, Google Drive) to build and manage
|
||||||
|
the search index. The `Search` object also runs all
|
||||||
|
queries against the search index. (See the
|
||||||
|
[Centillion Whoosh](centillion_whoosh.md) page
|
||||||
|
or the `centillion_search`.py` file
|
||||||
|
for details.)
|
||||||
|
|
||||||
|
* Flask app, which uses Jinja templates to present the
|
||||||
|
user with a minimal web frontend that allows them
|
||||||
|
to interact with the search engine. (See the
|
||||||
|
[Centillion Flask](centillion_flask.md) page
|
||||||
|
or the `centillion`.py` file
|
||||||
|
for details.)
|
||||||
|
|
||||||
|
|
30
docs/centillion_flask.md
Normal file
30
docs/centillion_flask.md
Normal file
@@ -0,0 +1,30 @@
|
|||||||
|
# Centillion Flask
|
||||||
|
|
||||||
|
## What the flask server does
|
||||||
|
|
||||||
|
Flask is a web server framework
|
||||||
|
that allows developers to define
|
||||||
|
behavior for specific endpoints,
|
||||||
|
such as `/hello_world`, or
|
||||||
|
<http://localhost:5000/hello_world>
|
||||||
|
on a web server running locally.
|
||||||
|
|
||||||
|
## Flask server routes
|
||||||
|
|
||||||
|
- `/home`
|
||||||
|
- if not logged in, this redirects to a "log into github" landing page (not implemented yet)
|
||||||
|
- if logged in, this redirects to the search route
|
||||||
|
|
||||||
|
- `/search`
|
||||||
|
- search template
|
||||||
|
|
||||||
|
- `/main_index_update`
|
||||||
|
- update main index, all docs period
|
||||||
|
|
||||||
|
- `/control_panel`
|
||||||
|
- this is the control panel, where you can trigger
|
||||||
|
the search index to be re-made
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
34
docs/centillion_whoosh.md
Normal file
34
docs/centillion_whoosh.md
Normal file
@@ -0,0 +1,34 @@
|
|||||||
|
# Centillion Whoosh
|
||||||
|
|
||||||
|
The `centillion_search.py` file defines a
|
||||||
|
`Search` class that serves as the backend
|
||||||
|
for centillion.
|
||||||
|
|
||||||
|
## What the Search class does
|
||||||
|
|
||||||
|
The `Search` class has two roles:
|
||||||
|
- create (and update) the search index
|
||||||
|
- this also requires the `Search` class
|
||||||
|
to define the schema for storing documents
|
||||||
|
- run queries against the search index,
|
||||||
|
and package results up for Flask and Jinja
|
||||||
|
|
||||||
|
## Search class functions
|
||||||
|
|
||||||
|
The `Search` class defines several functions:
|
||||||
|
|
||||||
|
- `open_index()` creates the schema
|
||||||
|
|
||||||
|
- `add_issue()`, `add_md()`, `add_document()` have three diff method sigs and add diff types
|
||||||
|
of documents to the search index
|
||||||
|
|
||||||
|
- `update_all_issues()` or `update_all_md()` or `update_all_documents()` iterates over items
|
||||||
|
and determines whether each item needs to be updated in the search index
|
||||||
|
|
||||||
|
- `update_main_index()` - update the entire search index
|
||||||
|
- calls all three update_all methods
|
||||||
|
|
||||||
|
- `create_search_results()` - package things up for jinja
|
||||||
|
|
||||||
|
- `search()` - run the query, pass results to the jinja-packager
|
||||||
|
|
BIN
docs/images/cp.png
Normal file
BIN
docs/images/cp.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 498 KiB |
BIN
docs/images/ss.png
Normal file
BIN
docs/images/ss.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 355 KiB |
@@ -1,30 +1,31 @@
|
|||||||
# The Centillion
|
# Centillion
|
||||||
|
|
||||||
**the centillion**: a pan-github-markdown-issues-google-docs search engine.
|
**centillion**: a pan-github-markdown-issues-google-docs search engine.
|
||||||
|
|
||||||
**a centillion**: a very large number consisting of a 1 with 303 zeros after it.
|
**a centillion**: a very large number consisting of a 1 with 303 zeros after it.
|
||||||
|
|
||||||
the centillion is 3.03 log-times better than the googol.
|
centillion is 3.03 log-times better than the googol.
|
||||||
|
|
||||||
## what is it
|
## What is centillion
|
||||||
|
|
||||||
The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
|
Centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
|
||||||
a Python library for building search engines.
|
a Python library for building search engines.
|
||||||
|
|
||||||
We define the types of documents the centillion should index,
|
|
||||||
what info and how. The centillion then builds and
|
We define the types of documents centillion should index,
|
||||||
|
what info and how. Centillion then builds and
|
||||||
updates a search index. That's all done in `centillion_search.py`.
|
updates a search index. That's all done in `centillion_search.py`.
|
||||||
|
|
||||||
The centillion also provides a simple web frontend for running
|
Centillion also provides a simple web frontend for running
|
||||||
queries against the search index. That's done using a Flask server
|
queries against the search index. That's done using a Flask server
|
||||||
defined in `centillion.py`.
|
defined in `centillion.py`.
|
||||||
|
|
||||||
The centillion keeps it simple.
|
Centillion keeps it simple.
|
||||||
|
|
||||||
|
|
||||||
## quickstart
|
## Quickstart
|
||||||
|
|
||||||
Run the centillion app with a github access token API key set via
|
Run centillion with a github access token API key set via
|
||||||
environment variable:
|
environment variable:
|
||||||
|
|
||||||
```
|
```
|
||||||
@@ -34,21 +35,50 @@ GITHUB_TOKEN="XXXXXXXX" python centillion.py
|
|||||||
This will start a Flask server, and you can view the minimal search engine
|
This will start a Flask server, and you can view the minimal search engine
|
||||||
interface in your browser at <http://localhost:5000>.
|
interface in your browser at <http://localhost:5000>.
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
## work that is done
|
### Centillion configuration
|
||||||
|
|
||||||
See [standalone.md](standalone.md) for the summary of
|
`config_centillion.json` defines configuration variables
|
||||||
the three standalone whoosh servers that were built:
|
for centillion - namely, what to index, and how, and where.
|
||||||
one for a folder of markdown files, one for github issues
|
|
||||||
and comments, and one for google drive documents.
|
|
||||||
|
|
||||||
## work that is being done
|
### Flask configuration
|
||||||
|
|
||||||
See [workinprogress.md](workinprogress.md) for details about
|
`config_flask.py` defines configuration variables
|
||||||
work in progress.
|
used by flask, which controls the web frontend
|
||||||
|
for centillion.
|
||||||
|
|
||||||
## work that is planned
|
## Control Panel/Rebuilding Search Index
|
||||||
|
|
||||||
See [plans.md](plans.md)
|
To rebuild the search engine, visit the control panel route (`/control_panel`),
|
||||||
|
for example at <http://localhost:5000/control_panel>.
|
||||||
|
|
||||||
|
This allows you to rebuild the search engine index. The search index
|
||||||
|
is stored in the `search_index/` directory, and that directory
|
||||||
|
can be configured with centillion's configuration file.
|
||||||
|
|
||||||
|
The diff search index is faster to build, as it only
|
||||||
|
indexes documents that have been added since the last
|
||||||
|
new document was added to the search index.
|
||||||
|
|
||||||
|
The main search index is slower to build, as it will
|
||||||
|
re-index everything.
|
||||||
|
|
||||||
|
(Cron scripts? Threaded task that runs hourly?)
|
||||||
|
|
||||||
|
## Details
|
||||||
|
|
||||||
|
More on the details of how centillion works.
|
||||||
|
|
||||||
|
Under the hood, centillion uses flask and whoosh.
|
||||||
|
Flask builds and runs the web server.
|
||||||
|
Whoosh handles search requests and management
|
||||||
|
of the search index.
|
||||||
|
|
||||||
|
[Centillion Components](centillion_components.md)
|
||||||
|
|
||||||
|
[Centillion Flask](centillion_flask.md)
|
||||||
|
|
||||||
|
[Centillion Whoosh](centillion_whoosh.md)
|
||||||
|
|
||||||
|
|
||||||
|
@@ -29,8 +29,7 @@ class GDrive(object):
|
|||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Set up the Google Drive API instance.
|
Set up the Google Drive API instance.
|
||||||
Factory method: create it and hand it over.
|
Factory method: create it here, hand it over in get_service().
|
||||||
Then we're finished.
|
|
||||||
"""
|
"""
|
||||||
self.credentials_file = credentials_file
|
self.credentials_file = credentials_file
|
||||||
self.client_secret_file = client_secret_file
|
self.client_secret_file = client_secret_file
|
||||||
@@ -40,6 +39,9 @@ class GDrive(object):
|
|||||||
self.store = file.Storage(credentials_file)
|
self.store = file.Storage(credentials_file)
|
||||||
|
|
||||||
def get_service(self):
|
def get_service(self):
|
||||||
|
"""
|
||||||
|
Return an instance of the Google Drive API service.
|
||||||
|
"""
|
||||||
|
|
||||||
creds = self.store.get()
|
creds = self.store.get()
|
||||||
if not creds or creds.invalid:
|
if not creds or creds.invalid:
|
||||||
|
382
groupsio_util.py
Normal file
382
groupsio_util.py
Normal file
@@ -0,0 +1,382 @@
|
|||||||
|
import requests, os, re
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
|
||||||
|
class GroupsIOArchivesCrawler(object):
|
||||||
|
"""
|
||||||
|
This is a Groups.io spider
|
||||||
|
designed to crawl the email
|
||||||
|
archives of a group.
|
||||||
|
|
||||||
|
credentials (dictionary):
|
||||||
|
groupsio_token : api access token
|
||||||
|
groupsio_username : username
|
||||||
|
groupsio_password : password
|
||||||
|
"""
|
||||||
|
def __init__(self,
|
||||||
|
credentials,
|
||||||
|
group_name):
|
||||||
|
# template url for archives page (list of topics)
|
||||||
|
self.url = "https://{group}.groups.io/g/{subgroup}/topics"
|
||||||
|
self.login_url = "https://groups.io/login"
|
||||||
|
|
||||||
|
self.credentials = credentials
|
||||||
|
self.group_name = group_name
|
||||||
|
self.crawled_archives = False
|
||||||
|
self.archives = None
|
||||||
|
|
||||||
|
|
||||||
|
def get_archives(self):
|
||||||
|
"""
|
||||||
|
Return a list of dictionaries containing
|
||||||
|
information about each email topic in the
|
||||||
|
groups.io email archive.
|
||||||
|
|
||||||
|
Call crawl_group_archives() first!
|
||||||
|
"""
|
||||||
|
return self.archives
|
||||||
|
|
||||||
|
|
||||||
|
def get_subgroups_list(self):
|
||||||
|
"""
|
||||||
|
Use the API to get a list of subgroups.
|
||||||
|
"""
|
||||||
|
subgroups_url = 'https://api.groups.io/v1/getsubgroups'
|
||||||
|
|
||||||
|
key = self.credentials['groupsio_token']
|
||||||
|
|
||||||
|
data = [('group_name', self.group_name),
|
||||||
|
('limit',100)
|
||||||
|
]
|
||||||
|
response = requests.post(subgroups_url,
|
||||||
|
data=data,
|
||||||
|
auth=(key,''))
|
||||||
|
response = response.json()
|
||||||
|
data = response['data']
|
||||||
|
|
||||||
|
subgroups = {}
|
||||||
|
for group in data:
|
||||||
|
k = group['id']
|
||||||
|
v = re.sub(r'dcppc\+','',group['name'])
|
||||||
|
subgroups[k] = v
|
||||||
|
|
||||||
|
return subgroups
|
||||||
|
|
||||||
|
|
||||||
|
def crawl_group_archives(self):
|
||||||
|
"""
|
||||||
|
Spider will crawl the email archives of the entire group
|
||||||
|
by crawling the email archives of each subgroup.
|
||||||
|
"""
|
||||||
|
subgroups = self.get_subgroups_list()
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Start by logging in.
|
||||||
|
|
||||||
|
# Create session object to persist session data
|
||||||
|
session = requests.Session()
|
||||||
|
|
||||||
|
# Log in to the website
|
||||||
|
data = dict(email = self.credentials['groupsio_username'],
|
||||||
|
password = self.credentials['groupsio_password'],
|
||||||
|
timezone = 'America/Los_Angeles')
|
||||||
|
|
||||||
|
r = session.post(self.login_url,
|
||||||
|
data = data)
|
||||||
|
|
||||||
|
csrf = self.get_csrf(r)
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# For each subgroup, crawl the archives
|
||||||
|
# and return a list of dictionaries
|
||||||
|
# containing all the email threads.
|
||||||
|
for subgroup_id in subgroups.keys():
|
||||||
|
self.crawl_subgroup_archives(session,
|
||||||
|
csrf,
|
||||||
|
subgroup_id,
|
||||||
|
subgroups[subgroup_id])
|
||||||
|
|
||||||
|
# Done. archives are now tucked away
|
||||||
|
# in the variable self.archives
|
||||||
|
#
|
||||||
|
# self.archives is a list of dictionaries,
|
||||||
|
# with each dictionary containing info about
|
||||||
|
# a topic/email thread in a subgroup.
|
||||||
|
# ------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def crawl_subgroup_archives(self, session, csrf, subgroup_id, subgroup_name):
|
||||||
|
"""
|
||||||
|
This kicks off the process to crawl the entire
|
||||||
|
archives of a given subgroup on groups.io.
|
||||||
|
|
||||||
|
For a given subgroup the url is self.url,
|
||||||
|
|
||||||
|
https://{group}.groups.io/g/{subgroup}/topics
|
||||||
|
|
||||||
|
This is the first of a paginated list of topics.
|
||||||
|
Procedure is:
|
||||||
|
- passed a starting page (or its contents)
|
||||||
|
- iterate through all topics via the HTML page elements
|
||||||
|
- assemble a bundle of information about each topic:
|
||||||
|
- topic title, by, URL, date, content, permalink
|
||||||
|
- content filtering:
|
||||||
|
- ^From, Reply-To, Date, To, Subject
|
||||||
|
- Lines containing phone numbers
|
||||||
|
- 9 digits
|
||||||
|
- XXX-XXX-XXXX, (XXX) XXX-XXXX
|
||||||
|
- XXXXXXXXXX, XXX XXX XXXX
|
||||||
|
- ^Work: or (Work) or Work$
|
||||||
|
- Home, Cell, Mobile
|
||||||
|
- +1 XXX
|
||||||
|
- \w@\w
|
||||||
|
- while next button is not greyed out,
|
||||||
|
- click the next button
|
||||||
|
|
||||||
|
everything stored in self.archives:
|
||||||
|
list of dictionaries.
|
||||||
|
|
||||||
|
"""
|
||||||
|
self.archives = []
|
||||||
|
|
||||||
|
prefix = "https://{group}.groups.io".format(group=self.group_name)
|
||||||
|
|
||||||
|
url = self.url.format(group=self.group_name,
|
||||||
|
subgroup=subgroup_name)
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
|
||||||
|
# Now get the first page
|
||||||
|
r = session.get(url)
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Fencepost algorithm:
|
||||||
|
|
||||||
|
# First page:
|
||||||
|
|
||||||
|
# Extract a list of (title, link) items
|
||||||
|
items = self.extract_archive_page_items_(r)
|
||||||
|
|
||||||
|
# Get the next link
|
||||||
|
next_url = self.get_next_url_(r)
|
||||||
|
|
||||||
|
# Now add each item to the archive of threads,
|
||||||
|
# then find the next button.
|
||||||
|
self.add_items_to_archives_(session,subgroup_name,items)
|
||||||
|
|
||||||
|
if next_url is None:
|
||||||
|
return
|
||||||
|
else:
|
||||||
|
full_next_url = prefix + next_url
|
||||||
|
|
||||||
|
# Now click the next button
|
||||||
|
next_request = requests.get(full_next_url)
|
||||||
|
|
||||||
|
while next_request.status_code==200:
|
||||||
|
items = self.extract_archive_page_items_(next_request)
|
||||||
|
next_url = self.get_next_url_(next_request)
|
||||||
|
self.add_items_to_archives_(session,subgroup_name,items)
|
||||||
|
if next_url is None:
|
||||||
|
return
|
||||||
|
else:
|
||||||
|
full_next_url = prefix + next_url
|
||||||
|
next_request = requests.get(full_next_url)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def add_items_to_archives_(self,session,subgroup_name,items):
|
||||||
|
"""
|
||||||
|
Given a set of items from a list of threads,
|
||||||
|
items being title and link,
|
||||||
|
get the page and store all info
|
||||||
|
in self.archives variable
|
||||||
|
(list of dictionaries)
|
||||||
|
"""
|
||||||
|
for (title, link) in items:
|
||||||
|
# Get the thread page:
|
||||||
|
prefix = "https://{group}.groups.io".format(group=self.group_name)
|
||||||
|
full_link = prefix + link
|
||||||
|
r = session.get(full_link)
|
||||||
|
soup = BeautifulSoup(r.text,'html.parser')
|
||||||
|
|
||||||
|
# soup contains the entire thread
|
||||||
|
|
||||||
|
# What are we extracting:
|
||||||
|
# 1. thread number
|
||||||
|
# 2. permalink
|
||||||
|
# 3. content/text (filtered)
|
||||||
|
|
||||||
|
# - - - - - - - - - - - - - -
|
||||||
|
# 1. topic/thread number:
|
||||||
|
# <a rel="nofollow" href="">
|
||||||
|
# where link is:
|
||||||
|
# https://{group}.groups.io/g/{subgroup}/topic/{topic_id}
|
||||||
|
# example topic id: 24209140
|
||||||
|
#
|
||||||
|
# ugly links are in the form
|
||||||
|
# https://dcppc.groups.io/g/{subgroup}/topic/some_text_here/{thread_id}?p=,,,,,1,2,3,,,4,,5
|
||||||
|
# split at ?, 0th portion
|
||||||
|
# then split at /, last (-1th) portion
|
||||||
|
topic_id = link.split('?')[0].split('/')[-1]
|
||||||
|
|
||||||
|
# - - - - - - - - - - - - - - -
|
||||||
|
# 2. permalink:
|
||||||
|
# - current link is ugly link
|
||||||
|
# - permalink is the nice one
|
||||||
|
# - topic id is available from the ugly link
|
||||||
|
# https://{group}.groups.io/g/{subgroup}/topic/{topic_id}
|
||||||
|
|
||||||
|
permalink_template = "https://{group}.groups.io/g/{subgroup}/topic/{topic_id}"
|
||||||
|
permalink = permalink_template.format(
|
||||||
|
group = self.group_name,
|
||||||
|
subgroup = subgroup_name,
|
||||||
|
topic_id = topic_id
|
||||||
|
)
|
||||||
|
|
||||||
|
# - - - - - - - - - - - - - - -
|
||||||
|
# 3. content:
|
||||||
|
|
||||||
|
# Need to rearrange how we're assembling threads here.
|
||||||
|
# This is one thread, no?
|
||||||
|
content = []
|
||||||
|
|
||||||
|
subject = soup.find('title').text
|
||||||
|
|
||||||
|
# Extract information for the schema:
|
||||||
|
# - permalink for thread (done)
|
||||||
|
# - subject/title (done)
|
||||||
|
# - original sender email/name (done)
|
||||||
|
# - content (done)
|
||||||
|
|
||||||
|
# Groups.io pages have zero CSS classes, which makes everything
|
||||||
|
# a giant pain in the neck to interact with. Thanks Groups.io!
|
||||||
|
original_sender = ''
|
||||||
|
for i, tr in enumerate(soup.find_all('tr',{'class':'test'})):
|
||||||
|
# Every other tr row contains an email.
|
||||||
|
if (i+1)%2==0:
|
||||||
|
# nope, no email here
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
# found an email!
|
||||||
|
# this is a maze, thanks groups.io
|
||||||
|
td = tr.find('td')
|
||||||
|
divrow = td.find('div',{'class':'row'}).find('div',{'class':'pull-left'})
|
||||||
|
if (i+1)==1:
|
||||||
|
original_sender = divrow.text.strip()
|
||||||
|
for div in td.find_all('div'):
|
||||||
|
if div.has_attr('id'):
|
||||||
|
|
||||||
|
# purge any signatures
|
||||||
|
for x in div.find_all('div',{'id':'Signature'}):
|
||||||
|
x.extract()
|
||||||
|
|
||||||
|
# purge any headers
|
||||||
|
for x in div.find_all('div'):
|
||||||
|
nonos = ['From:','Sent:','To:','Cc:','CC:','Subject:']
|
||||||
|
for nono in nonos:
|
||||||
|
if nono in x.text:
|
||||||
|
x.extract()
|
||||||
|
|
||||||
|
message_text = div.get_text()
|
||||||
|
|
||||||
|
# More filtering:
|
||||||
|
|
||||||
|
# phone numbers
|
||||||
|
message_text = re.sub(r'[0-9]{3}-[0-9]{3}-[0-9]{4}','XXX-XXX-XXXX',message_text)
|
||||||
|
message_text = re.sub(r'[0-9]\{10\}','XXXXXXXXXX',message_text)
|
||||||
|
|
||||||
|
content.append(message_text)
|
||||||
|
|
||||||
|
full_content = "\n".join(content)
|
||||||
|
|
||||||
|
thread = {
|
||||||
|
'permalink' : permalink,
|
||||||
|
'subject' : subject,
|
||||||
|
'original_sender' : original_sender,
|
||||||
|
'content' : full_content
|
||||||
|
}
|
||||||
|
|
||||||
|
print('*'*40)
|
||||||
|
for k in thread.keys():
|
||||||
|
if k=='content':
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
print("%s : %s"%(k,thread[k]))
|
||||||
|
print('*'*40)
|
||||||
|
self.archives.append(thread)
|
||||||
|
|
||||||
|
|
||||||
|
def extract_archive_page_items_(self, response):
|
||||||
|
"""
|
||||||
|
(Private method)
|
||||||
|
|
||||||
|
Given a response from a GET request,
|
||||||
|
use beautifulsoup to extract all items
|
||||||
|
(thread titles and ugly thread links)
|
||||||
|
and pass them back in a list.
|
||||||
|
"""
|
||||||
|
soup = BeautifulSoup(response.content,"html.parser")
|
||||||
|
rows = soup.find_all('tr',{'class':'test'})
|
||||||
|
if 'rate limited' in soup.text:
|
||||||
|
raise Exception("Error: rate limit in place for Groups.io")
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for row in rows:
|
||||||
|
# We don't care about anything except title and ugly link
|
||||||
|
subject = row.find('span',{'class':'subject'})
|
||||||
|
title = subject.get_text()
|
||||||
|
link = row.find('a')['href']
|
||||||
|
print(title)
|
||||||
|
results.append((title,link))
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def get_next_url_(self, response):
|
||||||
|
"""
|
||||||
|
(Private method)
|
||||||
|
|
||||||
|
Given a response (which is a list of threads),
|
||||||
|
find the next button and return the URL.
|
||||||
|
|
||||||
|
If no next URL, if is disabled, then return None.
|
||||||
|
"""
|
||||||
|
soup = BeautifulSoup(response.text,'html.parser')
|
||||||
|
chevron = soup.find('i',{'class':'fa-chevron-right'})
|
||||||
|
try:
|
||||||
|
if '#' in chevron.parent['href']:
|
||||||
|
# empty link, abort
|
||||||
|
return None
|
||||||
|
except AttributeError:
|
||||||
|
# I don't even now
|
||||||
|
return None
|
||||||
|
|
||||||
|
if chevron.parent.parent.has_attr('class') and 'disabled' in chevron.parent.parent['class']:
|
||||||
|
# no next link, abort
|
||||||
|
return None
|
||||||
|
|
||||||
|
return chevron.parent['href']
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def get_csrf(self,resp):
|
||||||
|
"""
|
||||||
|
Find the CSRF token embedded in the subgroup page
|
||||||
|
"""
|
||||||
|
soup = BeautifulSoup(resp.text,'html.parser')
|
||||||
|
csrf = ''
|
||||||
|
for i in soup.find_all('input'):
|
||||||
|
# Note that i.name is different from i['name']
|
||||||
|
# the first is the actual tag,
|
||||||
|
# the second is the attribute name="xyz"
|
||||||
|
if i['name']=='csrf':
|
||||||
|
csrf = i['value']
|
||||||
|
|
||||||
|
if csrf=='':
|
||||||
|
err = "ERROR: Could not find csrf token on page."
|
||||||
|
raise Exception(err)
|
||||||
|
|
||||||
|
return csrf
|
||||||
|
|
||||||
|
|
BIN
img/ss.png
BIN
img/ss.png
Binary file not shown.
Before Width: | Height: | Size: 356 KiB |
19
install_pandoc.sh
Executable file
19
install_pandoc.sh
Executable file
@@ -0,0 +1,19 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
#
|
||||||
|
# for ubuntu
|
||||||
|
|
||||||
|
if [ "$(id -u)" != "0" ]; then
|
||||||
|
echo ""
|
||||||
|
echo ""
|
||||||
|
echo "This script should be run as root."
|
||||||
|
echo ""
|
||||||
|
echo ""
|
||||||
|
exit 1;
|
||||||
|
fi
|
||||||
|
|
||||||
|
OFILE="/tmp/pandoc.deb"
|
||||||
|
curl -L https://github.com/jgm/pandoc/releases/download/2.2.2.1/pandoc-2.2.2.1-1-amd64.deb -o ${OFILE}
|
||||||
|
dpkg -i ${OFILE}
|
||||||
|
rm -f ${OFILE}
|
||||||
|
|
||||||
|
|
Submodule mkdocs-material deleted from 6569122bb1
1
mkdocs-material-dib
Submodule
1
mkdocs-material-dib
Submodule
Submodule mkdocs-material-dib added at c3dd912f3c
@@ -9,3 +9,5 @@ PyGithub>=1.39
|
|||||||
pypandoc>=1.4
|
pypandoc>=1.4
|
||||||
requests>=2.19
|
requests>=2.19
|
||||||
pandoc>=1.0
|
pandoc>=1.0
|
||||||
|
flask-dance>=1.0.0
|
||||||
|
beautifulsoup4>=4.6
|
||||||
|
7
static/bootstrap.min.js
vendored
Normal file
7
static/bootstrap.min.js
vendored
Normal file
File diff suppressed because one or more lines are too long
2
static/jquery.min.js
vendored
Normal file
2
static/jquery.min.js
vendored
Normal file
File diff suppressed because one or more lines are too long
@@ -1,17 +1,38 @@
|
|||||||
|
span.badge {
|
||||||
|
vertical-align: text-bottom;
|
||||||
|
}
|
||||||
|
|
||||||
li.search-group-item {
|
a.badgelinks, a.badgelinks:hover {
|
||||||
position: relative;
|
color: #fff;
|
||||||
display: block;
|
text-decoration: none;
|
||||||
padding: 0px;
|
|
||||||
margin-bottom: -1px;
|
|
||||||
background-color: #fff;
|
|
||||||
border: 1px solid #ddd;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
div.list-group {
|
div.list-group {
|
||||||
border: 1px solid rgba(86,61,124,.2);
|
border: 1px solid rgba(86,61,124,.2);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
li.list-group-item {
|
||||||
|
position: relative;
|
||||||
|
display: block;
|
||||||
|
|
||||||
|
/*padding: 20px 10px;*/
|
||||||
|
margin-bottom: -1px;
|
||||||
|
|
||||||
|
background-color: #f8f8f8;
|
||||||
|
border: 1px solid #ddd;
|
||||||
|
}
|
||||||
|
|
||||||
|
li.search-group-item {
|
||||||
|
position: relative;
|
||||||
|
display: block;
|
||||||
|
|
||||||
|
padding: 0px;
|
||||||
|
margin-bottom: -1px;
|
||||||
|
|
||||||
|
background-color: #fff;
|
||||||
|
border: 1px solid #ddd;
|
||||||
|
}
|
||||||
|
|
||||||
div.url {
|
div.url {
|
||||||
background-color: rgba(86,61,124,.15);
|
background-color: rgba(86,61,124,.15);
|
||||||
padding: 8px;
|
padding: 8px;
|
||||||
|
108
templates/controlpanel.html
Executable file
108
templates/controlpanel.html
Executable file
@@ -0,0 +1,108 @@
|
|||||||
|
{% extends "layout.html" %}
|
||||||
|
{% block body %}
|
||||||
|
|
||||||
|
{% with messages = get_flashed_messages() %}
|
||||||
|
{% if messages %}
|
||||||
|
<div class="container">
|
||||||
|
<div class="alert alert-success alert-dismissible">
|
||||||
|
<a href="#" class="close" data-dismiss="alert" aria-label="close">×</a>
|
||||||
|
<ul class=flashes>
|
||||||
|
{% for message in messages %}
|
||||||
|
<li>{{ message }}</li>
|
||||||
|
{% endfor %}
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
{% endwith %}
|
||||||
|
|
||||||
|
<div class="container">
|
||||||
|
|
||||||
|
<div class="row">
|
||||||
|
<div class="col-md-12">
|
||||||
|
<center>
|
||||||
|
<a href="{{ url_for('search')}}?query=&fields=">
|
||||||
|
<img src="{{ url_for('static', filename='centillion_white.png') }}">
|
||||||
|
</a>
|
||||||
|
{% if config['TAGLINE'] %}
|
||||||
|
<h2><a href="{{ url_for('search')}}?query=&fields=">
|
||||||
|
{{config['TAGLINE']}}
|
||||||
|
</a></h2>
|
||||||
|
{% endif %}
|
||||||
|
</center>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{% if config['zzzTAGLINE'] %}
|
||||||
|
<div class="row">
|
||||||
|
<div class="col12sm">
|
||||||
|
<center>
|
||||||
|
<h2><a href="{{ url_for('search')}}?query=&fields=">
|
||||||
|
{{config['TAGLINE']}}
|
||||||
|
</a></h2>
|
||||||
|
</center>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<hr />
|
||||||
|
|
||||||
|
<div class="container">
|
||||||
|
|
||||||
|
<div class="row">
|
||||||
|
|
||||||
|
{# update main search index #}
|
||||||
|
<div class="panel panel-danger">
|
||||||
|
<div class="panel-heading">
|
||||||
|
<h3 class="panel-title">
|
||||||
|
Update Main Search Index
|
||||||
|
</h3>
|
||||||
|
</div>
|
||||||
|
<div class="panel-body">
|
||||||
|
<div class="container-fluid">
|
||||||
|
<div class="row">
|
||||||
|
<div class="col-md-12">
|
||||||
|
<p class="panel-text">Re-index <i>every</i> document in the
|
||||||
|
remote collection in the search index. <b>Warning: this operation may take a while.</b>
|
||||||
|
<p/> <p>
|
||||||
|
<a href="{{ url_for('update_index') }}" class="btn btn-large btn-danger">Update Main Index</a>
|
||||||
|
<p/>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{# update diff search index #}
|
||||||
|
<div class="panel panel-danger">
|
||||||
|
<div class="panel-heading">
|
||||||
|
<h3 class="panel-title">
|
||||||
|
Update Diff Search Index
|
||||||
|
</h3>
|
||||||
|
</div>
|
||||||
|
<div class="panel-body">
|
||||||
|
<div class="container-fluid">
|
||||||
|
<div class="row">
|
||||||
|
<div class="col-md-12">
|
||||||
|
<p class="panel-text">Diff search index only re-indexes documents created after the last
|
||||||
|
search index update. <b>Not currently implemented.</b>
|
||||||
|
<p/> <p>
|
||||||
|
<a href="#" class="btn btn-large disabled btn-danger">Update Diff Index</a>
|
||||||
|
<p/>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{% endblock %}
|
||||||
|
|
@@ -1,11 +1,12 @@
|
|||||||
<!doctype html>
|
<!doctype html>
|
||||||
<title>Markdown Search</title>
|
<title>Centillion Search Engine</title>
|
||||||
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}">
|
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}">
|
||||||
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
|
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
|
||||||
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='bootstrap.min.css') }}">
|
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='bootstrap.min.css') }}">
|
||||||
|
|
||||||
|
<script src="{{ url_for('static', filename='jquery.min.js') }}"></script>
|
||||||
|
<script src="{{ url_for('static', filename='bootstrap.min.js') }}"></script>
|
||||||
|
|
||||||
<div>
|
<div>
|
||||||
{% for message in get_flashed_messages() %}
|
|
||||||
<div class="flash">{{ message }}</div>
|
|
||||||
{% endfor %}
|
|
||||||
{% block body %}{% endblock %}
|
{% block body %}{% endblock %}
|
||||||
</div>
|
</div>
|
||||||
|
@@ -4,34 +4,33 @@
|
|||||||
|
|
||||||
<div class="container">
|
<div class="container">
|
||||||
|
|
||||||
|
{#
|
||||||
|
banner image
|
||||||
|
#}
|
||||||
<div class="row">
|
<div class="row">
|
||||||
<div class="col12sm">
|
<div class="col12sm">
|
||||||
<center>
|
<center>
|
||||||
<a href="{{ url_for('search')}}?query=&fields=">
|
<a href="{{ url_for('search')}}?query=&fields=">
|
||||||
<img src="{{ url_for('static', filename='centillion_white.png') }}">
|
<img src="{{ url_for('static', filename='centillion_white.png') }}">
|
||||||
</a>
|
</a>
|
||||||
|
{#
|
||||||
|
need a tag line
|
||||||
|
#}
|
||||||
|
{% if config['TAGLINE'] %}
|
||||||
|
<h2><a href="{{ url_for('search')}}?query=&fields=">
|
||||||
|
{{config['TAGLINE']}}
|
||||||
|
</a></h2>
|
||||||
|
{% endif %}
|
||||||
</center>
|
</center>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
<div class="container">
|
||||||
<div class="row">
|
<div class="row">
|
||||||
<div class="col12sm">
|
|
||||||
<center>
|
|
||||||
<h2>
|
|
||||||
<a href="{{ url_for('search')}}?query=&fields=">
|
|
||||||
Search the DCPPC
|
|
||||||
</a>
|
|
||||||
</h2>
|
|
||||||
</center>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
|
<div class="col-xs-12">
|
||||||
<div class="row">
|
|
||||||
<div class="col-12">
|
|
||||||
<center>
|
<center>
|
||||||
<a class="index" href="{{ url_for('update_index')}}">[update index]</a>
|
|
||||||
<a class="index" href="{{ url_for('update_index')}}?rebuild=True">[rebuild index]</a>
|
|
||||||
<form action="{{ url_for('search') }}" name="search">
|
<form action="{{ url_for('search') }}" name="search">
|
||||||
<input type="text" name="query" value="{{ query }}"> <br />
|
<input type="text" name="query" value="{{ query }}"> <br />
|
||||||
<button type="submit" style="font-size: 20px; padding: 10px; padding-left: 50px; padding-right: 50px;"
|
<button type="submit" style="font-size: 20px; padding: 10px; padding-left: 50px; padding-right: 50px;"
|
||||||
@@ -48,8 +47,8 @@
|
|||||||
<div class="row">
|
<div class="row">
|
||||||
|
|
||||||
{% if directories %}
|
{% if directories %}
|
||||||
<div class="col-12 info directories-cloud">
|
<div class="col-xs-12 info directories-cloud">
|
||||||
File directories: 
|
<b>File directories:</b>
|
||||||
{% for d in directories %}
|
{% for d in directories %}
|
||||||
<a href="{{url_for('search')}}?query={{d|trim}}&fields=filename">{{d|trim}}</a>
|
<a href="{{url_for('search')}}?query={{d|trim}}&fields=filename">{{d|trim}}</a>
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
@@ -60,25 +59,40 @@
|
|||||||
|
|
||||||
{% if config['SHOW_PARSED_QUERY'] and parsed_query %}
|
{% if config['SHOW_PARSED_QUERY'] and parsed_query %}
|
||||||
<li class="list-group-item">
|
<li class="list-group-item">
|
||||||
<div class="col-12 info">
|
<div class="container-fluid">
|
||||||
|
<div class="row">
|
||||||
|
<div class="col-xs-12 info">
|
||||||
<b>Parsed query:</b> {{ parsed_query }}
|
<b>Parsed query:</b> {{ parsed_query }}
|
||||||
</div>
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
</li>
|
</li>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
{% if parsed_query %}
|
{% if parsed_query %}
|
||||||
<li class="list-group-item">
|
<li class="list-group-item">
|
||||||
<div class="col-12 info">
|
<div class="container-fluid">
|
||||||
<b>Found:</b> {{entries|length}} documents with results, out of {{totals["total"]}} total documents
|
<div class="row">
|
||||||
|
<div class="col-xs-12 info">
|
||||||
|
<b>Found:</b> <span class="badge">{{entries|length}}</span> results
|
||||||
|
out of <span class="badge">{{totals["total"]}}</span> total items indexed
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</li>
|
</li>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
<li class="list-group-item">
|
<li class="list-group-item">
|
||||||
<div class="col-12 info">
|
<div class="container-fluid">
|
||||||
<b>Indexing:</b> {{totals["documents"]}} Google Documents,
|
<div class="row">
|
||||||
{{totals["issues"]}} Github issues, and
|
<div class="col-xs-12 info">
|
||||||
{{totals["comments"]}} Github comments
|
<b>Indexing:</b> <span
|
||||||
|
class="badge">{{totals["gdoc"]}}</span> Google Documents,
|
||||||
|
<span class="badge">{{totals["issue"]}}</span> Github issues,
|
||||||
|
<span class="badge">{{totals["ghfile"]}}</span> Github files,
|
||||||
|
<span class="badge">{{totals["markdown"]}}</span> Github markdown files.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</li>
|
</li>
|
||||||
|
|
||||||
@@ -95,35 +109,46 @@
|
|||||||
|
|
||||||
<div class="url">
|
<div class="url">
|
||||||
{% if e.kind=="gdoc" %}
|
{% if e.kind=="gdoc" %}
|
||||||
<b>Google Drive File:</b>
|
{% if e.mimetype=="" %}
|
||||||
|
<b>Google Document:</b>
|
||||||
<a href='{{e.url}}'>{{e.title}}</a>
|
<a href='{{e.url}}'>{{e.title}}</a>
|
||||||
({{e.owner_name}}, {{e.owner_email}})
|
(Owner: {{e.owner_name}}, {{e.owner_email}})<br />
|
||||||
{% elif e.kind=="comment" %}
|
<b>Document Type</b>: {{e.mimetype}}
|
||||||
<b>Comment:</b>
|
{% else %}
|
||||||
<a href='{{e.url}}'>Comment (link)</a>
|
<b>Google Drive:</b>
|
||||||
{% if e.github_user %}
|
<a href='{{e.url}}'>{{e.title}}</a>
|
||||||
by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
|
(Owner: {{e.owner_name}}, {{e.owner_email}})
|
||||||
{% endif %}
|
|
||||||
on issue <a href='{{e.issue_url}}'>{{e.issue_title}}</a>
|
|
||||||
<br/>
|
|
||||||
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
|
|
||||||
{% if e.github_user %}
|
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
{% elif e.kind=="issue" %}
|
{% elif e.kind=="issue" %}
|
||||||
<b>Issue:</b>
|
<b>Github Issue:</b>
|
||||||
<a href='{{e.issue_url}}'>{{e.issue_title}}</a>
|
<a href='{{e.url}}'>{{e.title}}</a>
|
||||||
{% if e.github_user %}
|
{% if e.github_user %}
|
||||||
by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
|
opened by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
<br/>
|
<br/>
|
||||||
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
|
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
|
||||||
|
|
||||||
|
{% elif e.kind=="markdown" %}
|
||||||
|
<b>Github Markdown:</b>
|
||||||
|
<a href='{{e.url}}'>{{e.title}}</a>
|
||||||
|
<br/>
|
||||||
|
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
|
||||||
|
|
||||||
{% else %}
|
{% else %}
|
||||||
<b>Item:</b> (<a href='{{e.url}}'>link</a>)
|
<b>Item:</b> (<a href='{{e.url}}'>link</a>)
|
||||||
|
|
||||||
{% endif %}
|
{% endif %}
|
||||||
<br />
|
<br />
|
||||||
score: {{'%d' % e.score}}
|
Score: {{'%d' % e.score}}
|
||||||
|
</div>
|
||||||
|
<div class="markdown-body">
|
||||||
|
{% if e.content_highlight %}
|
||||||
|
{{ e.content_highlight|safe}}
|
||||||
|
{% else %}
|
||||||
|
<p>(A preview of this document is not available.)</p>
|
||||||
|
{% endif %}
|
||||||
</div>
|
</div>
|
||||||
<div class="markdown-body">{{ e.content_highlight|safe}}</div>
|
|
||||||
|
|
||||||
</li>
|
</li>
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
@@ -134,17 +159,29 @@
|
|||||||
|
|
||||||
<div class="container">
|
<div class="container">
|
||||||
<div class="row">
|
<div class="row">
|
||||||
<div class="col-12">
|
<ul class="list-group">
|
||||||
<div class="last-searches">Last searches: <br/>
|
|
||||||
{% for s in last_searches %}
|
{% if config['FOOTER_REPO_NAME'] %}
|
||||||
<span><a href="{{url_for('search')}}?{{s}}">{{s}}</a></span>
|
{% if config['FOOTER_REPO_ORG'] %}
|
||||||
{% endfor %}
|
|
||||||
|
<li class="list-group-item">
|
||||||
|
<div class="container-fluid">
|
||||||
|
<div class="row">
|
||||||
|
<div class="col-xs-12 info">
|
||||||
|
More information about {{config['FOOTER_REPO_NAME']}} can be found
|
||||||
|
in the <a href="https://github.com/{{config['FOOTER_REPO_ORG']}}/{{config['FOOTER_REPO_NAME']}}">{{config['FOOTER_REPO_ORG']}}/{{config['FOOTER_REPO_NAME']}}</a>
|
||||||
|
repository on Github.
|
||||||
</div>
|
</div>
|
||||||
<p>
|
|
||||||
More info can be found in the <a href="https://github.com/BernhardWenzel/markdown-search">README.md file</a>
|
|
||||||
</p>
|
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
</li>
|
||||||
|
|
||||||
|
{% endif %}
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
{% endblock %}
|
{% endblock %}
|
||||||
|
Reference in New Issue
Block a user