38 Commits
v0.3 ... 1.0

Author SHA1 Message Date
1a04814edf Merge pull request #9 from dcppc/ACharbonneau-patch-1
Update config_centillion.json
2018-08-07 16:09:45 -07:00
Amanda Charbonneau
3fb72d409b Update config_centillion.json
I fixed it
2018-08-07 18:24:32 -04:00
d89e01221a Merge pull request #8 from dcppc/dcppc-test
Fix the name of the milestones repo: 'dcppc-milestones' not 'milestones'
2018-08-07 14:59:06 -07:00
6736f3f8ad add centillion configuration json file 2018-08-07 14:54:56 -07:00
abd13aba29 Merge pull request #7 from dcppc/fix-docstrings
Fix docstrings
2018-08-07 14:43:42 -07:00
13e49cdaa6 improve docstrings on gdrive_util.py too 2018-08-07 14:42:19 -07:00
83b2ce17fb fix docstrings in centillion_search.py 2018-08-07 14:41:26 -07:00
5be0709070 Merge pull request #6 from dcppc/fix-docs
Move images, resize images, update image markdown in readme
2018-08-07 13:02:08 -07:00
9edd95a78d Merge branch 'fix-docs'
* fix-docs:
  Move images, resize images, update image markdown in readme
  update readme to use <img> tags
  merge image files in from master
  fix <title>
  fix the readme to reflect current state of things/links/descriptions
  fix typos/wording in readme
  adding changes to enable https, update callback to http, and everything still passes through https (proxy)
  update footer repo info
  update screen shots
  add mkdocs-material-dib submodule
  remove mkdocs material submodule
  update tagline
  update tagline
  add _example_ config file for flask
2018-08-07 12:50:29 -07:00
37615d8707 Move images, resize images, update image markdown in readme 2018-08-07 12:40:38 -07:00
4b218f63b9 update readme to use <img> tags 2018-08-03 15:56:49 -07:00
4e17c890bc merge image files in from master 2018-08-03 15:53:51 -07:00
1129ec38e0 update the readme 2018-08-03 15:49:46 -07:00
875508c796 update screen shot images 2018-08-03 15:49:12 -07:00
abc7a2aedf fix <title> 2018-08-03 15:45:56 -07:00
8f1e5faefc update readme to reflect latest 2018-08-03 15:38:23 -07:00
d5f63e2322 Merge pull request #1 from dcppc/fix-readme
fix the readme to reflect current state of things/links/descriptions
2018-08-03 15:28:51 -07:00
84e5560423 fix the readme to reflect current state of things/links/descriptions 2018-08-03 15:28:16 -07:00
924c562c0a fix typos/wording in readme 2018-08-03 15:22:35 -07:00
13c410ac5e adding changes to enable https, update callback to http, and everything still passes through https (proxy) 2018-08-03 15:21:41 -07:00
4e79800e83 update footer repo info 2018-08-03 15:19:55 -07:00
5b9570d8cd Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc
* 'dcppc' of github.com:dcppc/centillion:
  add mkdocs-material-dib submodule
  remove mkdocs material submodule
2018-08-03 14:54:25 -07:00
297a4b5977 update screen shots 2018-08-03 14:53:43 -07:00
69a6b5d680 add mkdocs-material-dib submodule 2018-08-03 13:51:13 -07:00
3feca1aba3 remove mkdocs material submodule 2018-08-03 13:50:37 -07:00
493581f861 update tagline 2018-08-03 13:38:00 -07:00
1b0ded809d update tagline 2018-08-03 13:36:56 -07:00
78e77c7cf2 add _example_ config file for flask 2018-08-03 13:34:27 -07:00
2f890d1aee Merge branch 'all-the-docs' of charlesreid1/centillion into master 2018-08-03 20:28:27 +00:00
937327f2cb update search template to treat drive files and documents differently. 2018-08-03 13:24:03 -07:00
ca0d88cfe6 index all the google drive things 2018-08-03 13:15:02 -07:00
5eda472072 improve handling of tokens for gh api, fix set ordering/logic 2018-08-03 13:07:46 -07:00
d943c14678 Merge branch 'master' into all-the-docs
* master:
  Update '.gitignore'
  no secrets plz
2018-08-03 12:37:49 -07:00
6be785a056 indexing all markdown is working. 2018-08-03 12:36:32 -07:00
65113a95f7 Update '.gitignore' 2018-08-03 17:52:04 +00:00
87c3f12c8f no secrets plz 2018-08-03 17:51:39 +00:00
933884e9ab search all the docs. search all the repos. 2018-08-03 10:29:52 -07:00
da9dea3f6b Merge branch 'github-markdown' of charlesreid1/centillion into master 2018-08-03 07:20:45 +00:00
15 changed files with 184 additions and 90 deletions

2
.gitignore vendored
View File

@@ -1,4 +1,4 @@
config_*
config_flask.py
vp
credentials.json
drive*.json

6
.gitmodules vendored
View File

@@ -1,3 +1,3 @@
[submodule "mkdocs-material"]
path = mkdocs-material
url = https://git.charlesreid1.com/charlesreid1/mkdocs-material.git
[submodule "mkdocs-material-dib"]
path = mkdocs-material-dib
url = https://github.com/dib-lab/mkdocs-material-dib.git

View File

@@ -1,18 +1,19 @@
# The Centillion
**the centillion**: a pan-github-markdown-issues-google-docs search engine.
**centillion**: a pan-github-markdown-issues-google-docs search engine.
**a centillion**: a very large number consisting of a 1 with 303 zeros after it.
the centillion is 3.03 log-times better than the googol.
one centillion is 3.03 log-times better than a googol.
![Screen shot of centillion](img/ss.png)
![Screen shot of centillion](docs/images/ss.png)
## what is it
The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
a Python library for building search engines.
Centillion (https://github.com/dcppc/centillion) is a search engine that can index
three kinds of collections: Google Documents, Github issues, and Markdown files in
Github repos.
We define the types of documents the centillion should index,
what info and how. The centillion then builds and
@@ -24,6 +25,30 @@ defined in `centillion.py`.
The centillion keeps it simple.
## authentication layer
Centillion lives behind a Github authentication layer, implemented with
[flask-dance](https://github.com/singingwolfboy/flask-dance). When you first
visit the site it will ask you to authenticate with Github so that it can
verify you have permission to access the site.
## technologies
Centillion is a Python program built using whoosh (search engine library). It
indexes the full text of docx files in Google Documents, just the filenames for
non-docx files. The full text of issues and their comments are indexed, and
results are grouped by issue. Centillion requires Google Drive and Github OAuth
apps. Once you provide credentials to Flask you're all set to go.
## control panel
There's also a control panel at <https://search.nihdatacommons.us/control_panel>
that allows you to rebuild the search index from scratch (the Google Drive indexing
takes a while).
![Screen shot of centillion control panel](docs/images/cp.png)
## quickstart (with Github auth)
@@ -31,6 +56,8 @@ Start by creating a Github OAuth application.
Get the public and private application key
(client token and client secret token)
from the Github application's page.
You will also need a Github access token
(in addition to the app tokens).
When you create the application, set the callback
URL to `/login/github/authorized`, as in:
@@ -65,11 +92,3 @@ as HTTP by Github, even though there is an HTTPS address, and
everything else seems fine, try deleting the Github OAuth app
and creating a new one.
## more info
For more info see the documentation: <https://charlesreid1.github.io/centillion>

View File

@@ -27,10 +27,10 @@ You provide:
class UpdateIndexTask(object):
def __init__(self, gh_oauth_token, diff_index=False):
def __init__(self, gh_access_token, diff_index=False):
self.diff_index = diff_index
thread = threading.Thread(target=self.run, args=())
self.gh_oauth_token = gh_oauth_token
self.gh_access_token = gh_access_token
thread.daemon = True
thread.start()
@@ -43,8 +43,8 @@ class UpdateIndexTask(object):
from get_centillion_config import get_centillion_config
config = get_centillion_config('config_centillion.json')
search.update_index_markdown(self.gh_oauth_token,config)
search.update_index_issues(self.gh_oauth_token,config)
search.update_index_issues(self.gh_access_token,config)
search.update_index_markdown(self.gh_access_token,config)
search.update_index_gdocs(config)
@@ -55,11 +55,11 @@ app.wsgi_app = ProxyFix(app.wsgi_app)
# Load default config and override config from an environment variable
app.config.from_pyfile("config_flask.py")
github_bp = make_github_blueprint()
#github_bp = make_github_blueprint(
# client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
# client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
# scope='read:org')
#github_bp = make_github_blueprint()
github_bp = make_github_blueprint(
client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
scope='read:org')
app.register_blueprint(github_bp, url_prefix="/login")
@@ -172,11 +172,13 @@ def update_index():
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
if mresp.status_code==204:
gh_oauth_token = github.token['access_token']
#gh_oauth_token = github.token['access_token']
gh_access_token = app.config['GITHUB_TOKEN']
# --------------------
# Business as usual
UpdateIndexTask(gh_oauth_token, diff_index=False)
UpdateIndexTask(gh_access_token,
diff_index=False)
flash("Rebuilding index, check console output")
return render_template("controlpanel.html",
totals={})
@@ -216,5 +218,6 @@ def oops(e):
return contents404
if __name__ == '__main__':
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = 'true'
app.run(host="0.0.0.0",port=5000)

View File

@@ -1,7 +1,7 @@
import shutil
import html.parser
from github import Github
from github import Github, GithubException
import base64
from gdrive_util import GDrive
@@ -252,7 +252,6 @@ class Search:
with open(fullpath_input, 'wb') as f:
f.write(r.content)
# Try to convert docx file to plain text
try:
output = pypandoc.convert_file(fullpath_input,
@@ -316,7 +315,7 @@ class Search:
# to a search index.
def add_issue(self, writer, issue, config, update=True):
def add_issue(self, writer, issue, gh_access_token, config, update=True):
"""
Add a Github issue/comment to a search index.
"""
@@ -368,7 +367,7 @@ class Search:
def add_markdown(self, writer, d, config, update=True):
def add_markdown(self, writer, d, gh_access_token, config, update=True):
"""
Use a Github markdown document API record
to add a markdown document's contents to
@@ -385,18 +384,27 @@ class Search:
_, fname = os.path.split(fpath)
_, fext = os.path.splitext(fpath)
print("Indexing markdown doc %s"%(fname))
print("Indexing markdown doc %s from repo %s"%(fname,repo_name))
# Unpack the requests response and decode the content
response = requests.get(furl)
jresponse = response.json()
content = ""
try:
binary_content = re.sub('\n','',jresponse['content'])
content = base64.b64decode(binary_content).decode('utf-8')
#
# don't forget the headers for private repos!
# useful: https://bit.ly/2LSAflS
except KeyError:
print(" > XXXXXXXX Failed to extract 'content' field. You probably hit the rate limit.")
headers = {'Authorization' : 'token %s'%(gh_access_token)}
response = requests.get(furl, headers=headers)
if response.status_code==200:
jresponse = response.json()
content = ""
try:
binary_content = re.sub('\n','',jresponse['content'])
content = base64.b64decode(binary_content).decode('utf-8')
except KeyError:
print(" > XXXXXXXX Failed to extract 'content' field. You probably hit the rate limit.")
else:
print(" > XXXXXXXX Failed to reach file URL. There may be a problem with authentication/headers.")
return
# Now create the actual search index record
@@ -431,6 +439,10 @@ class Search:
# Define how to update search index
# using different kinds of collections
# ------------------------------
# Google Drive Files/Documents
def update_index_gdocs(self,
config):
"""
@@ -478,7 +490,7 @@ class Search:
remote_ids = set()
full_items = {}
while True:
ps = 12
ps = 100
results = drive.list(
pageSize=ps,
pageToken=nextPageToken,
@@ -496,11 +508,11 @@ class Search:
# Also store the doc
full_items[f['id']] = f
# Shorter:
break
## Longer:
#if nextPageToken is None:
# break
## Shorter:
#break
# Longer:
if nextPageToken is None:
break
writer = self.ix.writer()
@@ -544,13 +556,13 @@ class Search:
print("Done, updated %d documents in the index" % count)
# ------------------------------
# Github Issues/Comments
def update_index_issues(self, gh_oauth_token, config):
def update_index_issues(self, gh_access_token, config):
"""
Update the search index using a collection of
Github repo issues and comments.
gh_oauth_token can also be an access token.
"""
# Updated algorithm:
# - get set of indexed ids
@@ -572,25 +584,29 @@ class Search:
# Get the set of remote ids:
# ------
# Start with api object
g = Github(gh_oauth_token)
g = Github(gh_access_token)
# Now index all issue threads in the user-specified repos
# Start by collecting all the things
remote_issues = set()
full_items = {}
# Iterate over each repo
list_of_repos = config['repositories']
for r in list_of_repos:
# Start by collecting all the things
remote_issues = set()
full_items = {}
if '/' not in r:
err = "Error: specify org/reponame or user/reponame in list of repos"
raise Exception(err)
this_org, this_repo = re.split('/',r)
org = g.get_organization(this_org)
repo = org.get_repo(this_repo)
try:
org = g.get_organization(this_org)
repo = org.get_repo(this_repo)
except:
print("Error: could not gain access to repository %s"%(r))
continue
# Iterate over each issue thread
issues = repo.get_issues()
@@ -622,7 +638,7 @@ class Search:
# cop out
writer.delete_by_term('id',update_issue)
item = full_items[update_issue]
self.add_issue(writer, item, config, update=True)
self.add_issue(writer, item, gh_access_token, config, update=True)
count += 1
@@ -631,7 +647,7 @@ class Search:
add_issues = remote_issues - indexed_issues
for add_issue in add_issues:
item = full_items[add_issue]
self.add_issue(writer, item, config, update=False)
self.add_issue(writer, item, gh_access_token, config, update=False)
count += 1
@@ -640,13 +656,13 @@ class Search:
# ------------------------------
# Github Markdown Files
def update_index_markdown(self, gh_oauth_token, config):
def update_index_markdown(self, gh_access_token, config):
"""
Update the search index using a collection of
Markdown files from a Github repo.
gh_oauth_token can also be an access token.
"""
EXT = '.md'
@@ -669,38 +685,48 @@ class Search:
# Get the set of remote ids:
# ------
# Start with api object
g = Github(gh_oauth_token)
g = Github(gh_access_token)
# Now index all markdown files
# in the user-specified repos
# Start by collecting all the things
remote_ids = set()
full_items = {}
# Iterate over each repo
list_of_repos = config['repositories']
for r in list_of_repos:
# Start by collecting all the things
remote_ids = set()
full_items = {}
if '/' not in r:
err = "Error: specify org/reponame or user/reponame in list of repos"
raise Exception(err)
this_org, this_repo = re.split('/',r)
org = g.get_organization(this_org)
repo = org.get_repo(this_repo)
try:
org = g.get_organization(this_org)
repo = org.get_repo(this_repo)
except:
print("Error: could not gain access to repository %s"%(r))
continue
# ---------
# begin markdown-specific code
# Get head commit
commits = repo.get_commits()
last = commits[0]
sha = last.sha
try:
last = commits[0]
sha = last.sha
except GithubException:
print("Error: could not get commits from repository %s"%(r))
continue
# Get all the docs
tree = repo.get_git_tree(sha=sha, recursive=True)
docs = tree.raw_data['tree']
print("Parsing doc ids from repository %s"%(r))
for d in docs:
@@ -736,10 +762,10 @@ class Search:
# and in remote_ids
update_ids = indexed_ids & remote_ids
for update_id in update_ids:
# cop out
# cop out: just delete and re-add
writer.delete_by_term('id',update_id)
item = full_items[update_id]
self.add_markdown(writer, item, config, update=True)
self.add_markdown(writer, item, gh_access_token, config, update=True)
count += 1
@@ -748,7 +774,7 @@ class Search:
add_ids = remote_ids - indexed_ids
for add_id in add_ids:
item = full_items[add_id]
self.add_markdown(writer, item, config, update=False)
self.add_markdown(writer, item, gh_access_token, config, update=False)
count += 1
@@ -757,6 +783,16 @@ class Search:
# ------------------------------
# Groups.io Emails
#def update_index_markdown(self, gh_access_token, config):
# ---------------------------------
# Search results bundler

View File

@@ -1,6 +1,27 @@
{
"repositories" : [
"dcppc/project-management",
"dcppc/nih-demo-meetings",
"dcppc/internal",
"dcppc/organize",
"dcppc/dcppc-bot",
"dcppc/full-stacks",
"dcppc/markdown-issues",
"dcppc/design-guidelines-discuss",
"dcppc/dcppc-deliverables",
"dcppc/dcppc-milestones",
"dcppc/crosscut-metadata",
"dcppc/lucky-penny",
"dcppc/dcppc-workshops",
"dcppc/metadata-matrix",
"dcppc/data-stewards",
"dcppc/dcppc-phase1-demos",
"dcppc/apis",
"dcppc/2018-june-workshop",
"dcppc/2018-july-workshop"
"dcppc/2018-july-workshop",
"dcppc/2018-august-workshop",
"dcppc/2018-september-workshop",
"dcppc/design-guidelines",
"dcppc/2018-may-workshop"
]
}

View File

@@ -2,17 +2,18 @@
INDEX_DIR = "search_index"
# oauth client deets
GITHUB_OAUTH_CLIENT_ID = "63f8d49c651840cbe31e"
GITHUB_OAUTH_CLIENT_SECRET = "36d9a4611f7427336d3c89ed041c45d086b793ee"
GITHUB_OAUTH_CLIENT_ID = "XXX"
GITHUB_OAUTH_CLIENT_SECRET = "YYY"
GITHUB_TOKEN = "ZZZ"
# More information footer: Repository label
FOOTER_REPO_ORG = "charlesreid1"
FOOTER_REPO_ORG = "dcppc"
FOOTER_REPO_NAME = "centillion"
# Toggle to show Whoosh parsed query
SHOW_PARSED_QUERY=True
TAGLINE = "Search all the things"
TAGLINE = "Search the Data Commons"
# Flask settings
DEBUG = True

BIN
docs/images/cp.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 498 KiB

BIN
docs/images/ss.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 355 KiB

View File

@@ -29,8 +29,7 @@ class GDrive(object):
):
"""
Set up the Google Drive API instance.
Factory method: create it and hand it over.
Then we're finished.
Factory method: create it here, hand it over in get_service().
"""
self.credentials_file = credentials_file
self.client_secret_file = client_secret_file
@@ -40,6 +39,9 @@ class GDrive(object):
self.store = file.Storage(credentials_file)
def get_service(self):
"""
Return an instance of the Google Drive API service.
"""
creds = self.store.get()
if not creds or creds.invalid:

Binary file not shown.

Before

Width:  |  Height:  |  Size: 356 KiB

Submodule mkdocs-material deleted from 6569122bb1

1
mkdocs-material-dib Submodule

Submodule mkdocs-material-dib added at c3dd912f3c

View File

@@ -1,5 +1,5 @@
<!doctype html>
<title>Markdown Search</title>
<title>Centillion Search Engine</title>
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='bootstrap.min.css') }}">

View File

@@ -107,12 +107,18 @@
<div class="url">
{% if e.kind=="gdoc" %}
<b>Google Drive File:</b>
<a href='{{e.url}}'>{{e.title}}</a>
(Owner: {{e.owner_name}}, {{e.owner_email}})
{% if e.mimetype=="document" %}
<b>Google Document:</b>
<a href='{{e.url}}'>{{e.title}}</a>
(Type: {{e.mimetype}}, Owner: {{e.owner_name}}, {{e.owner_email}})
{% else %}
<b>Google Drive:</b>
<a href='{{e.url}}'>{{e.title}}</a>
(Type: {{e.mimetype}}, Owner: {{e.owner_name}}, {{e.owner_email}})
{% endif %}
{% elif e.kind=="issue" %}
<b>Issue:</b>
<b>Github Issue:</b>
<a href='{{e.url}}'>{{e.title}}</a>
{% if e.github_user %}
opened by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
@@ -121,7 +127,7 @@
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
{% elif e.kind=="markdown" %}
<b>Markdown:</b>
<b>Github Markdown:</b>
<a href='{{e.url}}'>{{e.title}}</a>
<br/>
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
@@ -131,9 +137,15 @@
{% endif %}
<br />
score: {{'%d' % e.score}}
Score: {{'%d' % e.score}}
</div>
<div class="markdown-body">
{% if e.content_highlight %}
{{ e.content_highlight|safe}}
{% else %}
<p>(A preview of this document is not available.)</p>
{% endif %}
</div>
<div class="markdown-body">{{ e.content_highlight|safe}}</div>
</li>
{% endfor %}