Merge branch 'master' of github.com:charlesreid1/centillion

* 'master' of github.com:charlesreid1/centillion: update config_flask.example.py to strip dc info
Merge branch 'master' of github.com:dcppc/centillion
2018-08-13 19:14:54 -07:00 · 2018-08-13 19:14:07 -07:00 · 2018-08-13 19:13:53 -07:00 · 2018-08-13 12:42:18 -07:00 · 2018-08-13 00:54:12 -07:00 · 2018-08-13 00:27:45 -07:00
35 changed files with 1859 additions and 473 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,8 +1,8 @@
 config_flask.py
 vp
 credentials.json
 drive*.json
 *.pyc
 config.py
 out/
 search_index/
 venv/
--- a/.gitmodules
+++ b/.gitmodules
@@ -0,0 +1,3 @@
 [submodule "mkdocs-material-dib"]
 	path = mkdocs-material-dib
 	url = https://github.com/dib-lab/mkdocs-material-dib.git
--- a/Readme.md
+++ b/Readme.md
@@ -1,67 +1,94 @@
-# centillion
+# Centillion
-**the centillion**: a pan-github-markdown-issues-google-docs search engine.
+**centillion**: a pan-github-markdown-issues-google-docs search engine.
 **a centillion**: a very large number consisting of a 1 with 303 zeros after it.
-the centillion is 3.03 log-times better than the googol.
+one centillion is 3.03 log-times better than a googol.
 ![Screen shot of centillion](docs/images/ss.png)
 ## what is it
-The centillion is a search engine built using [whoosh](#),
+Centillion (https://github.com/dcppc/centillion) is a search engine that can index 
-a Python library for building search engines.
+three kinds of collections: Google Documents, Github issues, and Markdown files in 
 Github repos.
 We define the types of documents the centillion should index,
-and how, using what fields. The centillion then builds and
+what info and how. The centillion then builds and
-updates a search index.
+updates a search index. That's all done in `centillion_search.py`.
 The centillion also provides a simple web frontend for running
-queries against the search index.
+queries against the search index. That's done using a Flask server
 defined in `centillion.py`.
 The centillion keeps it simple.
 ## authentication layer
-## work that is done
+Centillion lives behind a Github authentication layer, implemented with 
 [flask-dance](https://github.com/singingwolfboy/flask-dance). When you first
 visit the site it will ask you to authenticate with Github so that it can 
 verify you have permission to access the site.
-See [Workdone.md](Workdone.md)
+## technologies
 Centillion is a Python program built using whoosh (search engine library). It
 indexes the full text of docx files in Google Documents, just the filenames for
 non-docx files. The full text of issues and their comments are indexed, and
 results are grouped by issue. Centillion requires Google Drive and Github OAuth
 apps. Once you provide credentials to Flask you're all set to go. 
-## work that is being done
+## control panel
-See [Workinprogress.md](Workinprogress.md) for details about
+There's also a control panel at <https://search.nihdatacommons.us/control_panel> 
-route and function layout. Summary below.
+that allows you to rebuild the search index from scratch (the Google Drive indexing 
 takes a while).
-### code organization
+![Screen shot of centillion control panel](docs/images/cp.png)
 centillion app routes:
 - home
    - if not logged in, landing page
    - if logged in, redirect to search
 - search
 - main_index_update
    - update main index, all docs period
-centillion Search functions:
+## quickstart (with Github auth)
- open_index creates the schema
+Start by creating a Github OAuth application.
 Get the public and private application key 
 (client token and client secret token)
 from the Github application's page.
 You will also need a Github access token
 (in addition to the app tokens).
- add_issue, add_md, add_document have three diff method sigs and add diff types
+When you create the application, set the callback
-  of documents to the search index
+URL to `/login/github/authorized`, as in:
- update_all_issues or update_all_md or update_all_documents iterates over items
+```
-  and determines whether each item needs to be updated in the search index
+https://<url>/login/github/authorized
 ```
- update_main_index - update the entire search index
+Edit the Flask configuration `config_flask.py`
-    - calls all three update_all methods
+and set the public and private application keys.
- create_search_results - package things up for jinja
+Now run centillion:
- search - run the query, pass results to the jinja-packager
+```
 python centillion.py
 ```
 or if you used http instead of https:
 ```
 OAUTHLIB_INSECURE_TRANSPORT="true" python centillion.py
 ```
 This will start a Flask server, and you can view the minimal search engine
 interface in your browser at `http://<ip>:5000`.
-## work that is planned
+## troubleshooting
-See [Workplanned.md](Workplanned.md)
+If you are having problems with your callback URL being treated
 as HTTP by Github, even though there is an HTTPS address, and
 everything else seems fine, try deleting the Github OAuth app
 and creating a new one.
--- a/Todo.md
+++ b/Todo.md
@@ -0,0 +1,47 @@
 # todo
 Main task:
 - hashing and caching
    - <s>first, working out the logic of how we group items into sets
        - needs to be deleted
        - needs to be updated
        - needs to be added
        - for docs, issues, and comments</s>
    - second, when we add or update an item, need to:
        - go through the motions, download file, extract text
        - check for existing indexed doc with that id
        - check if existing indexed doc has same hash
            - if so, skip
            - otherwise, delete and re-index
 Other bugs:
 - Some github issues have no title (?)
 - <s>Need to combine issues with comments</s>
 - Not able to index markdown files _in a repo_
 - (Longer term) update main index vs update diff index
 Needs:
 - <s>control panel</s>
 Thursday product:
 - Everything re-indexed nightly
 - Search engine built on all documents in Google Drive, all issues, markdown files
 - Using pandoc to extract Google Drive document contents
 - BRIEF quickstart documentation
 Future:
 - Future plans to improve - plugins, improving matching
 - Subdomain plans
 - Folksonomy tagging and integration plans
 config options for plugins
 conditional blocks with import github inside
 complicated tho - better to have components split off
--- a/Workinprogress.md
+++ b/Workinprogress.md
@@ -1,106 +0,0 @@
 # Components
 The components of centillion are as follows:
 - Flask application, which creates a Search object and uses it to search index
 - Search object, which allows you to create/update/search an index
 ## Routes layout
 Current application routes are as follows:
 - home -> search
 - search
 - update_index
 Ideal application routes (using github flask dance oauth):
 - home
    - if not logged in, landing page
    - if logged in, redirect to search
 - search
 - main_index_update
    - update main index, all docs period
 - delta_index_update
    - updates delta index, docs that have changed since last main index
 There should be one route to update the main index
 There should be another route to update the delta index
 These should go off and call the update index methods
 for each respective type of document/collection.
 For example, if I call `main_index_update` route it should
 - call `main_index_update` for all github issues
 - call `main_index_update` for folder of markdown docs
 - call `main_index_update` for google drive folder
 These are all members of the Search class
 ## Functions layout
 Functions of the entire search app:
 - create a search index
 - load a search index
 - call the search() method on the index
 - update the search index
 The first and last, creating and updating the search index,
 are of greatest interest.
 The Schema affects everything so it is hard to separate
 functionality into a main Search class shared by many.
 (Avoid inheritance/classes if possible.)
 current Search:
 - open_index creates the schema
 - add_issue or add_document adds an item to the index
 - add_all_issues or add_all_documents iterates over items and adds them to index
 - update_index_incremental - update the search index
 - create_search_results - package things up for jinja
 - search - run the query, pass results to the jinja-packager
 centillion Search:
 - open_index creates the schema
 - add_issue, add_md, add_document have three diff method sigs and add diff types
  of documents to the search index
 - update_all_issues or update_all_md or update_all_documents iterates over items
  and determines whether each item needs to be updated in the search index
 - update_main_index - update the entire search index
    - calls all three update_all methods
 - create_search_results - package things up for jinja
 - search - run the query, pass results to the jinja-packager
 Nice to have but focus on it later:
 - update_diff_issues or update_diff_md or update_diff_documents iterates over items
  and indexes recently-added items
 - update_diff_index - update the diff search index (what's been added since last
  time)
    - calls all three update_diff methods
 ## Files layout
 Schema definition:
 * include a "kind" or "class" to group objects
 * can provide different searches of different collections
 * eventually can provide user with checkboxes
--- a/centillion.py
+++ b/centillion.py
@@ -2,8 +2,11 @@ import threading
 from subprocess import call
 import codecs
-import os
+import os, json
 from werkzeug.contrib.fixers import ProxyFix
 from flask import Flask, request, redirect, url_for, render_template, flash
 from flask_dance.contrib.github import make_github_blueprint, github
 # create our application
 from centillion_search import Search
@@ -22,10 +25,18 @@ You provide:
    - Google Drive API key via file
 """
 class UpdateIndexTask(object):
-    def __init__(self, diff_index=False):
+    def __init__(self, app_config, diff_index=False):
        self.diff_index = diff_index
        thread = threading.Thread(target=self.run, args=())
        self.gh_token = app_config['GITHUB_TOKEN']
        self.groupsio_credentials = {
                'groupsio_token' :     app_config['GROUPSIO_TOKEN'],
                'groupsio_username' :  app_config['GROUPSIO_USERNAME'],
                'groupsio_password' :  app_config['GROUPSIO_PASSWORD']
        }
        thread.daemon = True
        thread.start()
@@ -38,90 +49,180 @@ class UpdateIndexTask(object):
        from get_centillion_config import get_centillion_config
        config = get_centillion_config('config_centillion.json')
-        gh_token = os.environ['GITHUB_ACESS_TOKEN']
+        search.update_index_groupsioemails(self.groupsio_credentials,config)
-        search.update_index_issues(gh_token, config)
+        ###search.update_index_ghfiles(self.gh_token,config)
-        search.update_index_gdocs(config)
+        ###search.update_index_issues(self.gh_token,config)
        ###search.update_index_gdocs(config)
 app = Flask(__name__)
 app.wsgi_app = ProxyFix(app.wsgi_app)
 # Load default config and override config from an environment variable
 app.config.from_pyfile("config_flask.py")
-last_searches_file = app.config["INDEX_DIR"] + "/last_searches.txt"
+#github_bp = make_github_blueprint()
 github_bp = make_github_blueprint(
                        client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
                        client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
                        scope='read:org')
 app.register_blueprint(github_bp, url_prefix="/login")
 contents404 = "<html><body><h1>Status: Error 404 Page Not Found</h1></body></html>"
 contents403 = "<html><body><h1>Status: Error 403 Access Denied</h1></body></html>"
 contents200 = "<html><body><h1>Status: OK 200</h1></body></html>"
 ##############################
 # Flask routes
@app.route('/')
 def index():
-    return redirect(url_for("search", query="", fields=""))
+
    if not github.authorized:
        return redirect(url_for("github.login"))
    else:
        username = github.get("/user").json()['login']
        resp = github.get("/user/orgs")
        if resp.ok:
            # If they are in team copper, redirect to search.
            # Otherwise, hit em with a 403
            all_orgs = resp.json()
            for org in all_orgs:
                if org['login']=='dcppc':
                    copper_team_id = '2700235'
                    mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
                    if mresp.status_code==204:
                        # --------------------
                        # Business as usual
                        return redirect(url_for("search", query="", fields=""))
            return contents403
        return contents404
 ### @app.route('/')
 ### def index():
 ###     return redirect(url_for("search", query="", fields=""))
@app.route('/search')
 def search():
    query = request.args['query']
    fields = request.args.get('fields')
    if fields == 'None':
        fields = None
-    search = Search(app.config["INDEX_DIR"])
+    if not github.authorized:
-    if not query:
+        return redirect(url_for("github.login"))
        parsed_query = ""
        result = []
-    else:
+    username = github.get("/user").json()['login']
        parsed_query, result = search.search(query.split(), fields=[fields])
        store_search(query, fields)
-    total = search.get_document_total_count()
+    resp = github.get("/user/orgs")
    if resp.ok:
-    return render_template('search.html', entries=result, query=query, parsed_query=parsed_query, fields=fields, last_searches=get_last_searches(), total=total)
+        all_orgs = resp.json()
        for org in all_orgs:
            if org['login']=='dcppc':
-@app.route('/open')
+                copper_team_id = '2700235'
-def open_file():
+
-    path = request.args['path']
+                mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
-    fields = request.args.get('fields')
+                if mresp.status_code==204:
-    query = request.args['query']
+
-    call([app.config["EDIT_COMMAND"], path])
+                    # --------------------
                    # Business as usual
                    query = request.args['query']
                    fields = request.args.get('fields')
                    if fields == 'None':
                        fields = None
                    search = Search(app.config["INDEX_DIR"])
                    if not query:
                        parsed_query = ""
                        result = []
                    else:
                        parsed_query, result = search.search(query.split(), fields=[fields])
                    totals = search.get_document_total_count()
                    return render_template('search.html', 
                                           entries=result, 
                                           query=query, 
                                           parsed_query=parsed_query, 
                                           fields=fields, 
                                           totals=totals)
    return contents403
    return redirect(url_for("search", query=query, fields=fields))
@app.route('/update_index')
 def update_index():
-    rebuild = request.args.get('rebuild')
+
-    UpdateIndexTask(diff_index=False)
+    if not github.authorized:
-    flash("Rebuilding index, check console output")
+        return redirect(url_for("github.login"))
-    return render_template("search.html", query="", fields="", last_searches=get_last_searches())
+
    username = github.get("/user").json()['login']
    resp = github.get("/user/orgs")
    if resp.ok:
        all_orgs = resp.json()
        for org in all_orgs:
            if org['login']=='dcppc':
                copper_team_id = '2700235'
                mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
                if mresp.status_code==204:
                    # --------------------
                    # Business as usual
                    UpdateIndexTask(app.config,
                                    diff_index=False)
                    flash("Rebuilding index, check console output")
                    return render_template("controlpanel.html", 
                                           totals={})
    return contents403
 ##############
 # Utility methods
-def get_last_searches():
+@app.route('/control_panel')
-    if os.path.exists(last_searches_file):
+def control_panel():
        with codecs.open(last_searches_file, 'r', encoding='utf-8') as f:
            contents = f.readlines()
    else:
        contents = []
    return contents
-def store_search(query, fields):
+    if not github.authorized:
-    if os.path.exists(last_searches_file):
+        return redirect(url_for("github.login"))
        with codecs.open(last_searches_file, 'r', encoding='utf-8') as f:
            contents = f.readlines()
    else:
        contents = []
-    search = "query=%s&fields=%s\n" % (query, fields)
+    username = github.get("/user").json()['login']
    if not search in contents:
        contents.insert(0, search)
-    with codecs.open(last_searches_file, 'w', encoding='utf-8') as f:
+    resp = github.get("/user/orgs")
-        f.writelines(contents[:30])
+    if resp.ok:
        all_orgs = resp.json()
        for org in all_orgs:
            if org['login']=='dcppc':
                copper_team_id = '2700235'
                mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
                if mresp.status_code==204:
                    return render_template("controlpanel.html", 
                                           totals={})
    return contents403
@app.errorhandler(404)
 def oops(e):
    return contents404
 if __name__ == '__main__':
-    app.run()
+    # if running local instance, set to true
    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = 'true'
    app.run(host="0.0.0.0",port=5000)
--- a/centillion_prepare.py
+++ b/centillion_prepare.py
@@ -0,0 +1,5 @@
 from gdrive_util import GDrive
 gd = GDrive()
 service = gd.get_service()
--- a/centillion_search.py
+++ b/centillion_search.py
@@ -1,9 +1,11 @@
 import shutil
 import html.parser
-from github import Github
+from github import Github, GithubException
 import base64
 from gdrive_util import GDrive
 from groupsio_util import GroupsIOArchivesCrawler
 from apiclient.http import MediaIoBaseDownload
 import mistune
@@ -14,6 +16,8 @@ import tempfile, subprocess
 import pypandoc
 import os.path
 import codecs
 from datetime import datetime
 from whoosh.qparser import MultifieldParser, QueryParser
 from whoosh.analysis import StemmingAnalyzer
@@ -40,6 +44,7 @@ Search object functions:
 Schema:
    - id
    - kind
    - fingerprint
    - created_time
    - modified_time
    - indexed_time
@@ -57,6 +62,10 @@ Schema:
 """
 def clean_timestamp(dt):
    return dt.replace(microsecond=0).isoformat()
 class SearchResult:
    score = 1.0
    path = None
@@ -89,6 +98,11 @@ class Search:
    def __init__(self, index_folder):
        self.open_index(index_folder)
    # ------------------------------
    # Create a schema and open a search index
    # on disk.
    def open_index(self, index_folder, create_new=False):
        """
        Create a schema,
@@ -109,13 +123,12 @@ class Search:
        # ------------------------------
        # IMPORTANT:
        # This is where the search index's document schema
        # is defined.
        schema = Schema(
                id = ID(stored=True, unique=True),
-                kind = ID(),
+                kind = ID(stored=True),
                created_time = ID(stored=True),
                modified_time = ID(stored=True),
@@ -154,28 +167,49 @@ class Search:
    # Define how to add documents
-    def add_drive_file(self, writer, item, indexed_ids, temp_dir, config):
+    def add_drive_file(self, writer, item, temp_dir, config, update=False):
        """
        Add a Google Drive document/file to a search index.
        If it is a document, extract the contents.
        """
        gd = GDrive()
        service = gd.get_service()
-        # ------------------------
+        # There are two kinds of documents:
        # Two kinds of documents:
        # - documents with text that can be extracted (docx)
        # - everything else
-
+        
        mimetype = re.split('[/\.]',item['mimeType'])[-1]
        mimemap = {
                'document' : 'docx',
        }
-        if(mimetype not in mimemap.keys()):
+        content = ""
-            # Not a document - 
+        if mimetype not in mimemap.keys():
-            # Just a file
+
-            print("Indexing document %s of type %s"%(item['name'], mimetype))
+            # Not a document - just a file
            print("Indexing Google Drive file \"%s\" of type %s"%(item['name'], mimetype))
            writer.delete_by_term('id',item['id'])
            # Index a plain google drive file
            writer.add_document(
                    id = item['id'],
                    kind = 'gdoc',
                    created_time = item['createdTime'],
                    modified_time = item['modifiedTime'],
                    indexed_time = datetime.now().replace(microsecond=0).isoformat(),
                    title = item['name'],
                    url = item['webViewLink'],
                    mimetype = mimetype,
                    owner_email = item['owners'][0]['emailAddress'],
                    owner_name = item['owners'][0]['displayName'],
                    repo_name='',
                    repo_url='',
                    github_user='',
                    issue_title='',
                    issue_url='',
                    content = content
            )
        else:
            # Document with text
            # Perform content extraction
@@ -187,7 +221,8 @@ class Search:
            # This is a file type we know how to convert
            # Construct the URL and download it
-            print("Extracting content from %s of type %s"%(item['name'], mimetype))
+            print("Indexing Google Drive document \"%s\" of type %s"%(item['name'], mimetype))
            print(" > Extracting content")
            # Create a URL and a destination filename
@@ -208,7 +243,7 @@ class Search:
                outfile_name = name+'.'+out_ext
-            # assemble input/output file paths
+            # Assemble input/output file paths
            fullpath_input = os.path.join(temp_dir,infile_name)
            fullpath_output = os.path.join(temp_dir,outfile_name)
@@ -217,7 +252,6 @@ class Search:
            with open(fullpath_input, 'wb') as f:
                f.write(r.content)
            # Try to convert docx file to plain text
            try:
                output = pypandoc.convert_file(fullpath_input,
@@ -227,12 +261,11 @@ class Search:
                )
                assert output == ""
            except RuntimeError:
-                print("XXXXXX Failed to index document %s"%(item['name']))
+                print(" > XXXXXX Failed to index document \"%s\""%(item['name']))
            # If export was successful, read contents of markdown
            # into the content variable.
            # into the content variable.
            if os.path.isfile(fullpath_output):
                # Export was successful
                with codecs.open(fullpath_output, encoding='utf-8') as f:
@@ -240,88 +273,196 @@ class Search:
            # No matter what happens, clean up.
-            print("Cleaning up %s"%item['name'])
+            print(" > Cleaning up \"%s\""%item['name'])
-            subprocess.call(['rm','-fr',fullpath_output])
+            ## test
            #print(" ".join(['rm','-fr',fullpath_output]))
            subprocess.call(['rm','-fr',fullpath_input])
            #print(" ".join(['rm','-fr',fullpath_input]))
            # do it
            subprocess.call(['rm','-fr',fullpath_output])
            subprocess.call(['rm','-fr',fullpath_input])
-        # ------------------------------
+            if update:
-        # IMPORTANT:
+                print(" > Removing old record")
-        # This is where the search documents are actually created.
+                writer.delete_by_term('id',item['id'])
            else:
                print(" > Creating a new record")
-        mimetype = re.split('[/\.]', item['mimeType'])[-1]
+            writer.add_document(
-        writer.add_document(
+                    id = item['id'],
-                id = item['id'],
+                    kind = 'gdoc',
-                kind = 'gdoc',
+                    created_time = item['createdTime'],
-                created_time = item['createdTime'],
+                    modified_time = item['modifiedTime'],
-                modified_time = item['modifiedTime'],
+                    indexed_time = datetime.now().replace(microsecond=0).isoformat(),
-                title = item['name'],
+                    title = item['name'],
-                url = item['webViewLink'],
+                    url = item['webViewLink'],
-                mimetype = mimetype,
+                    mimetype = mimetype,
-                owner_email = item['owners'][0]['emailAddress'],
+                    owner_email = item['owners'][0]['emailAddress'],
-                owner_name = item['owners'][0]['displayName'],
+                    owner_name = item['owners'][0]['displayName'],
-                repo_name=None,
+                    repo_name='',
-                repo_url=None,
+                    repo_url='',
-                github_user=None,
+                    github_user='',
-                issue_title=None,
+                    issue_title='',
-                issue_url=None,
+                    issue_url='',
-                content = content
+                    content = content
-        )
+            )
-    def add_issue(self, writer, issue, repo, config):
+
    # ------------------------------
    # Add a single github issue and its comments
    # to a search index.
    def add_issue(self, writer, issue, gh_token, config, update=True):
        """
        Add a Github issue/comment to a search index.
        """
-        repo_name = repo.name
+        repo = issue.repository
        repo_name = repo.owner.login+"/"+repo.name
        repo_url = repo.html_url
        count = 0
        # Handle the issue content
        print("Indexing issue %s"%(issue.html_url))
        writer.add_document(
                id = issue.html_url,
                kind = 'issue',
                url = issue.html_url,
                is_comment = False,
                timestamp = issue.created_at,
                repo_name = repo_name,
                repo_url = repo_url,
                issue_title = issue.title,
                issue_url = issue.html_url,
                user = issue.user.login,
                content = issue.body.rstrip()
        )
        count += 1
        # Combine comments with their respective issues.
        # Otherwise just too noisy.
        issue_comment_content = issue.body.rstrip()
        issue_comment_content += "\n"
        # Handle the comments content
        if(issue.comments>0):
            comments = issue.get_comments()
            for comment in comments:
                print(" > Indexing comment %s"%(comment.html_url))
                writer.add_document(
                        id = comment.html_url,
                        kind = 'comment',
                        url = comment.html_url,
                        is_comment = True,
                        timestamp = comment.created_at,
                        repo_name = repo_name,
                        repo_url = repo_url,
                        issue_title = issue.title,
                        issue_url = issue.html_url,
                        user = comment.user.login,
                        content = comment.body.strip()
                )
-        count += 1
+                issue_comment_content += comment.body.rstrip()
-        return count
+                issue_comment_content += "\n"
        # Now create the actual search index record
        created_time = clean_timestamp(issue.created_at)
        modified_time = clean_timestamp(issue.updated_at)
        indexed_time = clean_timestamp(datetime.now())
        # Add one document per issue thread,
        # containing entire text of thread.
        writer.add_document(
                id = issue.html_url,
                kind = 'issue',
                created_time = created_time,
                modified_time = modified_time,
                indexed_time = indexed_time,
                title = issue.title,
                url = issue.html_url,
                mimetype='',
                owner_email='',
                owner_name='',
                repo_name = repo_name,
                repo_url = repo_url,
                github_user = issue.user.login,
                issue_title = issue.title,
                issue_url = issue.html_url,
                content = issue_comment_content
        )
    def add_ghfile(self, writer, d, gh_token, config, update=True):
        """
        Use a Github file API record to add a filename
        to the search index.
        """
        MARKDOWN_EXTS = ['.md','.markdown']
        repo = d['repo']
        org = d['org']
        repo_name = org + "/" + repo
        repo_url = "https://github.com/" + repo_name
        try:
            fpath = d['path']
            furl = d['url']
            fsha = d['sha']
            _, fname = os.path.split(fpath)
            _, fext = os.path.splitext(fpath)
        except:
            print(" > XXXXXXXX Failed to find file info.")
            return
        indexed_time = clean_timestamp(datetime.now())
        if fext in MARKDOWN_EXTS:
            print("Indexing markdown doc %s from repo %s"%(fname,repo_name))
            # Unpack the requests response and decode the content
            # 
            # don't forget the headers for private repos!
            # useful: https://bit.ly/2LSAflS
            headers = {'Authorization' : 'token %s'%(gh_token)}
            response = requests.get(furl, headers=headers)
            if response.status_code==200:
                jresponse = response.json()
                content = ""
                try:
                    binary_content = re.sub('\n','',jresponse['content'])
                    content = base64.b64decode(binary_content).decode('utf-8')
                except KeyError:
                    print(" > XXXXXXXX Failed to extract 'content' field. You probably hit the rate limit.")
            else:
                print(" > XXXXXXXX Failed to reach file URL. There may be a problem with authentication/headers.")
                return 
            usable_url = "https://github.com/%s/blob/master/%s"%(repo_name, fpath)
            # Now create the actual search index record
            writer.add_document(
                    id = fsha,
                    kind = 'markdown',
                    created_time = '',
                    modified_time = '',
                    indexed_time = indexed_time,
                    title = fname,
                    url = usable_url,
                    mimetype='',
                    owner_email='',
                    owner_name='',
                    repo_name = repo_name,
                    repo_url = repo_url,
                    github_user = '',
                    issue_title = '',
                    issue_url = '',
                    content = content
            )
        else:
            print("Indexing github file %s from repo %s"%(fname,repo_name))
            key = fname+"_"+fsha
            # Now create the actual search index record
            writer.add_document(
                    id = key,
                    kind = 'ghfile',
                    created_time = '',
                    modified_time = '',
                    indexed_time = indexed_time,
                    title = fname,
                    url = repo_url,
                    mimetype='',
                    owner_email='',
                    owner_name='',
                    repo_name = repo_name,
                    repo_url = repo_url,
                    github_user = '',
                    issue_title = '',
                    issue_url = '',
                    content = ''
            )
@@ -329,133 +470,376 @@ class Search:
    # Define how to update search index
    # using different kinds of collections
    # ------------------------------
    # Google Drive Files/Documents
    def update_index_gdocs(self, 
                           config):
        """
        Update the search index using a collection of 
        Google Drive documents and files.
        Uses the 'id' field to uniquely identify documents.
        Also see:
        https://developers.google.com/drive/api/v3/reference/files
        """
        gd = GDrive()
        service = gd.get_service()
-        # -----
+        # Updated algorithm:
-        # Get the set of all documents on Google Drive:
+        # - get set of indexed ids
        # - get set of remote ids
        # - drop indexed ids not in remote ids
        # - index all remote ids
        # - add hash check in add_
        # ------------------------------
        # IMPORTANT:
        # This determines what information about the Google Drive files
        # you'll get back, and that's all you're going to have to work with.
        # If you need more information, modify the statement below.
        # Also see:
        # https://developers.google.com/drive/api/v3/reference/files
        # Get the set of indexed ids:
        # ------
        indexed_ids = set()
        p = QueryParser("kind", schema=self.ix.schema)
        q = p.parse("gdoc")
        with self.ix.searcher() as s:
            results = s.search(q,limit=None)
            for result in results:
                indexed_ids.add(result['id'])
        # Get the set of remote ids:
        # ------
        # Start with google drive api object
        gd = GDrive()
        service = gd.get_service()
        drive = service.files()
        # Now index all the docs in the google drive folder
        # The trick is to set next page token to None 1st time thru (fencepost)
        nextPageToken = None
        # Use the pager to return all the things
-        items = []
+        remote_ids = set()
        full_items = {}
        while True:
            ps = 100
            results = drive.list(
-                    pageSize=100,
+                    pageSize=ps,
                    pageToken=nextPageToken,
-                    fields="files(id, kind, createdTime, modifiedTime, mimeType, name, owners, webViewLink)",
+                    fields = "nextPageToken, files(id, kind, createdTime, modifiedTime, mimeType, name, owners, webViewLink)",
                    spaces="drive"
            ).execute()
            nextPageToken = results.get("nextPageToken")
-            items += results.get("files", [])
+            files = results.get("files",[])
            for f in files:
                # Add all remote docs to a set
                remote_ids.add(f['id'])
                # Also store the doc
                full_items[f['id']] = f
            ## Shorter:
            #break
            # Longer:
            if nextPageToken is None:
                break
        indexed_ids = set()
        for item in items:
            indexed_ids.add(item['id'])
        writer = self.ix.writer()
-
+        count = 0
        temp_dir = tempfile.mkdtemp(dir=os.getcwd())
        print("Temporary directory: %s"%(temp_dir))
        if not os.path.exists(temp_dir):
            os.mkdir(temp_dir)
-        count = 0
+
-        for item in items:
+
-            self.add_item(writer, item, indexed_ids, temp_dir, config)
+        # Drop any id in indexed_ids
        # not in remote_ids
        drop_ids = indexed_ids - remote_ids
        for drop_id in drop_ids:
            writer.delete_by_term('id',drop_id)
        # Update any id in indexed_ids
        # and in remote_ids
        update_ids = indexed_ids & remote_ids
        for update_id in update_ids:
            # cop out
            writer.delete_by_term('id',update_id)
            item = full_items[update_id]
            self.add_drive_file(writer, item, temp_dir, config, update=True)
            count += 1
        # Add any id not in indexed_ids
        # and in remote_ids
        add_ids = remote_ids - indexed_ids
        for add_id in add_ids:
            item = full_items[add_id]
            self.add_drive_file(writer, item, temp_dir, config, update=False)
            count += 1
        print("Cleaning temporary directory: %s"%(temp_dir))
        subprocess.call(['rm','-fr',temp_dir])
        writer.commit()
        print("Done, updated %d documents in the index" % count)
    # ------------------------------
    # Github Issues/Comments
-
+    def update_index_issues(self, gh_token, config):
    def update_index_issues(self, 
                            gh_access_token,
                            config):
        """
        Update the search index using a collection of 
        Github repo issues and comments.
        """
-        # Strategy:
+        # Updated algorithm:
-        # To get the proof of concept up and running,
+        # - get set of indexed ids
-        # we are just deleting and re-indexing every issue/comment.
+        # - get set of remote ids
        # - drop indexed ids not in remote ids
        # - index all remote ids
-        g = Github(gh_access_token)
+        # Get the set of indexed ids:
        # ------
        indexed_issues = set()
        p = QueryParser("kind", schema=self.ix.schema)
        q = p.parse("issue")
        with self.ix.searcher() as s:
            results = s.search(q,limit=None)
            for result in results:
                indexed_issues.add(result['id'])
        # Set of all URLs as existing on github
        to_index = set()
-        writer = self.ix.writer()
+        # Get the set of remote ids:
        # ------
        # Start with api object
        g = Github(gh_token)
-        # Iterate over each repo
+        # Now index all issue threads in the user-specified repos
-        list_of_repos = config['repos']
+
        # Start by collecting all the things
        remote_issues = set()
        full_items = {}
        # Iterate over each repo 
        list_of_repos = config['repositories']
        for r in list_of_repos:
            if '/' not in r:
                err = "Error: specify org/reponame or user/reponame in list of repos"
                raise Exception(err)
-            this_repo, this_org = re.split('/',r)
+            this_org, this_repo = re.split('/',r)
            try:
                org = g.get_organization(this_org)
                repo = org.get_repo(this_repo)
            except:
                print("Error: could not gain access to repository %s"%(r))
                continue
-            org = g.get_organization(this_org)
+            # Iterate over each issue thread
            repo = org.get_repo(this_repo)
            count = 0
            # Iterate over each thread
            issues = repo.get_issues()
            for issue in issues:
                # This approach is more work than is needed
                # but PoC||GTFO
                # For each issue/comment URL,
-                # remove the corresponding item
+                # grab the key and store the 
-                # and re-add it to the index
+                # corresponding issue object
                key = issue.html_url
                value = issue
-                to_index.add(issue.html_url)
+                remote_issues.add(key)
-                writer.delete_by_term('url', issue.html_url)
+                full_items[key] = value
                comments = issue.get_comments()
-                for comment in comments:
+        writer = self.ix.writer()
-                    to_index.add(comment.html_url)
+        count = 0
                    writer.delete_by_term('url', comment.html_url)
-                # Now re-add this issue to the index
+        # Drop any issues in indexed_issues
-                # (this will also add the comments)
+        # not in remote_issues
-                count += self.add_issue(writer, issue, repo, config)
+        drop_issues = indexed_issues - remote_issues
        for drop_issue in drop_issues:
            writer.delete_by_term('id',drop_issue)
        # Update any issue in indexed_issues
        # and in remote_issues
        update_issues = indexed_issues & remote_issues
        for update_issue in update_issues:
            # cop out
            writer.delete_by_term('id',update_issue)
            item = full_items[update_issue]
            self.add_issue(writer, item, gh_token, config, update=True)
            count += 1
        # Add any issue not in indexed_issues
        # and in remote_issues
        add_issues = remote_issues - indexed_issues
        for add_issue in add_issues:
            item = full_items[add_issue]
            self.add_issue(writer, item, gh_token, config, update=False)
            count += 1
        writer.commit()
        print("Done, updated %d documents in the index" % count)
    # ------------------------------
    # Github Files
    def update_index_ghfiles(self, gh_token, config): 
        """
        Update the search index using a collection of 
        files (and, separately, Markdown files) from 
        a Github repo.
        """
        # Updated algorithm:
        # - get set of indexed ids
        # - get set of remote ids
        # - drop indexed ids not in remote ids
        # - index all remote ids
        # Get the set of indexed ids:
        # ------
        indexed_ids = set()
        p = QueryParser("kind", schema=self.ix.schema)
        q = p.parse("ghfiles")
        with self.ix.searcher() as s:
            results = s.search(q,limit=None)
            for result in results:
                indexed_ids.add(result['id'])
        q = p.parse("markdown")
        with self.ix.searcher() as s:
            results = s.search(q,limit=None)
            for result in results:
                indexed_ids.add(result['id'])
        # Get the set of remote ids:
        # ------
        # Start with api object
        g = Github(gh_token)
        # Now index all the files.
        # Start by collecting all the things
        remote_ids = set()
        full_items = {}
        # Iterate over each repo 
        list_of_repos = config['repositories']
        for r in list_of_repos:
            if '/' not in r:
                err = "Error: specify org/reponame or user/reponame in list of repos"
                raise Exception(err)
            this_org, this_repo = re.split('/',r)
            try:
                org = g.get_organization(this_org)
                repo = org.get_repo(this_repo)
            except:
                print("Error: could not gain access to repository %s"%(r))
                continue
            # Get head commit
            commits = repo.get_commits()
            try:
                last = commits[0]
                sha = last.sha
            except GithubException:
                print("Error: could not get commits from repository %s"%(r))
                continue
            # Get all the docs
            tree = repo.get_git_tree(sha=sha, recursive=True)
            docs = tree.raw_data['tree']
            print("Parsing file ids from repository %s"%(r))
            for d in docs:
                # For each doc, get the file extension
                # and decide what to do with it.
                fpath = d['path']
                _, fname = os.path.split(fpath)
                _, fext = os.path.splitext(fpath)
                key = d['sha']
                d['org'] = this_org
                d['repo'] = this_repo
                value = d
                remote_ids.add(key)
                full_items[key] = value
        writer = self.ix.writer()
        count = 0
        # Drop any id in indexed_ids
        # not in remote_ids
        drop_ids = indexed_ids - remote_ids
        for drop_id in drop_ids:
            writer.delete_by_term('id',drop_id)
        # Update any id in indexed_ids
        # and in remote_ids
        update_ids = indexed_ids & remote_ids
        for update_id in update_ids:
            # cop out: just delete and re-add
            writer.delete_by_term('id',update_id)
            item = full_items[update_id]
            self.add_ghfile(writer, item, gh_token, config, update=True)
            count += 1
        # Add any issue not in indexed_ids
        # and in remote_ids
        add_ids = remote_ids - indexed_ids
        for add_id in add_ids:
            item = full_items[add_id]
            self.add_ghfile(writer, item, gh_token, config, update=False)
            count += 1
        writer.commit()
        print("Done, updated %d Github files in the index" % count)
    # ------------------------------
    # Groups.io Emails
    def update_index_groupsioemails(self, groupsio_token, config):
        """
        Update the search index using the email archives
        of groups.io groups.
        This requires the use of a spider.
        RELEASE THE SPIDER!!!
        """
        spider = GroupsIOArchivesCrawler(groupsio_token,'dcppc')
        # - ask spider to crawl the archives
        spider.crawl_group_archives()
        # - ask spider for list of all email records
        #   - 1 email = 1 dictionary
        #   - email records compiled by the spider
        archives = spider.get_archives()
        # - email object is sent off to add email method
        print("Finished indexing groups.io emails")
    # ---------------------------------
    # Search results bundler
@@ -477,11 +861,6 @@ class Search:
            # contains a {% for e in entries %}
            # and then an {{e.score}}
            # ------------------
            # cheseburger
            # create search results
            sr = SearchResult()
            sr.score = r.score
@@ -495,37 +874,29 @@ class Search:
            sr.id = r['id']
            sr.kind = r['kind']
-            sr.url = r['url']
+
            sr.created_time = r['created_time']
            sr.modified_time = r['modified_time']
            sr.indexed_time = r['indexed_time']
            sr.title = r['title']
            sr.url = r['url']
            sr.mimetype = r['mimetype']
            sr.owner_email = r['owner_email']
            sr.owner_name = r['owner_name']
            sr.content = r['content']
            # -----------------
            # github isuses
            # create search results
            sr = SearchResult()
            sr.score = r.score
            sr.url = r['url']
            sr.title = r['issue_title']
            sr.repo_name = r['repo_name']
            sr.repo_url = r['repo_url']
            sr.issue_title = r['issue_title']
            sr.issue_url = r['issue_url']
-            sr.is_comment = r['is_comment']
+            sr.github_user = r['github_user']
            sr.content = r['content']
            # ------------------
            highlights = r.highlights('content')
            if not highlights:
                # just use the first 1,000 words of the document
@@ -533,21 +904,18 @@ class Search:
            highlights = self.html_parser.unescape(highlights)
            html = self.markdown(highlights)
            html = re.sub(r'\n','<br />',html)
            sr.content_highlight = html
            search_results.append(sr)
        return search_results
        # ------------------
        # github issues
        # create search results
    def search(self, query_list, fields=None):
        with self.ix.searcher() as searcher:
            query_string = " ".join(query_list)
            query = None
@@ -558,27 +926,15 @@ class Search:
            elif len(fields) == 2:
                pass
            else:
-                fields = ['id',
+                # If the user does not specify a field,
-                          'kind',
+                # these are the fields that are actually searched
-                          'created_time',
+                fields = ['title',
-                          'modified_time',
+                          'content']
                          'indexed_time',
                          'title',
                          'url',
                          'mimetype',
                          'owner_email',
                          'owner_name',
                          'repo_name',
                          'repo_url',
                          'issue_title',
                          'issue_url',
                          'github_user',
                          'content'] 
            if not query:
                query = MultifieldParser(fields, schema=self.ix.schema).parse(query_string)
            parsed_query = "%s" % query
            print("query: %s" % parsed_query)
-            results = searcher.search(query, terms=False, scored=True, groupedby="url")
+            results = searcher.search(query, terms=False, scored=True, groupedby="kind")
            search_result = self.create_search_result(results)
        return parsed_query, search_result
@@ -589,9 +945,29 @@ class Search:
        return s if len(s) <= l else s[0:l - 3] + '...'
    def get_document_total_count(self):
-        return self.ix.searcher().doc_count_all()
+        p = QueryParser("kind", schema=self.ix.schema)
        counts = {
                "gdoc" : None,
                "issue" : None,
                "ghfile" : None,
                "markdown" : None,
                "total" : None
        }
        for key in counts.keys():
            q = p.parse(key)
            with self.ix.searcher() as s:
                results = s.search(q,limit=None)
                counts[key] = len(results)
        counts['total'] = sum(counts[k] for k in counts.keys())
        return counts
 if __name__ == "__main__":
    raise Exception("Error: main method not implemented (fix groupsio credentials first)")
    search = Search("search_index")
    from get_centillion_config import get_centillion_config
--- a/config_centillion.json
+++ b/config_centillion.json
@@ -1,7 +1,27 @@
 {
    "repositories" : [
        "dcppc/project-management",
        "dcppc/nih-demo-meetings",
        "dcppc/internal",
        "dcppc/organize",
        "dcppc/dcppc-bot",
        "dcppc/full-stacks",
        "dcppc/design-guidelines-discuss",
        "dcppc/dcppc-deliverables",
        "dcppc/dcppc-milestones",
        "dcppc/crosscut-metadata",
        "dcppc/lucky-penny",
        "dcppc/dcppc-workshops",
        "dcppc/metadata-matrix",
        "dcppc/data-stewards",
        "dcppc/dcppc-phase1-demos",
        "dcppc/apis",
        "dcppc/2018-june-workshop",
        "dcppc/2018-july-workshop",
-        "dcppc/data-stewards"
+        "dcppc/2018-august-workshop",
        "dcppc/2018-september-workshop",
        "dcppc/design-guidelines",
        "dcppc/2018-may-workshop",
        "dcppc/centillion"
    ]
 }
--- a/config_flask.example.py
+++ b/config_flask.example.py
@@ -0,0 +1,20 @@
 # Location of index file
 INDEX_DIR = "search_index"
 # oauth client deets
 GITHUB_OAUTH_CLIENT_ID = "XXX"
 GITHUB_OAUTH_CLIENT_SECRET = "YYY"
 GITHUB_TOKEN = "ZZZ"
 # More information footer: Repository label
 FOOTER_REPO_ORG = "charlesreid1"
 FOOTER_REPO_NAME = "centillion"
 # Toggle to show Whoosh parsed query
 SHOW_PARSED_QUERY=True
 TAGLINE = "Search All The Things"
 # Flask settings
 DEBUG = True
 SECRET_KEY = 'WWWWW'
--- a/config_flask.py
+++ b/config_flask.py
@@ -1,27 +0,0 @@
 # Path to markdown files
 MARKDOWN_FILES_DIR = "/Users/charles/codes/whoosh/markdown-search/fake-docs/"
 # Location of index file
 INDEX_DIR = "search_index"
 # Command to use when clicking on filepath in search results
 EDIT_COMMAND = "view"
 # Toggle to show Whoosh parsed query
 SHOW_PARSED_QUERY=True
 # Toogle to use tags
 USE_TAGS=True
 # Optional prefix in a markdown file, e.g. "tags: python search markdown tutorial"
 TAGS_PREFIX=""
 # List of tags that should be ignored
 TAGS_TO_IGNORE = "and are what how its not with the"
 # Regular expression to select tags, eg tag has to start with alphanumeric followed by at least two alphanumeric or "-" or "."
 TAGS_REGEX = r"\b([A-Za-z0-9][A-Za-z0-9-.]{2,})\b"
 # Flask settings
 DEBUG = True
 SECRET_KEY = '42c5a8eda356ca9d9c3ab2d149541e6b91d843fa'
--- a/docs/centillion_components.md
+++ b/docs/centillion_components.md
@@ -0,0 +1,22 @@
 # Centillion Components
 Centillion keeps it simple.
 There are two components:
 * The `Search` object, which uses whoosh and various
  APIs (Github, Google Drive) to build and manage
  the search index. The `Search` object also runs all
  queries against the search index. (See the
  [Centillion Whoosh](centillion_whoosh.md) page
  or the `centillion_search`.py` file
  for details.)
 * Flask app, which uses Jinja templates to present the
  user with a minimal web frontend that allows them
  to interact with the search engine. (See the
  [Centillion Flask](centillion_flask.md) page
  or the `centillion`.py` file
  for details.)
--- a/docs/centillion_flask.md
+++ b/docs/centillion_flask.md
@@ -0,0 +1,30 @@
 # Centillion Flask
 ## What the flask server does
 Flask is a web server framework
 that allows developers to define
 behavior for specific endpoints,
 such as `/hello_world`, or
 <http://localhost:5000/hello_world>
 on a web server running locally.
 ## Flask server routes
 - `/home`
    - if not logged in, this redirects to a "log into github" landing page (not implemented yet)
    - if logged in, this redirects to the search route
 - `/search`
    - search template
 - `/main_index_update`
    - update main index, all docs period
 - `/control_panel`
    - this is the control panel, where you can trigger
      the search index to be re-made
--- a/docs/centillion_whoosh.md
+++ b/docs/centillion_whoosh.md
@@ -0,0 +1,34 @@
 # Centillion Whoosh
 The `centillion_search.py` file defines a
 `Search` class that serves as the backend
 for centillion.
 ## What the Search class does
 The `Search` class has two roles:
 - create (and update) the search index
    - this also requires the `Search` class
      to define the schema for storing documents
 - run queries against the search index,
  and package results up for Flask and Jinja
 ## Search class functions
 The `Search` class defines several functions:
 - `open_index()` creates the schema
 - `add_issue()`, `add_md()`, `add_document()` have three diff method sigs and add diff types
  of documents to the search index
 - `update_all_issues()` or `update_all_md()` or `update_all_documents()` iterates over items
  and determines whether each item needs to be updated in the search index
 - `update_main_index()` - update the entire search index
    - calls all three update_all methods
 - `create_search_results()` - package things up for jinja
 - `search()` - run the query, pass results to the jinja-packager
--- a/docs/images/cp.png
+++ b/docs/images/cp.png
--- a/docs/images/ss.png
+++ b/docs/images/ss.png
--- a/docs/index.md
+++ b/docs/index.md
@@ -0,0 +1,84 @@
 # Centillion
 **centillion**: a pan-github-markdown-issues-google-docs search engine.
 **a centillion**: a very large number consisting of a 1 with 303 zeros after it.
 centillion is 3.03 log-times better than the googol.
 ## What is centillion
 Centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
 a Python library for building search engines.
 We define the types of documents centillion should index,
 what info and how. Centillion then builds and
 updates a search index. That's all done in `centillion_search.py`.
 Centillion also provides a simple web frontend for running
 queries against the search index. That's done using a Flask server
 defined in `centillion.py`.
 Centillion keeps it simple.
 ## Quickstart
 Run centillion with a github access token API key set via
 environment variable:
 ```
 GITHUB_TOKEN="XXXXXXXX" python centillion.py
 ```
 This will start a Flask server, and you can view the minimal search engine
 interface in your browser at <http://localhost:5000>.
 ## Configuration
 ### Centillion configuration
 `config_centillion.json` defines configuration variables
 for centillion - namely, what to index, and how, and where.
 ### Flask configuration
 `config_flask.py` defines configuration variables
 used by flask, which controls the web frontend 
 for centillion.
 ## Control Panel/Rebuilding Search Index
 To rebuild the search engine, visit the control panel route (`/control_panel`),
 for example at <http://localhost:5000/control_panel>.
 This allows you to rebuild the search engine index. The search index
 is stored in the `search_index/` directory, and that directory
 can be configured with centillion's configuration file.
 The diff search index is faster to build, as it only
 indexes documents that have been added since the last
 new document was added to the search index.
 The main search index is slower to build, as it will
 re-index everything.
 (Cron scripts? Threaded task that runs hourly?)
 ## Details
 More on the details of how centillion works.
 Under the hood, centillion uses flask and whoosh.
 Flask builds and runs the web server.
 Whoosh handles search requests and management
 of the search index.
 [Centillion Components](centillion_components.md)
 [Centillion Flask](centillion_flask.md)
 [Centillion Whoosh](centillion_whoosh.md)
--- a/Workplanned.md
+++ b/Workplanned.md
@@ -31,3 +31,4 @@ Stateless
--- a/docs/standalone.md
+++ b/docs/standalone.md
@@ -1,4 +1,4 @@
-## work that is done
+## work that is done: standalone
 **Stage 1: index folder of markdown files** (done)
 * See [markdown-search](https://git.charlesreid1.com/charlesreid1/markdown-search.git)
@@ -13,7 +13,7 @@
 Needs work:
-* More appropriate schema
+* <s>More appropriate schema</s>
 * Using more features (weights) plus pandoc filters for schema
 * Sqlalchemy (and hey waddya know safari books has it covered)
@@ -25,15 +25,16 @@ Needs work:
 * Main win here is uncovering metadata/linking/presentation issues
 Needs work:
- treat comments and issues as separate objects, fill out separate schema fields
+- <s>treat comments and issues as separate objects, fill out separate schema fields
 - map out and organize how the schema is updated to make it more flexible
- configuration needs to enable user to specify organization+repos
+- configuration needs to enable user to specify organization+repos</s>
 ```plain
 {
-    "to_index" : {
+    "to_index" : [
-        "google" : "google-api-python-client",
+        "google/google-api-python-client",
-        "microsoft" : ["TypeCode","api-guidelines"]
+        "microsoft/TypeCode",
        "microsoft/api-guielines"
    }
 }
 ```
@@ -48,3 +49,4 @@ Needs work:
 * Use the google drive api (see simple-simon)
 * Main win is more uncovering of metadata issues, identifying
  big-picture issues for centillion
--- a/docs/workinprogress.md
+++ b/docs/workinprogress.md
@@ -0,0 +1,48 @@
 # Components
 The components of centillion are as follows:
 - Flask application, which creates a Search object and uses it to search index
 - Search object, which allows you to create/update/search an index
 ## Routes layout
 Centillion flask app routes:
 - `/home`
    - if not logged in, landing page
    - if logged in, redirect to search
 - `/search`
 - `/main_index_update`
    - update main index, all docs period
 ## Functions layout
 Centillion Search class functions:
 - `open_index()` creates the schema
 - `add_issue()`, `add_md()`, `add_document()` have three diff method sigs and add diff types
  of documents to the search index
 - `update_all_issues()` or `update_all_md()` or `update_all_documents()` iterates over items
  and determines whether each item needs to be updated in the search index
 - `update_main_index()` - update the entire search index
    - calls all three update_all methods
 - `create_search_results()` - package things up for jinja
 - `search()` - run the query, pass results to the jinja-packager
 Nice to have but focus on it later:
 - update diff search index (what's been added since last index time)
    - max index time
 ## Files layout
 Schema definition:
 * include a "kind" or "class" to group objects
 * can provide different searches of different collections
 * eventually can provide user with checkboxes
--- a/gdrive_util.py
+++ b/gdrive_util.py
@@ -29,8 +29,7 @@ class GDrive(object):
    ):
        """
        Set up the Google Drive API instance.
-        Factory method: create it and hand it over.
+        Factory method: create it here, hand it over in get_service().
        Then we're finished.
        """
        self.credentials_file = credentials_file
        self.client_secret_file = client_secret_file
@@ -40,6 +39,9 @@ class GDrive(object):
        self.store = file.Storage(credentials_file)
    def get_service(self):
        """
        Return an instance of the Google Drive API service.
        """
        creds = self.store.get()
        if not creds or creds.invalid:
--- a/groupsio_util.py
+++ b/groupsio_util.py
@@ -0,0 +1,382 @@
 import requests, os, re
 from bs4 import BeautifulSoup
 class GroupsIOArchivesCrawler(object):
    """
    This is a Groups.io spider
    designed to crawl the email
    archives of a group.
    credentials (dictionary):
        groupsio_token :     api access token
        groupsio_username     :     username
        groupsio_password     :     password
    """
    def __init__(self,
                 credentials,
                 group_name):
        # template url for archives page (list of topics)
        self.url = "https://{group}.groups.io/g/{subgroup}/topics"
        self.login_url = "https://groups.io/login"
        self.credentials = credentials
        self.group_name = group_name
        self.crawled_archives = False
        self.archives = None
    def get_archives(self):
        """
        Return a list of dictionaries containing 
        information about each email topic in the 
        groups.io email archive.
        Call crawl_group_archives() first!
        """
        return self.archives
    def get_subgroups_list(self):
        """
        Use the API to get a list of subgroups.
        """
        subgroups_url = 'https://api.groups.io/v1/getsubgroups'
        key = self.credentials['groupsio_token']
        data = [('group_name', self.group_name),
                ('limit',100)
        ]
        response = requests.post(subgroups_url,
                                 data=data,
                                 auth=(key,''))
        response = response.json()
        data = response['data']
        subgroups = {}
        for group in data:
            k = group['id']
            v = re.sub(r'dcppc\+','',group['name'])
            subgroups[k] = v
        return subgroups
    def crawl_group_archives(self):
        """
        Spider will crawl the email archives of the entire group
        by crawling the email archives of each subgroup.
        """
        subgroups = self.get_subgroups_list()
        # ------------------------------
        # Start by logging in.
        # Create session object to persist session data
        session = requests.Session()
        # Log in to the website
        data = dict(email = self.credentials['groupsio_username'],
                    password = self.credentials['groupsio_password'],
                    timezone = 'America/Los_Angeles')
        r = session.post(self.login_url,
                         data = data)
        csrf = self.get_csrf(r)
        # ------------------------------
        # For each subgroup, crawl the archives
        # and return a list of dictionaries
        # containing all the email threads.
        for subgroup_id in subgroups.keys():
            self.crawl_subgroup_archives(session,
                                         csrf,
                                         subgroup_id, 
                                         subgroups[subgroup_id])
        # Done. archives are now tucked away
        # in the variable self.archives
        # 
        # self.archives is a list of dictionaries,
        # with each dictionary containing info about
        # a topic/email thread in a subgroup.
        # ------------------------------
    def crawl_subgroup_archives(self, session, csrf, subgroup_id, subgroup_name):
        """
        This kicks off the process to crawl the entire
        archives of a given subgroup on groups.io.
        For a given subgroup the url is self.url,
            https://{group}.groups.io/g/{subgroup}/topics
        This is the first of a paginated list of topics.
        Procedure is:
        - passed a starting page (or its contents)
        - iterate through all topics via the HTML page elements
        - assemble a bundle of information about each topic:
            - topic title, by, URL, date, content, permalink
            - content filtering:
                - ^From, Reply-To, Date, To, Subject
                - Lines containing phone numbers
                    - 9 digits
                    - XXX-XXX-XXXX, (XXX) XXX-XXXX
                    - XXXXXXXXXX, XXX XXX XXXX
                    - ^Work: or (Work) or Work$
                    - Home, Cell, Mobile
                    - +1 XXX 
                    - \w@\w
        - while next button is not greyed out,
        - click the next button
        everything stored in self.archives:
        list of dictionaries.
        """
        self.archives = []
        prefix = "https://{group}.groups.io".format(group=self.group_name)
        url = self.url.format(group=self.group_name, 
                              subgroup=subgroup_name)
        # ------------------------------
        # Now get the first page
        r = session.get(url)
        # ------------------------------
        # Fencepost algorithm:
        # First page:
        # Extract a list of (title, link) items
        items = self.extract_archive_page_items_(r)
        # Get the next link
        next_url = self.get_next_url_(r)
        # Now add each item to the archive of threads,
        # then find the next button.
        self.add_items_to_archives_(session,subgroup_name,items)
        if next_url is None:
            return
        else:
            full_next_url = prefix + next_url
        # Now click the next button
        next_request = requests.get(full_next_url)
        while next_request.status_code==200:
            items = self.extract_archive_page_items_(next_request)
            next_url = self.get_next_url_(next_request)
            self.add_items_to_archives_(session,subgroup_name,items)
            if next_url is None:
                return
            else:
                full_next_url = prefix + next_url
            next_request = requests.get(full_next_url)
    def add_items_to_archives_(self,session,subgroup_name,items):
        """
        Given a set of items from a list of threads,
        items being title and link,
        get the page and store all info
        in self.archives variable
        (list of dictionaries)
        """
        for (title, link) in items:
            # Get the thread page:
            prefix = "https://{group}.groups.io".format(group=self.group_name)
            full_link = prefix + link
            r = session.get(full_link)
            soup = BeautifulSoup(r.text,'html.parser')
            # soup contains the entire thread
            # What are we extracting:
            # 1. thread number
            # 2. permalink
            # 3. content/text (filtered)
            # - - - - - - - - - - - - - - 
            # 1. topic/thread number:
            # <a rel="nofollow" href="">
            # where link is:
            # https://{group}.groups.io/g/{subgroup}/topic/{topic_id}
            # example topic id: 24209140
            #
            # ugly links are in the form 
            # https://dcppc.groups.io/g/{subgroup}/topic/some_text_here/{thread_id}?p=,,,,,1,2,3,,,4,,5
            # split at ?, 0th portion
            # then split at /, last (-1th) portion
            topic_id = link.split('?')[0].split('/')[-1]
            # - - - - - - - - - - - - - - - 
            # 2. permalink:
            # - current link is ugly link
            # - permalink is the nice one
            # - topic id is available from the ugly link
            # https://{group}.groups.io/g/{subgroup}/topic/{topic_id}
            permalink_template = "https://{group}.groups.io/g/{subgroup}/topic/{topic_id}"
            permalink = permalink_template.format(
                    group = self.group_name,
                    subgroup = subgroup_name, 
                    topic_id = topic_id
            )
            # - - - - - - - - - - - - - - - 
            # 3. content:
            # Need to rearrange how we're assembling threads here.
            # This is one thread, no?
            content = []
            subject = soup.find('title').text
            # Extract information for the schema:
            # - permalink for thread (done)
            # - subject/title (done)
            # - original sender email/name (done)
            # - content (done)
            # Groups.io pages have zero CSS classes, which makes everything
            # a giant pain in the neck to interact with. Thanks Groups.io!
            original_sender = ''
            for i, tr in enumerate(soup.find_all('tr',{'class':'test'})):
                # Every other tr row contains an email.
                if (i+1)%2==0:
                    # nope, no email here
                    pass
                else:
                    # found an email!
                    # this is a maze, thanks groups.io
                    td = tr.find('td')
                    divrow = td.find('div',{'class':'row'}).find('div',{'class':'pull-left'})
                    if (i+1)==1:
                        original_sender = divrow.text.strip()
                    for div in td.find_all('div'):
                        if div.has_attr('id'):
                            # purge any signatures
                            for x in div.find_all('div',{'id':'Signature'}):
                                x.extract()
                            # purge any headers
                            for x in div.find_all('div'): 
                                nonos = ['From:','Sent:','To:','Cc:','CC:','Subject:']
                                for nono in nonos:
                                    if nono in x.text:
                                        x.extract()
                            message_text = div.get_text()
                            # More filtering:
                            # phone numbers
                            message_text = re.sub(r'[0-9]{3}-[0-9]{3}-[0-9]{4}','XXX-XXX-XXXX',message_text)
                            message_text = re.sub(r'[0-9]\{10\}','XXXXXXXXXX',message_text)
                            content.append(message_text)
            full_content = "\n".join(content)
            thread = {
                    'permalink' : permalink,
                    'subject' : subject,
                    'original_sender' : original_sender,
                    'content' : full_content
            }
            print('*'*40)
            for k in thread.keys():
                if k=='content':
                    pass
                else:
                    print("%s : %s"%(k,thread[k]))
            print('*'*40)
            self.archives.append(thread)
    def extract_archive_page_items_(self, response):
        """
        (Private method)
        Given a response from a GET request,
        use beautifulsoup to extract all items
        (thread titles and ugly thread links)
        and pass them back in a list.
        """
        soup = BeautifulSoup(response.content,"html.parser")
        rows = soup.find_all('tr',{'class':'test'})
        if 'rate limited' in soup.text:
            raise Exception("Error: rate limit in place for Groups.io")
        results = []
        for row in rows:
            # We don't care about anything except title and ugly link
            subject = row.find('span',{'class':'subject'})
            title = subject.get_text()
            link = row.find('a')['href']
            print(title)
            results.append((title,link))
        return results
    def get_next_url_(self, response):
        """
        (Private method)
        Given a response (which is a list of threads),
        find the next button and return the URL.
        If no next URL, if is disabled, then return None.
        """
        soup = BeautifulSoup(response.text,'html.parser')
        chevron = soup.find('i',{'class':'fa-chevron-right'})
        try:
            if '#' in chevron.parent['href']:
                # empty link, abort
                return None
        except AttributeError:
            # I don't even now
            return None
        if chevron.parent.parent.has_attr('class') and 'disabled' in chevron.parent.parent['class']:
            # no next link, abort
            return None
        return chevron.parent['href']
    def get_csrf(self,resp):
        """
        Find the CSRF token embedded in the subgroup page
        """
        soup = BeautifulSoup(resp.text,'html.parser')
        csrf = ''
        for i in soup.find_all('input'):
            # Note that i.name is different from i['name']
            # the first is the actual tag,
            # the second is the attribute name="xyz"
            if i['name']=='csrf':
                csrf = i['value']
        if csrf=='':
            err = "ERROR: Could not find csrf token on page."
            raise Exception(err)
        return csrf
--- a/install_pandoc.sh
+++ b/install_pandoc.sh
@@ -0,0 +1,19 @@
 #!/bin/bash
 #
 # for ubuntu 
 if [ "$(id -u)" != "0" ]; then
    echo ""
    echo ""
    echo "This script should be run as root."
    echo ""
    echo ""
    exit 1;
 fi
 OFILE="/tmp/pandoc.deb"
 curl -L https://github.com/jgm/pandoc/releases/download/2.2.2.1/pandoc-2.2.2.1-1-amd64.deb -o ${OFILE}
 dpkg -i ${OFILE}
 rm -f ${OFILE}
--- a/1
+++ b/1
--- a/requirements.txt
+++ b/requirements.txt
@@ -9,3 +9,5 @@ PyGithub>=1.39
 pypandoc>=1.4
 requests>=2.19
 pandoc>=1.0
 flask-dance>=1.0.0
 beautifulsoup4>=4.6
--- a/static/bootstrap.min.css
+++ b/static/bootstrap.min.css
--- a/static/bootstrap.min.js
+++ b/static/bootstrap.min.js
--- a/static/centillion_black.png
+++ b/static/centillion_black.png
--- a/static/centillion_white.png
+++ b/static/centillion_white.png
--- a/static/centillion_xparent.png
+++ b/static/centillion_xparent.png
--- a/static/jquery.min.js
+++ b/static/jquery.min.js
--- a/static/style.css
+++ b/static/style.css
@@ -1,3 +1,45 @@
 span.badge {
    vertical-align: text-bottom;
 }
 a.badgelinks, a.badgelinks:hover {
    color: #fff;
    text-decoration: none;
 }
 div.list-group {
    border: 1px solid rgba(86,61,124,.2);
 }
 li.list-group-item {
    position: relative;
    display: block;
    /*padding: 20px 10px;*/
    margin-bottom: -1px;
    background-color: #f8f8f8;
    border: 1px solid #ddd;
 }
 li.search-group-item {
    position: relative;
    display: block;
    padding: 0px;
    margin-bottom: -1px;
    background-color: #fff;
    border: 1px solid #ddd;
 }
 div.url {
    background-color: rgba(86,61,124,.15);
    padding: 8px;
 }
 /***************************/
 body {
    font-family: sans-serif;
 }
@@ -56,7 +98,7 @@ table {
    overflow: hidden;
 }
-td.info, .last-searches {
+.info, .last-searches {
    color: gray;
    font-size: 12px;
    font-family: Arial, serif;
--- a/templates/controlpanel.html
+++ b/templates/controlpanel.html
@@ -0,0 +1,108 @@
 {% extends "layout.html" %}
 {% block body %}
 {% with messages = get_flashed_messages() %}
 {% if messages %}
 <div class="container">
    <div class="alert alert-success alert-dismissible">
        <a href="#" class="close" data-dismiss="alert" aria-label="close">&times;</a>
        <ul class=flashes>
            {% for message in messages %}
            <li>{{ message }}</li>
            {% endfor %}
        </ul>
    </div>
 </div>
 {% endif %}
 {% endwith %}
 <div class="container">
    <div class="row">
        <div class="col-md-12">
            <center>
                <a href="{{ url_for('search')}}?query=&fields=">
                <img src="{{ url_for('static', filename='centillion_white.png') }}">
                </a>
                {% if config['TAGLINE'] %}
                    <h2><a href="{{ url_for('search')}}?query=&fields=">
                        {{config['TAGLINE']}}
                    </a></h2>
                {% endif %}
            </center>
        </div>
    </div>
    {% if config['zzzTAGLINE'] %}
    <div class="row">
        <div class="col12sm">
            <center>
                <h2><a href="{{ url_for('search')}}?query=&fields=">
                    {{config['TAGLINE']}}
                </a></h2>
            </center>
        </div>
    </div>
    {% endif %}
 </div>
 <hr />
 <div class="container">
    <div class="row">
        {# update main search index #}
        <div class="panel panel-danger">
            <div class="panel-heading">
                <h3 class="panel-title">
                    Update Main Search Index
                </h3>
            </div>
            <div class="panel-body">
                <div class="container-fluid">
                    <div class="row">
                        <div class="col-md-12">
                            <p class="panel-text">Re-index <i>every</i> document in the
                            remote collection in the search index. <b>Warning: this operation may take a while.</b>
                            <p/> <p>
                            <a href="{{ url_for('update_index') }}" class="btn btn-large btn-danger">Update Main Index</a>
                            <p/> 
                        </div>
                    </div>
                </div>
            </div>
        </div>
        {# update diff search index #}
        <div class="panel panel-danger">
            <div class="panel-heading">
                <h3 class="panel-title">
                    Update Diff Search Index
                </h3>
            </div>
            <div class="panel-body">
                <div class="container-fluid">
                    <div class="row">
                        <div class="col-md-12">
                            <p class="panel-text">Diff search index only re-indexes documents created after the last
                            search index update. <b>Not currently implemented.</b>
                            <p/> <p>
                            <a href="#" class="btn btn-large disabled btn-danger">Update Diff Index</a>
                            <p/> 
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
 </div>
 {% endblock %}
--- a/templates/layout.html
+++ b/templates/layout.html
@@ -1,10 +1,12 @@
 <!doctype html>
-<title>Markdown Search</title>
+<title>Centillion Search Engine</title>
 <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
 <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}">
 <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
 <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='bootstrap.min.css') }}">
 <script src="{{ url_for('static', filename='jquery.min.js') }}"></script>
 <script src="{{ url_for('static', filename='bootstrap.min.js') }}"></script>
 <div>
    {% for message in get_flashed_messages() %}
        <div class="flash">{{ message }}</div>
    {% endfor %}
    {% block body %}{% endblock %}
 </div>
--- a/templates/search.html
+++ b/templates/search.html
@@ -1,62 +1,188 @@
 {% extends "layout.html" %}
 {% block body %}
-<h1><a href="{{ url_for('search')}}?query=&fields=">Search directory: {{ config.MARKDOWN_FILES_DIR }}</a></h1>
+
-<a class="index" href="{{ url_for('update_index')}}">[update index]</a>
+
-<a class="index" href="{{ url_for('update_index')}}?rebuild=True">[rebuild index]</a>
+<div class="container">
-<form action="{{ url_for('search') }}" name="search">
+
-    <input type="text" name="query" value="{{ query }}">
+    {#
-    <input type="submit" value="search">
+    banner image
-    <a href="{{ url_for('search')}}?query=&fields=">[clear]</a>
+    #}
-</form>
+    <div class="row">
-<table cellspacing="3">
+        <div class="col12sm">
-    {% if directories %}
+            <center>
-    <tr>
+                <a href="{{ url_for('search')}}?query=&fields=">
-        <td class="directories-cloud">File directories:&nbsp
+                <img src="{{ url_for('static', filename='centillion_white.png') }}">
                </a>
                {#
                need a tag line
                #}
                {% if config['TAGLINE'] %}
                    <h2><a href="{{ url_for('search')}}?query=&fields=">
                        {{config['TAGLINE']}}
                    </a></h2>
                {% endif %}
            </center>
        </div>
    </div>
 </div>
 <div class="container">
    <div class="row">
        <div class="col-xs-12">
            <center>
                <form action="{{ url_for('search') }}" name="search">
                    <input type="text" name="query" value="{{ query }}"> <br />
                    <button type="submit" style="font-size: 20px; padding: 10px; padding-left: 50px; padding-right: 50px;" 
                        value="search" class="btn btn-primary">Search</button>
                    <br />
                    <a href="{{ url_for('search')}}?query=&fields=">[clear all results]</a>
                </form>
            </center>
        </div>
    </div>
 </div>
 <div class="container">
    <div class="row">
        {% if directories %}
        <div class="col-xs-12 info directories-cloud">
            <b>File directories:</b> 
            {% for d in directories %}
                <a href="{{url_for('search')}}?query={{d|trim}}&fields=filename">{{d|trim}}</a>
            {% endfor %}
-        </td>
+        </div>
-    </tr>
+        {% endif %}
    {% endif %}
    {% if config['SHOW_PARSED_QUERY']%}
    <tr>
        <td class="info">Parsed query: {{ parsed_query }}</td>
    </tr>
    {% endif %}
    <tr>
        <td class="info">FOUND {{ entries | length }} results of {{total}} documents</td>
    </tr>
-    {% for e in entries %}
+        <ul class="list-group">
-    <tr>
+
-        <td class="search-result">
+            {% if config['SHOW_PARSED_QUERY'] and parsed_query %}
-            <!--
+                <li  class="list-group-item">
-                <div class="path"><a href='{{ url_for("open_file")}}?path={{e.path|urlencode}}&query={{query}}&fields={{fields}}'>{{e.path}}</a>score: {{'%d'  % e.score}}</div>
+                    <div class="container-fluid">
-            -->
+                        <div class="row">
-            <div class="url">
+                            <div class="col-xs-12 info">
-                {% if e.is_comment %}
+                                <b>Parsed query:</b> {{ parsed_query }}
-                    <b>Comment</b> <a href='{{e.url}}'>(comment link)</a>
+                            </div>
-                    on issue <a href='{{e.issue_url}}'>{{e.issue_title}}</a>
+                        </div>
-                    in repo <a href='{{e.repo_url}}'>dcppc/{{e.repo_name}}</a>
+                    </div>
-                    <br />
+                </li>
-                {% else %}
+            {% endif %}
-                    <b>Issue</b> <a href='{{e.issue_url}}'>{{e.issue_title}}</a>
+
-                    in repo <a href='{{e.repo_url}}'>dcppc/{{e.repo_name}}</a>
+            {% if parsed_query %}
-                    <br />
+                <li  class="list-group-item">
-                {% endif %}
+                    <div class="container-fluid">
-                score: {{'%d'  % e.score}}
+                        <div class="row">
-            </div>
+                            <div class="col-xs-12 info">
-            <div class="markdown-body">{{ e.content_highlight|safe}}</div>
+                                <b>Found:</b> <span class="badge">{{entries|length}}</span> results 
-        </td>
+                                out of <span class="badge">{{totals["total"]}}</span> total items indexed
-    </tr>
+                            </div>
-    {% endfor %}
+                        </div>
-</table>
+                    </div>
-<div class="last-searches">Last searches: <br/>
+                </li>
-    {% for s in last_searches %}
+            {% endif %}
-        <span><a href="{{url_for('search')}}?{{s}}">{{s}}</a></span>
+
-    {% endfor %}
+            <li  class="list-group-item">
                    <div class="container-fluid">
                        <div class="row">
                            <div class="col-xs-12 info">
                                <b>Indexing:</b> <span
                                    class="badge">{{totals["gdoc"]}}</span> Google Documents,
                                <span class="badge">{{totals["issue"]}}</span> Github issues,
                                <span class="badge">{{totals["ghfile"]}}</span> Github files,
                                <span class="badge">{{totals["markdown"]}}</span> Github markdown files.
                            </div>
                        </div>
                </div>
            </li>
        </ul>
    </div>
 </div>
-<p>
+
-    More info can be found in the <a href="https://github.com/BernhardWenzel/markdown-search">README.md file</a>
+<div class="container">
-</p>
+    <div class="row">
        <ul class="list-group">
            {% for e in entries %}
                <li  class="search-group-item">
                    <div class="url">
                        {% if e.kind=="gdoc" %}
                            {% if e.mimetype=="" %}
                                <b>Google Document:</b>
                                <a href='{{e.url}}'>{{e.title}}</a>
                                (Owner: {{e.owner_name}}, {{e.owner_email}})<br />
                                <b>Document Type</b>: {{e.mimetype}}
                            {% else %}
                                <b>Google Drive:</b>
                                <a href='{{e.url}}'>{{e.title}}</a>
                                (Owner: {{e.owner_name}}, {{e.owner_email}})
                            {% endif %}
                        {% elif e.kind=="issue" %}
                            <b>Github Issue:</b>
                            <a href='{{e.url}}'>{{e.title}}</a>
                            {% if e.github_user %}
                            opened by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
                            {% endif %}
                            <br/>
                            <b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
                        {% elif e.kind=="markdown" %}
                            <b>Github Markdown:</b>
                            <a href='{{e.url}}'>{{e.title}}</a>
                            <br/>
                            <b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
                        {% else %}
                            <b>Item:</b> (<a href='{{e.url}}'>link</a>)
                        {% endif %}
                        <br />
                        Score: {{'%d'  % e.score}}
                    </div>
                    <div class="markdown-body">
                        {% if e.content_highlight %}
                            {{ e.content_highlight|safe}}
                        {% else %}
                        <p>(A preview of this document is not available.)</p>
                        {% endif %}
                    </div>
                </li>
            {% endfor %}
        </ul>
    </div>
 </div>
 <div class="container">
    <div class="row">
        <ul class="list-group">
            {% if config['FOOTER_REPO_NAME'] %}
                {% if config['FOOTER_REPO_ORG'] %}
                    <li  class="list-group-item">
                        <div class="container-fluid">
                            <div class="row">
                                <div class="col-xs-12 info">
                                    More information about {{config['FOOTER_REPO_NAME']}} can be found
                                    in the <a href="https://github.com/{{config['FOOTER_REPO_ORG']}}/{{config['FOOTER_REPO_NAME']}}">{{config['FOOTER_REPO_ORG']}}/{{config['FOOTER_REPO_NAME']}}</a>
                                    repository on Github.
                                </div>
                            </div>
                        </div>
                    </li>
                {% endif %}
            {% endif %}
        </ul>
    </div>
 </div>
 {% endblock %}
Author	SHA1	Message	Date
Charles Reid	de796880c5	Merge branch 'master' of github.com:charlesreid1/centillion * 'master' of github.com:charlesreid1/centillion: update config_flask.example.py to strip dc info	2018-08-13 19:14:54 -07:00
Charles Reid	f79f711a38	Merge branch 'master' of github.com:dcppc/centillion * 'master' of github.com:dcppc/centillion: Update Readme.md	2018-08-13 19:14:07 -07:00
Charles Reid	00b862b83e	Merge branch 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion * 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion:	2018-08-13 19:13:53 -07:00
Chaz Reid	a06c3b645a	Update Readme.md	2018-08-13 12:42:18 -07:00
Charles Reid	878ff011fb	locked out by rate limit, but otherwise successful in indexing so far.	2018-08-13 00:54:12 -07:00
Charles Reid	33cf78a524	successfully grabbing threads from 1st page of every subgroup	2018-08-13 00:27:45 -07:00
Charles Reid	c1bcd8dc22	add import pdb where things are currently stuck	2018-08-12 20:25:29 -07:00
Charles Reid	757e9d79a1	keep going with spider idea	2018-08-12 20:24:29 -07:00
Charles Reid	c47682adb4	fix typo with groupsio key	2018-08-12 20:13:45 -07:00
Charles Reid	f2662c3849	adding calls to index groupsio emails this is currently work in progress. we have a debug statement in place as a bookmark. we are currently: - creating a login session - getting all the subgroups - going to first subgroup - getting list of titles and links - getting emails for each title and link still need to: - figure out how to assemble email {} - assemble content/etc and how to parse text of emails	2018-08-12 18:00:33 -07:00
Charles Reid	2478a3f857	Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc * 'dcppc' of github.com:dcppc/centillion: fix how search results are bundled fix search template	2018-08-10 06:05:44 -07:00
Charles Reid	f174080dfd	catch exception when file info not found	2018-08-10 06:05:33 -07:00
Chaz Reid	ca8b12db06	Merge pull request #2 from charlesreid1/dcppc-merge-master Merge dcppc changes into master	2018-08-10 05:49:29 -07:00
Chaz Reid	a1ffdad292	Merge branch 'master' into dcppc-merge-master	2018-08-10 05:49:19 -07:00
Charles Reid	ce76396096	update config_flask.example.py to strip dc info	2018-08-10 05:46:07 -07:00
Chaz Reid	175ff4f71d	Merge pull request #17 from dcppc/github-files fix search template	2018-08-09 18:57:30 -07:00
Charles Reid	94f956e2d0	fix how search results are bundled	2018-08-09 18:56:56 -07:00
Charles Reid	dc015671fc	fix search template	2018-08-09 18:55:49 -07:00
Charles Reid	1e9eec81d7	make it valid json	2018-08-09 18:15:14 -07:00
Chaz Reid	31e12476af	Merge pull request #16 from dcppc/inception add inception	2018-08-09 18:08:11 -07:00
Chaz Reid	bbe4e32f63	Merge pull request #15 from dcppc/github-files index all github filenames, not just markdown	2018-08-09 18:07:56 -07:00
Charles Reid	5013741958	while we're at it	2018-08-09 17:40:56 -07:00
Charles Reid	1ce80a5da0	closes #11	2018-08-09 17:38:20 -07:00
Charles Reid	3ed967bd8b	remove unused function	2018-08-09 17:28:22 -07:00
Charles Reid	1eaaa32007	index all github filenames, not just markdown	2018-08-09 17:25:09 -07:00
Charles Reid	9c7e696b6a	Merge branch 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion * 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion: Move images, resize images, update image markdown in readme update readme to use <img> tags merge image files in from master fix <title> fix the readme to reflect current state of things/links/descriptions fix typos/wording in readme adding changes to enable https, update callback to http, and everything still passes through https (proxy) update footer repo info update screen shots add mkdocs-material-dib submodule remove mkdocs material submodule update tagline update tagline add _example_ config file for flask	2018-08-09 16:39:18 -07:00
Chaz Reid	262a0c19e7	Merge pull request #14 from dcppc/local-fixes Fix centillion to work for local instances	2018-08-09 16:37:37 -07:00
Chaz Reid	bd2714cc0b	Merge branch 'dcppc' into local-fixes	2018-08-09 16:36:34 -07:00
Charles Reid	899d6fed53	comment out localhost only env var	2018-08-09 16:25:37 -07:00
Charles Reid	a7756049e5	revert changes	2018-08-09 16:23:42 -07:00
Charles Reid	3df427a8f8	fix how existing issues in search index are collected. closes #10	2018-08-09 16:17:17 -07:00
Charles Reid	0dd06748de	fix centillion to work for local instance	2018-08-09 16:16:30 -07:00
Chaz Reid	1a04814edf	Merge pull request #9 from dcppc/ACharbonneau-patch-1 Update config_centillion.json	2018-08-07 16:09:45 -07:00
Amanda Charbonneau	3fb72d409b	Update config_centillion.json I fixed it	2018-08-07 18:24:32 -04:00
Chaz Reid	d89e01221a	Merge pull request #8 from dcppc/dcppc-test Fix the name of the milestones repo: 'dcppc-milestones' not 'milestones'	2018-08-07 14:59:06 -07:00
Charles Reid	6736f3f8ad	add centillion configuration json file	2018-08-07 14:54:56 -07:00
Chaz Reid	abd13aba29	Merge pull request #7 from dcppc/fix-docstrings Fix docstrings	2018-08-07 14:43:42 -07:00
Charles Reid	13e49cdaa6	improve docstrings on gdrive_util.py too	2018-08-07 14:42:19 -07:00
Charles Reid	83b2ce17fb	fix docstrings in centillion_search.py	2018-08-07 14:41:26 -07:00
Chaz Reid	5be0709070	Merge pull request #6 from dcppc/fix-docs Move images, resize images, update image markdown in readme	2018-08-07 13:02:08 -07:00
Charles Reid	9edd95a78d	Merge branch 'fix-docs' * fix-docs: Move images, resize images, update image markdown in readme update readme to use <img> tags merge image files in from master fix <title> fix the readme to reflect current state of things/links/descriptions fix typos/wording in readme adding changes to enable https, update callback to http, and everything still passes through https (proxy) update footer repo info update screen shots add mkdocs-material-dib submodule remove mkdocs material submodule update tagline update tagline add _example_ config file for flask	2018-08-07 12:50:29 -07:00
Charles Reid	37615d8707	Move images, resize images, update image markdown in readme	2018-08-07 12:40:38 -07:00
Charles Reid	4b218f63b9	update readme to use <img> tags	2018-08-03 15:56:49 -07:00
Charles Reid	4e17c890bc	merge image files in from master	2018-08-03 15:53:51 -07:00
Charles Reid	1129ec38e0	update the readme	2018-08-03 15:49:46 -07:00
Charles Reid	875508c796	update screen shot images	2018-08-03 15:49:12 -07:00
Charles Reid	abc7a2aedf	fix <title>	2018-08-03 15:45:56 -07:00
Charles Reid	8f1e5faefc	update readme to reflect latest	2018-08-03 15:38:23 -07:00
Chaz Reid	d5f63e2322	Merge pull request #1 from dcppc/fix-readme fix the readme to reflect current state of things/links/descriptions	2018-08-03 15:28:51 -07:00
Charles Reid	84e5560423	fix the readme to reflect current state of things/links/descriptions	2018-08-03 15:28:16 -07:00
Charles Reid	924c562c0a	fix typos/wording in readme	2018-08-03 15:22:35 -07:00
Charles Reid	13c410ac5e	adding changes to enable https, update callback to http, and everything still passes through https (proxy)	2018-08-03 15:21:41 -07:00
Charles Reid	4e79800e83	update footer repo info	2018-08-03 15:19:55 -07:00
Charles Reid	5b9570d8cd	Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc * 'dcppc' of github.com:dcppc/centillion: add mkdocs-material-dib submodule remove mkdocs material submodule	2018-08-03 14:54:25 -07:00
Charles Reid	297a4b5977	update screen shots	2018-08-03 14:53:43 -07:00
Charles Reid	69a6b5d680	add mkdocs-material-dib submodule	2018-08-03 13:51:13 -07:00
Charles Reid	3feca1aba3	remove mkdocs material submodule	2018-08-03 13:50:37 -07:00
Charles Reid	493581f861	update tagline	2018-08-03 13:38:00 -07:00
Charles Reid	1b0ded809d	update tagline	2018-08-03 13:36:56 -07:00
Charles Reid	78e77c7cf2	add _example_ config file for flask	2018-08-03 13:34:27 -07:00
Charles Reid	2f890d1aee	Merge branch 'all-the-docs' of charlesreid1/centillion into master	2018-08-03 20:28:27 +00:00
Charles Reid	937327f2cb	update search template to treat drive files and documents differently.	2018-08-03 13:24:03 -07:00
Charles Reid	ca0d88cfe6	index all the google drive things	2018-08-03 13:15:02 -07:00
Charles Reid	5eda472072	improve handling of tokens for gh api, fix set ordering/logic	2018-08-03 13:07:46 -07:00
Charles Reid	d943c14678	Merge branch 'master' into all-the-docs * master: Update '.gitignore' no secrets plz	2018-08-03 12:37:49 -07:00
Charles Reid	6be785a056	indexing all markdown is working.	2018-08-03 12:36:32 -07:00
Charles Reid	65113a95f7	Update '.gitignore'	2018-08-03 17:52:04 +00:00
Charles Reid	87c3f12c8f	no secrets plz	2018-08-03 17:51:39 +00:00
Charles Reid	933884e9ab	search all the docs. search all the repos.	2018-08-03 10:29:52 -07:00
Charles Reid	da9dea3f6b	Merge branch 'github-markdown' of charlesreid1/centillion into master	2018-08-03 07:20:45 +00:00
Charles Reid	4d6386e74a	add results-handling for markdown files	2018-08-03 00:19:57 -07:00
Charles Reid	a93b7519de	improve counts accounting, and construct usable urls for markdown	2018-08-03 00:19:35 -07:00
Charles Reid	5e2c37164b	fix markdown indexing	2018-08-02 23:56:56 -07:00
Charles Reid	829e9c4263	finish subsuming repotree into centillion_search	2018-08-02 23:14:55 -07:00
Charles Reid	283991017c	add repotree script. temporary/standalone, but doing exactly what centillion needs to do.	2018-08-02 22:29:18 -07:00
Charles Reid	653af18f24	add update_index_markdown() function, rough/unfinished	2018-08-02 22:27:30 -07:00
Charles Reid	fae184f1f3	re-indexer now calls (nonexistent file) update_index_markdown	2018-08-02 22:26:56 -07:00
Charles Reid	d40bb3557f	Merge branch 'flask-dance' of charlesreid1/centillion into master	2018-08-03 04:09:20 +00:00
Charles Reid	a848f3ec3e	complete the conversion to oauth tokens	2018-08-02 19:06:34 -07:00
Charles Reid	50d27a915a	update readme	2018-08-02 19:04:40 -07:00
Charles Reid	1b950b7790	update re-index task to use gh token; reorganize logic; use werkzeug proxy	2018-08-02 19:02:00 -07:00
Charles Reid	04d4195668	Add flask-dance to centillion. - Remove config file, which now contains secrets - Add flask dance to requirements - Update instructions in readme to include Github application setup	2018-08-02 11:52:56 -07:00
Charles Reid	d0fe7aa799	ignore config files, which may have keys in them	2018-08-02 11:24:33 -07:00
Charles Reid	acc28aab44	Merge branch 'cache-and-hash' of charlesreid1/centillion into master	2018-08-02 17:59:45 +00:00
Charles Reid	adc2666a9b	actually fix flashed messages	2018-08-02 00:58:37 -07:00
Charles Reid	581f0a67ed	fix messages so they are js and dismissable	2018-08-02 00:54:56 -07:00
Charles Reid	0b96061bc5	update documentation, add new docs pages on components/flask/whoosh	2018-08-01 23:04:35 -07:00
Charles Reid	c7acdea889	finally. make results comprehensible.	2018-08-01 22:39:07 -07:00
Charles Reid	4eabd4536e	remove last searches from search.html	2018-08-01 22:32:20 -07:00
Charles Reid	78276c14d9	align badges higher	2018-08-01 22:31:59 -07:00
Charles Reid	68f90d383f	fix up how issues are added, and how all issues are iterated over (use set algebra)	2018-08-01 22:31:41 -07:00
Charles Reid	202643b85e	add control_panel route, remove last_search silliness	2018-08-01 22:29:06 -07:00
Charles Reid	dc9ac74d68	add control panel page	2018-08-01 20:12:55 -07:00
Charles Reid	36cc94a854	Fix bootstrap div classes, badgify counts, fix <li> styles	2018-08-01 20:12:10 -07:00
Charles Reid	740e757bcd	update todo with what we have done	2018-08-01 15:54:03 -07:00
Charles Reid	bf6afe39c6	caching is working	2018-08-01 15:48:43 -07:00
Charles Reid	54c09ce80b	call add drive file function with add/update docIDs. fix method headers.	2018-08-01 15:17:07 -07:00
Charles Reid	1407178f39	updating flask config and templates to parameterize repo info in footer	2018-08-01 13:43:43 -07:00
Charles Reid	2bf9abfd6f	update footer: prior searches are now badges, and link to more info now points to repo	2018-08-01 13:36:45 -07:00
Charles Reid	8328f96f76	make "prior searches" a badge and infobox bg color	2018-08-01 13:36:05 -07:00
Charles Reid	d5a9fe85af	Merge branch 'master' into cache-and-hash * master: update installation preparation step	2018-08-01 12:50:10 -07:00
Charles Reid	f8d2156d85	update installation preparation step	2018-08-01 12:48:09 -07:00
Charles Reid	a753ba4963	update centillion search with comment blocks laying out what to change and where	2018-08-01 11:32:37 -07:00
Charles Reid	8cca4b2c8d	add TAGLINE param	2018-08-01 00:49:56 -07:00
Charles Reid	69339abe24	fix the way repo name label is handled	2018-08-01 00:25:29 -07:00
Charles Reid	8d2718d783	update how we store totals	2018-07-31 23:58:19 -07:00
Charles Reid	8912b945fe	remove print statement	2018-07-31 23:17:16 -07:00
Charles Reid	ddceb16a2c	fix template rendering in update_index url endpoint	2018-07-31 23:16:45 -07:00
Charles Reid	f769d18b4e	clean up flask config file	2018-07-31 23:16:23 -07:00
Chaz Reid	34a889479a	Update config_flask.py	2018-07-31 23:12:57 -07:00
Charles Reid	a074e6c0e7	add image to readme	2018-07-31 23:07:32 -07:00
Charles Reid	918c9d583f	update search results template	2018-07-31 23:01:38 -07:00
Charles Reid	6cd505087b	package up the counts in get_document_total_count	2018-07-31 22:37:20 -07:00
Charles Reid	ee9b3bb811	pass a count dictionary instead of an integer to the jinja template	2018-07-31 22:36:43 -07:00
Charles Reid	8a4e20b71c	update template - gotta look good	2018-07-31 22:36:13 -07:00
Charles Reid	64d3ce4a9b	update search engine style to use centillion logo	2018-07-31 18:29:01 -07:00
Charles Reid	5e9b584d26	uncovered the mysterious missing google docs: they were just being labeled as issues by the search template.	2018-07-31 15:59:21 -07:00
Charles Reid	b03a42d261	start some troubleshooting	2018-07-31 05:21:58 -07:00
Charles Reid	bd4f4da8dc	more fixes - use "" not None	2018-07-31 05:15:22 -07:00
Charles Reid	23743773a6	add mkdocs-material submodule	2018-07-31 04:33:27 -07:00
Charles Reid	b7d2a8c960	rename some files, and move docs into docs/	2018-07-31 04:32:38 -07:00
Charles Reid	1f4b43163a	fix env var name	2018-07-31 03:16:28 -07:00
Charles Reid	f80ccc2520	successfully indexing, unsuccessfully searching	2018-07-31 03:06:25 -07:00
Charles Reid	c2eae4f521	improve handling of repo names, oweners, and document schema. improve timestamps.	2018-07-31 01:52:44 -07:00
Charles Reid	c758ca7a6c	add quickstart	2018-07-31 01:28:38 -07:00
Charles Reid	3cf142465a	updating readme with flask mention	2018-07-31 01:23:49 -07:00
Charles Reid	bfd351c990	Update 'Workdone.md'	2018-07-31 08:12:28 +00:00