Merge branch 'master' of github.com:charlesreid1/centillion

* 'master' of github.com:charlesreid1/centillion: update config_flask.example.py to strip dc info
Merge branch 'master' of github.com:dcppc/centillion
2018-08-13 19:14:54 -07:00 · 2018-08-13 19:14:07 -07:00 · 2018-08-13 19:13:53 -07:00 · 2018-08-13 12:42:18 -07:00 · 2018-08-13 00:54:12 -07:00 · 2018-08-13 00:27:45 -07:00
29 changed files with 1598 additions and 356 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,8 +1,8 @@
+config_flask.py
 vp
 credentials.json
 drive*.json
 *.pyc
-config.py
 out/
 search_index/
 venv/
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,3 +1,3 @@
-[submodule "mkdocs-material"]
-	path = mkdocs-material
-	url = https://git.charlesreid1.com/charlesreid1/mkdocs-material.git
+[submodule "mkdocs-material-dib"]
+	path = mkdocs-material-dib
+	url = https://github.com/dib-lab/mkdocs-material-dib.git
--- a/Readme.md
+++ b/Readme.md
@@ -1,17 +1,19 @@
-# The Centillion
+# Centillion

-**the centillion**: a pan-github-markdown-issues-google-docs search engine.
+**centillion**: a pan-github-markdown-issues-google-docs search engine.

 **a centillion**: a very large number consisting of a 1 with 303 zeros after it.

-the centillion is 3.03 log-times better than the googol.
+one centillion is 3.03 log-times better than a googol.
+
+![Screen shot of centillion](docs/images/ss.png)

-![Screen shot of centillion](img/ss.png)

 ## what is it

-The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
-a Python library for building search engines.
+Centillion (https://github.com/dcppc/centillion) is a search engine that can index 
+three kinds of collections: Google Documents, Github issues, and Markdown files in 
+Github repos.

 We define the types of documents the centillion should index,
 what info and how. The centillion then builds and
@@ -23,23 +25,70 @@ defined in `centillion.py`.

 The centillion keeps it simple.

+## authentication layer

-## quickstart
+Centillion lives behind a Github authentication layer, implemented with 
+[flask-dance](https://github.com/singingwolfboy/flask-dance). When you first
+visit the site it will ask you to authenticate with Github so that it can 
+verify you have permission to access the site.

-Run the centillion app with a github access token API key set via
-environment variable:
+## technologies
+
+Centillion is a Python program built using whoosh (search engine library). It
+indexes the full text of docx files in Google Documents, just the filenames for
+non-docx files. The full text of issues and their comments are indexed, and
+results are grouped by issue. Centillion requires Google Drive and Github OAuth
+apps. Once you provide credentials to Flask you're all set to go. 
+
+
+## control panel
+
+There's also a control panel at <https://search.nihdatacommons.us/control_panel> 
+that allows you to rebuild the search index from scratch (the Google Drive indexing 
+takes a while).
+
+![Screen shot of centillion control panel](docs/images/cp.png)
+
+
+## quickstart (with Github auth)
+
+Start by creating a Github OAuth application.
+Get the public and private application key 
+(client token and client secret token)
+from the Github application's page.
+You will also need a Github access token
+(in addition to the app tokens).
+
+When you create the application, set the callback
+URL to `/login/github/authorized`, as in:

 ```
-GITHUB_TOKEN="XXXXXXXX" python centillion.py
+https://<url>/login/github/authorized
+```
+
+Edit the Flask configuration `config_flask.py`
+and set the public and private application keys.
+
+Now run centillion:
+
+```
+python centillion.py
+```
+
+or if you used http instead of https:
+
+```
+OAUTHLIB_INSECURE_TRANSPORT="true" python centillion.py
 ```

 This will start a Flask server, and you can view the minimal search engine
-interface in your browser at <http://localhost:5000>.
-
-## more info
-
-For more info see the documentation: <https://charlesreid1.github.io/centillion>
+interface in your browser at `http://<ip>:5000`.


+## troubleshooting

+If you are having problems with your callback URL being treated
+as HTTP by Github, even though there is an HTTPS address, and
+everything else seems fine, try deleting the Github OAuth app
+and creating a new one.

--- a/Todo.md
+++ b/Todo.md
@@ -1,7 +1,47 @@
 # todo

-current problems:
- some github issues have no title
- github issues are just being re-indexed over and over
- documents not showing up in results
+Main task:
+- hashing and caching
+    - <s>first, working out the logic of how we group items into sets
+        - needs to be deleted
+        - needs to be updated
+        - needs to be added
+        - for docs, issues, and comments</s>
+    - second, when we add or update an item, need to:
+        - go through the motions, download file, extract text
+        - check for existing indexed doc with that id
+        - check if existing indexed doc has same hash
+            - if so, skip
+            - otherwise, delete and re-index
+
+Other bugs:
+- Some github issues have no title (?)
+- <s>Need to combine issues with comments</s>
+- Not able to index markdown files _in a repo_
+- (Longer term) update main index vs update diff index
+
+Needs:
+- <s>control panel</s>
+
+Thursday product:
+- Everything re-indexed nightly
+- Search engine built on all documents in Google Drive, all issues, markdown files
+- Using pandoc to extract Google Drive document contents
+- BRIEF quickstart documentation
+
+Future:
+- Future plans to improve - plugins, improving matching
+- Subdomain plans
+- Folksonomy tagging and integration plans
+
+
+
+
+config options for plugins
+conditional blocks with import github inside
+complicated tho - better to have components split off
+
+
+
+

--- a/centillion.py
+++ b/centillion.py
@@ -2,8 +2,11 @@ import threading
 from subprocess import call

 import codecs
-import os
+import os, json
+
+from werkzeug.contrib.fixers import ProxyFix
 from flask import Flask, request, redirect, url_for, render_template, flash
+from flask_dance.contrib.github import make_github_blueprint, github

 # create our application
 from centillion_search import Search
@@ -22,10 +25,18 @@ You provide:
    - Google Drive API key via file
 """

+
 class UpdateIndexTask(object):
-    def __init__(self, diff_index=False):
+    def __init__(self, app_config, diff_index=False):
        self.diff_index = diff_index
        thread = threading.Thread(target=self.run, args=())
+
+        self.gh_token = app_config['GITHUB_TOKEN']
+        self.groupsio_credentials = {
+                'groupsio_token' :     app_config['GROUPSIO_TOKEN'],
+                'groupsio_username' :  app_config['GROUPSIO_USERNAME'],
+                'groupsio_password' :  app_config['GROUPSIO_PASSWORD']
+        }
        thread.daemon = True
        thread.start()

@@ -38,30 +49,91 @@ class UpdateIndexTask(object):
        from get_centillion_config import get_centillion_config
        config = get_centillion_config('config_centillion.json')

-        gh_token = os.environ['GITHUB_TOKEN']
-        search.update_index_issues(gh_token, config)
-        search.update_index_gdocs(config)
+        search.update_index_groupsioemails(self.groupsio_credentials,config)
+        ###search.update_index_ghfiles(self.gh_token,config)
+        ###search.update_index_issues(self.gh_token,config)
+        ###search.update_index_gdocs(config)



 app = Flask(__name__)
+app.wsgi_app = ProxyFix(app.wsgi_app)

 # Load default config and override config from an environment variable
 app.config.from_pyfile("config_flask.py")

-last_searches_file = app.config["INDEX_DIR"] + "/last_searches.txt"
+#github_bp = make_github_blueprint()
+github_bp = make_github_blueprint(
+                        client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
+                        client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
+                        scope='read:org')
+
+app.register_blueprint(github_bp, url_prefix="/login")
+
+contents404 = "<html><body><h1>Status: Error 404 Page Not Found</h1></body></html>"
+contents403 = "<html><body><h1>Status: Error 403 Access Denied</h1></body></html>"
+contents200 = "<html><body><h1>Status: OK 200</h1></body></html>"


 ##############################
 # Flask routes

-
@app.route('/')
 def index():
+
+    if not github.authorized:
+        return redirect(url_for("github.login"))
+
+    else:
+
+        username = github.get("/user").json()['login']
+
+        resp = github.get("/user/orgs")
+        if resp.ok:
+
+            # If they are in team copper, redirect to search.
+            # Otherwise, hit em with a 403
+            all_orgs = resp.json()
+            for org in all_orgs:
+                if org['login']=='dcppc':
+                    copper_team_id = '2700235'
+                    mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
+                    if mresp.status_code==204:
+
+                        # --------------------
+                        # Business as usual
                        return redirect(url_for("search", query="", fields=""))

+            return contents403
+
+        return contents404
+
+### @app.route('/')
+### def index():
+###     return redirect(url_for("search", query="", fields=""))
+
@app.route('/search')
 def search():
+
+    if not github.authorized:
+        return redirect(url_for("github.login"))
+
+    username = github.get("/user").json()['login']
+
+    resp = github.get("/user/orgs")
+    if resp.ok:
+
+        all_orgs = resp.json()
+        for org in all_orgs:
+            if org['login']=='dcppc':
+
+                copper_team_id = '2700235'
+
+                mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
+                if mresp.status_code==204:
+
+                    # --------------------
+                    # Business as usual
                    query = request.args['query']
                    fields = request.args.get('fields')
                    if fields == 'None':
@@ -74,7 +146,6 @@ def search():

                    else:
                        parsed_query, result = search.search(query.split(), fields=[fields])
-        store_search(query, fields)

                    totals = search.get_document_total_count()

@@ -83,46 +154,75 @@ def search():
                                           query=query, 
                                           parsed_query=parsed_query, 
                                           fields=fields, 
-                           last_searches=get_last_searches(), 
                                           totals=totals)

+    return contents403
+
+
@app.route('/update_index')
 def update_index():
-    rebuild = request.args.get('rebuild')
-    UpdateIndexTask(diff_index=False)
+
+    if not github.authorized:
+        return redirect(url_for("github.login"))
+
+    username = github.get("/user").json()['login']
+
+    resp = github.get("/user/orgs")
+    if resp.ok:
+
+        all_orgs = resp.json()
+        for org in all_orgs:
+            if org['login']=='dcppc':
+
+                copper_team_id = '2700235'
+
+                mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
+                if mresp.status_code==204:
+
+                    # --------------------
+                    # Business as usual
+                    UpdateIndexTask(app.config,
+                                    diff_index=False)
                    flash("Rebuilding index, check console output")
-    return render_template("search.html", 
-                           query="", 
-                           fields="", 
-                           last_searches=get_last_searches(),
+                    return render_template("controlpanel.html", 
                                           totals={})

+    return contents403

-##############
-# Utility methods

-def get_last_searches():
-    if os.path.exists(last_searches_file):
-        with codecs.open(last_searches_file, 'r', encoding='utf-8') as f:
-            contents = f.readlines()
-    else:
-        contents = []
-    return contents

-def store_search(query, fields):
-    if os.path.exists(last_searches_file):
-        with codecs.open(last_searches_file, 'r', encoding='utf-8') as f:
-            contents = f.readlines()
-    else:
-        contents = []
+@app.route('/control_panel')
+def control_panel():

-    search = "query=%s&fields=%s\n" % (query, fields)
-    if not search in contents:
-        contents.insert(0, search)
+    if not github.authorized:
+        return redirect(url_for("github.login"))

-    with codecs.open(last_searches_file, 'w', encoding='utf-8') as f:
-        f.writelines(contents[:30])
+    username = github.get("/user").json()['login']
+
+    resp = github.get("/user/orgs")
+    if resp.ok:
+
+        all_orgs = resp.json()
+        for org in all_orgs:
+            if org['login']=='dcppc':
+
+                copper_team_id = '2700235'
+
+                mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
+                if mresp.status_code==204:
+
+                    return render_template("controlpanel.html", 
+                                           totals={})
+
+    return contents403
+
+
+@app.errorhandler(404)
+def oops(e):
+    return contents404

 if __name__ == '__main__':
-    app.run()
+    # if running local instance, set to true
+    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = 'true'
+    app.run(host="0.0.0.0",port=5000)

--- a/centillion_prepare.py
+++ b/centillion_prepare.py
@@ -0,0 +1,5 @@
+from gdrive_util import GDrive
+
+gd = GDrive()
+service = gd.get_service()
+
--- a/centillion_search.py
+++ b/centillion_search.py
@@ -1,9 +1,11 @@
 import shutil
 import html.parser

-from github import Github
+from github import Github, GithubException
+import base64

 from gdrive_util import GDrive
+from groupsio_util import GroupsIOArchivesCrawler
 from apiclient.http import MediaIoBaseDownload

 import mistune
@@ -42,6 +44,7 @@ Search object functions:
 Schema:
    - id
    - kind
+    - fingerprint
    - created_time
    - modified_time
    - indexed_time
@@ -95,6 +98,11 @@ class Search:
    def __init__(self, index_folder):
        self.open_index(index_folder)

+
+    # ------------------------------
+    # Create a schema and open a search index
+    # on disk.
+
    def open_index(self, index_folder, create_new=False):
        """
        Create a schema,
@@ -115,7 +123,6 @@ class Search:

        
        # ------------------------------
-        # IMPORTANT:
        # This is where the search index's document schema
        # is defined.

@@ -160,16 +167,13 @@ class Search:
    # Define how to add documents


-    def add_drive_file(self, writer, item, indexed_ids, temp_dir, config):
+    def add_drive_file(self, writer, item, temp_dir, config, update=False):
        """
        Add a Google Drive document/file to a search index.
        If it is a document, extract the contents.
        """
-        gd = GDrive()
-        service = gd.get_service()

-        # ------------------------
-        # Two kinds of documents:
+        # There are two kinds of documents:
        # - documents with text that can be extracted (docx)
        # - everything else
        
@@ -179,88 +183,13 @@ class Search:
        }

        content = ""
-        if(mimetype not in mimemap.keys()):
-            # Not a document - 
-            # Just a file
-            print("Indexing document \"%s\" of type %s"%(item['name'], mimetype))
-        else:
-            # Document with text
-            # Perform content extraction
+        if mimetype not in mimemap.keys():

-            # -----------
-            # docx Content Extraction:
-            # 
-            # We can only do this with .docx files
-            # This is a file type we know how to convert
-            # Construct the URL and download it
+            # Not a document - just a file
+            print("Indexing Google Drive file \"%s\" of type %s"%(item['name'], mimetype))
+            writer.delete_by_term('id',item['id'])

-            print("Extracting content from \"%s\" of type %s"%(item['name'], mimetype))
-
-
-            # Create a URL and a destination filename
-            file_ext = mimemap[mimetype]
-            file_url = "https://docs.google.com/document/d/%s/export?format=%s"%(item['id'], file_ext)
-
-            # This re could probablybe improved
-            name = re.sub('/','_',item['name'])
-
-            # Now make the pandoc input/output filenames
-            out_ext = 'txt'
-            pandoc_fmt = 'plain'
-            if name.endswith(file_ext):
-                infile_name = name
-                outfile_name = re.sub(file_ext,out_ext,infile_name)
-            else:
-                infile_name  = name+'.'+file_ext
-                outfile_name = name+'.'+out_ext
-
-
-            # assemble input/output file paths
-            fullpath_input = os.path.join(temp_dir,infile_name)
-            fullpath_output = os.path.join(temp_dir,outfile_name)
-
-            # Use requests.get to download url to file
-            r = requests.get(file_url, allow_redirects=True)
-            with open(fullpath_input, 'wb') as f:
-                f.write(r.content)
-
-
-            # Try to convert docx file to plain text
-            try:
-                output = pypandoc.convert_file(fullpath_input,
-                                               pandoc_fmt,
-                                               format='docx',
-                                               outputfile=fullpath_output
-                )
-                assert output == ""
-            except RuntimeError:
-                print("XXXXXX Failed to index document \"%s\""%(item['name']))
-
-
-            # If export was successful, read contents of markdown
-            # into the content variable.
-            # into the content variable.
-            if os.path.isfile(fullpath_output):
-                # Export was successful
-                with codecs.open(fullpath_output, encoding='utf-8') as f:
-                    content = f.read()
-
-
-            # No matter what happens, clean up.
-            print("Cleaning up \"%s\""%item['name'])
-
-            subprocess.call(['rm','-fr',fullpath_output])
-            #print(" ".join(['rm','-fr',fullpath_output]))
-
-            subprocess.call(['rm','-fr',fullpath_input])
-            #print(" ".join(['rm','-fr',fullpath_input]))
-
-
-        # ------------------------------
-        # IMPORTANT:
-        # This is where the search documents are actually created.
-
-        mimetype = re.split('[/\.]', item['mimeType'])[-1]
+            # Index a plain google drive file
            writer.add_document(
                    id = item['id'],
                    kind = 'gdoc',
@@ -281,23 +210,143 @@ class Search:
            )


-    def add_issue(self, writer, issue, repo, config):
+        else:
+            # Document with text
+            # Perform content extraction
+
+            # -----------
+            # docx Content Extraction:
+            # 
+            # We can only do this with .docx files
+            # This is a file type we know how to convert
+            # Construct the URL and download it
+
+            print("Indexing Google Drive document \"%s\" of type %s"%(item['name'], mimetype))
+            print(" > Extracting content")
+
+
+            # Create a URL and a destination filename
+            file_ext = mimemap[mimetype]
+            file_url = "https://docs.google.com/document/d/%s/export?format=%s"%(item['id'], file_ext)
+
+            # This re could probablybe improved
+            name = re.sub('/','_',item['name'])
+
+            # Now make the pandoc input/output filenames
+            out_ext = 'txt'
+            pandoc_fmt = 'plain'
+            if name.endswith(file_ext):
+                infile_name = name
+                outfile_name = re.sub(file_ext,out_ext,infile_name)
+            else:
+                infile_name  = name+'.'+file_ext
+                outfile_name = name+'.'+out_ext
+
+
+            # Assemble input/output file paths
+            fullpath_input = os.path.join(temp_dir,infile_name)
+            fullpath_output = os.path.join(temp_dir,outfile_name)
+
+            # Use requests.get to download url to file
+            r = requests.get(file_url, allow_redirects=True)
+            with open(fullpath_input, 'wb') as f:
+                f.write(r.content)
+
+            # Try to convert docx file to plain text
+            try:
+                output = pypandoc.convert_file(fullpath_input,
+                                               pandoc_fmt,
+                                               format='docx',
+                                               outputfile=fullpath_output
+                )
+                assert output == ""
+            except RuntimeError:
+                print(" > XXXXXX Failed to index document \"%s\""%(item['name']))
+
+
+            # If export was successful, read contents of markdown
+            # into the content variable.
+            if os.path.isfile(fullpath_output):
+                # Export was successful
+                with codecs.open(fullpath_output, encoding='utf-8') as f:
+                    content = f.read()
+
+
+            # No matter what happens, clean up.
+            print(" > Cleaning up \"%s\""%item['name'])
+
+            ## test
+            #print(" ".join(['rm','-fr',fullpath_output]))
+            #print(" ".join(['rm','-fr',fullpath_input]))
+
+            # do it
+            subprocess.call(['rm','-fr',fullpath_output])
+            subprocess.call(['rm','-fr',fullpath_input])
+
+            if update:
+                print(" > Removing old record")
+                writer.delete_by_term('id',item['id'])
+            else:
+                print(" > Creating a new record")
+
+            writer.add_document(
+                    id = item['id'],
+                    kind = 'gdoc',
+                    created_time = item['createdTime'],
+                    modified_time = item['modifiedTime'],
+                    indexed_time = datetime.now().replace(microsecond=0).isoformat(),
+                    title = item['name'],
+                    url = item['webViewLink'],
+                    mimetype = mimetype,
+                    owner_email = item['owners'][0]['emailAddress'],
+                    owner_name = item['owners'][0]['displayName'],
+                    repo_name='',
+                    repo_url='',
+                    github_user='',
+                    issue_title='',
+                    issue_url='',
+                    content = content
+            )
+
+
+
+
+    # ------------------------------
+    # Add a single github issue and its comments
+    # to a search index.
+
+
+    def add_issue(self, writer, issue, gh_token, config, update=True):
        """
        Add a Github issue/comment to a search index.
        """
+        repo = issue.repository
        repo_name = repo.owner.login+"/"+repo.name
        repo_url = repo.html_url

-        count = 0
-
-
-        # Handle the issue content
        print("Indexing issue %s"%(issue.html_url))

+        # Combine comments with their respective issues.
+        # Otherwise just too noisy.
+        issue_comment_content = issue.body.rstrip()
+        issue_comment_content += "\n"
+
+        # Handle the comments content
+        if(issue.comments>0):
+
+            comments = issue.get_comments()
+            for comment in comments:
+
+                issue_comment_content += comment.body.rstrip()
+                issue_comment_content += "\n"
+
+        # Now create the actual search index record
        created_time = clean_timestamp(issue.created_at)
        modified_time = clean_timestamp(issue.updated_at)
        indexed_time = clean_timestamp(datetime.now())

+        # Add one document per issue thread,
+        # containing entire text of thread.
        writer.add_document(
                id = issue.html_url,
                kind = 'issue',
@@ -314,45 +363,106 @@ class Search:
                github_user = issue.user.login,
                issue_title = issue.title,
                issue_url = issue.html_url,
-                content = issue.body.rstrip()
+                content = issue_comment_content
        )
-        count += 1



-        # Handle the comments content
-        if(issue.comments>0):

-            comments = issue.get_comments()
-            for comment in comments:
+    def add_ghfile(self, writer, d, gh_token, config, update=True):
+        """
+        Use a Github file API record to add a filename
+        to the search index.
+        """
+        MARKDOWN_EXTS = ['.md','.markdown']

-                print(" > Indexing comment %s"%(comment.html_url))
+        repo = d['repo']
+        org = d['org']
+        repo_name = org + "/" + repo
+        repo_url = "https://github.com/" + repo_name
+
+        try:
+            fpath = d['path']
+            furl = d['url']
+            fsha = d['sha']
+            _, fname = os.path.split(fpath)
+            _, fext = os.path.splitext(fpath)
+        except:
+            print(" > XXXXXXXX Failed to find file info.")
+            return

-                created_time = clean_timestamp(comment.created_at)
-                modified_time = clean_timestamp(comment.updated_at)
        indexed_time = clean_timestamp(datetime.now())

+        if fext in MARKDOWN_EXTS:
+            print("Indexing markdown doc %s from repo %s"%(fname,repo_name))
+
+            # Unpack the requests response and decode the content
+            # 
+            # don't forget the headers for private repos!
+            # useful: https://bit.ly/2LSAflS
+
+            headers = {'Authorization' : 'token %s'%(gh_token)}
+
+            response = requests.get(furl, headers=headers)
+            if response.status_code==200:
+                jresponse = response.json()
+                content = ""
+                try:
+                    binary_content = re.sub('\n','',jresponse['content'])
+                    content = base64.b64decode(binary_content).decode('utf-8')
+                except KeyError:
+                    print(" > XXXXXXXX Failed to extract 'content' field. You probably hit the rate limit.")
+
+            else:
+                print(" > XXXXXXXX Failed to reach file URL. There may be a problem with authentication/headers.")
+                return 
+
+            usable_url = "https://github.com/%s/blob/master/%s"%(repo_name, fpath)
+
+            # Now create the actual search index record
            writer.add_document(
-                        id = comment.html_url,
-                        kind = 'comment',
-                        created_time = created_time,
-                        modified_time = modified_time,
+                    id = fsha,
+                    kind = 'markdown',
+                    created_time = '',
+                    modified_time = '',
                    indexed_time = indexed_time,
-                        title = "Comment on "+issue.title,
-                        url = comment.html_url,
+                    title = fname,
+                    url = usable_url,
                    mimetype='',
                    owner_email='',
                    owner_name='',
                    repo_name = repo_name,
                    repo_url = repo_url,
-                        github_user = comment.user.login,
-                        issue_title = issue.title,
-                        issue_url = issue.html_url,
-                        content = comment.body.rstrip()
+                    github_user = '',
+                    issue_title = '',
+                    issue_url = '',
+                    content = content
            )

-        count += 1
-        return count
+        else:
+            print("Indexing github file %s from repo %s"%(fname,repo_name))
+
+            key = fname+"_"+fsha
+
+            # Now create the actual search index record
+            writer.add_document(
+                    id = key,
+                    kind = 'ghfile',
+                    created_time = '',
+                    modified_time = '',
+                    indexed_time = indexed_time,
+                    title = fname,
+                    url = repo_url,
+                    mimetype='',
+                    owner_email='',
+                    owner_name='',
+                    repo_name = repo_name,
+                    repo_url = repo_url,
+                    github_user = '',
+                    issue_title = '',
+                    issue_url = '',
+                    content = ''
+            )



@@ -360,51 +470,58 @@ class Search:
    # Define how to update search index
    # using different kinds of collections

+
+    # ------------------------------
+    # Google Drive Files/Documents
+
    def update_index_gdocs(self, 
                           config):
        """
        Update the search index using a collection of 
        Google Drive documents and files.
+        
+        Uses the 'id' field to uniquely identify documents.
+
+        Also see:
+        https://developers.google.com/drive/api/v3/reference/files
        """
-        gd = GDrive()
-        service = gd.get_service()

-        # -----
-        # Get the set of all documents on Google Drive:
+        # Updated algorithm:
+        # - get set of indexed ids
+        # - get set of remote ids
+        # - drop indexed ids not in remote ids
+        # - index all remote ids
+        # - add hash check in add_

-        # ------------------------------
-        # IMPORTANT:
-        # This determines what information about the Google Drive files
-        # you'll get back, and that's all you're going to have to work with.
-        # If you need more information, modify the statement below.
-        # Also see:
-        # https://developers.google.com/drive/api/v3/reference/files

+        # Get the set of indexed ids:
+        # ------
+        indexed_ids = set()
+        p = QueryParser("kind", schema=self.ix.schema)
+        q = p.parse("gdoc")
+        with self.ix.searcher() as s:
+            results = s.search(q,limit=None)
+            for result in results:
+                indexed_ids.add(result['id'])
+
+
+        # Get the set of remote ids:
+        # ------
+        # Start with google drive api object
        gd = GDrive()
        service = gd.get_service()
        drive = service.files()

-
-        # We should do more here
-        # to check if we should update
-        # or not...
-        # 
-        # loop over existing documents in index:
-        #
-        #    p = QueryParser("kind", schema=self.ix.schema)
-        #    q = p.parse("gdoc")
-        #    with self.ix.searcher() as s:
-        #        results = s.search(q,limit=None)
-        #        counts[key] = len(results)
-
+        # Now index all the docs in the google drive folder

        # The trick is to set next page token to None 1st time thru (fencepost)
        nextPageToken = None

        # Use the pager to return all the things
-        items = []
+        remote_ids = set()
+        full_items = {}
        while True:
-            ps = 12
+            ps = 100
            results = drive.list(
                    pageSize=ps,
                    pageToken=nextPageToken,
@@ -413,38 +530,56 @@ class Search:
            ).execute()

            nextPageToken = results.get("nextPageToken")
-            items += results.get("files", [])
+            files = results.get("files",[])
+            for f in files:
                
-            # Keep it short
+                # Add all remote docs to a set
+                remote_ids.add(f['id'])
+
+                # Also store the doc
+                full_items[f['id']] = f
+            
+            ## Shorter:
+            #break
+            # Longer:
+            if nextPageToken is None:
                break

-            #if nextPageToken is None:
-            #    break
-
-        # Here is where we update.
-        # Grab indexed ids
-        # Grab remote ids
-        # Drop indexed ids not in remote ids
-        # Index all remote ids
-        # Change add_ to update_
-        # Add a hash check in update_
-
-        indexed_ids = set()
-        for item in items:
-            indexed_ids.add(item['id'])

        writer = self.ix.writer()
-
+        count = 0
        temp_dir = tempfile.mkdtemp(dir=os.getcwd())
        print("Temporary directory: %s"%(temp_dir))
-        if not os.path.exists(temp_dir):
-            os.mkdir(temp_dir)

-        count = 0
-        for item in items:
-            self.add_drive_file(writer, item, indexed_ids, temp_dir, config)
+
+
+        # Drop any id in indexed_ids
+        # not in remote_ids
+        drop_ids = indexed_ids - remote_ids
+        for drop_id in drop_ids:
+            writer.delete_by_term('id',drop_id)
+
+
+        # Update any id in indexed_ids
+        # and in remote_ids
+        update_ids = indexed_ids & remote_ids
+        for update_id in update_ids:
+            # cop out
+            writer.delete_by_term('id',update_id)
+            item = full_items[update_id]
+            self.add_drive_file(writer, item, temp_dir, config, update=True)
            count += 1

+
+        # Add any id not in indexed_ids
+        # and in remote_ids
+        add_ids = remote_ids - indexed_ids
+        for add_id in add_ids:
+            item = full_items[add_id]
+            self.add_drive_file(writer, item, temp_dir, config, update=False)
+            count += 1
+
+
        print("Cleaning temporary directory: %s"%(temp_dir))
        subprocess.call(['rm','-fr',temp_dir])

@@ -452,25 +587,41 @@ class Search:
        print("Done, updated %d documents in the index" % count)


+    # ------------------------------
+    # Github Issues/Comments

-
-    def update_index_issues(self, 
-                            gh_access_token,
-                            config):
+    def update_index_issues(self, gh_token, config):
        """
        Update the search index using a collection of 
        Github repo issues and comments.
        """
-        # Strategy:
-        # To get the proof of concept up and running,
-        # we are just deleting and re-indexing every issue/comment.
+        # Updated algorithm:
+        # - get set of indexed ids
+        # - get set of remote ids
+        # - drop indexed ids not in remote ids
+        # - index all remote ids

-        g = Github(gh_access_token)
+        # Get the set of indexed ids:
+        # ------
+        indexed_issues = set()
+        p = QueryParser("kind", schema=self.ix.schema)
+        q = p.parse("issue")
+        with self.ix.searcher() as s:
+            results = s.search(q,limit=None)
+            for result in results:
+                indexed_issues.add(result['id'])

-        # Set of all URLs as existing on github
-        to_index = set()

-        writer = self.ix.writer()
+        # Get the set of remote ids:
+        # ------
+        # Start with api object
+        g = Github(gh_token)
+
+        # Now index all issue threads in the user-specified repos
+
+        # Start by collecting all the things
+        remote_issues = set()
+        full_items = {}

        # Iterate over each repo 
        list_of_repos = config['repositories']
@@ -481,41 +632,214 @@ class Search:
                raise Exception(err)

            this_org, this_repo = re.split('/',r)
-
+            try:
                org = g.get_organization(this_org)
                repo = org.get_repo(this_repo)
+            except:
+                print("Error: could not gain access to repository %s"%(r))
+                continue

-            count = 0
-
-            # Iterate over each thread
+            # Iterate over each issue thread
            issues = repo.get_issues()
            for issue in issues:

-                # This approach is more work than is needed
-                # but PoC||GTFO
-
                # For each issue/comment URL,
-                # remove the corresponding item
-                # and re-add it to the index
+                # grab the key and store the 
+                # corresponding issue object
+                key = issue.html_url
+                value = issue

-                to_index.add(issue.html_url)
-                writer.delete_by_term('url', issue.html_url)
-                count -= 1
-                comments = issue.get_comments()
+                remote_issues.add(key)
+                full_items[key] = value

-                for comment in comments:
-                    to_index.add(comment.html_url)
-                    writer.delete_by_term('url', comment.html_url)
+        writer = self.ix.writer()
+        count = 0

-                # Now re-add this issue to the index
-                # (this will also add the comments)
-                count += self.add_issue(writer, issue, repo, config)
+        # Drop any issues in indexed_issues
+        # not in remote_issues
+        drop_issues = indexed_issues - remote_issues
+        for drop_issue in drop_issues:
+            writer.delete_by_term('id',drop_issue)
+
+
+        # Update any issue in indexed_issues
+        # and in remote_issues
+        update_issues = indexed_issues & remote_issues
+        for update_issue in update_issues:
+            # cop out
+            writer.delete_by_term('id',update_issue)
+            item = full_items[update_issue]
+            self.add_issue(writer, item, gh_token, config, update=True)
+            count += 1
+
+
+        # Add any issue not in indexed_issues
+        # and in remote_issues
+        add_issues = remote_issues - indexed_issues
+        for add_issue in add_issues:
+            item = full_items[add_issue]
+            self.add_issue(writer, item, gh_token, config, update=False)
+            count += 1


        writer.commit()
        print("Done, updated %d documents in the index" % count)


+
+    # ------------------------------
+    # Github Files
+
+    def update_index_ghfiles(self, gh_token, config): 
+        """
+        Update the search index using a collection of 
+        files (and, separately, Markdown files) from 
+        a Github repo.
+        """
+        # Updated algorithm:
+        # - get set of indexed ids
+        # - get set of remote ids
+        # - drop indexed ids not in remote ids
+        # - index all remote ids
+
+        # Get the set of indexed ids:
+        # ------
+        indexed_ids = set()
+        p = QueryParser("kind", schema=self.ix.schema)
+        q = p.parse("ghfiles")
+        with self.ix.searcher() as s:
+            results = s.search(q,limit=None)
+            for result in results:
+                indexed_ids.add(result['id'])
+
+        q = p.parse("markdown")
+        with self.ix.searcher() as s:
+            results = s.search(q,limit=None)
+            for result in results:
+                indexed_ids.add(result['id'])
+
+        # Get the set of remote ids:
+        # ------
+        # Start with api object
+        g = Github(gh_token)
+
+        # Now index all the files.
+
+        # Start by collecting all the things
+        remote_ids = set()
+        full_items = {}
+
+        # Iterate over each repo 
+        list_of_repos = config['repositories']
+        for r in list_of_repos:
+
+            if '/' not in r:
+                err = "Error: specify org/reponame or user/reponame in list of repos"
+                raise Exception(err)
+
+            this_org, this_repo = re.split('/',r)
+            try:
+                org = g.get_organization(this_org)
+                repo = org.get_repo(this_repo)
+            except:
+                print("Error: could not gain access to repository %s"%(r))
+                continue
+
+
+            # Get head commit
+            commits = repo.get_commits()
+            try:
+                last = commits[0]
+                sha = last.sha
+            except GithubException:
+                print("Error: could not get commits from repository %s"%(r))
+                continue
+
+            # Get all the docs
+            tree = repo.get_git_tree(sha=sha, recursive=True)
+            docs = tree.raw_data['tree']
+            print("Parsing file ids from repository %s"%(r))
+
+            for d in docs:
+
+                # For each doc, get the file extension
+                # and decide what to do with it.
+
+                fpath = d['path']
+                _, fname = os.path.split(fpath)
+                _, fext = os.path.splitext(fpath)
+
+                key = d['sha']
+
+                d['org'] = this_org
+                d['repo'] = this_repo
+                value = d
+
+                remote_ids.add(key)
+                full_items[key] = value
+
+        writer = self.ix.writer()
+        count = 0
+
+        # Drop any id in indexed_ids
+        # not in remote_ids
+        drop_ids = indexed_ids - remote_ids
+        for drop_id in drop_ids:
+            writer.delete_by_term('id',drop_id)
+
+
+        # Update any id in indexed_ids
+        # and in remote_ids
+        update_ids = indexed_ids & remote_ids
+        for update_id in update_ids:
+            # cop out: just delete and re-add
+            writer.delete_by_term('id',update_id)
+            item = full_items[update_id]
+            self.add_ghfile(writer, item, gh_token, config, update=True)
+            count += 1
+
+
+        # Add any issue not in indexed_ids
+        # and in remote_ids
+        add_ids = remote_ids - indexed_ids
+        for add_id in add_ids:
+            item = full_items[add_id]
+            self.add_ghfile(writer, item, gh_token, config, update=False)
+            count += 1
+
+
+        writer.commit()
+        print("Done, updated %d Github files in the index" % count)
+
+
+
+    # ------------------------------
+    # Groups.io Emails
+
+
+    def update_index_groupsioemails(self, groupsio_token, config):
+        """
+        Update the search index using the email archives
+        of groups.io groups.
+
+        This requires the use of a spider.
+        RELEASE THE SPIDER!!!
+        """
+        spider = GroupsIOArchivesCrawler(groupsio_token,'dcppc')
+
+        # - ask spider to crawl the archives
+        spider.crawl_group_archives()
+
+        # - ask spider for list of all email records
+        #   - 1 email = 1 dictionary
+        #   - email records compiled by the spider
+        archives = spider.get_archives()
+
+        # - email object is sent off to add email method
+
+        print("Finished indexing groups.io emails")
+
+
    # ---------------------------------
    # Search results bundler

@@ -580,21 +904,18 @@ class Search:

            highlights = self.html_parser.unescape(highlights)
            html = self.markdown(highlights)
+            html = re.sub(r'\n','<br />',html)
            sr.content_highlight = html

            search_results.append(sr)

        return search_results

-        # ------------------
-        # github issues
-        # create search results
-
-



    def search(self, query_list, fields=None):
+
        with self.ix.searcher() as searcher:
            query_string = " ".join(query_list)
            query = None
@@ -626,29 +947,27 @@ class Search:
    def get_document_total_count(self):
        p = QueryParser("kind", schema=self.ix.schema)

-        kind_labels = {
-                "documents" : "gdoc",
-                "issues" :    "issue",
-                "comments" :  "comment"
-        }
        counts = {
-                "documents" : None,
-                "issues" : None,
-                "comments" : None,
+                "gdoc" : None,
+                "issue" : None,
+                "ghfile" : None,
+                "markdown" : None,
                "total" : None
        }
-        for key in kind_labels:
-            kind = kind_labels[key]
-            q = p.parse(kind)
+        for key in counts.keys():
+            q = p.parse(key)
            with self.ix.searcher() as s:
                results = s.search(q,limit=None)
                counts[key] = len(results)

-        counts['total'] = self.ix.searcher().doc_count_all()
+        counts['total'] = sum(counts[k] for k in counts.keys())

        return counts

 if __name__ == "__main__":
+
+    raise Exception("Error: main method not implemented (fix groupsio credentials first)")
+
    search = Search("search_index")

    from get_centillion_config import get_centillion_config
--- a/config_centillion.json
+++ b/config_centillion.json
@@ -1,6 +1,27 @@
 {
    "repositories" : [
+        "dcppc/project-management",
+        "dcppc/nih-demo-meetings",
+        "dcppc/internal",
+        "dcppc/organize",
+        "dcppc/dcppc-bot",
+        "dcppc/full-stacks",
+        "dcppc/design-guidelines-discuss",
+        "dcppc/dcppc-deliverables",
+        "dcppc/dcppc-milestones",
+        "dcppc/crosscut-metadata",
+        "dcppc/lucky-penny",
+        "dcppc/dcppc-workshops",
+        "dcppc/metadata-matrix",
+        "dcppc/data-stewards",
+        "dcppc/dcppc-phase1-demos",
+        "dcppc/apis",
        "dcppc/2018-june-workshop",
-        "dcppc/2018-july-workshop"
+        "dcppc/2018-july-workshop",
+        "dcppc/2018-august-workshop",
+        "dcppc/2018-september-workshop",
+        "dcppc/design-guidelines",
+        "dcppc/2018-may-workshop",
+        "dcppc/centillion"
    ]
 }
--- a/config_flask.example.py
+++ b/config_flask.example.py
@@ -0,0 +1,20 @@
+# Location of index file
+INDEX_DIR = "search_index"
+
+# oauth client deets
+GITHUB_OAUTH_CLIENT_ID = "XXX"
+GITHUB_OAUTH_CLIENT_SECRET = "YYY"
+GITHUB_TOKEN = "ZZZ"
+
+# More information footer: Repository label
+FOOTER_REPO_ORG = "charlesreid1"
+FOOTER_REPO_NAME = "centillion"
+
+# Toggle to show Whoosh parsed query
+SHOW_PARSED_QUERY=True
+
+TAGLINE = "Search All The Things"
+
+# Flask settings
+DEBUG = True
+SECRET_KEY = 'WWWWW'
--- a/config_flask.py
+++ b/config_flask.py
@@ -1,9 +0,0 @@
-# Location of index file
-INDEX_DIR = "search_index"
-
-# Toggle to show Whoosh parsed query
-SHOW_PARSED_QUERY=True
-
-# Flask settings
-DEBUG = True
-SECRET_KEY = '42c5a8eda356ca9d9c3ab2d149541e6b91d843fa'
--- a/docs/centillion_components.md
+++ b/docs/centillion_components.md
@@ -0,0 +1,22 @@
+# Centillion Components
+
+Centillion keeps it simple.
+
+There are two components:
+
+* The `Search` object, which uses whoosh and various
+  APIs (Github, Google Drive) to build and manage
+  the search index. The `Search` object also runs all
+  queries against the search index. (See the
+  [Centillion Whoosh](centillion_whoosh.md) page
+  or the `centillion_search`.py` file
+  for details.)
+
+* Flask app, which uses Jinja templates to present the
+  user with a minimal web frontend that allows them
+  to interact with the search engine. (See the
+  [Centillion Flask](centillion_flask.md) page
+  or the `centillion`.py` file
+  for details.)
+
+
--- a/docs/centillion_flask.md
+++ b/docs/centillion_flask.md
@@ -0,0 +1,30 @@
+# Centillion Flask
+
+## What the flask server does
+
+Flask is a web server framework
+that allows developers to define
+behavior for specific endpoints,
+such as `/hello_world`, or
+<http://localhost:5000/hello_world>
+on a web server running locally.
+
+## Flask server routes
+
+- `/home`
+    - if not logged in, this redirects to a "log into github" landing page (not implemented yet)
+    - if logged in, this redirects to the search route
+
+- `/search`
+    - search template
+
+- `/main_index_update`
+    - update main index, all docs period
+
+- `/control_panel`
+    - this is the control panel, where you can trigger
+      the search index to be re-made
+
+
+
+
--- a/docs/centillion_whoosh.md
+++ b/docs/centillion_whoosh.md
@@ -0,0 +1,34 @@
+# Centillion Whoosh
+
+The `centillion_search.py` file defines a
+`Search` class that serves as the backend
+for centillion.
+
+## What the Search class does
+
+The `Search` class has two roles:
+- create (and update) the search index
+    - this also requires the `Search` class
+      to define the schema for storing documents
+- run queries against the search index,
+  and package results up for Flask and Jinja
+
+## Search class functions
+
+The `Search` class defines several functions:
+
+- `open_index()` creates the schema
+
+- `add_issue()`, `add_md()`, `add_document()` have three diff method sigs and add diff types
+  of documents to the search index
+
+- `update_all_issues()` or `update_all_md()` or `update_all_documents()` iterates over items
+  and determines whether each item needs to be updated in the search index
+
+- `update_main_index()` - update the entire search index
+    - calls all three update_all methods
+
+- `create_search_results()` - package things up for jinja
+
+- `search()` - run the query, pass results to the jinja-packager
+
--- a/docs/images/cp.png
+++ b/docs/images/cp.png
--- a/docs/images/ss.png
+++ b/docs/images/ss.png
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,30 +1,31 @@
-# The Centillion
+# Centillion

-**the centillion**: a pan-github-markdown-issues-google-docs search engine.
+**centillion**: a pan-github-markdown-issues-google-docs search engine.

 **a centillion**: a very large number consisting of a 1 with 303 zeros after it.

-the centillion is 3.03 log-times better than the googol.
+centillion is 3.03 log-times better than the googol.

-## what is it
+## What is centillion

-The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
+Centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
 a Python library for building search engines.

-We define the types of documents the centillion should index,
-what info and how. The centillion then builds and
+
+We define the types of documents centillion should index,
+what info and how. Centillion then builds and
 updates a search index. That's all done in `centillion_search.py`.

-The centillion also provides a simple web frontend for running
+Centillion also provides a simple web frontend for running
 queries against the search index. That's done using a Flask server
 defined in `centillion.py`.

-The centillion keeps it simple.
+Centillion keeps it simple.


-## quickstart
+## Quickstart

-Run the centillion app with a github access token API key set via
+Run centillion with a github access token API key set via
 environment variable:

 ```
@@ -34,21 +35,50 @@ GITHUB_TOKEN="XXXXXXXX" python centillion.py
 This will start a Flask server, and you can view the minimal search engine
 interface in your browser at <http://localhost:5000>.

+## Configuration

-## work that is done
+### Centillion configuration

-See [standalone.md](standalone.md) for the summary of
-the three standalone whoosh servers that were built:
-one for a folder of markdown files, one for github issues
-and comments, and one for google drive documents.
+`config_centillion.json` defines configuration variables
+for centillion - namely, what to index, and how, and where.

-## work that is being done
+### Flask configuration

-See [workinprogress.md](workinprogress.md) for details about
-work in progress.
+`config_flask.py` defines configuration variables
+used by flask, which controls the web frontend 
+for centillion.

-## work that is planned
+## Control Panel/Rebuilding Search Index

-See [plans.md](plans.md)
+To rebuild the search engine, visit the control panel route (`/control_panel`),
+for example at <http://localhost:5000/control_panel>.
+
+This allows you to rebuild the search engine index. The search index
+is stored in the `search_index/` directory, and that directory
+can be configured with centillion's configuration file.
+
+The diff search index is faster to build, as it only
+indexes documents that have been added since the last
+new document was added to the search index.
+
+The main search index is slower to build, as it will
+re-index everything.
+
+(Cron scripts? Threaded task that runs hourly?)
+
+## Details
+
+More on the details of how centillion works.
+
+Under the hood, centillion uses flask and whoosh.
+Flask builds and runs the web server.
+Whoosh handles search requests and management
+of the search index.
+
+[Centillion Components](centillion_components.md)
+
+[Centillion Flask](centillion_flask.md)
+
+[Centillion Whoosh](centillion_whoosh.md)


--- a/gdrive_util.py
+++ b/gdrive_util.py
@@ -29,8 +29,7 @@ class GDrive(object):
    ):
        """
        Set up the Google Drive API instance.
-        Factory method: create it and hand it over.
-        Then we're finished.
+        Factory method: create it here, hand it over in get_service().
        """
        self.credentials_file = credentials_file
        self.client_secret_file = client_secret_file
@@ -40,6 +39,9 @@ class GDrive(object):
        self.store = file.Storage(credentials_file)

    def get_service(self):
+        """
+        Return an instance of the Google Drive API service.
+        """

        creds = self.store.get()
        if not creds or creds.invalid:
--- a/groupsio_util.py
+++ b/groupsio_util.py
@@ -0,0 +1,382 @@
+import requests, os, re
+from bs4 import BeautifulSoup
+
+class GroupsIOArchivesCrawler(object):
+    """
+    This is a Groups.io spider
+    designed to crawl the email
+    archives of a group.
+
+    credentials (dictionary):
+        groupsio_token :     api access token
+        groupsio_username     :     username
+        groupsio_password     :     password
+    """
+    def __init__(self,
+                 credentials,
+                 group_name):
+        # template url for archives page (list of topics)
+        self.url = "https://{group}.groups.io/g/{subgroup}/topics"
+        self.login_url = "https://groups.io/login"
+
+        self.credentials = credentials
+        self.group_name = group_name
+        self.crawled_archives = False
+        self.archives = None
+
+
+    def get_archives(self):
+        """
+        Return a list of dictionaries containing 
+        information about each email topic in the 
+        groups.io email archive.
+
+        Call crawl_group_archives() first!
+        """
+        return self.archives
+
+
+    def get_subgroups_list(self):
+        """
+        Use the API to get a list of subgroups.
+        """
+        subgroups_url = 'https://api.groups.io/v1/getsubgroups'
+
+        key = self.credentials['groupsio_token']
+
+        data = [('group_name', self.group_name),
+                ('limit',100)
+        ]
+        response = requests.post(subgroups_url,
+                                 data=data,
+                                 auth=(key,''))
+        response = response.json()
+        data = response['data']
+
+        subgroups = {}
+        for group in data:
+            k = group['id']
+            v = re.sub(r'dcppc\+','',group['name'])
+            subgroups[k] = v
+
+        return subgroups
+
+
+    def crawl_group_archives(self):
+        """
+        Spider will crawl the email archives of the entire group
+        by crawling the email archives of each subgroup.
+        """
+        subgroups = self.get_subgroups_list()
+
+        # ------------------------------
+        # Start by logging in.
+
+        # Create session object to persist session data
+        session = requests.Session()
+
+        # Log in to the website
+        data = dict(email = self.credentials['groupsio_username'],
+                    password = self.credentials['groupsio_password'],
+                    timezone = 'America/Los_Angeles')
+
+        r = session.post(self.login_url,
+                         data = data)
+
+        csrf = self.get_csrf(r)
+
+        # ------------------------------
+        # For each subgroup, crawl the archives
+        # and return a list of dictionaries
+        # containing all the email threads.
+        for subgroup_id in subgroups.keys():
+            self.crawl_subgroup_archives(session,
+                                         csrf,
+                                         subgroup_id, 
+                                         subgroups[subgroup_id])
+
+        # Done. archives are now tucked away
+        # in the variable self.archives
+        # 
+        # self.archives is a list of dictionaries,
+        # with each dictionary containing info about
+        # a topic/email thread in a subgroup.
+        # ------------------------------
+
+
+
+
+    def crawl_subgroup_archives(self, session, csrf, subgroup_id, subgroup_name):
+        """
+        This kicks off the process to crawl the entire
+        archives of a given subgroup on groups.io.
+
+        For a given subgroup the url is self.url,
+        
+            https://{group}.groups.io/g/{subgroup}/topics
+
+        This is the first of a paginated list of topics.
+        Procedure is:
+        - passed a starting page (or its contents)
+        - iterate through all topics via the HTML page elements
+        - assemble a bundle of information about each topic:
+            - topic title, by, URL, date, content, permalink
+            - content filtering:
+                - ^From, Reply-To, Date, To, Subject
+                - Lines containing phone numbers
+                    - 9 digits
+                    - XXX-XXX-XXXX, (XXX) XXX-XXXX
+                    - XXXXXXXXXX, XXX XXX XXXX
+                    - ^Work: or (Work) or Work$
+                    - Home, Cell, Mobile
+                    - +1 XXX 
+                    - \w@\w
+        - while next button is not greyed out,
+        - click the next button
+
+        everything stored in self.archives:
+        list of dictionaries.
+
+        """
+        self.archives = []
+
+        prefix = "https://{group}.groups.io".format(group=self.group_name)
+
+        url = self.url.format(group=self.group_name, 
+                              subgroup=subgroup_name)
+
+        # ------------------------------
+
+        # Now get the first page
+        r = session.get(url)
+
+        # ------------------------------
+        # Fencepost algorithm:
+
+        # First page:
+
+        # Extract a list of (title, link) items
+        items = self.extract_archive_page_items_(r)
+
+        # Get the next link
+        next_url = self.get_next_url_(r)
+
+        # Now add each item to the archive of threads,
+        # then find the next button.
+        self.add_items_to_archives_(session,subgroup_name,items)
+
+        if next_url is None:
+            return
+        else:
+            full_next_url = prefix + next_url
+
+        # Now click the next button
+        next_request = requests.get(full_next_url)
+
+        while next_request.status_code==200:
+            items = self.extract_archive_page_items_(next_request)
+            next_url = self.get_next_url_(next_request)
+            self.add_items_to_archives_(session,subgroup_name,items)
+            if next_url is None:
+                return
+            else:
+                full_next_url = prefix + next_url
+            next_request = requests.get(full_next_url)
+        
+
+
+    def add_items_to_archives_(self,session,subgroup_name,items):
+        """
+        Given a set of items from a list of threads,
+        items being title and link,
+        get the page and store all info
+        in self.archives variable
+        (list of dictionaries)
+        """
+        for (title, link) in items:
+            # Get the thread page:
+            prefix = "https://{group}.groups.io".format(group=self.group_name)
+            full_link = prefix + link
+            r = session.get(full_link)
+            soup = BeautifulSoup(r.text,'html.parser')
+
+            # soup contains the entire thread
+
+            # What are we extracting:
+            # 1. thread number
+            # 2. permalink
+            # 3. content/text (filtered)
+
+            # - - - - - - - - - - - - - - 
+            # 1. topic/thread number:
+            # <a rel="nofollow" href="">
+            # where link is:
+            # https://{group}.groups.io/g/{subgroup}/topic/{topic_id}
+            # example topic id: 24209140
+            #
+            # ugly links are in the form 
+            # https://dcppc.groups.io/g/{subgroup}/topic/some_text_here/{thread_id}?p=,,,,,1,2,3,,,4,,5
+            # split at ?, 0th portion
+            # then split at /, last (-1th) portion
+            topic_id = link.split('?')[0].split('/')[-1]
+
+            # - - - - - - - - - - - - - - - 
+            # 2. permalink:
+            # - current link is ugly link
+            # - permalink is the nice one
+            # - topic id is available from the ugly link
+            # https://{group}.groups.io/g/{subgroup}/topic/{topic_id}
+
+            permalink_template = "https://{group}.groups.io/g/{subgroup}/topic/{topic_id}"
+            permalink = permalink_template.format(
+                    group = self.group_name,
+                    subgroup = subgroup_name, 
+                    topic_id = topic_id
+            )
+
+            # - - - - - - - - - - - - - - - 
+            # 3. content:
+
+            # Need to rearrange how we're assembling threads here.
+            # This is one thread, no?
+            content = []
+
+            subject = soup.find('title').text
+
+            # Extract information for the schema:
+            # - permalink for thread (done)
+            # - subject/title (done)
+            # - original sender email/name (done)
+            # - content (done)
+
+            # Groups.io pages have zero CSS classes, which makes everything
+            # a giant pain in the neck to interact with. Thanks Groups.io!
+            original_sender = ''
+            for i, tr in enumerate(soup.find_all('tr',{'class':'test'})):
+                # Every other tr row contains an email.
+                if (i+1)%2==0:
+                    # nope, no email here
+                    pass
+                else:
+                    # found an email!
+                    # this is a maze, thanks groups.io
+                    td = tr.find('td')
+                    divrow = td.find('div',{'class':'row'}).find('div',{'class':'pull-left'})
+                    if (i+1)==1:
+                        original_sender = divrow.text.strip()
+                    for div in td.find_all('div'):
+                        if div.has_attr('id'):
+
+                            # purge any signatures
+                            for x in div.find_all('div',{'id':'Signature'}):
+                                x.extract()
+
+                            # purge any headers
+                            for x in div.find_all('div'): 
+                                nonos = ['From:','Sent:','To:','Cc:','CC:','Subject:']
+                                for nono in nonos:
+                                    if nono in x.text:
+                                        x.extract()
+
+                            message_text = div.get_text()
+
+                            # More filtering:
+
+                            # phone numbers
+                            message_text = re.sub(r'[0-9]{3}-[0-9]{3}-[0-9]{4}','XXX-XXX-XXXX',message_text)
+                            message_text = re.sub(r'[0-9]\{10\}','XXXXXXXXXX',message_text)
+
+                            content.append(message_text)
+
+            full_content = "\n".join(content)
+
+            thread = {
+                    'permalink' : permalink,
+                    'subject' : subject,
+                    'original_sender' : original_sender,
+                    'content' : full_content
+            }
+            
+            print('*'*40)
+            for k in thread.keys():
+                if k=='content':
+                    pass
+                else:
+                    print("%s : %s"%(k,thread[k]))
+            print('*'*40)
+            self.archives.append(thread)
+
+
+    def extract_archive_page_items_(self, response):
+        """
+        (Private method)
+
+        Given a response from a GET request,
+        use beautifulsoup to extract all items
+        (thread titles and ugly thread links)
+        and pass them back in a list.
+        """
+        soup = BeautifulSoup(response.content,"html.parser")
+        rows = soup.find_all('tr',{'class':'test'})
+        if 'rate limited' in soup.text:
+            raise Exception("Error: rate limit in place for Groups.io")
+
+        results = []
+        for row in rows:
+            # We don't care about anything except title and ugly link
+            subject = row.find('span',{'class':'subject'})
+            title = subject.get_text()
+            link = row.find('a')['href']
+            print(title)
+            results.append((title,link))
+
+        return results
+
+
+    def get_next_url_(self, response):
+        """
+        (Private method)
+
+        Given a response (which is a list of threads),
+        find the next button and return the URL.
+
+        If no next URL, if is disabled, then return None.
+        """
+        soup = BeautifulSoup(response.text,'html.parser')
+        chevron = soup.find('i',{'class':'fa-chevron-right'})
+        try:
+            if '#' in chevron.parent['href']:
+                # empty link, abort
+                return None
+        except AttributeError:
+            # I don't even now
+            return None
+
+        if chevron.parent.parent.has_attr('class') and 'disabled' in chevron.parent.parent['class']:
+            # no next link, abort
+            return None
+
+        return chevron.parent['href']
+
+
+
+    def get_csrf(self,resp):
+        """
+        Find the CSRF token embedded in the subgroup page
+        """
+        soup = BeautifulSoup(resp.text,'html.parser')
+        csrf = ''
+        for i in soup.find_all('input'):
+            # Note that i.name is different from i['name']
+            # the first is the actual tag,
+            # the second is the attribute name="xyz"
+            if i['name']=='csrf':
+                csrf = i['value']
+        
+        if csrf=='':
+            err = "ERROR: Could not find csrf token on page."
+            raise Exception(err)
+    
+        return csrf
+
+
--- a/img/ss.png
+++ b/img/ss.png
--- a/install_pandoc.sh
+++ b/install_pandoc.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+#
+# for ubuntu 
+
+if [ "$(id -u)" != "0" ]; then
+    echo ""
+    echo ""
+    echo "This script should be run as root."
+    echo ""
+    echo ""
+    exit 1;
+fi
+
+OFILE="/tmp/pandoc.deb"
+curl -L https://github.com/jgm/pandoc/releases/download/2.2.2.1/pandoc-2.2.2.1-1-amd64.deb -o ${OFILE}
+dpkg -i ${OFILE}
+rm -f ${OFILE}
+
+
--- a/1
+++ b/1
--- a/1
+++ b/1
--- a/requirements.txt
+++ b/requirements.txt
@@ -9,3 +9,5 @@ PyGithub>=1.39
 pypandoc>=1.4
 requests>=2.19
 pandoc>=1.0
+flask-dance>=1.0.0
+beautifulsoup4>=4.6
--- a/static/bootstrap.min.js
+++ b/static/bootstrap.min.js
--- a/static/jquery.min.js
+++ b/static/jquery.min.js
--- a/static/style.css
+++ b/static/style.css
@@ -1,17 +1,38 @@
+span.badge {
+    vertical-align: text-bottom;
+}

-li.search-group-item {
-    position: relative;
-    display: block;
-    padding: 0px;
-    margin-bottom: -1px;
-    background-color: #fff;
-    border: 1px solid #ddd;
+a.badgelinks, a.badgelinks:hover {
+    color: #fff;
+    text-decoration: none;
 }

 div.list-group {
    border: 1px solid rgba(86,61,124,.2);
 }

+li.list-group-item {
+    position: relative;
+    display: block;
+
+    /*padding: 20px 10px;*/
+    margin-bottom: -1px;
+
+    background-color: #f8f8f8;
+    border: 1px solid #ddd;
+}
+
+li.search-group-item {
+    position: relative;
+    display: block;
+
+    padding: 0px;
+    margin-bottom: -1px;
+
+    background-color: #fff;
+    border: 1px solid #ddd;
+}
+
 div.url {
    background-color: rgba(86,61,124,.15);
    padding: 8px;
--- a/templates/controlpanel.html
+++ b/templates/controlpanel.html
@@ -0,0 +1,108 @@
+{% extends "layout.html" %}
+{% block body %}
+
+{% with messages = get_flashed_messages() %}
+{% if messages %}
+<div class="container">
+    <div class="alert alert-success alert-dismissible">
+        <a href="#" class="close" data-dismiss="alert" aria-label="close">&times;</a>
+        <ul class=flashes>
+            {% for message in messages %}
+            <li>{{ message }}</li>
+            {% endfor %}
+        </ul>
+    </div>
+</div>
+{% endif %}
+{% endwith %}
+
+<div class="container">
+
+    <div class="row">
+        <div class="col-md-12">
+            <center>
+                <a href="{{ url_for('search')}}?query=&fields=">
+                <img src="{{ url_for('static', filename='centillion_white.png') }}">
+                </a>
+                {% if config['TAGLINE'] %}
+                    <h2><a href="{{ url_for('search')}}?query=&fields=">
+                        {{config['TAGLINE']}}
+                    </a></h2>
+                {% endif %}
+            </center>
+        </div>
+    </div>
+    
+    {% if config['zzzTAGLINE'] %}
+    <div class="row">
+        <div class="col12sm">
+            <center>
+                <h2><a href="{{ url_for('search')}}?query=&fields=">
+                    {{config['TAGLINE']}}
+                </a></h2>
+            </center>
+        </div>
+    </div>
+    {% endif %}
+
+</div>
+
+<hr />
+
+<div class="container">
+
+    <div class="row">
+
+        {# update main search index #}
+        <div class="panel panel-danger">
+            <div class="panel-heading">
+                <h3 class="panel-title">
+                    Update Main Search Index
+                </h3>
+            </div>
+            <div class="panel-body">
+                <div class="container-fluid">
+                    <div class="row">
+                        <div class="col-md-12">
+                            <p class="panel-text">Re-index <i>every</i> document in the
+                            remote collection in the search index. <b>Warning: this operation may take a while.</b>
+                            <p/> <p>
+                            <a href="{{ url_for('update_index') }}" class="btn btn-large btn-danger">Update Main Index</a>
+                            <p/> 
+                        </div>
+                    </div>
+                </div>
+
+            </div>
+        </div>
+
+        {# update diff search index #}
+        <div class="panel panel-danger">
+            <div class="panel-heading">
+                <h3 class="panel-title">
+                    Update Diff Search Index
+                </h3>
+            </div>
+            <div class="panel-body">
+                <div class="container-fluid">
+                    <div class="row">
+                        <div class="col-md-12">
+                            <p class="panel-text">Diff search index only re-indexes documents created after the last
+                            search index update. <b>Not currently implemented.</b>
+                            <p/> <p>
+                            <a href="#" class="btn btn-large disabled btn-danger">Update Diff Index</a>
+                            <p/> 
+                        </div>
+                    </div>
+                </div>
+
+            </div>
+        </div>
+
+
+    </div>
+
+</div>
+
+{% endblock %}
+
--- a/templates/layout.html
+++ b/templates/layout.html
@@ -1,11 +1,12 @@
 <!doctype html>
-<title>Markdown Search</title>
+<title>Centillion Search Engine</title>
 <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}">
 <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
 <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='bootstrap.min.css') }}">
+
+<script src="{{ url_for('static', filename='jquery.min.js') }}"></script>
+<script src="{{ url_for('static', filename='bootstrap.min.js') }}"></script>
+
 <div>
-    {% for message in get_flashed_messages() %}
-        <div class="flash">{{ message }}</div>
-    {% endfor %}
    {% block body %}{% endblock %}
 </div>
--- a/templates/search.html
+++ b/templates/search.html
@@ -4,34 +4,33 @@

 <div class="container">

+    {#
+    banner image
+    #}
    <div class="row">
        <div class="col12sm">
            <center>
                <a href="{{ url_for('search')}}?query=&fields=">
                <img src="{{ url_for('static', filename='centillion_white.png') }}">
                </a>
+                {#
+                need a tag line
+                #}
+                {% if config['TAGLINE'] %}
+                    <h2><a href="{{ url_for('search')}}?query=&fields=">
+                        {{config['TAGLINE']}}
+                    </a></h2>
+                {% endif %}
            </center>
        </div>
    </div>

-    <div class="row">
-        <div class="col12sm">
-            <center>
-                <h2>
-                    <a href="{{ url_for('search')}}?query=&fields=">
-                    Search the DCPPC
-                    </a>
-                </h2>
-            </center>
 </div>
-    </div>
-
-
+<div class="container">
    <div class="row">
-        <div class="col-12">
+
+        <div class="col-xs-12">
            <center>
-                <a class="index" href="{{ url_for('update_index')}}">[update index]</a>
-                <a class="index" href="{{ url_for('update_index')}}?rebuild=True">[rebuild index]</a>
                <form action="{{ url_for('search') }}" name="search">
                    <input type="text" name="query" value="{{ query }}"> <br />
                    <button type="submit" style="font-size: 20px; padding: 10px; padding-left: 50px; padding-right: 50px;" 
@@ -48,8 +47,8 @@
    <div class="row">

        {% if directories %}
-        <div class="col-12 info directories-cloud">
-            File directories:&nbsp
+        <div class="col-xs-12 info directories-cloud">
+            <b>File directories:</b> 
            {% for d in directories %}
                <a href="{{url_for('search')}}?query={{d|trim}}&fields=filename">{{d|trim}}</a>
            {% endfor %}
@@ -60,25 +59,40 @@

            {% if config['SHOW_PARSED_QUERY'] and parsed_query %}
                <li  class="list-group-item">
-                    <div class="col-12 info">
+                    <div class="container-fluid">
+                        <div class="row">
+                            <div class="col-xs-12 info">
                                <b>Parsed query:</b> {{ parsed_query }}
                            </div>
+                        </div>
+                    </div>
                </li>
            {% endif %}

            {% if parsed_query %}
                <li  class="list-group-item">
-                    <div class="col-12 info">
-                        <b>Found:</b> {{entries|length}} documents with results, out of {{totals["total"]}} total documents
+                    <div class="container-fluid">
+                        <div class="row">
+                            <div class="col-xs-12 info">
+                                <b>Found:</b> <span class="badge">{{entries|length}}</span> results 
+                                out of <span class="badge">{{totals["total"]}}</span> total items indexed
+                            </div>
+                        </div>
                    </div>
                </li>
            {% endif %}

            <li  class="list-group-item">
-                <div class="col-12 info">
-                    <b>Indexing:</b> {{totals["documents"]}} Google Documents,
-                    {{totals["issues"]}} Github issues, and 
-                    {{totals["comments"]}} Github comments
+                    <div class="container-fluid">
+                        <div class="row">
+                            <div class="col-xs-12 info">
+                                <b>Indexing:</b> <span
+                                    class="badge">{{totals["gdoc"]}}</span> Google Documents,
+                                <span class="badge">{{totals["issue"]}}</span> Github issues,
+                                <span class="badge">{{totals["ghfile"]}}</span> Github files,
+                                <span class="badge">{{totals["markdown"]}}</span> Github markdown files.
+                            </div>
+                        </div>
                </div>
            </li>

@@ -95,35 +109,46 @@

                    <div class="url">
                        {% if e.kind=="gdoc" %}
-                            <b>Google Drive File:</b>
+                            {% if e.mimetype=="" %}
+                                <b>Google Document:</b>
                                <a href='{{e.url}}'>{{e.title}}</a>
-                            ({{e.owner_name}}, {{e.owner_email}})
-                        {% elif e.kind=="comment" %}
-                            <b>Comment:</b>
-                            <a href='{{e.url}}'>Comment (link)</a>
-                            {% if e.github_user %}
-                            by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
-                            {% endif %}
-                            on issue <a href='{{e.issue_url}}'>{{e.issue_title}}</a>
-                            <br/>
-                            <b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
-                            {% if e.github_user %}
+                                (Owner: {{e.owner_name}}, {{e.owner_email}})<br />
+                                <b>Document Type</b>: {{e.mimetype}}
+                            {% else %}
+                                <b>Google Drive:</b>
+                                <a href='{{e.url}}'>{{e.title}}</a>
+                                (Owner: {{e.owner_name}}, {{e.owner_email}})
                            {% endif %}
+
                        {% elif e.kind=="issue" %}
-                            <b>Issue:</b>
-                            <a href='{{e.issue_url}}'>{{e.issue_title}}</a>
+                            <b>Github Issue:</b>
+                            <a href='{{e.url}}'>{{e.title}}</a>
                            {% if e.github_user %}
-                            by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
+                            opened by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
                            {% endif %}
                            <br/>
                            <b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
+
+                        {% elif e.kind=="markdown" %}
+                            <b>Github Markdown:</b>
+                            <a href='{{e.url}}'>{{e.title}}</a>
+                            <br/>
+                            <b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
+
                        {% else %}
                            <b>Item:</b> (<a href='{{e.url}}'>link</a>)
+
                        {% endif %}
                        <br />
-                        score: {{'%d'  % e.score}}
+                        Score: {{'%d'  % e.score}}
+                    </div>
+                    <div class="markdown-body">
+                        {% if e.content_highlight %}
+                            {{ e.content_highlight|safe}}
+                        {% else %}
+                        <p>(A preview of this document is not available.)</p>
+                        {% endif %}
                    </div>
-                    <div class="markdown-body">{{ e.content_highlight|safe}}</div>

                </li>
            {% endfor %}
@@ -134,17 +159,29 @@

 <div class="container">
    <div class="row">
-        <div class="col-12">
-            <div class="last-searches">Last searches: <br/>
-                {% for s in last_searches %}
-                    <span><a href="{{url_for('search')}}?{{s}}">{{s}}</a></span>
-                {% endfor %}
+        <ul class="list-group">
+
+            {% if config['FOOTER_REPO_NAME'] %}
+                {% if config['FOOTER_REPO_ORG'] %}
+
+                    <li  class="list-group-item">
+                        <div class="container-fluid">
+                            <div class="row">
+                                <div class="col-xs-12 info">
+                                    More information about {{config['FOOTER_REPO_NAME']}} can be found
+                                    in the <a href="https://github.com/{{config['FOOTER_REPO_ORG']}}/{{config['FOOTER_REPO_NAME']}}">{{config['FOOTER_REPO_ORG']}}/{{config['FOOTER_REPO_NAME']}}</a>
+                                    repository on Github.
                                </div>
-            <p>
-                More info can be found in the <a href="https://github.com/BernhardWenzel/markdown-search">README.md file</a>
-            </p>
                            </div>
                        </div>
+                    </li>
+
+                {% endif %}
+            {% endif %}
+
+        </ul>
+
+    </div>
 </div>

 {% endblock %}
Author	SHA1	Message	Date
Charles Reid	de796880c5	Merge branch 'master' of github.com:charlesreid1/centillion * 'master' of github.com:charlesreid1/centillion: update config_flask.example.py to strip dc info	2018-08-13 19:14:54 -07:00
Charles Reid	f79f711a38	Merge branch 'master' of github.com:dcppc/centillion * 'master' of github.com:dcppc/centillion: Update Readme.md	2018-08-13 19:14:07 -07:00
Charles Reid	00b862b83e	Merge branch 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion * 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion:	2018-08-13 19:13:53 -07:00
Chaz Reid	a06c3b645a	Update Readme.md	2018-08-13 12:42:18 -07:00
Charles Reid	878ff011fb	locked out by rate limit, but otherwise successful in indexing so far.	2018-08-13 00:54:12 -07:00
Charles Reid	33cf78a524	successfully grabbing threads from 1st page of every subgroup	2018-08-13 00:27:45 -07:00
Charles Reid	c1bcd8dc22	add import pdb where things are currently stuck	2018-08-12 20:25:29 -07:00
Charles Reid	757e9d79a1	keep going with spider idea	2018-08-12 20:24:29 -07:00
Charles Reid	c47682adb4	fix typo with groupsio key	2018-08-12 20:13:45 -07:00
Charles Reid	f2662c3849	adding calls to index groupsio emails this is currently work in progress. we have a debug statement in place as a bookmark. we are currently: - creating a login session - getting all the subgroups - going to first subgroup - getting list of titles and links - getting emails for each title and link still need to: - figure out how to assemble email {} - assemble content/etc and how to parse text of emails	2018-08-12 18:00:33 -07:00
Charles Reid	2478a3f857	Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc * 'dcppc' of github.com:dcppc/centillion: fix how search results are bundled fix search template	2018-08-10 06:05:44 -07:00
Charles Reid	f174080dfd	catch exception when file info not found	2018-08-10 06:05:33 -07:00
Chaz Reid	ca8b12db06	Merge pull request #2 from charlesreid1/dcppc-merge-master Merge dcppc changes into master	2018-08-10 05:49:29 -07:00
Chaz Reid	a1ffdad292	Merge branch 'master' into dcppc-merge-master	2018-08-10 05:49:19 -07:00
Charles Reid	ce76396096	update config_flask.example.py to strip dc info	2018-08-10 05:46:07 -07:00
Chaz Reid	175ff4f71d	Merge pull request #17 from dcppc/github-files fix search template	2018-08-09 18:57:30 -07:00
Charles Reid	94f956e2d0	fix how search results are bundled	2018-08-09 18:56:56 -07:00
Charles Reid	dc015671fc	fix search template	2018-08-09 18:55:49 -07:00
Charles Reid	1e9eec81d7	make it valid json	2018-08-09 18:15:14 -07:00
Chaz Reid	31e12476af	Merge pull request #16 from dcppc/inception add inception	2018-08-09 18:08:11 -07:00
Chaz Reid	bbe4e32f63	Merge pull request #15 from dcppc/github-files index all github filenames, not just markdown	2018-08-09 18:07:56 -07:00
Charles Reid	5013741958	while we're at it	2018-08-09 17:40:56 -07:00
Charles Reid	1ce80a5da0	closes #11	2018-08-09 17:38:20 -07:00
Charles Reid	3ed967bd8b	remove unused function	2018-08-09 17:28:22 -07:00
Charles Reid	1eaaa32007	index all github filenames, not just markdown	2018-08-09 17:25:09 -07:00
Charles Reid	9c7e696b6a	Merge branch 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion * 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion: Move images, resize images, update image markdown in readme update readme to use <img> tags merge image files in from master fix <title> fix the readme to reflect current state of things/links/descriptions fix typos/wording in readme adding changes to enable https, update callback to http, and everything still passes through https (proxy) update footer repo info update screen shots add mkdocs-material-dib submodule remove mkdocs material submodule update tagline update tagline add _example_ config file for flask	2018-08-09 16:39:18 -07:00
Chaz Reid	262a0c19e7	Merge pull request #14 from dcppc/local-fixes Fix centillion to work for local instances	2018-08-09 16:37:37 -07:00
Chaz Reid	bd2714cc0b	Merge branch 'dcppc' into local-fixes	2018-08-09 16:36:34 -07:00
Charles Reid	899d6fed53	comment out localhost only env var	2018-08-09 16:25:37 -07:00
Charles Reid	a7756049e5	revert changes	2018-08-09 16:23:42 -07:00
Charles Reid	3df427a8f8	fix how existing issues in search index are collected. closes #10	2018-08-09 16:17:17 -07:00
Charles Reid	0dd06748de	fix centillion to work for local instance	2018-08-09 16:16:30 -07:00
Chaz Reid	1a04814edf	Merge pull request #9 from dcppc/ACharbonneau-patch-1 Update config_centillion.json	2018-08-07 16:09:45 -07:00
Amanda Charbonneau	3fb72d409b	Update config_centillion.json I fixed it	2018-08-07 18:24:32 -04:00
Chaz Reid	d89e01221a	Merge pull request #8 from dcppc/dcppc-test Fix the name of the milestones repo: 'dcppc-milestones' not 'milestones'	2018-08-07 14:59:06 -07:00
Charles Reid	6736f3f8ad	add centillion configuration json file	2018-08-07 14:54:56 -07:00
Chaz Reid	abd13aba29	Merge pull request #7 from dcppc/fix-docstrings Fix docstrings	2018-08-07 14:43:42 -07:00
Charles Reid	13e49cdaa6	improve docstrings on gdrive_util.py too	2018-08-07 14:42:19 -07:00
Charles Reid	83b2ce17fb	fix docstrings in centillion_search.py	2018-08-07 14:41:26 -07:00
Chaz Reid	5be0709070	Merge pull request #6 from dcppc/fix-docs Move images, resize images, update image markdown in readme	2018-08-07 13:02:08 -07:00
Charles Reid	9edd95a78d	Merge branch 'fix-docs' * fix-docs: Move images, resize images, update image markdown in readme update readme to use <img> tags merge image files in from master fix <title> fix the readme to reflect current state of things/links/descriptions fix typos/wording in readme adding changes to enable https, update callback to http, and everything still passes through https (proxy) update footer repo info update screen shots add mkdocs-material-dib submodule remove mkdocs material submodule update tagline update tagline add _example_ config file for flask	2018-08-07 12:50:29 -07:00
Charles Reid	37615d8707	Move images, resize images, update image markdown in readme	2018-08-07 12:40:38 -07:00
Charles Reid	4b218f63b9	update readme to use <img> tags	2018-08-03 15:56:49 -07:00
Charles Reid	4e17c890bc	merge image files in from master	2018-08-03 15:53:51 -07:00
Charles Reid	1129ec38e0	update the readme	2018-08-03 15:49:46 -07:00
Charles Reid	875508c796	update screen shot images	2018-08-03 15:49:12 -07:00
Charles Reid	abc7a2aedf	fix <title>	2018-08-03 15:45:56 -07:00
Charles Reid	8f1e5faefc	update readme to reflect latest	2018-08-03 15:38:23 -07:00
Chaz Reid	d5f63e2322	Merge pull request #1 from dcppc/fix-readme fix the readme to reflect current state of things/links/descriptions	2018-08-03 15:28:51 -07:00
Charles Reid	84e5560423	fix the readme to reflect current state of things/links/descriptions	2018-08-03 15:28:16 -07:00
Charles Reid	924c562c0a	fix typos/wording in readme	2018-08-03 15:22:35 -07:00
Charles Reid	13c410ac5e	adding changes to enable https, update callback to http, and everything still passes through https (proxy)	2018-08-03 15:21:41 -07:00
Charles Reid	4e79800e83	update footer repo info	2018-08-03 15:19:55 -07:00
Charles Reid	5b9570d8cd	Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc * 'dcppc' of github.com:dcppc/centillion: add mkdocs-material-dib submodule remove mkdocs material submodule	2018-08-03 14:54:25 -07:00
Charles Reid	297a4b5977	update screen shots	2018-08-03 14:53:43 -07:00
Charles Reid	69a6b5d680	add mkdocs-material-dib submodule	2018-08-03 13:51:13 -07:00
Charles Reid	3feca1aba3	remove mkdocs material submodule	2018-08-03 13:50:37 -07:00
Charles Reid	493581f861	update tagline	2018-08-03 13:38:00 -07:00
Charles Reid	1b0ded809d	update tagline	2018-08-03 13:36:56 -07:00
Charles Reid	78e77c7cf2	add _example_ config file for flask	2018-08-03 13:34:27 -07:00
Charles Reid	2f890d1aee	Merge branch 'all-the-docs' of charlesreid1/centillion into master	2018-08-03 20:28:27 +00:00
Charles Reid	937327f2cb	update search template to treat drive files and documents differently.	2018-08-03 13:24:03 -07:00
Charles Reid	ca0d88cfe6	index all the google drive things	2018-08-03 13:15:02 -07:00
Charles Reid	5eda472072	improve handling of tokens for gh api, fix set ordering/logic	2018-08-03 13:07:46 -07:00
Charles Reid	d943c14678	Merge branch 'master' into all-the-docs * master: Update '.gitignore' no secrets plz	2018-08-03 12:37:49 -07:00
Charles Reid	6be785a056	indexing all markdown is working.	2018-08-03 12:36:32 -07:00
Charles Reid	65113a95f7	Update '.gitignore'	2018-08-03 17:52:04 +00:00
Charles Reid	87c3f12c8f	no secrets plz	2018-08-03 17:51:39 +00:00
Charles Reid	933884e9ab	search all the docs. search all the repos.	2018-08-03 10:29:52 -07:00
Charles Reid	da9dea3f6b	Merge branch 'github-markdown' of charlesreid1/centillion into master	2018-08-03 07:20:45 +00:00
Charles Reid	4d6386e74a	add results-handling for markdown files	2018-08-03 00:19:57 -07:00
Charles Reid	a93b7519de	improve counts accounting, and construct usable urls for markdown	2018-08-03 00:19:35 -07:00
Charles Reid	5e2c37164b	fix markdown indexing	2018-08-02 23:56:56 -07:00
Charles Reid	829e9c4263	finish subsuming repotree into centillion_search	2018-08-02 23:14:55 -07:00
Charles Reid	283991017c	add repotree script. temporary/standalone, but doing exactly what centillion needs to do.	2018-08-02 22:29:18 -07:00
Charles Reid	653af18f24	add update_index_markdown() function, rough/unfinished	2018-08-02 22:27:30 -07:00
Charles Reid	fae184f1f3	re-indexer now calls (nonexistent file) update_index_markdown	2018-08-02 22:26:56 -07:00
Charles Reid	d40bb3557f	Merge branch 'flask-dance' of charlesreid1/centillion into master	2018-08-03 04:09:20 +00:00
Charles Reid	a848f3ec3e	complete the conversion to oauth tokens	2018-08-02 19:06:34 -07:00
Charles Reid	50d27a915a	update readme	2018-08-02 19:04:40 -07:00
Charles Reid	1b950b7790	update re-index task to use gh token; reorganize logic; use werkzeug proxy	2018-08-02 19:02:00 -07:00
Charles Reid	04d4195668	Add flask-dance to centillion. - Remove config file, which now contains secrets - Add flask dance to requirements - Update instructions in readme to include Github application setup	2018-08-02 11:52:56 -07:00
Charles Reid	d0fe7aa799	ignore config files, which may have keys in them	2018-08-02 11:24:33 -07:00
Charles Reid	acc28aab44	Merge branch 'cache-and-hash' of charlesreid1/centillion into master	2018-08-02 17:59:45 +00:00
Charles Reid	adc2666a9b	actually fix flashed messages	2018-08-02 00:58:37 -07:00
Charles Reid	581f0a67ed	fix messages so they are js and dismissable	2018-08-02 00:54:56 -07:00
Charles Reid	0b96061bc5	update documentation, add new docs pages on components/flask/whoosh	2018-08-01 23:04:35 -07:00
Charles Reid	c7acdea889	finally. make results comprehensible.	2018-08-01 22:39:07 -07:00
Charles Reid	4eabd4536e	remove last searches from search.html	2018-08-01 22:32:20 -07:00
Charles Reid	78276c14d9	align badges higher	2018-08-01 22:31:59 -07:00
Charles Reid	68f90d383f	fix up how issues are added, and how all issues are iterated over (use set algebra)	2018-08-01 22:31:41 -07:00
Charles Reid	202643b85e	add control_panel route, remove last_search silliness	2018-08-01 22:29:06 -07:00
Charles Reid	dc9ac74d68	add control panel page	2018-08-01 20:12:55 -07:00
Charles Reid	36cc94a854	Fix bootstrap div classes, badgify counts, fix <li> styles	2018-08-01 20:12:10 -07:00
Charles Reid	740e757bcd	update todo with what we have done	2018-08-01 15:54:03 -07:00
Charles Reid	bf6afe39c6	caching is working	2018-08-01 15:48:43 -07:00
Charles Reid	54c09ce80b	call add drive file function with add/update docIDs. fix method headers.	2018-08-01 15:17:07 -07:00
Charles Reid	1407178f39	updating flask config and templates to parameterize repo info in footer	2018-08-01 13:43:43 -07:00
Charles Reid	2bf9abfd6f	update footer: prior searches are now badges, and link to more info now points to repo	2018-08-01 13:36:45 -07:00
Charles Reid	8328f96f76	make "prior searches" a badge and infobox bg color	2018-08-01 13:36:05 -07:00
Charles Reid	d5a9fe85af	Merge branch 'master' into cache-and-hash * master: update installation preparation step	2018-08-01 12:50:10 -07:00
Charles Reid	f8d2156d85	update installation preparation step	2018-08-01 12:48:09 -07:00
Charles Reid	a753ba4963	update centillion search with comment blocks laying out what to change and where	2018-08-01 11:32:37 -07:00
Charles Reid	8cca4b2c8d	add TAGLINE param	2018-08-01 00:49:56 -07:00