Merge pull request #9 from dcppc/ACharbonneau-patch-1

Update config_centillion.json
2018-08-07 16:09:45 -07:00 · 2018-08-07 18:24:32 -04:00 · 2018-08-07 14:59:06 -07:00 · 2018-08-07 14:54:56 -07:00 · 2018-08-07 14:43:42 -07:00 · 2018-08-07 14:42:19 -07:00
15 changed files with 184 additions and 90 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,4 +1,4 @@
-config_*
+config_flask.py
 vp
 credentials.json
 drive*.json
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,3 +1,3 @@
-[submodule "mkdocs-material"]
-	path = mkdocs-material
-	url = https://git.charlesreid1.com/charlesreid1/mkdocs-material.git
+[submodule "mkdocs-material-dib"]
+	path = mkdocs-material-dib
+	url = https://github.com/dib-lab/mkdocs-material-dib.git
--- a/Readme.md
+++ b/Readme.md
@@ -1,18 +1,19 @@
 # The Centillion

-**the centillion**: a pan-github-markdown-issues-google-docs search engine.
+**centillion**: a pan-github-markdown-issues-google-docs search engine.

 **a centillion**: a very large number consisting of a 1 with 303 zeros after it.

-the centillion is 3.03 log-times better than the googol.
+one centillion is 3.03 log-times better than a googol.

-![Screen shot of centillion](img/ss.png)
+![Screen shot of centillion](docs/images/ss.png)


 ## what is it

-The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
-a Python library for building search engines.
+Centillion (https://github.com/dcppc/centillion) is a search engine that can index 
+three kinds of collections: Google Documents, Github issues, and Markdown files in 
+Github repos.

 We define the types of documents the centillion should index,
 what info and how. The centillion then builds and
@@ -24,6 +25,30 @@ defined in `centillion.py`.

 The centillion keeps it simple.

+## authentication layer
+
+Centillion lives behind a Github authentication layer, implemented with 
+[flask-dance](https://github.com/singingwolfboy/flask-dance). When you first
+visit the site it will ask you to authenticate with Github so that it can 
+verify you have permission to access the site.
+
+## technologies
+
+Centillion is a Python program built using whoosh (search engine library). It
+indexes the full text of docx files in Google Documents, just the filenames for
+non-docx files. The full text of issues and their comments are indexed, and
+results are grouped by issue. Centillion requires Google Drive and Github OAuth
+apps. Once you provide credentials to Flask you're all set to go. 
+
+
+## control panel
+
+There's also a control panel at <https://search.nihdatacommons.us/control_panel> 
+that allows you to rebuild the search index from scratch (the Google Drive indexing 
+takes a while).
+
+![Screen shot of centillion control panel](docs/images/cp.png)
+

 ## quickstart (with Github auth)

@@ -31,6 +56,8 @@ Start by creating a Github OAuth application.
 Get the public and private application key 
 (client token and client secret token)
 from the Github application's page.
+You will also need a Github access token
+(in addition to the app tokens).

 When you create the application, set the callback
 URL to `/login/github/authorized`, as in:
@@ -65,11 +92,3 @@ as HTTP by Github, even though there is an HTTPS address, and
 everything else seems fine, try deleting the Github OAuth app
 and creating a new one.

-
-## more info
-
-For more info see the documentation: <https://charlesreid1.github.io/centillion>
-
-
-
-
--- a/centillion.py
+++ b/centillion.py
@@ -27,10 +27,10 @@ You provide:


 class UpdateIndexTask(object):
-    def __init__(self, gh_oauth_token, diff_index=False):
+    def __init__(self, gh_access_token, diff_index=False):
        self.diff_index = diff_index
        thread = threading.Thread(target=self.run, args=())
-        self.gh_oauth_token = gh_oauth_token
+        self.gh_access_token = gh_access_token
        thread.daemon = True
        thread.start()

@@ -43,8 +43,8 @@ class UpdateIndexTask(object):
        from get_centillion_config import get_centillion_config
        config = get_centillion_config('config_centillion.json')

-        search.update_index_markdown(self.gh_oauth_token,config)
-        search.update_index_issues(self.gh_oauth_token,config)
+        search.update_index_issues(self.gh_access_token,config)
+        search.update_index_markdown(self.gh_access_token,config)
        search.update_index_gdocs(config)


@@ -55,11 +55,11 @@ app.wsgi_app = ProxyFix(app.wsgi_app)
 # Load default config and override config from an environment variable
 app.config.from_pyfile("config_flask.py")

-github_bp = make_github_blueprint()
-#github_bp = make_github_blueprint(
-#                        client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
-#                        client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
-#                        scope='read:org')
+#github_bp = make_github_blueprint()
+github_bp = make_github_blueprint(
+                        client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
+                        client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
+                        scope='read:org')

 app.register_blueprint(github_bp, url_prefix="/login")

@@ -172,11 +172,13 @@ def update_index():
                mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
                if mresp.status_code==204:

-                    gh_oauth_token = github.token['access_token']
+                    #gh_oauth_token = github.token['access_token']
+                    gh_access_token = app.config['GITHUB_TOKEN']

                    # --------------------
                    # Business as usual
-                    UpdateIndexTask(gh_oauth_token, diff_index=False)
+                    UpdateIndexTask(gh_access_token, 
+                                    diff_index=False)
                    flash("Rebuilding index, check console output")
                    return render_template("controlpanel.html", 
                                           totals={})
@@ -216,5 +218,6 @@ def oops(e):
    return contents404

 if __name__ == '__main__':
+    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = 'true'
    app.run(host="0.0.0.0",port=5000)

--- a/centillion_search.py
+++ b/centillion_search.py
@@ -1,7 +1,7 @@
 import shutil
 import html.parser

-from github import Github
+from github import Github, GithubException
 import base64

 from gdrive_util import GDrive
@@ -252,7 +252,6 @@ class Search:
            with open(fullpath_input, 'wb') as f:
                f.write(r.content)

-
            # Try to convert docx file to plain text
            try:
                output = pypandoc.convert_file(fullpath_input,
@@ -316,7 +315,7 @@ class Search:
    # to a search index.


-    def add_issue(self, writer, issue, config, update=True):
+    def add_issue(self, writer, issue, gh_access_token, config, update=True):
        """
        Add a Github issue/comment to a search index.
        """
@@ -368,7 +367,7 @@ class Search:



-    def add_markdown(self, writer, d, config, update=True):
+    def add_markdown(self, writer, d, gh_access_token, config, update=True):
        """
        Use a Github markdown document API record
        to add a markdown document's contents to
@@ -385,18 +384,27 @@ class Search:
        _, fname = os.path.split(fpath)
        _, fext = os.path.splitext(fpath)

-        print("Indexing markdown doc %s"%(fname))
+        print("Indexing markdown doc %s from repo %s"%(fname,repo_name))

        # Unpack the requests response and decode the content
-        response = requests.get(furl)
-        jresponse = response.json()
-        content = ""
-        try:
-            binary_content = re.sub('\n','',jresponse['content'])
-            content = base64.b64decode(binary_content).decode('utf-8')
+        # 
+        # don't forget the headers for private repos!
+        # useful: https://bit.ly/2LSAflS

-        except KeyError:
-            print(" > XXXXXXXX Failed to extract 'content' field. You probably hit the rate limit.")
+        headers = {'Authorization' : 'token %s'%(gh_access_token)}
+
+        response = requests.get(furl, headers=headers)
+        if response.status_code==200:
+            jresponse = response.json()
+            content = ""
+            try:
+                binary_content = re.sub('\n','',jresponse['content'])
+                content = base64.b64decode(binary_content).decode('utf-8')
+            except KeyError:
+                print(" > XXXXXXXX Failed to extract 'content' field. You probably hit the rate limit.")
+
+        else:
+            print(" > XXXXXXXX Failed to reach file URL. There may be a problem with authentication/headers.")
            return 

        # Now create the actual search index record
@@ -431,6 +439,10 @@ class Search:
    # Define how to update search index
    # using different kinds of collections

+
+    # ------------------------------
+    # Google Drive Files/Documents
+
    def update_index_gdocs(self, 
                           config):
        """
@@ -478,7 +490,7 @@ class Search:
        remote_ids = set()
        full_items = {}
        while True:
-            ps = 12
+            ps = 100
            results = drive.list(
                    pageSize=ps,
                    pageToken=nextPageToken,
@@ -496,11 +508,11 @@ class Search:
                # Also store the doc
                full_items[f['id']] = f
            
-            # Shorter:
-            break
-            ## Longer:
-            #if nextPageToken is None:
-            #    break
+            ## Shorter:
+            #break
+            # Longer:
+            if nextPageToken is None:
+                break


        writer = self.ix.writer()
@@ -544,13 +556,13 @@ class Search:
        print("Done, updated %d documents in the index" % count)


+    # ------------------------------
+    # Github Issues/Comments

-    def update_index_issues(self, gh_oauth_token, config):
+    def update_index_issues(self, gh_access_token, config):
        """
        Update the search index using a collection of 
        Github repo issues and comments.
-
-        gh_oauth_token can also be an access token.
        """
        # Updated algorithm:
        # - get set of indexed ids
@@ -572,25 +584,29 @@ class Search:
        # Get the set of remote ids:
        # ------
        # Start with api object
-        g = Github(gh_oauth_token)
+        g = Github(gh_access_token)

        # Now index all issue threads in the user-specified repos

+        # Start by collecting all the things
+        remote_issues = set()
+        full_items = {}
+
        # Iterate over each repo 
        list_of_repos = config['repositories']
        for r in list_of_repos:

-            # Start by collecting all the things
-            remote_issues = set()
-            full_items = {}
-
            if '/' not in r:
                err = "Error: specify org/reponame or user/reponame in list of repos"
                raise Exception(err)

            this_org, this_repo = re.split('/',r)
-            org = g.get_organization(this_org)
-            repo = org.get_repo(this_repo)
+            try:
+                org = g.get_organization(this_org)
+                repo = org.get_repo(this_repo)
+            except:
+                print("Error: could not gain access to repository %s"%(r))
+                continue

            # Iterate over each issue thread
            issues = repo.get_issues()
@@ -622,7 +638,7 @@ class Search:
            # cop out
            writer.delete_by_term('id',update_issue)
            item = full_items[update_issue]
-            self.add_issue(writer, item, config, update=True)
+            self.add_issue(writer, item, gh_access_token, config, update=True)
            count += 1


@@ -631,7 +647,7 @@ class Search:
        add_issues = remote_issues - indexed_issues
        for add_issue in add_issues:
            item = full_items[add_issue]
-            self.add_issue(writer, item, config, update=False)
+            self.add_issue(writer, item, gh_access_token, config, update=False)
            count += 1


@@ -640,13 +656,13 @@ class Search:



+    # ------------------------------
+    # Github Markdown Files

-    def update_index_markdown(self, gh_oauth_token, config): 
+    def update_index_markdown(self, gh_access_token, config): 
        """
        Update the search index using a collection of 
        Markdown files from a Github repo.
-
-        gh_oauth_token can also be an access token.
        """
        EXT = '.md'

@@ -669,38 +685,48 @@ class Search:
        # Get the set of remote ids:
        # ------
        # Start with api object
-        g = Github(gh_oauth_token)
+        g = Github(gh_access_token)

        # Now index all markdown files
        # in the user-specified repos

+        # Start by collecting all the things
+        remote_ids = set()
+        full_items = {}
+
        # Iterate over each repo 
        list_of_repos = config['repositories']
        for r in list_of_repos:

-            # Start by collecting all the things
-            remote_ids = set()
-            full_items = {}
-
            if '/' not in r:
                err = "Error: specify org/reponame or user/reponame in list of repos"
                raise Exception(err)

            this_org, this_repo = re.split('/',r)
-            org = g.get_organization(this_org)
-            repo = org.get_repo(this_repo)
+            try:
+                org = g.get_organization(this_org)
+                repo = org.get_repo(this_repo)
+            except:
+                print("Error: could not gain access to repository %s"%(r))
+                continue
+

            # ---------
            # begin markdown-specific code

            # Get head commit
            commits = repo.get_commits()
-            last = commits[0]
-            sha = last.sha
+            try:
+                last = commits[0]
+                sha = last.sha
+            except GithubException:
+                print("Error: could not get commits from repository %s"%(r))
+                continue

            # Get all the docs
            tree = repo.get_git_tree(sha=sha, recursive=True)
            docs = tree.raw_data['tree']
+            print("Parsing doc ids from repository %s"%(r))

            for d in docs:

@@ -736,10 +762,10 @@ class Search:
        # and in remote_ids
        update_ids = indexed_ids & remote_ids
        for update_id in update_ids:
-            # cop out
+            # cop out: just delete and re-add
            writer.delete_by_term('id',update_id)
            item = full_items[update_id]
-            self.add_markdown(writer, item, config, update=True)
+            self.add_markdown(writer, item, gh_access_token, config, update=True)
            count += 1


@@ -748,7 +774,7 @@ class Search:
        add_ids = remote_ids - indexed_ids
        for add_id in add_ids:
            item = full_items[add_id]
-            self.add_markdown(writer, item, config, update=False)
+            self.add_markdown(writer, item, gh_access_token, config, update=False)
            count += 1


@@ -757,6 +783,16 @@ class Search:



+    # ------------------------------
+    # Groups.io Emails
+
+
+    #def update_index_markdown(self, gh_access_token, config): 
+
+
+
+
+
    # ---------------------------------
    # Search results bundler

--- a/config_centillion.json
+++ b/config_centillion.json
@@ -1,6 +1,27 @@
 {
    "repositories" : [
+        "dcppc/project-management",
+        "dcppc/nih-demo-meetings",
+        "dcppc/internal",
+        "dcppc/organize",
+        "dcppc/dcppc-bot",
+        "dcppc/full-stacks",
+        "dcppc/markdown-issues",
+        "dcppc/design-guidelines-discuss",
+        "dcppc/dcppc-deliverables",
+        "dcppc/dcppc-milestones",
+        "dcppc/crosscut-metadata",
+        "dcppc/lucky-penny",
+        "dcppc/dcppc-workshops",
+        "dcppc/metadata-matrix",
+        "dcppc/data-stewards",
+        "dcppc/dcppc-phase1-demos",
+        "dcppc/apis",
        "dcppc/2018-june-workshop",
-        "dcppc/2018-july-workshop"
+        "dcppc/2018-july-workshop",
+        "dcppc/2018-august-workshop",
+        "dcppc/2018-september-workshop",
+        "dcppc/design-guidelines",
+        "dcppc/2018-may-workshop"
    ]
 }
--- a/config_flask.example.py
+++ b/config_flask.example.py
@@ -2,17 +2,18 @@
 INDEX_DIR = "search_index"

 # oauth client deets
-GITHUB_OAUTH_CLIENT_ID = "63f8d49c651840cbe31e"
-GITHUB_OAUTH_CLIENT_SECRET = "36d9a4611f7427336d3c89ed041c45d086b793ee"
+GITHUB_OAUTH_CLIENT_ID = "XXX"
+GITHUB_OAUTH_CLIENT_SECRET = "YYY"
+GITHUB_TOKEN = "ZZZ"

 # More information footer: Repository label
-FOOTER_REPO_ORG = "charlesreid1"
+FOOTER_REPO_ORG = "dcppc"
 FOOTER_REPO_NAME = "centillion"

 # Toggle to show Whoosh parsed query
 SHOW_PARSED_QUERY=True

-TAGLINE = "Search all the things"
+TAGLINE = "Search the Data Commons"

 # Flask settings
 DEBUG = True
--- a/docs/images/cp.png
+++ b/docs/images/cp.png
--- a/docs/images/ss.png
+++ b/docs/images/ss.png
--- a/gdrive_util.py
+++ b/gdrive_util.py
@@ -29,8 +29,7 @@ class GDrive(object):
    ):
        """
        Set up the Google Drive API instance.
-        Factory method: create it and hand it over.
-        Then we're finished.
+        Factory method: create it here, hand it over in get_service().
        """
        self.credentials_file = credentials_file
        self.client_secret_file = client_secret_file
@@ -40,6 +39,9 @@ class GDrive(object):
        self.store = file.Storage(credentials_file)

    def get_service(self):
+        """
+        Return an instance of the Google Drive API service.
+        """

        creds = self.store.get()
        if not creds or creds.invalid:
--- a/img/ss.png
+++ b/img/ss.png
--- a/1
+++ b/1
--- a/1
+++ b/1
--- a/templates/layout.html
+++ b/templates/layout.html
@@ -1,5 +1,5 @@
 <!doctype html>
-<title>Markdown Search</title>
+<title>Centillion Search Engine</title>
 <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}">
 <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
 <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='bootstrap.min.css') }}">
--- a/templates/search.html
+++ b/templates/search.html
@@ -107,12 +107,18 @@

                    <div class="url">
                        {% if e.kind=="gdoc" %}
-                            <b>Google Drive File:</b>
-                            <a href='{{e.url}}'>{{e.title}}</a>
-                            (Owner: {{e.owner_name}}, {{e.owner_email}})
+                            {% if e.mimetype=="document" %}
+                                <b>Google Document:</b>
+                                <a href='{{e.url}}'>{{e.title}}</a>
+                                (Type: {{e.mimetype}}, Owner: {{e.owner_name}}, {{e.owner_email}})
+                            {% else %}
+                                <b>Google Drive:</b>
+                                <a href='{{e.url}}'>{{e.title}}</a>
+                                (Type: {{e.mimetype}}, Owner: {{e.owner_name}}, {{e.owner_email}})
+                            {% endif %}

                        {% elif e.kind=="issue" %}
-                            <b>Issue:</b>
+                            <b>Github Issue:</b>
                            <a href='{{e.url}}'>{{e.title}}</a>
                            {% if e.github_user %}
                            opened by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
@@ -121,7 +127,7 @@
                            <b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>

                        {% elif e.kind=="markdown" %}
-                            <b>Markdown:</b>
+                            <b>Github Markdown:</b>
                            <a href='{{e.url}}'>{{e.title}}</a>
                            <br/>
                            <b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
@@ -131,9 +137,15 @@

                        {% endif %}
                        <br />
-                        score: {{'%d'  % e.score}}
+                        Score: {{'%d'  % e.score}}
+                    </div>
+                    <div class="markdown-body">
+                        {% if e.content_highlight %}
+                            {{ e.content_highlight|safe}}
+                        {% else %}
+                        <p>(A preview of this document is not available.)</p>
+                        {% endif %}
                    </div>
-                    <div class="markdown-body">{{ e.content_highlight|safe}}</div>

                </li>
            {% endfor %}
Author	SHA1	Message	Date
Chaz Reid	1a04814edf	Merge pull request #9 from dcppc/ACharbonneau-patch-1 Update config_centillion.json	2018-08-07 16:09:45 -07:00
Amanda Charbonneau	3fb72d409b	Update config_centillion.json I fixed it	2018-08-07 18:24:32 -04:00
Chaz Reid	d89e01221a	Merge pull request #8 from dcppc/dcppc-test Fix the name of the milestones repo: 'dcppc-milestones' not 'milestones'	2018-08-07 14:59:06 -07:00
Charles Reid	6736f3f8ad	add centillion configuration json file	2018-08-07 14:54:56 -07:00
Chaz Reid	abd13aba29	Merge pull request #7 from dcppc/fix-docstrings Fix docstrings	2018-08-07 14:43:42 -07:00
Charles Reid	13e49cdaa6	improve docstrings on gdrive_util.py too	2018-08-07 14:42:19 -07:00
Charles Reid	83b2ce17fb	fix docstrings in centillion_search.py	2018-08-07 14:41:26 -07:00
Chaz Reid	5be0709070	Merge pull request #6 from dcppc/fix-docs Move images, resize images, update image markdown in readme	2018-08-07 13:02:08 -07:00
Charles Reid	9edd95a78d	Merge branch 'fix-docs' * fix-docs: Move images, resize images, update image markdown in readme update readme to use <img> tags merge image files in from master fix <title> fix the readme to reflect current state of things/links/descriptions fix typos/wording in readme adding changes to enable https, update callback to http, and everything still passes through https (proxy) update footer repo info update screen shots add mkdocs-material-dib submodule remove mkdocs material submodule update tagline update tagline add _example_ config file for flask	2018-08-07 12:50:29 -07:00
Charles Reid	37615d8707	Move images, resize images, update image markdown in readme	2018-08-07 12:40:38 -07:00
Charles Reid	4b218f63b9	update readme to use <img> tags	2018-08-03 15:56:49 -07:00
Charles Reid	4e17c890bc	merge image files in from master	2018-08-03 15:53:51 -07:00
Charles Reid	1129ec38e0	update the readme	2018-08-03 15:49:46 -07:00
Charles Reid	875508c796	update screen shot images	2018-08-03 15:49:12 -07:00
Charles Reid	abc7a2aedf	fix <title>	2018-08-03 15:45:56 -07:00
Charles Reid	8f1e5faefc	update readme to reflect latest	2018-08-03 15:38:23 -07:00
Chaz Reid	d5f63e2322	Merge pull request #1 from dcppc/fix-readme fix the readme to reflect current state of things/links/descriptions	2018-08-03 15:28:51 -07:00
Charles Reid	84e5560423	fix the readme to reflect current state of things/links/descriptions	2018-08-03 15:28:16 -07:00
Charles Reid	924c562c0a	fix typos/wording in readme	2018-08-03 15:22:35 -07:00
Charles Reid	13c410ac5e	adding changes to enable https, update callback to http, and everything still passes through https (proxy)	2018-08-03 15:21:41 -07:00
Charles Reid	4e79800e83	update footer repo info	2018-08-03 15:19:55 -07:00
Charles Reid	5b9570d8cd	Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc * 'dcppc' of github.com:dcppc/centillion: add mkdocs-material-dib submodule remove mkdocs material submodule	2018-08-03 14:54:25 -07:00
Charles Reid	297a4b5977	update screen shots	2018-08-03 14:53:43 -07:00
Charles Reid	69a6b5d680	add mkdocs-material-dib submodule	2018-08-03 13:51:13 -07:00
Charles Reid	3feca1aba3	remove mkdocs material submodule	2018-08-03 13:50:37 -07:00
Charles Reid	493581f861	update tagline	2018-08-03 13:38:00 -07:00
Charles Reid	1b0ded809d	update tagline	2018-08-03 13:36:56 -07:00
Charles Reid	78e77c7cf2	add _example_ config file for flask	2018-08-03 13:34:27 -07:00
Charles Reid	2f890d1aee	Merge branch 'all-the-docs' of charlesreid1/centillion into master	2018-08-03 20:28:27 +00:00
Charles Reid	937327f2cb	update search template to treat drive files and documents differently.	2018-08-03 13:24:03 -07:00
Charles Reid	ca0d88cfe6	index all the google drive things	2018-08-03 13:15:02 -07:00
Charles Reid	5eda472072	improve handling of tokens for gh api, fix set ordering/logic	2018-08-03 13:07:46 -07:00
Charles Reid	d943c14678	Merge branch 'master' into all-the-docs * master: Update '.gitignore' no secrets plz	2018-08-03 12:37:49 -07:00
Charles Reid	6be785a056	indexing all markdown is working.	2018-08-03 12:36:32 -07:00
Charles Reid	65113a95f7	Update '.gitignore'	2018-08-03 17:52:04 +00:00
Charles Reid	87c3f12c8f	no secrets plz	2018-08-03 17:51:39 +00:00
Charles Reid	933884e9ab	search all the docs. search all the repos.	2018-08-03 10:29:52 -07:00
Charles Reid	da9dea3f6b	Merge branch 'github-markdown' of charlesreid1/centillion into master	2018-08-03 07:20:45 +00:00