fix output messages for reindexing

do *full* indexing
Fix display bug. Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc
2018-08-24 09:23:09 -07:00 · 2018-08-24 09:01:18 -07:00 · 2018-08-24 08:42:03 -07:00 · 2018-08-24 08:41:35 -07:00 · 2018-08-24 08:40:46 -07:00 · 2018-08-24 08:37:51 -07:00
15 changed files with 471 additions and 226 deletions
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,12 @@
+Thanks for contributing to centillion!
+
+Please place an x between the brackets to indicate a yes answer
+to the questions below.
+
+- [ ] Is this pull request mergeable?
+- [ ] Has this been tested locally?
+- [ ] Does this pull request pass the tests?
+- [ ] Have new tests been added to cover any new code?
+- [ ] Was a spellchecker run on the source code and documentation after
+  changes were made?
+
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,43 @@
+# Code of Conduct
+
+## DCPPC Code of Conduct
+
+All members of the Commons are expected to agree with the following code
+of conduct. We will enforce this code as needed. We expect cooperation
+from all members to help ensuring a safe environment for everybody.
+
+## The Quick Version
+
+The Consortium is dedicated to providing a harassment-free experience
+for everyone, regardless of gender, gender identity and expression, age,
+sexual orientation, disability, physical appearance, body size, race, or
+religion (or lack thereof). We do not tolerate harassment of Consortium
+members in any form. Sexual language and imagery is generally not
+appropriate for any venue, including meetings, presentations, or
+discussions.
+
+## The Less Quick Version
+
+Harassment includes offensive verbal comments related to gender, gender
+identity and expression, age, sexual orientation, disability, physical
+appearance, body size, race, religion, sexual images in public spaces,
+deliberate intimidation, stalking, following, harassing photography or
+recording, sustained disruption of talks or other events, inappropriate
+physical contact, and unwelcome sexual attention.
+
+Members asked to stop any harassing behavior are expected to comply
+immediately.
+
+If you are being harassed, notice that someone else is being harassed,
+or have any other concerns, please contact [Titus
+Brown](mailto:ctbrown@ucdavis.edu) immediately. If Titus is the cause of
+your concern, please contact [Vivien
+Bonazzi](mailto:bonazziv@mail.nih.gov).
+
+We expect members to follow these guidelines at any Consortium event.
+
+Original source and credit: <http://2012.jsconf.us/#/about> & The Ada
+Initiative. Please help by translating or improving:
+<http://github.com/leftlogic/confcodeofconduct.com>. This work is
+licensed under a Creative Commons Attribution 3.0 Unported License
+
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -0,0 +1,21 @@
+# Contributing to the DCPPC Internal Repository
+
+Hello, and thank you for wanting to contribute to the DCPPC Internal
+Repository\!
+
+By contributing to this repository, you agree:
+
+1.  To obey the [Code of Conduct](./CODE_OF_CONDUCT.md)
+2.  To release all your contributions under the same terms as the
+    license itself: the [Creative Commons Zero](./LICENSE.md) (aka
+    Public Domain) license
+
+If you are OK with these two conditions, then we welcome both you and
+your contribution\!
+
+If you have any questions about contributing, please [open an
+issue](https://github.com/dcppc/internal/issues/new) and Team Copper
+will lend a hand ASAP.
+
+Thank you for being here and for being a part of the DCPPC project.
+
--- a/centillion.py
+++ b/centillion.py
@@ -267,7 +267,11 @@ def list_docs(doctype):
            if org['login']=='dcppc':
                # Business as usual
                search = Search(app.config["INDEX_DIR"])
-                return jsonify(search.get_list(doctype))
+                results_list = search.get_list(doctype)
+                for result in results_list:
+                    ct = result['created_time']
+                    result['created_time'] = datetime.strftime(ct,"%Y-%m-%d %I:%M %p")
+                return jsonify(results_list)

    # nope
    return render_template('403.html')
--- a/centillion_search.py
+++ b/centillion_search.py
@@ -24,6 +24,8 @@ import dateutil.parser
 from whoosh import query
 from whoosh.qparser import MultifieldParser, QueryParser
 from whoosh.analysis import StemmingAnalyzer, LowercaseFilter, StopFilter
+from whoosh.qparser.dateparse import DateParserPlugin
+from whoosh import fields, index


 """
@@ -195,30 +197,38 @@ class Search:
        # is defined.

        schema = Schema(
-                id = ID(stored=True, unique=True),
-                kind = ID(stored=True),
+                id = fields.ID(stored=True, unique=True),
+                kind = fields.ID(stored=True),

-                created_time = ID(stored=True),
-                modified_time = ID(stored=True),
-                indexed_time = ID(stored=True),
+                created_time = fields.DATETIME(stored=True),
+                modified_time = fields.DATETIME(stored=True),
+                indexed_time = fields.DATETIME(stored=True),
                
-                title = TEXT(stored=True, field_boost=100.0),
-                url = ID(stored=True, unique=True),
-                
-                mimetype=ID(stored=True),
-                owner_email=ID(stored=True),
-                owner_name=TEXT(stored=True),
-                
-                repo_name=TEXT(stored=True),
-                repo_url=ID(stored=True),
+                title = fields.TEXT(stored=True, field_boost=100.0),

-                github_user=TEXT(stored=True),
+                url = fields.ID(stored=True),
+                
+                mimetype = fields.TEXT(stored=True),
+
+                owner_email = fields.ID(stored=True),
+                owner_name = fields.TEXT(stored=True),
+
+                # mainly for email threads, groups.io, hypothesis
+                group = fields.ID(stored=True),
+
+                repo_name = fields.TEXT(stored=True),
+                repo_url = fields.ID(stored=True),
+                github_user = fields.TEXT(stored=True),
+
+                tags = fields.KEYWORD(commas=True,
+                                      stored=True,
+                                      lowercase=True),

                # comments only
-                issue_title=TEXT(stored=True, field_boost=100.0),
-                issue_url=ID(stored=True),
+                issue_title = fields.TEXT(stored=True, field_boost=100.0),
+                issue_url = fields.ID(stored=True),
                
-                content=TEXT(stored=True, analyzer=stemming_analyzer)
+                content = fields.TEXT(stored=True, analyzer=stemming_analyzer)
        )


@@ -258,24 +268,32 @@ class Search:
            writer.delete_by_term('id',item['id'])

            # Index a plain google drive file
-            writer.add_document(
-                    id = item['id'],
-                    kind = 'gdoc',
-                    created_time = item['createdTime'],
-                    modified_time = item['modifiedTime'],
-                    indexed_time = datetime.now().replace(microsecond=0).isoformat(),
-                    title = item['name'],
-                    url = item['webViewLink'],
-                    mimetype = mimetype,
-                    owner_email = item['owners'][0]['emailAddress'],
-                    owner_name = item['owners'][0]['displayName'],
-                    repo_name='',
-                    repo_url='',
-                    github_user='',
-                    issue_title='',
-                    issue_url='',
-                    content = content
-            )
+            created_time = dateutil.parser.parse(item['createdTime'])
+            modified_time = dateutil.parser.parse(item['modifiedTime'])
+            indexed_time = datetime.now().replace(microsecond=0)
+            try:
+                writer.add_document(
+                        id = item['id'],
+                        kind = 'gdoc',
+                        created_time = created_time,
+                        modified_time = modified_time,
+                        indexed_time = indexed_time,
+                        title = item['name'],
+                        url = item['webViewLink'],
+                        mimetype = mimetype,
+                        owner_email = item['owners'][0]['emailAddress'],
+                        owner_name = item['owners'][0]['displayName'],
+                        group='',
+                        repo_name='',
+                        repo_url='',
+                        github_user='',
+                        issue_title='',
+                        issue_url='',
+                        content = content
+                )
+            except ValueError as e:
+                print(repr(e))
+                print(" > XXXXXX Failed to index Google Drive file \"%s\""%(item['name']))


        else:
@@ -329,7 +347,7 @@ class Search:
                )
                assert output == ""
            except RuntimeError:
-                print(" > XXXXXX Failed to index document \"%s\""%(item['name']))
+                print(" > XXXXXX Failed to index Google Drive document \"%s\""%(item['name']))


            # If export was successful, read contents of markdown
@@ -357,24 +375,33 @@ class Search:
            else:
                print(" > Creating a new record")

-            writer.add_document(
-                    id = item['id'],
-                    kind = 'gdoc',
-                    created_time = item['createdTime'],
-                    modified_time = item['modifiedTime'],
-                    indexed_time = datetime.now().replace(microsecond=0).isoformat(),
-                    title = item['name'],
-                    url = item['webViewLink'],
-                    mimetype = mimetype,
-                    owner_email = item['owners'][0]['emailAddress'],
-                    owner_name = item['owners'][0]['displayName'],
-                    repo_name='',
-                    repo_url='',
-                    github_user='',
-                    issue_title='',
-                    issue_url='',
-                    content = content
-            )
+            try:
+                created_time = dateutil.parser.parse(item['createdTime'])
+                modified_time = dateutil.parser.parse(item['modifiedTime'])
+                indexed_time = datetime.now()
+                writer.add_document(
+                        id = item['id'],
+                        kind = 'gdoc',
+                        created_time = created_time,
+                        modified_time = modified_time,
+                        indexed_time = indexed_time,
+                        title = item['name'],
+                        url = item['webViewLink'],
+                        mimetype = mimetype,
+                        owner_email = item['owners'][0]['emailAddress'],
+                        owner_name = item['owners'][0]['displayName'],
+                        group='',
+                        repo_name='',
+                        repo_url='',
+                        github_user='',
+                        issue_title='',
+                        issue_url='',
+                        content = content
+                )
+            except ValueError as e:
+                print(repr(e))
+                print(" > XXXXXX Failed to index Google Drive file \"%s\""%(item['name']))
+



@@ -408,31 +435,36 @@ class Search:
                issue_comment_content += comment.body.rstrip()
                issue_comment_content += "\n"

-        # Now create the actual search index record
-        created_time = clean_timestamp(issue.created_at)
-        modified_time = clean_timestamp(issue.updated_at)
-        indexed_time = clean_timestamp(datetime.now())
-
+        # Now create the actual search index record.
        # Add one document per issue thread,
        # containing entire text of thread.
-        writer.add_document(
-                id = issue.html_url,
-                kind = 'issue',
-                created_time = created_time,
-                modified_time = modified_time,
-                indexed_time = indexed_time,
-                title = issue.title,
-                url = issue.html_url,
-                mimetype='',
-                owner_email='',
-                owner_name='',
-                repo_name = repo_name,
-                repo_url = repo_url,
-                github_user = issue.user.login,
-                issue_title = issue.title,
-                issue_url = issue.html_url,
-                content = issue_comment_content
-        )
+
+        created_time = issue.created_at
+        modified_time = issue.updated_at
+        indexed_time = datetime.now()
+        try:
+            writer.add_document(
+                    id = issue.html_url,
+                    kind = 'issue',
+                    created_time = created_time,
+                    modified_time = modified_time,
+                    indexed_time = indexed_time,
+                    title = issue.title,
+                    url = issue.html_url,
+                    mimetype='',
+                    owner_email='',
+                    owner_name='',
+                    group='',
+                    repo_name = repo_name,
+                    repo_url = repo_url,
+                    github_user = issue.user.login,
+                    issue_title = issue.title,
+                    issue_url = issue.html_url,
+                    content = issue_comment_content
+            )
+        except ValueError as e:
+            print(repr(e))
+            print(" > XXXXXX Failed to index Github issue \"%s\""%(issue.title))



@@ -462,7 +494,8 @@ class Search:
            print(" > XXXXXXXX Failed to find file info.")
            return

-        indexed_time = clean_timestamp(datetime.now())
+
+        indexed_time = datetime.now()

        if fext in MARKDOWN_EXTS:
            print("Indexing markdown doc %s from repo %s"%(fname,repo_name))
@@ -491,24 +524,31 @@ class Search:
            usable_url = "https://github.com/%s/blob/master/%s"%(repo_name, fpath)

            # Now create the actual search index record
-            writer.add_document(
-                    id = fsha,
-                    kind = 'markdown',
-                    created_time = '',
-                    modified_time = '',
-                    indexed_time = indexed_time,
-                    title = fname,
-                    url = usable_url,
-                    mimetype='',
-                    owner_email='',
-                    owner_name='',
-                    repo_name = repo_name,
-                    repo_url = repo_url,
-                    github_user = '',
-                    issue_title = '',
-                    issue_url = '',
-                    content = content
-            )
+            try:
+                writer.add_document(
+                        id = fsha,
+                        kind = 'markdown',
+                        created_time = None,
+                        modified_time = None,
+                        indexed_time = indexed_time,
+                        title = fname,
+                        url = usable_url,
+                        mimetype='',
+                        owner_email='',
+                        owner_name='',
+                        group='',
+                        repo_name = repo_name,
+                        repo_url = repo_url,
+                        github_user = '',
+                        issue_title = '',
+                        issue_url = '',
+                        content = content
+                )
+            except ValueError as e:
+                print(repr(e))
+                print(" > XXXXXX Failed to index Github markdown file \"%s\""%(fname))
+
+

        else:
            print("Indexing github file %s from repo %s"%(fname,repo_name))
@@ -516,24 +556,29 @@ class Search:
            key = fname+"_"+fsha

            # Now create the actual search index record
-            writer.add_document(
-                    id = key,
-                    kind = 'ghfile',
-                    created_time = '',
-                    modified_time = '',
-                    indexed_time = indexed_time,
-                    title = fname,
-                    url = repo_url,
-                    mimetype='',
-                    owner_email='',
-                    owner_name='',
-                    repo_name = repo_name,
-                    repo_url = repo_url,
-                    github_user = '',
-                    issue_title = '',
-                    issue_url = '',
-                    content = ''
-            )
+            try:
+                writer.add_document(
+                        id = key,
+                        kind = 'ghfile',
+                        created_time = None,
+                        modified_time = None,
+                        indexed_time = indexed_time,
+                        title = fname,
+                        url = repo_url,
+                        mimetype='',
+                        owner_email='',
+                        owner_name='',
+                        group='',
+                        repo_name = repo_name,
+                        repo_url = repo_url,
+                        github_user = '',
+                        issue_title = '',
+                        issue_url = '',
+                        content = ''
+                )
+            except ValueError as e:
+                print(repr(e))
+                print(" > XXXXXX Failed to index Github file \"%s\""%(fname))



@@ -547,28 +592,42 @@ class Search:
        Use a Groups.io email thread record to add 
        an email thread to the search index.
        """
-        indexed_time = clean_timestamp(datetime.now())
+        if 'created_time' in d.keys() and d['created_time'] is not None:
+            created_time = d['created_time']
+        else:
+            created_time = None
+
+        if 'modified_time' in d.keys() and d['modified_time'] is not None:
+            modified_time = d['modified_time']
+        else:
+            modified_time = None
+
+        indexed_time = datetime.now()

        # Now create the actual search index record
-        writer.add_document(
-                id = d['permalink'],
-                kind = 'emailthread',
-                created_time = '',
-                modified_time = '',
-                indexed_time = indexed_time,
-                title = d['subject'],
-                url = d['permalink'],
-                mimetype='',
-                owner_email='',
-                owner_name=d['original_sender'],
-                repo_name = '',
-                repo_url = '',
-                github_user = '',
-                issue_title = '',
-                issue_url = '',
-                content = d['content']
-        )
-
+        try:
+            writer.add_document(
+                    id = d['permalink'],
+                    kind = 'emailthread',
+                    created_time = created_time,
+                    modified_time = modified_time,
+                    indexed_time = indexed_time,
+                    title = d['subject'],
+                    url = d['permalink'],
+                    mimetype='',
+                    owner_email='',
+                    owner_name=d['original_sender'],
+                    group=d['subgroup'],
+                    repo_name = '',
+                    repo_url = '',
+                    github_user = '',
+                    issue_title = '',
+                    issue_url = '',
+                    content = d['content']
+            )
+        except ValueError as e:
+            print(repr(e))
+            print(" > XXXXXX Failed to index Groups.io thread \"%s\""%(d['subject']))


    # ------------------------------
@@ -581,28 +640,33 @@ class Search:
        to add a disqus comment thread to the
        search index.
        """
-        indexed_time = clean_timestamp(datetime.now())
+        indexed_time = datetime.now()
+
+        # created_time is already a timestamp

        # Now create the actual search index record
-        writer.add_document(
-                id = d['id'],
-                kind = 'disqus',
-                created_time = d['created_time'],
-                modified_time = '',
-                indexed_time = indexed_time,
-                title = d['title'],
-                url = d['link'],
-                mimetype='',
-                owner_email='',
-                owner_name='',
-                repo_name = '',
-                repo_url = '',
-                github_user = '',
-                issue_title = '',
-                issue_url = '',
-                content = d['content']
-        )
-
+        try:
+            writer.add_document(
+                    id = d['id'],
+                    kind = 'disqus',
+                    created_time = d['created_time'],
+                    modified_time = None,
+                    indexed_time = indexed_time,
+                    title = d['title'],
+                    url = d['link'],
+                    mimetype='',
+                    owner_email='',
+                    owner_name='',
+                    repo_name = '',
+                    repo_url = '',
+                    github_user = '',
+                    issue_title = '',
+                    issue_url = '',
+                    content = d['content']
+            )
+        except ValueError as e:
+            print(repr(e))
+            print(" > XXXXXX Failed to index Disqus comment thread \"%s\""%(d['title']))



@@ -681,7 +745,7 @@ class Search:
            
            ## Shorter:
            #break
-            # Longer:
+            ## Longer:
            if nextPageToken is None:
                break

@@ -691,40 +755,47 @@ class Search:
        temp_dir = tempfile.mkdtemp(dir=os.getcwd())
        print("Temporary directory: %s"%(temp_dir))

+        try:
+
+            # Drop any id in indexed_ids
+            # not in remote_ids
+            drop_ids = indexed_ids - remote_ids
+            for drop_id in drop_ids:
+                writer.delete_by_term('id',drop_id)


-        # Drop any id in indexed_ids
-        # not in remote_ids
-        drop_ids = indexed_ids - remote_ids
-        for drop_id in drop_ids:
-            writer.delete_by_term('id',drop_id)
+            # Update any id in indexed_ids
+            # and in remote_ids
+            update_ids = indexed_ids & remote_ids
+            for update_id in update_ids:
+                # cop out
+                writer.delete_by_term('id',update_id)
+                item = full_items[update_id]
+                self.add_drive_file(writer, item, temp_dir, config, update=True)
+                count += 1


-        # Update any id in indexed_ids
-        # and in remote_ids
-        update_ids = indexed_ids & remote_ids
-        for update_id in update_ids:
-            # cop out
-            writer.delete_by_term('id',update_id)
-            item = full_items[update_id]
-            self.add_drive_file(writer, item, temp_dir, config, update=True)
-            count += 1
-
-
-        # Add any id not in indexed_ids
-        # and in remote_ids
-        add_ids = remote_ids - indexed_ids
-        for add_id in add_ids:
-            item = full_items[add_id]
-            self.add_drive_file(writer, item, temp_dir, config, update=False)
-            count += 1
+            # Add any id not in indexed_ids
+            # and in remote_ids
+            add_ids = remote_ids - indexed_ids
+            for add_id in add_ids:
+                item = full_items[add_id]
+                self.add_drive_file(writer, item, temp_dir, config, update=False)
+                count += 1

+        except Exception as e:
+            print("ERROR: While adding Google Drive files to search index")
+            print("-"*40)
+            print(repr(e))
+            print("-"*40)
+            print("Continuing...")
+            pass

        print("Cleaning temporary directory: %s"%(temp_dir))
        subprocess.call(['rm','-fr',temp_dir])

        writer.commit()
-        print("Done, updated %d documents in the index" % count)
+        print("Done, updated %d Google Drive files in the index" % count)


    # ------------------------------
@@ -802,7 +873,7 @@ class Search:


        writer.commit()
-        print("Done, updated %d documents in the index" % count)
+        print("Done, updated %d Github issues in the index" % count)



@@ -1176,7 +1247,7 @@ class Search:
        elif doctype=='issue':
            item_keys = ['title','repo_name','repo_url','url','created_time','modified_time']
        elif doctype=='emailthread':
-            item_keys = ['title','owner_name','url']
+            item_keys = ['title','owner_name','url','group','created_time','modified_time']
        elif doctype=='disqus':
            item_keys = ['title','created_time','url']
        elif doctype=='ghfile':
@@ -1195,11 +1266,7 @@ class Search:
            for r in results:
                d = {}
                for k in item_keys:
-                    if k=='created_time' or k=='modified_time':
-                        #d[k] = r[k]
-                        d[k] = dateutil.parser.parse(r[k]).strftime("%Y-%m-%d")
-                    else:
-                        d[k] = r[k]
+                    d[k] = r[k]
                json_results.append(d)

        return json_results
@@ -1212,13 +1279,16 @@ class Search:
            query_string = " ".join(query_list)
            query = None
            if ":" in query_string:
+
                #query = QueryParser("content", 
                #                    self.schema
                #).parse(query_string)
                query = QueryParser("content", 
                                    self.schema,
                                    termclass=query.Variations
-                ).parse(query_string)
+                )
+                query.add_plugin(DateParserPlugin(free=True))
+                query = query.parse(query_string)
            elif len(fields) == 1 and fields[0] == "filename":
                pass
            elif len(fields) == 2:
@@ -1226,9 +1296,12 @@ class Search:
            else:
                # If the user does not specify a field,
                # these are the fields that are actually searched
-                fields = ['title', 'content','owner_name','owner_email','url']
+                fields = ['title', 'content','owner_name','owner_email','url','created_date','modified_date']
            if not query:
-                query = MultifieldParser(fields, schema=self.ix.schema).parse(query_string)
+                query = MultifieldParser(fields, schema=self.ix.schema)
+                query.add_plugin(DateParserPlugin(free=True))
+                query = query.parse(query_string)
+                #query = MultifieldParser(fields, schema=self.ix.schema).parse(query_string) 
            parsed_query = "%s" % query
            print("query: %s" % parsed_query)
            results = searcher.search(query, terms=False, scored=True, groupedby="kind")
--- a/config_flask.example.py
+++ b/config_flask.example.py
@@ -1,20 +1,38 @@
+######################################
+# github oauth
+GITHUB_OAUTH_CLIENT_ID = "XXX"
+GITHUB_OAUTH_CLIENT_SECRET = "YYY"
+
+######################################
+# github acces token
+GITHUB_TOKEN = "XXX"
+
+######################################
+# groups.io
+GROUPSIO_TOKEN = "XXXXX"
+GROUPSIO_USERNAME = "XXXXX"
+GROUPSIO_PASSWORD = "XXXXX"
+
+######################################
+# Disqus API public key
+DISQUS_TOKEN = "XXXXX"
+
+######################################
+# everything else
+
 # Location of index file
 INDEX_DIR = "search_index"

-# oauth client deets
-GITHUB_OAUTH_CLIENT_ID = "XXX"
-GITHUB_OAUTH_CLIENT_SECRET = "YYY"
-GITHUB_TOKEN = "ZZZ"
-
 # More information footer: Repository label
-FOOTER_REPO_ORG = "charlesreid1"
+FOOTER_REPO_ORG = "dcppc"
 FOOTER_REPO_NAME = "centillion"

 # Toggle to show Whoosh parsed query
 SHOW_PARSED_QUERY=True

-TAGLINE = "Search All The Things"
+TAGLINE = "Search the Data Commons"

 # Flask settings
 DEBUG = True
-SECRET_KEY = 'WWWWW'
+SECRET_KEY = 'XXXXX'
+
--- a/disqus_util.py
+++ b/disqus_util.py
@@ -1,6 +1,7 @@
 import os, re
 import requests
 import json
+import dateutil.parser

 from pprint import pprint

@@ -117,13 +118,14 @@ class DisqusCrawler(object):

                        link = response['link']
                        clean_link = re.sub('data-commons.us','nihdatacommons.us',link)
+                        clean_link += "#disqus_comments"

                        # Finished working on thread.

                        # We need to make this value a dictionary
                        thread_info = dict(
                                id = response['id'],
-                                created_time = response['createdAt'],
+                                created_time = dateutil.parser.parse(response['createdAt']),
                                title = response['title'],
                                forum = response['forum'],
                                link = clean_link,
--- a/groupsio_util.py
+++ b/groupsio_util.py
@@ -1,5 +1,7 @@
 import requests, os, re
 from bs4 import BeautifulSoup
+import dateutil.parser
+import datetime

 class GroupsIOException(Exception):
    pass
@@ -251,7 +253,7 @@ class GroupsIOArchivesCrawler(object):
            subject = soup.find('title').text

            # Extract information for the schema:
-            # - permalink for thread (done)
+            # - permalink for thread (done above)
            # - subject/title (done)
            # - original sender email/name (done)
            # - content (done)
@@ -266,11 +268,35 @@ class GroupsIOArchivesCrawler(object):
                    pass
                else:
                    # found an email!
-                    # this is a maze, thanks groups.io
+                    # this is a maze, not amazing.
+                    # thanks groups.io!
                    td = tr.find('td')
-                    divrow = td.find('div',{'class':'row'}).find('div',{'class':'pull-left'})
+
+                    sender_divrow = td.find('div',{'class':'row'})
+                    sender_divrow = sender_divrow.find('div',{'class':'pull-left'})
                    if (i+1)==1:
-                        original_sender = divrow.text.strip()
+                        original_sender = sender_divrow.text.strip()
+
+                    date_divrow = td.find('div',{'class':'row'})
+                    date_divrow = date_divrow.find('div',{'class':'pull-right'})
+                    date_divrow = date_divrow.find('font',{'class':'text-muted'})
+                    date_divrow = date_divrow.find('script').text
+                    try:
+                        time_seconds = re.search(' [0-9]{1,} ',date_divrow).group(0)
+                        time_seconds = time_seconds.strip()
+                        # Thanks groups.io for the weird date formatting
+                        time_seconds = time_seconds[:10]
+                        mmicro_seconds = time_seconds[10:]
+                        if (i+1)==1:
+                            created_time  = datetime.datetime.utcfromtimestamp(int(time_seconds))
+                            modified_time = datetime.datetime.utcfromtimestamp(int(time_seconds))
+                        else:
+                            modified_time = datetime.datetime.utcfromtimestamp(int(time_seconds))
+
+                    except AttributeError:
+                        created_time = None
+                        modified_time = None
+
                    for div in td.find_all('div'):
                        if div.has_attr('id'):

@@ -299,7 +325,10 @@ class GroupsIOArchivesCrawler(object):

            thread = {
                    'permalink' : permalink,
+                    'created_time' : created_time,
+                    'modified_time' : modified_time,
                    'subject' : subject,
+                    'subgroup' : subgroup_name,
                    'original_sender' : original_sender,
                    'content' : full_content
            }
@@ -324,11 +353,13 @@ class GroupsIOArchivesCrawler(object):

        results = []
        for row in rows:
-            # We don't care about anything except title and ugly link
+            # This is where we extract
+            # a list of thread titles 
+            # and corresponding links.
            subject = row.find('span',{'class':'subject'})
            title = subject.get_text()
            link = row.find('a')['href']
-            #print(title)
+
            results.append((title,link))

        return results
--- a/static/centillion_white_beta.png
+++ b/static/centillion_white_beta.png
--- a/static/centillion_white_localhost.png
+++ b/static/centillion_white_localhost.png
--- a/static/master_list.js
+++ b/static/master_list.js
@@ -57,6 +57,25 @@ $(document).ready(function() {
 });


+//////////////////////////////////
+// utility functions
+
+// https://stackoverflow.com/a/25275808
+function iso8601(date) {
+  var hours = date.getHours();
+  var minutes = date.getMinutes();
+  var ampm = hours >= 12 ? 'PM' : 'AM';
+  hours = hours % 12;
+  hours = hours ? hours : 12; // the hour '0' should be '12'
+  minutes = minutes < 10 ? '0'+minutes : minutes;
+  var strTime = hours + ':' + minutes + ' ' + ampm;
+  return date.getYear() + "-" + (date.getMonth()+1) + "-" + date.getDate() + "  " + strTime;
+}
+
+// https://stackoverflow.com/a/7390612
+var toType = function(obj) {
+  return ({}).toString.call(obj).match(/\s([a-zA-Z]+)/)[1].toLowerCase()
+}

 //////////////////////////////////
 // API-to-Table Functions
@@ -315,8 +334,10 @@ function load_emailthreads_table(){
                var r = new Array(), j = -1, size=result.length;
                r[++j] = '<thead>'
                r[++j] = '<tr class="header-row">';
-                r[++j] = '<th width="70%">Topic</th>';
-                r[++j] = '<th width="30%">Started By</th>';
+                r[++j] = '<th width="60%">Topic</th>';
+                r[++j] = '<th width="15%">Started By</th>';
+                r[++j] = '<th width="15%">Date</th>';
+                r[++j] = '<th width="10%">Mailing List</th>';
                r[++j] = '</tr>';
                r[++j] = '</thead>'
                r[++j] = '<tbody>'
@@ -327,6 +348,10 @@ function load_emailthreads_table(){
                    r[++j] = '</a>'
                    r[++j] = '</td><td>';
                    r[++j] = result[i]['owner_name'];
+                    r[++j] = '</td><td>';
+                    r[++j] = result[i]['created_time'];
+                    r[++j] = '</td><td>';
+                    r[++j] = result[i]['group'];
                    r[++j] = '</td></tr>';
                }
                r[++j] = '</tbody>'
--- a/static/style.css
+++ b/static/style.css
@@ -58,7 +58,7 @@ button#feedback {
 /* search results table */
 td#search-results-score-col,
 td#search-results-type-col {
-    width: 100px;
+    width: 90px;
 }

 div.container {
@@ -86,6 +86,14 @@ div.container {
 }

 /* badges for number of docs indexed */
+span.results-count {
+    background-color: #555;
+}
+
+span.indexing-count {
+    background-color: #337ab7;
+}
+
 span.badge {
    vertical-align: text-bottom;
 }
@@ -126,7 +134,7 @@ li.search-group-item {
 }

 div.url {
-    background-color: rgba(86,61,124,.15);
+    background-color: rgba(40,40,60,.15);
    padding: 8px;
 }

@@ -192,7 +200,7 @@ table {

 .info, .last-searches {
    color: gray;
-    font-size: 12px;
+    /*font-size: 12px;*/
    font-family: Arial, serif;
 }

@@ -202,27 +210,27 @@ table {

 div.tags a, td.tag-cloud a {
    color: #b56020;
-    font-size: 12px;
+    /*font-size: 12px;*/
 }

 td.tag-cloud, td.directories-cloud {
-    font-size: 12px;
+    /*font-size: 12px;*/
    color: #555555;
 }

 td.directories-cloud a {
-    font-size: 12px;
+    /*font-size: 12px;*/
    color: #377BA8;
 }

 div.path {
-    font-size: 12px;
+    /*font-size: 12px;*/
    color: #666666;
    margin-bottom: 3px;
 }

 div.path a {
-    font-size: 12px;
+    /*font-size: 12px;*/
    margin-right: 5px;
 }

--- a/templates/banner.html
+++ b/templates/banner.html
@@ -7,11 +7,18 @@
        <div class="col12sm" id="banner-col">
            <center>
                <a id="banner-a" href="{{ url_for('search')}}?query=&fields=">
-                    <img id="banner-img" src="{{ url_for('static', filename='centillion_white.png') }}">
+                    {% if 'betasearch' in request.url %}
+                        <img id="banner-img" src="{{ url_for('static', filename='centillion_white_beta.png') }}">
+                    {% elif 'localhost' in request.url %}
+                        <img id="banner-img" src="{{ url_for('static', filename='centillion_white_localhost.png') }}">
+                    {% else %}
+                        <img id="banner-img" src="{{ url_for('static', filename='centillion_white.png') }}">
+                    {% endif %}
                </a>
            </center>
        </div>
    </div>
+
    {% if config['TAGLINE'] %}
    <div class="row" id="tagline-row">
        <div class="col12sm" id="tagline-col">
--- a/templates/flashed_messages.html
+++ b/templates/flashed_messages.html
@@ -5,7 +5,7 @@
            <div class="alert alert-success alert-dismissible fade in">
                <a href="#" class="close" data-dismiss="alert" aria-label="close">&times;</a>
                    {% for message in messages %}
-                        <p class="lead">{{ message }}</p>
+                        <p>{{ message }}</p>
                    {% endfor %}
            </div>
        </div>
--- a/templates/search.html
+++ b/templates/search.html
@@ -52,8 +52,8 @@
                    <div class="container-fluid">
                        <div class="row">
                            <div class="col-xs-12 info">
-                                <b>Found:</b> <span class="badge">{{entries|length}}</span> results 
-                                out of <span class="badge">{{totals["total"]}}</span> total items indexed
+                                <b>Found:</b> <span class="badge results-count">{{entries|length}}</span> results 
+                                out of <span class="badge results-count">{{totals["total"]}}</span> total items indexed
                            </div>
                        </div>
                    </div>
@@ -67,32 +67,32 @@
                            <div class="col-xs-12 info">
                                <b>Indexing:</b>

-                                <span class="badge">{{totals["gdoc"]}}</span>
+                                <span class="badge indexing-count">{{totals["gdoc"]}}</span>
                                <a href="/master_list?doctype=gdoc#gdoc">
                                Google Drive files
                                </a>,

-                                <span class="badge">{{totals["issue"]}}</span>
+                                <span class="badge indexing-count">{{totals["issue"]}}</span>
                                <a href="/master_list?doctype=issue#issue">
                                Github issues
                                </a>,

-                                <span class="badge">{{totals["ghfile"]}}</span>
+                                <span class="badge indexing-count">{{totals["ghfile"]}}</span>
                                <a href="/master_list?doctype=ghfile#ghfile">
                                Github files
                                </a>,

-                                <span class="badge">{{totals["markdown"]}}</span>
+                                <span class="badge indexing-count">{{totals["markdown"]}}</span>
                                <a href="/master_list?doctype=markdown#markdown">
                                Github Markdown files
                                </a>,

-                                <span class="badge">{{totals["emailthread"]}}</span>
+                                <span class="badge indexing-count">{{totals["emailthread"]}}</span>
                                <a href="/master_list?doctype=emailthread#emailthread">
                                Groups.io email threads
                                </a>,

-                                <span class="badge">{{totals["disqus"]}}</span>
+                                <span class="badge indexing-count">{{totals["disqus"]}}</span>
                                <a href="/master_list?doctype=disqus#disqus">
                                Disqus comment threads
                                </a>
@@ -101,6 +101,7 @@
                </div>
            </li>

+
        </ul>
    </div>
 </div>
Author	SHA1	Message	Date
Charles Reid	1b2f9a2278	fix output messages for reindexing	2018-08-24 09:23:09 -07:00
Charles Reid	937708f5d8	do full indexing	2018-08-24 09:01:18 -07:00
Charles Reid	4c3ee712bb	Fix display bug. Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc * 'dcppc' of github.com:dcppc/centillion: fix styles	2018-08-24 08:42:03 -07:00
Charles Reid	f5af965a33	fix display bug	2018-08-24 08:41:35 -07:00
Charles Reid	bce16d336d	fix flask example configuration	2018-08-24 08:40:46 -07:00
Chaz Reid	729514ac89	Merge pull request #93 from dcppc/fix-styles fix styles	2018-08-24 08:37:51 -07:00
Charles Reid	46ce070b09	fix styles	2018-08-24 08:31:57 -07:00
Charles Reid	891fa50868	fix results boxes in results table to be gray	2018-08-24 02:30:49 -07:00
Charles Reid	fdb3963ede	tack on the disqus comments anchor to disqus URLs	2018-08-24 02:01:34 -07:00
Chaz Reid	90379a69c5	Merge pull request #92 from dcppc/add-date-subgrp-emailthreads add string formatting for dates and add date/mailing list column to email threads master list	2018-08-24 01:58:29 -07:00
Charles Reid	0faca67c35	add string formatting for dates and add date/mailing list column to email threads master list closes #58	2018-08-24 01:56:19 -07:00
Chaz Reid	77b533b642	Merge pull request #86 from dcppc/disqus Add Disqus	2018-08-24 01:18:37 -07:00
Chaz Reid	ccf013e3c9	Merge pull request #85 from dcppc/add-coc-dotgithub Add Code of Conduct, Contributing, and PR template	2018-08-24 01:18:14 -07:00
Chaz Reid	e67db4f1ef	Merge pull request #89 from dcppc/fix-flashed-messages-font fix font used in flashed messages	2018-08-24 01:17:59 -07:00
Chaz Reid	b11a26a812	Merge pull request #91 from dcppc/merge-datetime-into-disqus Merge datetime into disqus	2018-08-24 01:14:24 -07:00
Charles Reid	55a74f7d98	Merge branch 'use-datetime' into merge-datetime-into-disqus * use-datetime: extract date and time from email threads pages add groups and tags to schema; update how we determine timestamps; handle exceptions when we add the document to the writer, rather than elsewhere move where exception is caught (exception was also incorrect.) switched created_time, modified_time, indexed_time over to DATETIME. added DateParserPlugin to query QueryParser. added time fields to those being searched by default. tests do not seem to be working.	2018-08-24 01:13:42 -07:00
Chaz Reid	ab76226b0c	Merge pull request #90 from dcppc/add-dates-and-subgroups-to-emails Add dates and subgroups to emails	2018-08-24 00:07:40 -07:00
Charles Reid	a4ebef6e6f	extract date and time from email threads pages	2018-08-24 00:04:35 -07:00
Charles Reid	bad50efa9b	add groups and tags to schema; update how we determine timestamps; handle exceptions when we add the document to the writer, rather than elsewhere	2018-08-24 00:03:23 -07:00
Charles Reid	629fc063db	move where exception is caught (exception was also incorrect.)	2018-08-24 00:01:26 -07:00
Charles Reid	4f41d8597f	fix font used in flashed messages	2018-08-23 19:05:16 -07:00
Charles Reid	3b0baa21de	switched created_time, modified_time, indexed_time over to DATETIME. added DateParserPlugin to query QueryParser. added time fields to those being searched by default. tests do not seem to be working.	2018-08-23 19:01:40 -07:00
Charles Reid	17b2d359bb	add contributing and code of conduct files	2018-08-23 11:03:48 -07:00
Charles Reid	62ca62274e	add github pull request template	2018-08-23 11:02:37 -07:00
Chaz Reid	501cae8329	Merge pull request #81 from dcppc/detect-beta-banner Add custom banners for beta/localhost centillion instances	2018-08-21 13:18:11 -07:00
Charles Reid	0543c3e89f	fix filename	2018-08-21 12:01:12 -07:00
Charles Reid	2191140232	Add custom banners for beta/localhost centillion instances	2018-08-21 11:58:19 -07:00