31 Commits

Author SHA1 Message Date
1985e6606c Merge pull request #95 from dcppc/fix-output-msg
change "documents" to "issues" in reindexing message
2018-08-24 09:25:09 -07:00
1b2f9a2278 fix output messages for reindexing 2018-08-24 09:23:09 -07:00
d7d929689b Merge pull request #94 from dcppc/raynamharris-patch-1
Create ISSUE_TEMPLATE.md
2018-08-24 09:20:46 -07:00
937708f5d8 do *full* indexing 2018-08-24 09:01:18 -07:00
Rayna M Harris
d2dff2217a fixed typo 2018-08-24 10:44:45 -05:00
4c3ee712bb Fix display bug. Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc
* 'dcppc' of github.com:dcppc/centillion:
  fix styles
2018-08-24 08:42:03 -07:00
f5af965a33 fix display bug 2018-08-24 08:41:35 -07:00
bce16d336d fix flask example configuration 2018-08-24 08:40:46 -07:00
Rayna M Harris
9b2ce7b3ca Create ISSUE_TEMPLATE.md 2018-08-24 10:40:29 -05:00
729514ac89 Merge pull request #93 from dcppc/fix-styles
fix styles
2018-08-24 08:37:51 -07:00
46ce070b09 fix styles 2018-08-24 08:31:57 -07:00
891fa50868 fix results boxes in results table to be gray 2018-08-24 02:30:49 -07:00
fdb3963ede tack on the disqus comments anchor to disqus URLs 2018-08-24 02:01:34 -07:00
90379a69c5 Merge pull request #92 from dcppc/add-date-subgrp-emailthreads
add string formatting for dates and add date/mailing list column to email threads master list
2018-08-24 01:58:29 -07:00
0faca67c35 add string formatting for dates and add date/mailing list column to email threads master list
closes #58
2018-08-24 01:56:19 -07:00
77b533b642 Merge pull request #86 from dcppc/disqus
Add Disqus
2018-08-24 01:18:37 -07:00
ccf013e3c9 Merge pull request #85 from dcppc/add-coc-dotgithub
Add Code of Conduct, Contributing, and PR template
2018-08-24 01:18:14 -07:00
e67db4f1ef Merge pull request #89 from dcppc/fix-flashed-messages-font
fix font used in flashed messages
2018-08-24 01:17:59 -07:00
b11a26a812 Merge pull request #91 from dcppc/merge-datetime-into-disqus
Merge datetime into disqus
2018-08-24 01:14:24 -07:00
55a74f7d98 Merge branch 'use-datetime' into merge-datetime-into-disqus
* use-datetime:
  extract date and time from email threads pages
  add groups and tags to schema; update how we determine timestamps; handle exceptions when we add the document to the writer, rather than elsewhere
  move where exception is caught (exception was also incorrect.)
  switched created_time, modified_time, indexed_time over to DATETIME. added DateParserPlugin to query QueryParser. added time fields to those being searched by default. tests do not seem to be working.
2018-08-24 01:13:42 -07:00
ab76226b0c Merge pull request #90 from dcppc/add-dates-and-subgroups-to-emails
Add dates and subgroups to emails
2018-08-24 00:07:40 -07:00
a4ebef6e6f extract date and time from email threads pages 2018-08-24 00:04:35 -07:00
bad50efa9b add groups and tags to schema; update how we determine timestamps; handle exceptions when we add the document to the writer, rather than elsewhere 2018-08-24 00:03:23 -07:00
629fc063db move where exception is caught (exception was also incorrect.) 2018-08-24 00:01:26 -07:00
4f41d8597f fix font used in flashed messages 2018-08-23 19:05:16 -07:00
3b0baa21de switched created_time, modified_time, indexed_time over to DATETIME. added DateParserPlugin to query QueryParser. added time fields to those being searched by default. tests do not seem to be working. 2018-08-23 19:01:40 -07:00
17b2d359bb add contributing and code of conduct files 2018-08-23 11:03:48 -07:00
62ca62274e add github pull request template 2018-08-23 11:02:37 -07:00
501cae8329 Merge pull request #81 from dcppc/detect-beta-banner
Add custom banners for beta/localhost centillion instances
2018-08-21 13:18:11 -07:00
0543c3e89f fix filename 2018-08-21 12:01:12 -07:00
2191140232 Add custom banners for beta/localhost centillion instances 2018-08-21 11:58:19 -07:00
16 changed files with 488 additions and 226 deletions

17
.github/ISSUE_TEMPLATE.md vendored Normal file
View File

@@ -0,0 +1,17 @@
Thanks for using Centillion. Your feedback is important to us.
### When reporting a bug, please be sure to include the following:
- [ ] A descriptive title
- [ ] The behavior you expect to see and the actual behavior observed
- [ ] Steps to reproduce the behavior
- [ ] What browser you are using
### When you open an issue for a feature request, please add as much detail as possible:
- [ ] A descriptive title
- [ ] A description of the problem you're trying to solve, including *why* you think this is a problem
- [ ] An overview of the suggested solution
- [ ] If the feature changes current behavior, please explain why your solution is better
See read [our contributor guidelines](https://github.com/dcppc/centillion/blob/dcppc/CONTRIBUTING.md)
for more details about contributing to this project.

12
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@@ -0,0 +1,12 @@
Thanks for contributing to centillion!
Please place an x between the brackets to indicate a yes answer
to the questions below.
- [ ] Is this pull request mergeable?
- [ ] Has this been tested locally?
- [ ] Does this pull request pass the tests?
- [ ] Have new tests been added to cover any new code?
- [ ] Was a spellchecker run on the source code and documentation after
changes were made?

43
CODE_OF_CONDUCT.md Normal file
View File

@@ -0,0 +1,43 @@
# Code of Conduct
## DCPPC Code of Conduct
All members of the Commons are expected to agree with the following code
of conduct. We will enforce this code as needed. We expect cooperation
from all members to help ensuring a safe environment for everybody.
## The Quick Version
The Consortium is dedicated to providing a harassment-free experience
for everyone, regardless of gender, gender identity and expression, age,
sexual orientation, disability, physical appearance, body size, race, or
religion (or lack thereof). We do not tolerate harassment of Consortium
members in any form. Sexual language and imagery is generally not
appropriate for any venue, including meetings, presentations, or
discussions.
## The Less Quick Version
Harassment includes offensive verbal comments related to gender, gender
identity and expression, age, sexual orientation, disability, physical
appearance, body size, race, religion, sexual images in public spaces,
deliberate intimidation, stalking, following, harassing photography or
recording, sustained disruption of talks or other events, inappropriate
physical contact, and unwelcome sexual attention.
Members asked to stop any harassing behavior are expected to comply
immediately.
If you are being harassed, notice that someone else is being harassed,
or have any other concerns, please contact [Titus
Brown](mailto:ctbrown@ucdavis.edu) immediately. If Titus is the cause of
your concern, please contact [Vivien
Bonazzi](mailto:bonazziv@mail.nih.gov).
We expect members to follow these guidelines at any Consortium event.
Original source and credit: <http://2012.jsconf.us/#/about> & The Ada
Initiative. Please help by translating or improving:
<http://github.com/leftlogic/confcodeofconduct.com>. This work is
licensed under a Creative Commons Attribution 3.0 Unported License

21
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,21 @@
# Contributing to the DCPPC Internal Repository
Hello, and thank you for wanting to contribute to the DCPPC Internal
Repository\!
By contributing to this repository, you agree:
1. To obey the [Code of Conduct](./CODE_OF_CONDUCT.md)
2. To release all your contributions under the same terms as the
license itself: the [Creative Commons Zero](./LICENSE.md) (aka
Public Domain) license
If you are OK with these two conditions, then we welcome both you and
your contribution\!
If you have any questions about contributing, please [open an
issue](https://github.com/dcppc/internal/issues/new) and Team Copper
will lend a hand ASAP.
Thank you for being here and for being a part of the DCPPC project.

View File

@@ -267,7 +267,11 @@ def list_docs(doctype):
if org['login']=='dcppc':
# Business as usual
search = Search(app.config["INDEX_DIR"])
return jsonify(search.get_list(doctype))
results_list = search.get_list(doctype)
for result in results_list:
ct = result['created_time']
result['created_time'] = datetime.strftime(ct,"%Y-%m-%d %I:%M %p")
return jsonify(results_list)
# nope
return render_template('403.html')

View File

@@ -24,6 +24,8 @@ import dateutil.parser
from whoosh import query
from whoosh.qparser import MultifieldParser, QueryParser
from whoosh.analysis import StemmingAnalyzer, LowercaseFilter, StopFilter
from whoosh.qparser.dateparse import DateParserPlugin
from whoosh import fields, index
"""
@@ -195,30 +197,38 @@ class Search:
# is defined.
schema = Schema(
id = ID(stored=True, unique=True),
kind = ID(stored=True),
id = fields.ID(stored=True, unique=True),
kind = fields.ID(stored=True),
created_time = ID(stored=True),
modified_time = ID(stored=True),
indexed_time = ID(stored=True),
created_time = fields.DATETIME(stored=True),
modified_time = fields.DATETIME(stored=True),
indexed_time = fields.DATETIME(stored=True),
title = TEXT(stored=True, field_boost=100.0),
url = ID(stored=True, unique=True),
mimetype=ID(stored=True),
owner_email=ID(stored=True),
owner_name=TEXT(stored=True),
repo_name=TEXT(stored=True),
repo_url=ID(stored=True),
title = fields.TEXT(stored=True, field_boost=100.0),
github_user=TEXT(stored=True),
url = fields.ID(stored=True),
mimetype = fields.TEXT(stored=True),
owner_email = fields.ID(stored=True),
owner_name = fields.TEXT(stored=True),
# mainly for email threads, groups.io, hypothesis
group = fields.ID(stored=True),
repo_name = fields.TEXT(stored=True),
repo_url = fields.ID(stored=True),
github_user = fields.TEXT(stored=True),
tags = fields.KEYWORD(commas=True,
stored=True,
lowercase=True),
# comments only
issue_title=TEXT(stored=True, field_boost=100.0),
issue_url=ID(stored=True),
issue_title = fields.TEXT(stored=True, field_boost=100.0),
issue_url = fields.ID(stored=True),
content=TEXT(stored=True, analyzer=stemming_analyzer)
content = fields.TEXT(stored=True, analyzer=stemming_analyzer)
)
@@ -258,24 +268,32 @@ class Search:
writer.delete_by_term('id',item['id'])
# Index a plain google drive file
writer.add_document(
id = item['id'],
kind = 'gdoc',
created_time = item['createdTime'],
modified_time = item['modifiedTime'],
indexed_time = datetime.now().replace(microsecond=0).isoformat(),
title = item['name'],
url = item['webViewLink'],
mimetype = mimetype,
owner_email = item['owners'][0]['emailAddress'],
owner_name = item['owners'][0]['displayName'],
repo_name='',
repo_url='',
github_user='',
issue_title='',
issue_url='',
content = content
)
created_time = dateutil.parser.parse(item['createdTime'])
modified_time = dateutil.parser.parse(item['modifiedTime'])
indexed_time = datetime.now().replace(microsecond=0)
try:
writer.add_document(
id = item['id'],
kind = 'gdoc',
created_time = created_time,
modified_time = modified_time,
indexed_time = indexed_time,
title = item['name'],
url = item['webViewLink'],
mimetype = mimetype,
owner_email = item['owners'][0]['emailAddress'],
owner_name = item['owners'][0]['displayName'],
group='',
repo_name='',
repo_url='',
github_user='',
issue_title='',
issue_url='',
content = content
)
except ValueError as e:
print(repr(e))
print(" > XXXXXX Failed to index Google Drive file \"%s\""%(item['name']))
else:
@@ -329,7 +347,7 @@ class Search:
)
assert output == ""
except RuntimeError:
print(" > XXXXXX Failed to index document \"%s\""%(item['name']))
print(" > XXXXXX Failed to index Google Drive document \"%s\""%(item['name']))
# If export was successful, read contents of markdown
@@ -357,24 +375,33 @@ class Search:
else:
print(" > Creating a new record")
writer.add_document(
id = item['id'],
kind = 'gdoc',
created_time = item['createdTime'],
modified_time = item['modifiedTime'],
indexed_time = datetime.now().replace(microsecond=0).isoformat(),
title = item['name'],
url = item['webViewLink'],
mimetype = mimetype,
owner_email = item['owners'][0]['emailAddress'],
owner_name = item['owners'][0]['displayName'],
repo_name='',
repo_url='',
github_user='',
issue_title='',
issue_url='',
content = content
)
try:
created_time = dateutil.parser.parse(item['createdTime'])
modified_time = dateutil.parser.parse(item['modifiedTime'])
indexed_time = datetime.now()
writer.add_document(
id = item['id'],
kind = 'gdoc',
created_time = created_time,
modified_time = modified_time,
indexed_time = indexed_time,
title = item['name'],
url = item['webViewLink'],
mimetype = mimetype,
owner_email = item['owners'][0]['emailAddress'],
owner_name = item['owners'][0]['displayName'],
group='',
repo_name='',
repo_url='',
github_user='',
issue_title='',
issue_url='',
content = content
)
except ValueError as e:
print(repr(e))
print(" > XXXXXX Failed to index Google Drive file \"%s\""%(item['name']))
@@ -408,31 +435,36 @@ class Search:
issue_comment_content += comment.body.rstrip()
issue_comment_content += "\n"
# Now create the actual search index record
created_time = clean_timestamp(issue.created_at)
modified_time = clean_timestamp(issue.updated_at)
indexed_time = clean_timestamp(datetime.now())
# Now create the actual search index record.
# Add one document per issue thread,
# containing entire text of thread.
writer.add_document(
id = issue.html_url,
kind = 'issue',
created_time = created_time,
modified_time = modified_time,
indexed_time = indexed_time,
title = issue.title,
url = issue.html_url,
mimetype='',
owner_email='',
owner_name='',
repo_name = repo_name,
repo_url = repo_url,
github_user = issue.user.login,
issue_title = issue.title,
issue_url = issue.html_url,
content = issue_comment_content
)
created_time = issue.created_at
modified_time = issue.updated_at
indexed_time = datetime.now()
try:
writer.add_document(
id = issue.html_url,
kind = 'issue',
created_time = created_time,
modified_time = modified_time,
indexed_time = indexed_time,
title = issue.title,
url = issue.html_url,
mimetype='',
owner_email='',
owner_name='',
group='',
repo_name = repo_name,
repo_url = repo_url,
github_user = issue.user.login,
issue_title = issue.title,
issue_url = issue.html_url,
content = issue_comment_content
)
except ValueError as e:
print(repr(e))
print(" > XXXXXX Failed to index Github issue \"%s\""%(issue.title))
@@ -462,7 +494,8 @@ class Search:
print(" > XXXXXXXX Failed to find file info.")
return
indexed_time = clean_timestamp(datetime.now())
indexed_time = datetime.now()
if fext in MARKDOWN_EXTS:
print("Indexing markdown doc %s from repo %s"%(fname,repo_name))
@@ -491,24 +524,31 @@ class Search:
usable_url = "https://github.com/%s/blob/master/%s"%(repo_name, fpath)
# Now create the actual search index record
writer.add_document(
id = fsha,
kind = 'markdown',
created_time = '',
modified_time = '',
indexed_time = indexed_time,
title = fname,
url = usable_url,
mimetype='',
owner_email='',
owner_name='',
repo_name = repo_name,
repo_url = repo_url,
github_user = '',
issue_title = '',
issue_url = '',
content = content
)
try:
writer.add_document(
id = fsha,
kind = 'markdown',
created_time = None,
modified_time = None,
indexed_time = indexed_time,
title = fname,
url = usable_url,
mimetype='',
owner_email='',
owner_name='',
group='',
repo_name = repo_name,
repo_url = repo_url,
github_user = '',
issue_title = '',
issue_url = '',
content = content
)
except ValueError as e:
print(repr(e))
print(" > XXXXXX Failed to index Github markdown file \"%s\""%(fname))
else:
print("Indexing github file %s from repo %s"%(fname,repo_name))
@@ -516,24 +556,29 @@ class Search:
key = fname+"_"+fsha
# Now create the actual search index record
writer.add_document(
id = key,
kind = 'ghfile',
created_time = '',
modified_time = '',
indexed_time = indexed_time,
title = fname,
url = repo_url,
mimetype='',
owner_email='',
owner_name='',
repo_name = repo_name,
repo_url = repo_url,
github_user = '',
issue_title = '',
issue_url = '',
content = ''
)
try:
writer.add_document(
id = key,
kind = 'ghfile',
created_time = None,
modified_time = None,
indexed_time = indexed_time,
title = fname,
url = repo_url,
mimetype='',
owner_email='',
owner_name='',
group='',
repo_name = repo_name,
repo_url = repo_url,
github_user = '',
issue_title = '',
issue_url = '',
content = ''
)
except ValueError as e:
print(repr(e))
print(" > XXXXXX Failed to index Github file \"%s\""%(fname))
@@ -547,28 +592,42 @@ class Search:
Use a Groups.io email thread record to add
an email thread to the search index.
"""
indexed_time = clean_timestamp(datetime.now())
if 'created_time' in d.keys() and d['created_time'] is not None:
created_time = d['created_time']
else:
created_time = None
if 'modified_time' in d.keys() and d['modified_time'] is not None:
modified_time = d['modified_time']
else:
modified_time = None
indexed_time = datetime.now()
# Now create the actual search index record
writer.add_document(
id = d['permalink'],
kind = 'emailthread',
created_time = '',
modified_time = '',
indexed_time = indexed_time,
title = d['subject'],
url = d['permalink'],
mimetype='',
owner_email='',
owner_name=d['original_sender'],
repo_name = '',
repo_url = '',
github_user = '',
issue_title = '',
issue_url = '',
content = d['content']
)
try:
writer.add_document(
id = d['permalink'],
kind = 'emailthread',
created_time = created_time,
modified_time = modified_time,
indexed_time = indexed_time,
title = d['subject'],
url = d['permalink'],
mimetype='',
owner_email='',
owner_name=d['original_sender'],
group=d['subgroup'],
repo_name = '',
repo_url = '',
github_user = '',
issue_title = '',
issue_url = '',
content = d['content']
)
except ValueError as e:
print(repr(e))
print(" > XXXXXX Failed to index Groups.io thread \"%s\""%(d['subject']))
# ------------------------------
@@ -581,28 +640,33 @@ class Search:
to add a disqus comment thread to the
search index.
"""
indexed_time = clean_timestamp(datetime.now())
indexed_time = datetime.now()
# created_time is already a timestamp
# Now create the actual search index record
writer.add_document(
id = d['id'],
kind = 'disqus',
created_time = d['created_time'],
modified_time = '',
indexed_time = indexed_time,
title = d['title'],
url = d['link'],
mimetype='',
owner_email='',
owner_name='',
repo_name = '',
repo_url = '',
github_user = '',
issue_title = '',
issue_url = '',
content = d['content']
)
try:
writer.add_document(
id = d['id'],
kind = 'disqus',
created_time = d['created_time'],
modified_time = None,
indexed_time = indexed_time,
title = d['title'],
url = d['link'],
mimetype='',
owner_email='',
owner_name='',
repo_name = '',
repo_url = '',
github_user = '',
issue_title = '',
issue_url = '',
content = d['content']
)
except ValueError as e:
print(repr(e))
print(" > XXXXXX Failed to index Disqus comment thread \"%s\""%(d['title']))
@@ -681,7 +745,7 @@ class Search:
## Shorter:
#break
# Longer:
## Longer:
if nextPageToken is None:
break
@@ -691,40 +755,47 @@ class Search:
temp_dir = tempfile.mkdtemp(dir=os.getcwd())
print("Temporary directory: %s"%(temp_dir))
try:
# Drop any id in indexed_ids
# not in remote_ids
drop_ids = indexed_ids - remote_ids
for drop_id in drop_ids:
writer.delete_by_term('id',drop_id)
# Drop any id in indexed_ids
# not in remote_ids
drop_ids = indexed_ids - remote_ids
for drop_id in drop_ids:
writer.delete_by_term('id',drop_id)
# Update any id in indexed_ids
# and in remote_ids
update_ids = indexed_ids & remote_ids
for update_id in update_ids:
# cop out
writer.delete_by_term('id',update_id)
item = full_items[update_id]
self.add_drive_file(writer, item, temp_dir, config, update=True)
count += 1
# Update any id in indexed_ids
# and in remote_ids
update_ids = indexed_ids & remote_ids
for update_id in update_ids:
# cop out
writer.delete_by_term('id',update_id)
item = full_items[update_id]
self.add_drive_file(writer, item, temp_dir, config, update=True)
count += 1
# Add any id not in indexed_ids
# and in remote_ids
add_ids = remote_ids - indexed_ids
for add_id in add_ids:
item = full_items[add_id]
self.add_drive_file(writer, item, temp_dir, config, update=False)
count += 1
# Add any id not in indexed_ids
# and in remote_ids
add_ids = remote_ids - indexed_ids
for add_id in add_ids:
item = full_items[add_id]
self.add_drive_file(writer, item, temp_dir, config, update=False)
count += 1
except Exception as e:
print("ERROR: While adding Google Drive files to search index")
print("-"*40)
print(repr(e))
print("-"*40)
print("Continuing...")
pass
print("Cleaning temporary directory: %s"%(temp_dir))
subprocess.call(['rm','-fr',temp_dir])
writer.commit()
print("Done, updated %d documents in the index" % count)
print("Done, updated %d Google Drive files in the index" % count)
# ------------------------------
@@ -802,7 +873,7 @@ class Search:
writer.commit()
print("Done, updated %d documents in the index" % count)
print("Done, updated %d Github issues in the index" % count)
@@ -1176,7 +1247,7 @@ class Search:
elif doctype=='issue':
item_keys = ['title','repo_name','repo_url','url','created_time','modified_time']
elif doctype=='emailthread':
item_keys = ['title','owner_name','url']
item_keys = ['title','owner_name','url','group','created_time','modified_time']
elif doctype=='disqus':
item_keys = ['title','created_time','url']
elif doctype=='ghfile':
@@ -1195,11 +1266,7 @@ class Search:
for r in results:
d = {}
for k in item_keys:
if k=='created_time' or k=='modified_time':
#d[k] = r[k]
d[k] = dateutil.parser.parse(r[k]).strftime("%Y-%m-%d")
else:
d[k] = r[k]
d[k] = r[k]
json_results.append(d)
return json_results
@@ -1212,13 +1279,16 @@ class Search:
query_string = " ".join(query_list)
query = None
if ":" in query_string:
#query = QueryParser("content",
# self.schema
#).parse(query_string)
query = QueryParser("content",
self.schema,
termclass=query.Variations
).parse(query_string)
)
query.add_plugin(DateParserPlugin(free=True))
query = query.parse(query_string)
elif len(fields) == 1 and fields[0] == "filename":
pass
elif len(fields) == 2:
@@ -1226,9 +1296,12 @@ class Search:
else:
# If the user does not specify a field,
# these are the fields that are actually searched
fields = ['title', 'content','owner_name','owner_email','url']
fields = ['title', 'content','owner_name','owner_email','url','created_date','modified_date']
if not query:
query = MultifieldParser(fields, schema=self.ix.schema).parse(query_string)
query = MultifieldParser(fields, schema=self.ix.schema)
query.add_plugin(DateParserPlugin(free=True))
query = query.parse(query_string)
#query = MultifieldParser(fields, schema=self.ix.schema).parse(query_string)
parsed_query = "%s" % query
print("query: %s" % parsed_query)
results = searcher.search(query, terms=False, scored=True, groupedby="kind")

View File

@@ -1,20 +1,38 @@
######################################
# github oauth
GITHUB_OAUTH_CLIENT_ID = "XXX"
GITHUB_OAUTH_CLIENT_SECRET = "YYY"
######################################
# github acces token
GITHUB_TOKEN = "XXX"
######################################
# groups.io
GROUPSIO_TOKEN = "XXXXX"
GROUPSIO_USERNAME = "XXXXX"
GROUPSIO_PASSWORD = "XXXXX"
######################################
# Disqus API public key
DISQUS_TOKEN = "XXXXX"
######################################
# everything else
# Location of index file
INDEX_DIR = "search_index"
# oauth client deets
GITHUB_OAUTH_CLIENT_ID = "XXX"
GITHUB_OAUTH_CLIENT_SECRET = "YYY"
GITHUB_TOKEN = "ZZZ"
# More information footer: Repository label
FOOTER_REPO_ORG = "charlesreid1"
FOOTER_REPO_ORG = "dcppc"
FOOTER_REPO_NAME = "centillion"
# Toggle to show Whoosh parsed query
SHOW_PARSED_QUERY=True
TAGLINE = "Search All The Things"
TAGLINE = "Search the Data Commons"
# Flask settings
DEBUG = True
SECRET_KEY = 'WWWWW'
SECRET_KEY = 'XXXXX'

View File

@@ -1,6 +1,7 @@
import os, re
import requests
import json
import dateutil.parser
from pprint import pprint
@@ -117,13 +118,14 @@ class DisqusCrawler(object):
link = response['link']
clean_link = re.sub('data-commons.us','nihdatacommons.us',link)
clean_link += "#disqus_comments"
# Finished working on thread.
# We need to make this value a dictionary
thread_info = dict(
id = response['id'],
created_time = response['createdAt'],
created_time = dateutil.parser.parse(response['createdAt']),
title = response['title'],
forum = response['forum'],
link = clean_link,

View File

@@ -1,5 +1,7 @@
import requests, os, re
from bs4 import BeautifulSoup
import dateutil.parser
import datetime
class GroupsIOException(Exception):
pass
@@ -251,7 +253,7 @@ class GroupsIOArchivesCrawler(object):
subject = soup.find('title').text
# Extract information for the schema:
# - permalink for thread (done)
# - permalink for thread (done above)
# - subject/title (done)
# - original sender email/name (done)
# - content (done)
@@ -266,11 +268,35 @@ class GroupsIOArchivesCrawler(object):
pass
else:
# found an email!
# this is a maze, thanks groups.io
# this is a maze, not amazing.
# thanks groups.io!
td = tr.find('td')
divrow = td.find('div',{'class':'row'}).find('div',{'class':'pull-left'})
sender_divrow = td.find('div',{'class':'row'})
sender_divrow = sender_divrow.find('div',{'class':'pull-left'})
if (i+1)==1:
original_sender = divrow.text.strip()
original_sender = sender_divrow.text.strip()
date_divrow = td.find('div',{'class':'row'})
date_divrow = date_divrow.find('div',{'class':'pull-right'})
date_divrow = date_divrow.find('font',{'class':'text-muted'})
date_divrow = date_divrow.find('script').text
try:
time_seconds = re.search(' [0-9]{1,} ',date_divrow).group(0)
time_seconds = time_seconds.strip()
# Thanks groups.io for the weird date formatting
time_seconds = time_seconds[:10]
mmicro_seconds = time_seconds[10:]
if (i+1)==1:
created_time = datetime.datetime.utcfromtimestamp(int(time_seconds))
modified_time = datetime.datetime.utcfromtimestamp(int(time_seconds))
else:
modified_time = datetime.datetime.utcfromtimestamp(int(time_seconds))
except AttributeError:
created_time = None
modified_time = None
for div in td.find_all('div'):
if div.has_attr('id'):
@@ -299,7 +325,10 @@ class GroupsIOArchivesCrawler(object):
thread = {
'permalink' : permalink,
'created_time' : created_time,
'modified_time' : modified_time,
'subject' : subject,
'subgroup' : subgroup_name,
'original_sender' : original_sender,
'content' : full_content
}
@@ -324,11 +353,13 @@ class GroupsIOArchivesCrawler(object):
results = []
for row in rows:
# We don't care about anything except title and ugly link
# This is where we extract
# a list of thread titles
# and corresponding links.
subject = row.find('span',{'class':'subject'})
title = subject.get_text()
link = row.find('a')['href']
#print(title)
results.append((title,link))
return results

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

View File

@@ -57,6 +57,25 @@ $(document).ready(function() {
});
//////////////////////////////////
// utility functions
// https://stackoverflow.com/a/25275808
function iso8601(date) {
var hours = date.getHours();
var minutes = date.getMinutes();
var ampm = hours >= 12 ? 'PM' : 'AM';
hours = hours % 12;
hours = hours ? hours : 12; // the hour '0' should be '12'
minutes = minutes < 10 ? '0'+minutes : minutes;
var strTime = hours + ':' + minutes + ' ' + ampm;
return date.getYear() + "-" + (date.getMonth()+1) + "-" + date.getDate() + " " + strTime;
}
// https://stackoverflow.com/a/7390612
var toType = function(obj) {
return ({}).toString.call(obj).match(/\s([a-zA-Z]+)/)[1].toLowerCase()
}
//////////////////////////////////
// API-to-Table Functions
@@ -315,8 +334,10 @@ function load_emailthreads_table(){
var r = new Array(), j = -1, size=result.length;
r[++j] = '<thead>'
r[++j] = '<tr class="header-row">';
r[++j] = '<th width="70%">Topic</th>';
r[++j] = '<th width="30%">Started By</th>';
r[++j] = '<th width="60%">Topic</th>';
r[++j] = '<th width="15%">Started By</th>';
r[++j] = '<th width="15%">Date</th>';
r[++j] = '<th width="10%">Mailing List</th>';
r[++j] = '</tr>';
r[++j] = '</thead>'
r[++j] = '<tbody>'
@@ -327,6 +348,10 @@ function load_emailthreads_table(){
r[++j] = '</a>'
r[++j] = '</td><td>';
r[++j] = result[i]['owner_name'];
r[++j] = '</td><td>';
r[++j] = result[i]['created_time'];
r[++j] = '</td><td>';
r[++j] = result[i]['group'];
r[++j] = '</td></tr>';
}
r[++j] = '</tbody>'

View File

@@ -58,7 +58,7 @@ button#feedback {
/* search results table */
td#search-results-score-col,
td#search-results-type-col {
width: 100px;
width: 90px;
}
div.container {
@@ -86,6 +86,14 @@ div.container {
}
/* badges for number of docs indexed */
span.results-count {
background-color: #555;
}
span.indexing-count {
background-color: #337ab7;
}
span.badge {
vertical-align: text-bottom;
}
@@ -126,7 +134,7 @@ li.search-group-item {
}
div.url {
background-color: rgba(86,61,124,.15);
background-color: rgba(40,40,60,.15);
padding: 8px;
}
@@ -192,7 +200,7 @@ table {
.info, .last-searches {
color: gray;
font-size: 12px;
/*font-size: 12px;*/
font-family: Arial, serif;
}
@@ -202,27 +210,27 @@ table {
div.tags a, td.tag-cloud a {
color: #b56020;
font-size: 12px;
/*font-size: 12px;*/
}
td.tag-cloud, td.directories-cloud {
font-size: 12px;
/*font-size: 12px;*/
color: #555555;
}
td.directories-cloud a {
font-size: 12px;
/*font-size: 12px;*/
color: #377BA8;
}
div.path {
font-size: 12px;
/*font-size: 12px;*/
color: #666666;
margin-bottom: 3px;
}
div.path a {
font-size: 12px;
/*font-size: 12px;*/
margin-right: 5px;
}

View File

@@ -7,11 +7,18 @@
<div class="col12sm" id="banner-col">
<center>
<a id="banner-a" href="{{ url_for('search')}}?query=&fields=">
<img id="banner-img" src="{{ url_for('static', filename='centillion_white.png') }}">
{% if 'betasearch' in request.url %}
<img id="banner-img" src="{{ url_for('static', filename='centillion_white_beta.png') }}">
{% elif 'localhost' in request.url %}
<img id="banner-img" src="{{ url_for('static', filename='centillion_white_localhost.png') }}">
{% else %}
<img id="banner-img" src="{{ url_for('static', filename='centillion_white.png') }}">
{% endif %}
</a>
</center>
</div>
</div>
{% if config['TAGLINE'] %}
<div class="row" id="tagline-row">
<div class="col12sm" id="tagline-col">

View File

@@ -5,7 +5,7 @@
<div class="alert alert-success alert-dismissible fade in">
<a href="#" class="close" data-dismiss="alert" aria-label="close">&times;</a>
{% for message in messages %}
<p class="lead">{{ message }}</p>
<p>{{ message }}</p>
{% endfor %}
</div>
</div>

View File

@@ -52,8 +52,8 @@
<div class="container-fluid">
<div class="row">
<div class="col-xs-12 info">
<b>Found:</b> <span class="badge">{{entries|length}}</span> results
out of <span class="badge">{{totals["total"]}}</span> total items indexed
<b>Found:</b> <span class="badge results-count">{{entries|length}}</span> results
out of <span class="badge results-count">{{totals["total"]}}</span> total items indexed
</div>
</div>
</div>
@@ -67,32 +67,32 @@
<div class="col-xs-12 info">
<b>Indexing:</b>
<span class="badge">{{totals["gdoc"]}}</span>
<span class="badge indexing-count">{{totals["gdoc"]}}</span>
<a href="/master_list?doctype=gdoc#gdoc">
Google Drive files
</a>,
<span class="badge">{{totals["issue"]}}</span>
<span class="badge indexing-count">{{totals["issue"]}}</span>
<a href="/master_list?doctype=issue#issue">
Github issues
</a>,
<span class="badge">{{totals["ghfile"]}}</span>
<span class="badge indexing-count">{{totals["ghfile"]}}</span>
<a href="/master_list?doctype=ghfile#ghfile">
Github files
</a>,
<span class="badge">{{totals["markdown"]}}</span>
<span class="badge indexing-count">{{totals["markdown"]}}</span>
<a href="/master_list?doctype=markdown#markdown">
Github Markdown files
</a>,
<span class="badge">{{totals["emailthread"]}}</span>
<span class="badge indexing-count">{{totals["emailthread"]}}</span>
<a href="/master_list?doctype=emailthread#emailthread">
Groups.io email threads
</a>,
<span class="badge">{{totals["disqus"]}}</span>
<span class="badge indexing-count">{{totals["disqus"]}}</span>
<a href="/master_list?doctype=disqus#disqus">
Disqus comment threads
</a>
@@ -101,6 +101,7 @@
</div>
</li>
</ul>
</div>
</div>