Compare commits
23 Commits
Author | SHA1 | Date | |
---|---|---|---|
69339abe24 | |||
8d2718d783 | |||
8912b945fe | |||
ddceb16a2c | |||
f769d18b4e | |||
34a889479a | |||
a074e6c0e7 | |||
918c9d583f | |||
6cd505087b | |||
ee9b3bb811 | |||
8a4e20b71c | |||
64d3ce4a9b | |||
5e9b584d26 | |||
b03a42d261 | |||
bd4f4da8dc | |||
23743773a6 | |||
b7d2a8c960 | |||
1f4b43163a | |||
f80ccc2520 | |||
c2eae4f521 | |||
c758ca7a6c | |||
3cf142465a | |||
bfd351c990 |
3
.gitmodules
vendored
Normal file
3
.gitmodules
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
[submodule "mkdocs-material"]
|
||||
path = mkdocs-material
|
||||
url = https://git.charlesreid1.com/charlesreid1/mkdocs-material.git
|
66
Readme.md
66
Readme.md
@@ -1,4 +1,4 @@
|
||||
# centillion
|
||||
# The Centillion
|
||||
|
||||
**the centillion**: a pan-github-markdown-issues-google-docs search engine.
|
||||
|
||||
@@ -6,62 +6,40 @@
|
||||
|
||||
the centillion is 3.03 log-times better than the googol.
|
||||
|
||||

|
||||
|
||||
## what is it
|
||||
|
||||
The centillion is a search engine built using [whoosh](#),
|
||||
The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
|
||||
a Python library for building search engines.
|
||||
|
||||
We define the types of documents the centillion should index,
|
||||
and how, using what fields. The centillion then builds and
|
||||
updates a search index.
|
||||
what info and how. The centillion then builds and
|
||||
updates a search index. That's all done in `centillion_search.py`.
|
||||
|
||||
The centillion also provides a simple web frontend for running
|
||||
queries against the search index.
|
||||
queries against the search index. That's done using a Flask server
|
||||
defined in `centillion.py`.
|
||||
|
||||
The centillion keeps it simple.
|
||||
|
||||
|
||||
## work that is done
|
||||
## quickstart
|
||||
|
||||
See [Workdone.md](Workdone.md)
|
||||
Run the centillion app with a github access token API key set via
|
||||
environment variable:
|
||||
|
||||
```
|
||||
GITHUB_TOKEN="XXXXXXXX" python centillion.py
|
||||
```
|
||||
|
||||
This will start a Flask server, and you can view the minimal search engine
|
||||
interface in your browser at <http://localhost:5000>.
|
||||
|
||||
## more info
|
||||
|
||||
For more info see the documentation: <https://charlesreid1.github.io/centillion>
|
||||
|
||||
|
||||
## work that is being done
|
||||
|
||||
See [Workinprogress.md](Workinprogress.md) for details about
|
||||
route and function layout. Summary below.
|
||||
|
||||
### code organization
|
||||
|
||||
centillion app routes:
|
||||
|
||||
- home
|
||||
- if not logged in, landing page
|
||||
- if logged in, redirect to search
|
||||
- search
|
||||
- main_index_update
|
||||
- update main index, all docs period
|
||||
|
||||
|
||||
centillion Search functions:
|
||||
|
||||
- open_index creates the schema
|
||||
|
||||
- add_issue, add_md, add_document have three diff method sigs and add diff types
|
||||
of documents to the search index
|
||||
|
||||
- update_all_issues or update_all_md or update_all_documents iterates over items
|
||||
and determines whether each item needs to be updated in the search index
|
||||
|
||||
- update_main_index - update the entire search index
|
||||
- calls all three update_all methods
|
||||
|
||||
- create_search_results - package things up for jinja
|
||||
|
||||
- search - run the query, pass results to the jinja-packager
|
||||
|
||||
|
||||
## work that is planned
|
||||
|
||||
See [Workplanned.md](Workplanned.md)
|
||||
|
||||
|
7
Todo.md
Normal file
7
Todo.md
Normal file
@@ -0,0 +1,7 @@
|
||||
# todo
|
||||
|
||||
current problems:
|
||||
- some github issues have no title
|
||||
- github issues are just being re-indexed over and over
|
||||
- documents not showing up in results
|
||||
|
@@ -1,106 +0,0 @@
|
||||
# Components
|
||||
|
||||
The components of centillion are as follows:
|
||||
- Flask application, which creates a Search object and uses it to search index
|
||||
- Search object, which allows you to create/update/search an index
|
||||
|
||||
## Routes layout
|
||||
|
||||
Current application routes are as follows:
|
||||
|
||||
- home -> search
|
||||
- search
|
||||
- update_index
|
||||
|
||||
Ideal application routes (using github flask dance oauth):
|
||||
|
||||
- home
|
||||
- if not logged in, landing page
|
||||
- if logged in, redirect to search
|
||||
- search
|
||||
- main_index_update
|
||||
- update main index, all docs period
|
||||
- delta_index_update
|
||||
- updates delta index, docs that have changed since last main index
|
||||
|
||||
There should be one route to update the main index
|
||||
|
||||
There should be another route to update the delta index
|
||||
|
||||
These should go off and call the update index methods
|
||||
for each respective type of document/collection.
|
||||
For example, if I call `main_index_update` route it should
|
||||
|
||||
- call `main_index_update` for all github issues
|
||||
- call `main_index_update` for folder of markdown docs
|
||||
- call `main_index_update` for google drive folder
|
||||
|
||||
These are all members of the Search class
|
||||
|
||||
## Functions layout
|
||||
|
||||
Functions of the entire search app:
|
||||
- create a search index
|
||||
- load a search index
|
||||
- call the search() method on the index
|
||||
- update the search index
|
||||
|
||||
The first and last, creating and updating the search index,
|
||||
are of greatest interest.
|
||||
|
||||
The Schema affects everything so it is hard to separate
|
||||
functionality into a main Search class shared by many.
|
||||
(Avoid inheritance/classes if possible.)
|
||||
|
||||
current Search:
|
||||
- open_index creates the schema
|
||||
- add_issue or add_document adds an item to the index
|
||||
- add_all_issues or add_all_documents iterates over items and adds them to index
|
||||
- update_index_incremental - update the search index
|
||||
- create_search_results - package things up for jinja
|
||||
- search - run the query, pass results to the jinja-packager
|
||||
|
||||
|
||||
centillion Search:
|
||||
|
||||
- open_index creates the schema
|
||||
|
||||
- add_issue, add_md, add_document have three diff method sigs and add diff types
|
||||
of documents to the search index
|
||||
|
||||
- update_all_issues or update_all_md or update_all_documents iterates over items
|
||||
and determines whether each item needs to be updated in the search index
|
||||
|
||||
- update_main_index - update the entire search index
|
||||
- calls all three update_all methods
|
||||
|
||||
- create_search_results - package things up for jinja
|
||||
|
||||
- search - run the query, pass results to the jinja-packager
|
||||
|
||||
|
||||
Nice to have but focus on it later:
|
||||
|
||||
- update_diff_issues or update_diff_md or update_diff_documents iterates over items
|
||||
and indexes recently-added items
|
||||
|
||||
- update_diff_index - update the diff search index (what's been added since last
|
||||
time)
|
||||
- calls all three update_diff methods
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## Files layout
|
||||
|
||||
Schema definition:
|
||||
* include a "kind" or "class" to group objects
|
||||
* can provide different searches of different collections
|
||||
* eventually can provide user with checkboxes
|
||||
|
||||
|
||||
|
||||
|
||||
|
@@ -38,7 +38,7 @@ class UpdateIndexTask(object):
|
||||
from get_centillion_config import get_centillion_config
|
||||
config = get_centillion_config('config_centillion.json')
|
||||
|
||||
gh_token = os.environ['GITHUB_ACESS_TOKEN']
|
||||
gh_token = os.environ['GITHUB_TOKEN']
|
||||
search.update_index_issues(gh_token, config)
|
||||
search.update_index_gdocs(config)
|
||||
|
||||
@@ -76,25 +76,26 @@ def search():
|
||||
parsed_query, result = search.search(query.split(), fields=[fields])
|
||||
store_search(query, fields)
|
||||
|
||||
total = search.get_document_total_count()
|
||||
totals = search.get_document_total_count()
|
||||
|
||||
return render_template('search.html', entries=result, query=query, parsed_query=parsed_query, fields=fields, last_searches=get_last_searches(), total=total)
|
||||
|
||||
@app.route('/open')
|
||||
def open_file():
|
||||
path = request.args['path']
|
||||
fields = request.args.get('fields')
|
||||
query = request.args['query']
|
||||
call([app.config["EDIT_COMMAND"], path])
|
||||
|
||||
return redirect(url_for("search", query=query, fields=fields))
|
||||
return render_template('search.html',
|
||||
entries=result,
|
||||
query=query,
|
||||
parsed_query=parsed_query,
|
||||
fields=fields,
|
||||
last_searches=get_last_searches(),
|
||||
totals=totals)
|
||||
|
||||
@app.route('/update_index')
|
||||
def update_index():
|
||||
rebuild = request.args.get('rebuild')
|
||||
UpdateIndexTask(diff_index=False)
|
||||
flash("Rebuilding index, check console output")
|
||||
return render_template("search.html", query="", fields="", last_searches=get_last_searches())
|
||||
return render_template("search.html",
|
||||
query="",
|
||||
fields="",
|
||||
last_searches=get_last_searches(),
|
||||
totals={})
|
||||
|
||||
|
||||
##############
|
||||
|
@@ -14,6 +14,8 @@ import tempfile, subprocess
|
||||
import pypandoc
|
||||
import os.path
|
||||
import codecs
|
||||
from datetime import datetime
|
||||
|
||||
from whoosh.qparser import MultifieldParser, QueryParser
|
||||
from whoosh.analysis import StemmingAnalyzer
|
||||
|
||||
@@ -57,6 +59,10 @@ Schema:
|
||||
"""
|
||||
|
||||
|
||||
def clean_timestamp(dt):
|
||||
return dt.replace(microsecond=0).isoformat()
|
||||
|
||||
|
||||
class SearchResult:
|
||||
score = 1.0
|
||||
path = None
|
||||
@@ -115,7 +121,7 @@ class Search:
|
||||
|
||||
schema = Schema(
|
||||
id = ID(stored=True, unique=True),
|
||||
kind = ID(),
|
||||
kind = ID(stored=True),
|
||||
|
||||
created_time = ID(stored=True),
|
||||
modified_time = ID(stored=True),
|
||||
@@ -172,10 +178,11 @@ class Search:
|
||||
'document' : 'docx',
|
||||
}
|
||||
|
||||
content = ""
|
||||
if(mimetype not in mimemap.keys()):
|
||||
# Not a document -
|
||||
# Just a file
|
||||
print("Indexing document %s of type %s"%(item['name'], mimetype))
|
||||
print("Indexing document \"%s\" of type %s"%(item['name'], mimetype))
|
||||
else:
|
||||
# Document with text
|
||||
# Perform content extraction
|
||||
@@ -187,7 +194,7 @@ class Search:
|
||||
# This is a file type we know how to convert
|
||||
# Construct the URL and download it
|
||||
|
||||
print("Extracting content from %s of type %s"%(item['name'], mimetype))
|
||||
print("Extracting content from \"%s\" of type %s"%(item['name'], mimetype))
|
||||
|
||||
|
||||
# Create a URL and a destination filename
|
||||
@@ -227,7 +234,7 @@ class Search:
|
||||
)
|
||||
assert output == ""
|
||||
except RuntimeError:
|
||||
print("XXXXXX Failed to index document %s"%(item['name']))
|
||||
print("XXXXXX Failed to index document \"%s\""%(item['name']))
|
||||
|
||||
|
||||
# If export was successful, read contents of markdown
|
||||
@@ -240,7 +247,7 @@ class Search:
|
||||
|
||||
|
||||
# No matter what happens, clean up.
|
||||
print("Cleaning up %s"%item['name'])
|
||||
print("Cleaning up \"%s\""%item['name'])
|
||||
|
||||
subprocess.call(['rm','-fr',fullpath_output])
|
||||
#print(" ".join(['rm','-fr',fullpath_output]))
|
||||
@@ -259,16 +266,17 @@ class Search:
|
||||
kind = 'gdoc',
|
||||
created_time = item['createdTime'],
|
||||
modified_time = item['modifiedTime'],
|
||||
indexed_time = datetime.now().replace(microsecond=0).isoformat(),
|
||||
title = item['name'],
|
||||
url = item['webViewLink'],
|
||||
mimetype = mimetype,
|
||||
owner_email = item['owners'][0]['emailAddress'],
|
||||
owner_name = item['owners'][0]['displayName'],
|
||||
repo_name=None,
|
||||
repo_url=None,
|
||||
github_user=None,
|
||||
issue_title=None,
|
||||
issue_url=None,
|
||||
repo_name='',
|
||||
repo_url='',
|
||||
github_user='',
|
||||
issue_title='',
|
||||
issue_url='',
|
||||
content = content
|
||||
)
|
||||
|
||||
@@ -277,7 +285,7 @@ class Search:
|
||||
"""
|
||||
Add a Github issue/comment to a search index.
|
||||
"""
|
||||
repo_name = repo.name
|
||||
repo_name = repo.owner.login+"/"+repo.name
|
||||
repo_url = repo.html_url
|
||||
|
||||
count = 0
|
||||
@@ -285,39 +293,62 @@ class Search:
|
||||
|
||||
# Handle the issue content
|
||||
print("Indexing issue %s"%(issue.html_url))
|
||||
|
||||
created_time = clean_timestamp(issue.created_at)
|
||||
modified_time = clean_timestamp(issue.updated_at)
|
||||
indexed_time = clean_timestamp(datetime.now())
|
||||
|
||||
writer.add_document(
|
||||
id = issue.html_url,
|
||||
kind = 'issue',
|
||||
created_time = created_time,
|
||||
modified_time = modified_time,
|
||||
indexed_time = indexed_time,
|
||||
title = issue.title,
|
||||
url = issue.html_url,
|
||||
is_comment = False,
|
||||
timestamp = issue.created_at,
|
||||
mimetype='',
|
||||
owner_email='',
|
||||
owner_name='',
|
||||
repo_name = repo_name,
|
||||
repo_url = repo_url,
|
||||
github_user = issue.user.login,
|
||||
issue_title = issue.title,
|
||||
issue_url = issue.html_url,
|
||||
user = issue.user.login,
|
||||
content = issue.body.rstrip()
|
||||
)
|
||||
count += 1
|
||||
|
||||
|
||||
|
||||
# Handle the comments content
|
||||
if(issue.comments>0):
|
||||
|
||||
comments = issue.get_comments()
|
||||
for comment in comments:
|
||||
|
||||
print(" > Indexing comment %s"%(comment.html_url))
|
||||
|
||||
created_time = clean_timestamp(comment.created_at)
|
||||
modified_time = clean_timestamp(comment.updated_at)
|
||||
indexed_time = clean_timestamp(datetime.now())
|
||||
|
||||
writer.add_document(
|
||||
id = comment.html_url,
|
||||
kind = 'comment',
|
||||
created_time = created_time,
|
||||
modified_time = modified_time,
|
||||
indexed_time = indexed_time,
|
||||
title = "Comment on "+issue.title,
|
||||
url = comment.html_url,
|
||||
is_comment = True,
|
||||
timestamp = comment.created_at,
|
||||
mimetype='',
|
||||
owner_email='',
|
||||
owner_name='',
|
||||
repo_name = repo_name,
|
||||
repo_url = repo_url,
|
||||
github_user = comment.user.login,
|
||||
issue_title = issue.title,
|
||||
issue_url = issue.html_url,
|
||||
user = comment.user.login,
|
||||
content = comment.body.strip()
|
||||
content = comment.body.rstrip()
|
||||
)
|
||||
|
||||
count += 1
|
||||
@@ -354,24 +385,49 @@ class Search:
|
||||
drive = service.files()
|
||||
|
||||
|
||||
# We should do more here
|
||||
# to check if we should update
|
||||
# or not...
|
||||
#
|
||||
# loop over existing documents in index:
|
||||
#
|
||||
# p = QueryParser("kind", schema=self.ix.schema)
|
||||
# q = p.parse("gdoc")
|
||||
# with self.ix.searcher() as s:
|
||||
# results = s.search(q,limit=None)
|
||||
# counts[key] = len(results)
|
||||
|
||||
|
||||
# The trick is to set next page token to None 1st time thru (fencepost)
|
||||
nextPageToken = None
|
||||
|
||||
# Use the pager to return all the things
|
||||
items = []
|
||||
while True:
|
||||
ps = 12
|
||||
results = drive.list(
|
||||
pageSize=100,
|
||||
pageSize=ps,
|
||||
pageToken=nextPageToken,
|
||||
fields="files(id, kind, createdTime, modifiedTime, mimeType, name, owners, webViewLink)",
|
||||
fields="nextPageToken, files(id, kind, createdTime, modifiedTime, mimeType, name, owners, webViewLink)",
|
||||
spaces="drive"
|
||||
).execute()
|
||||
|
||||
nextPageToken = results.get("nextPageToken")
|
||||
items += results.get("files", [])
|
||||
|
||||
if nextPageToken is None:
|
||||
break
|
||||
# Keep it short
|
||||
break
|
||||
|
||||
#if nextPageToken is None:
|
||||
# break
|
||||
|
||||
# Here is where we update.
|
||||
# Grab indexed ids
|
||||
# Grab remote ids
|
||||
# Drop indexed ids not in remote ids
|
||||
# Index all remote ids
|
||||
# Change add_ to update_
|
||||
# Add a hash check in update_
|
||||
|
||||
indexed_ids = set()
|
||||
for item in items:
|
||||
@@ -386,9 +442,12 @@ class Search:
|
||||
|
||||
count = 0
|
||||
for item in items:
|
||||
self.add_item(writer, item, indexed_ids, temp_dir, config)
|
||||
self.add_drive_file(writer, item, indexed_ids, temp_dir, config)
|
||||
count += 1
|
||||
|
||||
print("Cleaning temporary directory: %s"%(temp_dir))
|
||||
subprocess.call(['rm','-fr',temp_dir])
|
||||
|
||||
writer.commit()
|
||||
print("Done, updated %d documents in the index" % count)
|
||||
|
||||
@@ -414,14 +473,14 @@ class Search:
|
||||
writer = self.ix.writer()
|
||||
|
||||
# Iterate over each repo
|
||||
list_of_repos = config['repos']
|
||||
list_of_repos = config['repositories']
|
||||
for r in list_of_repos:
|
||||
|
||||
if '/' not in r:
|
||||
err = "Error: specify org/reponame or user/reponame in list of repos"
|
||||
raise Exception(err)
|
||||
|
||||
this_repo, this_org = re.split('/',r)
|
||||
this_org, this_repo = re.split('/',r)
|
||||
|
||||
org = g.get_organization(this_org)
|
||||
repo = org.get_repo(this_repo)
|
||||
@@ -441,6 +500,7 @@ class Search:
|
||||
|
||||
to_index.add(issue.html_url)
|
||||
writer.delete_by_term('url', issue.html_url)
|
||||
count -= 1
|
||||
comments = issue.get_comments()
|
||||
|
||||
for comment in comments:
|
||||
@@ -477,11 +537,6 @@ class Search:
|
||||
# contains a {% for e in entries %}
|
||||
# and then an {{e.score}}
|
||||
|
||||
|
||||
# ------------------
|
||||
# cheseburger
|
||||
# create search results
|
||||
|
||||
sr = SearchResult()
|
||||
sr.score = r.score
|
||||
|
||||
@@ -495,37 +550,29 @@ class Search:
|
||||
|
||||
sr.id = r['id']
|
||||
sr.kind = r['kind']
|
||||
sr.url = r['url']
|
||||
|
||||
sr.created_time = r['created_time']
|
||||
sr.modified_time = r['modified_time']
|
||||
sr.indexed_time = r['indexed_time']
|
||||
|
||||
sr.title = r['title']
|
||||
sr.url = r['url']
|
||||
|
||||
sr.mimetype = r['mimetype']
|
||||
|
||||
sr.owner_email = r['owner_email']
|
||||
sr.owner_name = r['owner_name']
|
||||
|
||||
sr.content = r['content']
|
||||
|
||||
|
||||
# -----------------
|
||||
# github isuses
|
||||
# create search results
|
||||
|
||||
sr = SearchResult()
|
||||
sr.score = r.score
|
||||
sr.url = r['url']
|
||||
sr.title = r['issue_title']
|
||||
|
||||
sr.repo_name = r['repo_name']
|
||||
sr.repo_url = r['repo_url']
|
||||
|
||||
sr.issue_title = r['issue_title']
|
||||
sr.issue_url = r['issue_url']
|
||||
|
||||
sr.is_comment = r['is_comment']
|
||||
sr.github_user = r['github_user']
|
||||
|
||||
sr.content = r['content']
|
||||
|
||||
# ------------------
|
||||
|
||||
highlights = r.highlights('content')
|
||||
if not highlights:
|
||||
# just use the first 1,000 words of the document
|
||||
@@ -558,27 +605,15 @@ class Search:
|
||||
elif len(fields) == 2:
|
||||
pass
|
||||
else:
|
||||
fields = ['id',
|
||||
'kind',
|
||||
'created_time',
|
||||
'modified_time',
|
||||
'indexed_time',
|
||||
'title',
|
||||
'url',
|
||||
'mimetype',
|
||||
'owner_email',
|
||||
'owner_name',
|
||||
'repo_name',
|
||||
'repo_url',
|
||||
'issue_title',
|
||||
'issue_url',
|
||||
'github_user',
|
||||
'content']
|
||||
# If the user does not specify a field,
|
||||
# these are the fields that are actually searched
|
||||
fields = ['title',
|
||||
'content']
|
||||
if not query:
|
||||
query = MultifieldParser(fields, schema=self.ix.schema).parse(query_string)
|
||||
parsed_query = "%s" % query
|
||||
print("query: %s" % parsed_query)
|
||||
results = searcher.search(query, terms=False, scored=True, groupedby="url")
|
||||
results = searcher.search(query, terms=False, scored=True, groupedby="kind")
|
||||
search_result = self.create_search_result(results)
|
||||
|
||||
return parsed_query, search_result
|
||||
@@ -589,7 +624,29 @@ class Search:
|
||||
return s if len(s) <= l else s[0:l - 3] + '...'
|
||||
|
||||
def get_document_total_count(self):
|
||||
return self.ix.searcher().doc_count_all()
|
||||
p = QueryParser("kind", schema=self.ix.schema)
|
||||
|
||||
kind_labels = {
|
||||
"documents" : "gdoc",
|
||||
"issues" : "issue",
|
||||
"comments" : "comment"
|
||||
}
|
||||
counts = {
|
||||
"documents" : None,
|
||||
"issues" : None,
|
||||
"comments" : None,
|
||||
"total" : None
|
||||
}
|
||||
for key in kind_labels:
|
||||
kind = kind_labels[key]
|
||||
q = p.parse(kind)
|
||||
with self.ix.searcher() as s:
|
||||
results = s.search(q,limit=None)
|
||||
counts[key] = len(results)
|
||||
|
||||
counts['total'] = self.ix.searcher().doc_count_all()
|
||||
|
||||
return counts
|
||||
|
||||
if __name__ == "__main__":
|
||||
search = Search("search_index")
|
||||
|
@@ -1,7 +1,6 @@
|
||||
{
|
||||
"repositories" : [
|
||||
"dcppc/2018-june-workshop",
|
||||
"dcppc/2018-july-workshop",
|
||||
"dcppc/data-stewards"
|
||||
"dcppc/2018-july-workshop"
|
||||
]
|
||||
}
|
||||
|
@@ -1,27 +1,9 @@
|
||||
# Path to markdown files
|
||||
MARKDOWN_FILES_DIR = "/Users/charles/codes/whoosh/markdown-search/fake-docs/"
|
||||
|
||||
# Location of index file
|
||||
INDEX_DIR = "search_index"
|
||||
|
||||
# Command to use when clicking on filepath in search results
|
||||
EDIT_COMMAND = "view"
|
||||
|
||||
# Toggle to show Whoosh parsed query
|
||||
SHOW_PARSED_QUERY=True
|
||||
|
||||
# Toogle to use tags
|
||||
USE_TAGS=True
|
||||
|
||||
# Optional prefix in a markdown file, e.g. "tags: python search markdown tutorial"
|
||||
TAGS_PREFIX=""
|
||||
|
||||
# List of tags that should be ignored
|
||||
TAGS_TO_IGNORE = "and are what how its not with the"
|
||||
|
||||
# Regular expression to select tags, eg tag has to start with alphanumeric followed by at least two alphanumeric or "-" or "."
|
||||
TAGS_REGEX = r"\b([A-Za-z0-9][A-Za-z0-9-.]{2,})\b"
|
||||
|
||||
# Flask settings
|
||||
DEBUG = True
|
||||
SECRET_KEY = '42c5a8eda356ca9d9c3ab2d149541e6b91d843fa'
|
||||
|
54
docs/index.md
Normal file
54
docs/index.md
Normal file
@@ -0,0 +1,54 @@
|
||||
# The Centillion
|
||||
|
||||
**the centillion**: a pan-github-markdown-issues-google-docs search engine.
|
||||
|
||||
**a centillion**: a very large number consisting of a 1 with 303 zeros after it.
|
||||
|
||||
the centillion is 3.03 log-times better than the googol.
|
||||
|
||||
## what is it
|
||||
|
||||
The centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
|
||||
a Python library for building search engines.
|
||||
|
||||
We define the types of documents the centillion should index,
|
||||
what info and how. The centillion then builds and
|
||||
updates a search index. That's all done in `centillion_search.py`.
|
||||
|
||||
The centillion also provides a simple web frontend for running
|
||||
queries against the search index. That's done using a Flask server
|
||||
defined in `centillion.py`.
|
||||
|
||||
The centillion keeps it simple.
|
||||
|
||||
|
||||
## quickstart
|
||||
|
||||
Run the centillion app with a github access token API key set via
|
||||
environment variable:
|
||||
|
||||
```
|
||||
GITHUB_TOKEN="XXXXXXXX" python centillion.py
|
||||
```
|
||||
|
||||
This will start a Flask server, and you can view the minimal search engine
|
||||
interface in your browser at <http://localhost:5000>.
|
||||
|
||||
|
||||
## work that is done
|
||||
|
||||
See [standalone.md](standalone.md) for the summary of
|
||||
the three standalone whoosh servers that were built:
|
||||
one for a folder of markdown files, one for github issues
|
||||
and comments, and one for google drive documents.
|
||||
|
||||
## work that is being done
|
||||
|
||||
See [workinprogress.md](workinprogress.md) for details about
|
||||
work in progress.
|
||||
|
||||
## work that is planned
|
||||
|
||||
See [plans.md](plans.md)
|
||||
|
||||
|
@@ -31,3 +31,4 @@ Stateless
|
||||
|
||||
|
||||
|
||||
|
@@ -1,4 +1,4 @@
|
||||
## work that is done
|
||||
## work that is done: standalone
|
||||
|
||||
**Stage 1: index folder of markdown files** (done)
|
||||
* See [markdown-search](https://git.charlesreid1.com/charlesreid1/markdown-search.git)
|
||||
@@ -13,7 +13,7 @@
|
||||
|
||||
Needs work:
|
||||
|
||||
* More appropriate schema
|
||||
* <s>More appropriate schema</s>
|
||||
* Using more features (weights) plus pandoc filters for schema
|
||||
* Sqlalchemy (and hey waddya know safari books has it covered)
|
||||
|
||||
@@ -25,15 +25,16 @@ Needs work:
|
||||
* Main win here is uncovering metadata/linking/presentation issues
|
||||
|
||||
Needs work:
|
||||
- treat comments and issues as separate objects, fill out separate schema fields
|
||||
- <s>treat comments and issues as separate objects, fill out separate schema fields
|
||||
- map out and organize how the schema is updated to make it more flexible
|
||||
- configuration needs to enable user to specify organization+repos
|
||||
- configuration needs to enable user to specify organization+repos</s>
|
||||
|
||||
```plain
|
||||
{
|
||||
"to_index" : {
|
||||
"google" : "google-api-python-client",
|
||||
"microsoft" : ["TypeCode","api-guidelines"]
|
||||
"to_index" : [
|
||||
"google/google-api-python-client",
|
||||
"microsoft/TypeCode",
|
||||
"microsoft/api-guielines"
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -48,3 +49,4 @@ Needs work:
|
||||
* Use the google drive api (see simple-simon)
|
||||
* Main win is more uncovering of metadata issues, identifying
|
||||
big-picture issues for centillion
|
||||
|
48
docs/workinprogress.md
Normal file
48
docs/workinprogress.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Components
|
||||
|
||||
The components of centillion are as follows:
|
||||
- Flask application, which creates a Search object and uses it to search index
|
||||
- Search object, which allows you to create/update/search an index
|
||||
|
||||
## Routes layout
|
||||
|
||||
Centillion flask app routes:
|
||||
|
||||
- `/home`
|
||||
- if not logged in, landing page
|
||||
- if logged in, redirect to search
|
||||
- `/search`
|
||||
- `/main_index_update`
|
||||
- update main index, all docs period
|
||||
|
||||
## Functions layout
|
||||
|
||||
Centillion Search class functions:
|
||||
|
||||
- `open_index()` creates the schema
|
||||
|
||||
- `add_issue()`, `add_md()`, `add_document()` have three diff method sigs and add diff types
|
||||
of documents to the search index
|
||||
|
||||
- `update_all_issues()` or `update_all_md()` or `update_all_documents()` iterates over items
|
||||
and determines whether each item needs to be updated in the search index
|
||||
|
||||
- `update_main_index()` - update the entire search index
|
||||
- calls all three update_all methods
|
||||
|
||||
- `create_search_results()` - package things up for jinja
|
||||
|
||||
- `search()` - run the query, pass results to the jinja-packager
|
||||
|
||||
Nice to have but focus on it later:
|
||||
- update diff search index (what's been added since last index time)
|
||||
- max index time
|
||||
|
||||
|
||||
## Files layout
|
||||
|
||||
Schema definition:
|
||||
* include a "kind" or "class" to group objects
|
||||
* can provide different searches of different collections
|
||||
* eventually can provide user with checkboxes
|
||||
|
BIN
img/ss.png
Normal file
BIN
img/ss.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 356 KiB |
1
mkdocs-material
Submodule
1
mkdocs-material
Submodule
Submodule mkdocs-material added at 6569122bb1
6
static/bootstrap.min.css
vendored
Normal file
6
static/bootstrap.min.css
vendored
Normal file
File diff suppressed because one or more lines are too long
BIN
static/centillion_black.png
Normal file
BIN
static/centillion_black.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 29 KiB |
BIN
static/centillion_white.png
Normal file
BIN
static/centillion_white.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 25 KiB |
BIN
static/centillion_xparent.png
Normal file
BIN
static/centillion_xparent.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 30 KiB |
@@ -1,3 +1,24 @@
|
||||
|
||||
li.search-group-item {
|
||||
position: relative;
|
||||
display: block;
|
||||
padding: 0px;
|
||||
margin-bottom: -1px;
|
||||
background-color: #fff;
|
||||
border: 1px solid #ddd;
|
||||
}
|
||||
|
||||
div.list-group {
|
||||
border: 1px solid rgba(86,61,124,.2);
|
||||
}
|
||||
|
||||
div.url {
|
||||
background-color: rgba(86,61,124,.15);
|
||||
padding: 8px;
|
||||
}
|
||||
|
||||
/***************************/
|
||||
|
||||
body {
|
||||
font-family: sans-serif;
|
||||
}
|
||||
@@ -56,7 +77,7 @@ table {
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
td.info, .last-searches {
|
||||
.info, .last-searches {
|
||||
color: gray;
|
||||
font-size: 12px;
|
||||
font-family: Arial, serif;
|
||||
|
@@ -1,7 +1,8 @@
|
||||
<!doctype html>
|
||||
<title>Markdown Search</title>
|
||||
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
|
||||
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}">
|
||||
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
|
||||
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='bootstrap.min.css') }}">
|
||||
<div>
|
||||
{% for message in get_flashed_messages() %}
|
||||
<div class="flash">{{ message }}</div>
|
||||
|
@@ -1,62 +1,151 @@
|
||||
{% extends "layout.html" %}
|
||||
{% block body %}
|
||||
<h1><a href="{{ url_for('search')}}?query=&fields=">Search directory: {{ config.MARKDOWN_FILES_DIR }}</a></h1>
|
||||
<a class="index" href="{{ url_for('update_index')}}">[update index]</a>
|
||||
<a class="index" href="{{ url_for('update_index')}}?rebuild=True">[rebuild index]</a>
|
||||
<form action="{{ url_for('search') }}" name="search">
|
||||
<input type="text" name="query" value="{{ query }}">
|
||||
<input type="submit" value="search">
|
||||
<a href="{{ url_for('search')}}?query=&fields=">[clear]</a>
|
||||
</form>
|
||||
<table cellspacing="3">
|
||||
{% if directories %}
|
||||
<tr>
|
||||
<td class="directories-cloud">File directories: 
|
||||
|
||||
|
||||
<div class="container">
|
||||
|
||||
<div class="row">
|
||||
<div class="col12sm">
|
||||
<center>
|
||||
<a href="{{ url_for('search')}}?query=&fields=">
|
||||
<img src="{{ url_for('static', filename='centillion_white.png') }}">
|
||||
</a>
|
||||
</center>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="row">
|
||||
<div class="col12sm">
|
||||
<center>
|
||||
<h2>
|
||||
<a href="{{ url_for('search')}}?query=&fields=">
|
||||
Search the DCPPC
|
||||
</a>
|
||||
</h2>
|
||||
</center>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
<div class="row">
|
||||
<div class="col-12">
|
||||
<center>
|
||||
<a class="index" href="{{ url_for('update_index')}}">[update index]</a>
|
||||
<a class="index" href="{{ url_for('update_index')}}?rebuild=True">[rebuild index]</a>
|
||||
<form action="{{ url_for('search') }}" name="search">
|
||||
<input type="text" name="query" value="{{ query }}"> <br />
|
||||
<button type="submit" style="font-size: 20px; padding: 10px; padding-left: 50px; padding-right: 50px;"
|
||||
value="search" class="btn btn-primary">Search</button>
|
||||
<br />
|
||||
<a href="{{ url_for('search')}}?query=&fields=">[clear all results]</a>
|
||||
</form>
|
||||
</center>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row">
|
||||
|
||||
{% if directories %}
|
||||
<div class="col-12 info directories-cloud">
|
||||
File directories: 
|
||||
{% for d in directories %}
|
||||
<a href="{{url_for('search')}}?query={{d|trim}}&fields=filename">{{d|trim}}</a>
|
||||
{% endfor %}
|
||||
</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
{% if config['SHOW_PARSED_QUERY']%}
|
||||
<tr>
|
||||
<td class="info">Parsed query: {{ parsed_query }}</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
<tr>
|
||||
<td class="info">FOUND {{ entries | length }} results of {{total}} documents</td>
|
||||
</tr>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{% for e in entries %}
|
||||
<tr>
|
||||
<td class="search-result">
|
||||
<!--
|
||||
<div class="path"><a href='{{ url_for("open_file")}}?path={{e.path|urlencode}}&query={{query}}&fields={{fields}}'>{{e.path}}</a>score: {{'%d' % e.score}}</div>
|
||||
-->
|
||||
<div class="url">
|
||||
{% if e.is_comment %}
|
||||
<b>Comment</b> <a href='{{e.url}}'>(comment link)</a>
|
||||
on issue <a href='{{e.issue_url}}'>{{e.issue_title}}</a>
|
||||
in repo <a href='{{e.repo_url}}'>dcppc/{{e.repo_name}}</a>
|
||||
<br />
|
||||
{% else %}
|
||||
<b>Issue</b> <a href='{{e.issue_url}}'>{{e.issue_title}}</a>
|
||||
in repo <a href='{{e.repo_url}}'>dcppc/{{e.repo_name}}</a>
|
||||
<br />
|
||||
{% endif %}
|
||||
score: {{'%d' % e.score}}
|
||||
</div>
|
||||
<div class="markdown-body">{{ e.content_highlight|safe}}</div>
|
||||
</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</table>
|
||||
<div class="last-searches">Last searches: <br/>
|
||||
{% for s in last_searches %}
|
||||
<span><a href="{{url_for('search')}}?{{s}}">{{s}}</a></span>
|
||||
{% endfor %}
|
||||
<ul class="list-group">
|
||||
|
||||
{% if config['SHOW_PARSED_QUERY'] and parsed_query %}
|
||||
<li class="list-group-item">
|
||||
<div class="col-12 info">
|
||||
<b>Parsed query:</b> {{ parsed_query }}
|
||||
</div>
|
||||
</li>
|
||||
{% endif %}
|
||||
|
||||
{% if parsed_query %}
|
||||
<li class="list-group-item">
|
||||
<div class="col-12 info">
|
||||
<b>Found:</b> {{entries|length}} documents with results, out of {{totals["total"]}} total documents
|
||||
</div>
|
||||
</li>
|
||||
{% endif %}
|
||||
|
||||
<li class="list-group-item">
|
||||
<div class="col-12 info">
|
||||
<b>Indexing:</b> {{totals["documents"]}} Google Documents,
|
||||
{{totals["issues"]}} Github issues, and
|
||||
{{totals["comments"]}} Github comments
|
||||
</div>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
<p>
|
||||
More info can be found in the <a href="https://github.com/BernhardWenzel/markdown-search">README.md file</a>
|
||||
</p>
|
||||
|
||||
<div class="container">
|
||||
<div class="row">
|
||||
<ul class="list-group">
|
||||
|
||||
{% for e in entries %}
|
||||
<li class="search-group-item">
|
||||
|
||||
<div class="url">
|
||||
{% if e.kind=="gdoc" %}
|
||||
<b>Google Drive File:</b>
|
||||
<a href='{{e.url}}'>{{e.title}}</a>
|
||||
({{e.owner_name}}, {{e.owner_email}})
|
||||
{% elif e.kind=="comment" %}
|
||||
<b>Comment:</b>
|
||||
<a href='{{e.url}}'>Comment (link)</a>
|
||||
{% if e.github_user %}
|
||||
by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
|
||||
{% endif %}
|
||||
on issue <a href='{{e.issue_url}}'>{{e.issue_title}}</a>
|
||||
<br/>
|
||||
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
|
||||
{% if e.github_user %}
|
||||
{% endif %}
|
||||
{% elif e.kind=="issue" %}
|
||||
<b>Issue:</b>
|
||||
<a href='{{e.issue_url}}'>{{e.issue_title}}</a>
|
||||
{% if e.github_user %}
|
||||
by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
|
||||
{% endif %}
|
||||
<br/>
|
||||
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
|
||||
{% else %}
|
||||
<b>Item:</b> (<a href='{{e.url}}'>link</a>)
|
||||
{% endif %}
|
||||
<br />
|
||||
score: {{'%d' % e.score}}
|
||||
</div>
|
||||
<div class="markdown-body">{{ e.content_highlight|safe}}</div>
|
||||
|
||||
</li>
|
||||
{% endfor %}
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row">
|
||||
<div class="col-12">
|
||||
<div class="last-searches">Last searches: <br/>
|
||||
{% for s in last_searches %}
|
||||
<span><a href="{{url_for('search')}}?{{s}}">{{s}}</a></span>
|
||||
{% endfor %}
|
||||
</div>
|
||||
<p>
|
||||
More info can be found in the <a href="https://github.com/BernhardWenzel/markdown-search">README.md file</a>
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{% endblock %}
|
||||
|
||||
|
Reference in New Issue
Block a user