127 Commits
v0.1 ... master

Author SHA1 Message Date
de796880c5 Merge branch 'master' of github.com:charlesreid1/centillion
* 'master' of github.com:charlesreid1/centillion:
  update config_flask.example.py to strip dc info
2018-08-13 19:14:54 -07:00
f79f711a38 Merge branch 'master' of github.com:dcppc/centillion
* 'master' of github.com:dcppc/centillion:
  Update Readme.md
2018-08-13 19:14:07 -07:00
00b862b83e Merge branch 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion
* 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion:
2018-08-13 19:13:53 -07:00
a06c3b645a Update Readme.md 2018-08-13 12:42:18 -07:00
878ff011fb locked out by rate limit, but otherwise successful in indexing so far. 2018-08-13 00:54:12 -07:00
33cf78a524 successfully grabbing threads from 1st page of every subgroup 2018-08-13 00:27:45 -07:00
c1bcd8dc22 add import pdb where things are currently stuck 2018-08-12 20:25:29 -07:00
757e9d79a1 keep going with spider idea 2018-08-12 20:24:29 -07:00
c47682adb4 fix typo with groupsio key 2018-08-12 20:13:45 -07:00
f2662c3849 adding calls to index groupsio emails
this is currently work in progress.
we have a debug statement in place as a bookmark.

we are currently:
- creating a login session
- getting all the subgroups
- going to first subgroup
- getting list of titles and links
- getting emails for each title and link

still need to:
- figure out how to assemble email {}
- assemble content/etc and how to parse text of emails
2018-08-12 18:00:33 -07:00
2478a3f857 Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc
* 'dcppc' of github.com:dcppc/centillion:
  fix how search results are bundled
  fix search template
2018-08-10 06:05:44 -07:00
f174080dfd catch exception when file info not found 2018-08-10 06:05:33 -07:00
ca8b12db06 Merge pull request #2 from charlesreid1/dcppc-merge-master
Merge dcppc changes into master
2018-08-10 05:49:29 -07:00
a1ffdad292 Merge branch 'master' into dcppc-merge-master 2018-08-10 05:49:19 -07:00
ce76396096 update config_flask.example.py to strip dc info 2018-08-10 05:46:07 -07:00
175ff4f71d Merge pull request #17 from dcppc/github-files
fix search template
2018-08-09 18:57:30 -07:00
94f956e2d0 fix how search results are bundled 2018-08-09 18:56:56 -07:00
dc015671fc fix search template 2018-08-09 18:55:49 -07:00
1e9eec81d7 make it valid json 2018-08-09 18:15:14 -07:00
31e12476af Merge pull request #16 from dcppc/inception
add inception
2018-08-09 18:08:11 -07:00
bbe4e32f63 Merge pull request #15 from dcppc/github-files
index all github filenames, not just markdown
2018-08-09 18:07:56 -07:00
5013741958 while we're at it 2018-08-09 17:40:56 -07:00
1ce80a5da0 closes #11 2018-08-09 17:38:20 -07:00
3ed967bd8b remove unused function 2018-08-09 17:28:22 -07:00
1eaaa32007 index all github filenames, not just markdown 2018-08-09 17:25:09 -07:00
9c7e696b6a Merge branch 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion
* 'master' of ssh://git.charlesreid1.com:222/charlesreid1/centillion:
  Move images, resize images, update image markdown in readme
  update readme to use <img> tags
  merge image files in from master
  fix <title>
  fix the readme to reflect current state of things/links/descriptions
  fix typos/wording in readme
  adding changes to enable https, update callback to http, and everything still passes through https (proxy)
  update footer repo info
  update screen shots
  add mkdocs-material-dib submodule
  remove mkdocs material submodule
  update tagline
  update tagline
  add _example_ config file for flask
2018-08-09 16:39:18 -07:00
262a0c19e7 Merge pull request #14 from dcppc/local-fixes
Fix centillion to work for local instances
2018-08-09 16:37:37 -07:00
bd2714cc0b Merge branch 'dcppc' into local-fixes 2018-08-09 16:36:34 -07:00
899d6fed53 comment out localhost only env var 2018-08-09 16:25:37 -07:00
a7756049e5 revert changes 2018-08-09 16:23:42 -07:00
3df427a8f8 fix how existing issues in search index are collected. closes #10 2018-08-09 16:17:17 -07:00
0dd06748de fix centillion to work for local instance 2018-08-09 16:16:30 -07:00
1a04814edf Merge pull request #9 from dcppc/ACharbonneau-patch-1
Update config_centillion.json
2018-08-07 16:09:45 -07:00
Amanda Charbonneau
3fb72d409b Update config_centillion.json
I fixed it
2018-08-07 18:24:32 -04:00
d89e01221a Merge pull request #8 from dcppc/dcppc-test
Fix the name of the milestones repo: 'dcppc-milestones' not 'milestones'
2018-08-07 14:59:06 -07:00
6736f3f8ad add centillion configuration json file 2018-08-07 14:54:56 -07:00
abd13aba29 Merge pull request #7 from dcppc/fix-docstrings
Fix docstrings
2018-08-07 14:43:42 -07:00
13e49cdaa6 improve docstrings on gdrive_util.py too 2018-08-07 14:42:19 -07:00
83b2ce17fb fix docstrings in centillion_search.py 2018-08-07 14:41:26 -07:00
5be0709070 Merge pull request #6 from dcppc/fix-docs
Move images, resize images, update image markdown in readme
2018-08-07 13:02:08 -07:00
9edd95a78d Merge branch 'fix-docs'
* fix-docs:
  Move images, resize images, update image markdown in readme
  update readme to use <img> tags
  merge image files in from master
  fix <title>
  fix the readme to reflect current state of things/links/descriptions
  fix typos/wording in readme
  adding changes to enable https, update callback to http, and everything still passes through https (proxy)
  update footer repo info
  update screen shots
  add mkdocs-material-dib submodule
  remove mkdocs material submodule
  update tagline
  update tagline
  add _example_ config file for flask
2018-08-07 12:50:29 -07:00
37615d8707 Move images, resize images, update image markdown in readme 2018-08-07 12:40:38 -07:00
4b218f63b9 update readme to use <img> tags 2018-08-03 15:56:49 -07:00
4e17c890bc merge image files in from master 2018-08-03 15:53:51 -07:00
1129ec38e0 update the readme 2018-08-03 15:49:46 -07:00
875508c796 update screen shot images 2018-08-03 15:49:12 -07:00
abc7a2aedf fix <title> 2018-08-03 15:45:56 -07:00
8f1e5faefc update readme to reflect latest 2018-08-03 15:38:23 -07:00
d5f63e2322 Merge pull request #1 from dcppc/fix-readme
fix the readme to reflect current state of things/links/descriptions
2018-08-03 15:28:51 -07:00
84e5560423 fix the readme to reflect current state of things/links/descriptions 2018-08-03 15:28:16 -07:00
924c562c0a fix typos/wording in readme 2018-08-03 15:22:35 -07:00
13c410ac5e adding changes to enable https, update callback to http, and everything still passes through https (proxy) 2018-08-03 15:21:41 -07:00
4e79800e83 update footer repo info 2018-08-03 15:19:55 -07:00
5b9570d8cd Merge branch 'dcppc' of github.com:dcppc/centillion into dcppc
* 'dcppc' of github.com:dcppc/centillion:
  add mkdocs-material-dib submodule
  remove mkdocs material submodule
2018-08-03 14:54:25 -07:00
297a4b5977 update screen shots 2018-08-03 14:53:43 -07:00
69a6b5d680 add mkdocs-material-dib submodule 2018-08-03 13:51:13 -07:00
3feca1aba3 remove mkdocs material submodule 2018-08-03 13:50:37 -07:00
493581f861 update tagline 2018-08-03 13:38:00 -07:00
1b0ded809d update tagline 2018-08-03 13:36:56 -07:00
78e77c7cf2 add _example_ config file for flask 2018-08-03 13:34:27 -07:00
2f890d1aee Merge branch 'all-the-docs' of charlesreid1/centillion into master 2018-08-03 20:28:27 +00:00
937327f2cb update search template to treat drive files and documents differently. 2018-08-03 13:24:03 -07:00
ca0d88cfe6 index all the google drive things 2018-08-03 13:15:02 -07:00
5eda472072 improve handling of tokens for gh api, fix set ordering/logic 2018-08-03 13:07:46 -07:00
d943c14678 Merge branch 'master' into all-the-docs
* master:
  Update '.gitignore'
  no secrets plz
2018-08-03 12:37:49 -07:00
6be785a056 indexing all markdown is working. 2018-08-03 12:36:32 -07:00
65113a95f7 Update '.gitignore' 2018-08-03 17:52:04 +00:00
87c3f12c8f no secrets plz 2018-08-03 17:51:39 +00:00
933884e9ab search all the docs. search all the repos. 2018-08-03 10:29:52 -07:00
da9dea3f6b Merge branch 'github-markdown' of charlesreid1/centillion into master 2018-08-03 07:20:45 +00:00
4d6386e74a add results-handling for markdown files 2018-08-03 00:19:57 -07:00
a93b7519de improve counts accounting, and construct usable urls for markdown 2018-08-03 00:19:35 -07:00
5e2c37164b fix markdown indexing 2018-08-02 23:56:56 -07:00
829e9c4263 finish subsuming repotree into centillion_search 2018-08-02 23:14:55 -07:00
283991017c add repotree script. temporary/standalone, but doing exactly what centillion needs to do. 2018-08-02 22:29:18 -07:00
653af18f24 add update_index_markdown() function, rough/unfinished 2018-08-02 22:27:30 -07:00
fae184f1f3 re-indexer now calls (nonexistent file) update_index_markdown 2018-08-02 22:26:56 -07:00
d40bb3557f Merge branch 'flask-dance' of charlesreid1/centillion into master 2018-08-03 04:09:20 +00:00
a848f3ec3e complete the conversion to oauth tokens 2018-08-02 19:06:34 -07:00
50d27a915a update readme 2018-08-02 19:04:40 -07:00
1b950b7790 update re-index task to use gh token; reorganize logic; use werkzeug proxy 2018-08-02 19:02:00 -07:00
04d4195668 Add flask-dance to centillion.
- Remove config file, which now contains secrets
- Add flask dance to requirements
- Update instructions in readme to include Github application setup
2018-08-02 11:52:56 -07:00
d0fe7aa799 ignore config files, which may have keys in them 2018-08-02 11:24:33 -07:00
acc28aab44 Merge branch 'cache-and-hash' of charlesreid1/centillion into master 2018-08-02 17:59:45 +00:00
adc2666a9b actually fix flashed messages 2018-08-02 00:58:37 -07:00
581f0a67ed fix messages so they are js and dismissable 2018-08-02 00:54:56 -07:00
0b96061bc5 update documentation, add new docs pages on components/flask/whoosh 2018-08-01 23:04:35 -07:00
c7acdea889 finally. make results comprehensible. 2018-08-01 22:39:07 -07:00
4eabd4536e remove last searches from search.html 2018-08-01 22:32:20 -07:00
78276c14d9 align badges higher 2018-08-01 22:31:59 -07:00
68f90d383f fix up how issues are added, and how all issues are iterated over (use set algebra) 2018-08-01 22:31:41 -07:00
202643b85e add control_panel route, remove last_search silliness 2018-08-01 22:29:06 -07:00
dc9ac74d68 add control panel page 2018-08-01 20:12:55 -07:00
36cc94a854 Fix bootstrap div classes, badgify counts, fix <li> styles 2018-08-01 20:12:10 -07:00
740e757bcd update todo with what we have done 2018-08-01 15:54:03 -07:00
bf6afe39c6 caching is working 2018-08-01 15:48:43 -07:00
54c09ce80b call add drive file function with add/update docIDs. fix method headers. 2018-08-01 15:17:07 -07:00
1407178f39 updating flask config and templates to parameterize repo info in footer 2018-08-01 13:43:43 -07:00
2bf9abfd6f update footer: prior searches are now badges, and link to more info now points to repo 2018-08-01 13:36:45 -07:00
8328f96f76 make "prior searches" a badge and infobox bg color 2018-08-01 13:36:05 -07:00
d5a9fe85af Merge branch 'master' into cache-and-hash
* master:
  update installation preparation step
2018-08-01 12:50:10 -07:00
f8d2156d85 update installation preparation step 2018-08-01 12:48:09 -07:00
a753ba4963 update centillion search with comment blocks laying out what to change and where 2018-08-01 11:32:37 -07:00
8cca4b2c8d add TAGLINE param 2018-08-01 00:49:56 -07:00
69339abe24 fix the way repo name label is handled 2018-08-01 00:25:29 -07:00
8d2718d783 update how we store totals 2018-07-31 23:58:19 -07:00
8912b945fe remove print statement 2018-07-31 23:17:16 -07:00
ddceb16a2c fix template rendering in update_index url endpoint 2018-07-31 23:16:45 -07:00
f769d18b4e clean up flask config file 2018-07-31 23:16:23 -07:00
34a889479a Update config_flask.py 2018-07-31 23:12:57 -07:00
a074e6c0e7 add image to readme 2018-07-31 23:07:32 -07:00
918c9d583f update search results template 2018-07-31 23:01:38 -07:00
6cd505087b package up the counts in get_document_total_count 2018-07-31 22:37:20 -07:00
ee9b3bb811 pass a count dictionary instead of an integer to the jinja template 2018-07-31 22:36:43 -07:00
8a4e20b71c update template - gotta look good 2018-07-31 22:36:13 -07:00
64d3ce4a9b update search engine style to use centillion logo 2018-07-31 18:29:01 -07:00
5e9b584d26 uncovered the mysterious missing google docs: they were just being labeled as issues by the search template. 2018-07-31 15:59:21 -07:00
b03a42d261 start some troubleshooting 2018-07-31 05:21:58 -07:00
bd4f4da8dc more fixes - use "" not None 2018-07-31 05:15:22 -07:00
23743773a6 add mkdocs-material submodule 2018-07-31 04:33:27 -07:00
b7d2a8c960 rename some files, and move docs into docs/ 2018-07-31 04:32:38 -07:00
1f4b43163a fix env var name 2018-07-31 03:16:28 -07:00
f80ccc2520 successfully indexing, unsuccessfully searching 2018-07-31 03:06:25 -07:00
c2eae4f521 improve handling of repo names, oweners, and document schema. improve timestamps. 2018-07-31 01:52:44 -07:00
c758ca7a6c add quickstart 2018-07-31 01:28:38 -07:00
3cf142465a updating readme with flask mention 2018-07-31 01:23:49 -07:00
bfd351c990 Update 'Workdone.md' 2018-07-31 08:12:28 +00:00
35 changed files with 1859 additions and 473 deletions

2
.gitignore vendored
View File

@@ -1,8 +1,8 @@
config_flask.py
vp vp
credentials.json credentials.json
drive*.json drive*.json
*.pyc *.pyc
config.py
out/ out/
search_index/ search_index/
venv/ venv/

3
.gitmodules vendored Normal file
View File

@@ -0,0 +1,3 @@
[submodule "mkdocs-material-dib"]
path = mkdocs-material-dib
url = https://github.com/dib-lab/mkdocs-material-dib.git

View File

@@ -1,67 +1,94 @@
# centillion # Centillion
**the centillion**: a pan-github-markdown-issues-google-docs search engine. **centillion**: a pan-github-markdown-issues-google-docs search engine.
**a centillion**: a very large number consisting of a 1 with 303 zeros after it. **a centillion**: a very large number consisting of a 1 with 303 zeros after it.
the centillion is 3.03 log-times better than the googol. one centillion is 3.03 log-times better than a googol.
![Screen shot of centillion](docs/images/ss.png)
## what is it ## what is it
The centillion is a search engine built using [whoosh](#), Centillion (https://github.com/dcppc/centillion) is a search engine that can index
a Python library for building search engines. three kinds of collections: Google Documents, Github issues, and Markdown files in
Github repos.
We define the types of documents the centillion should index, We define the types of documents the centillion should index,
and how, using what fields. The centillion then builds and what info and how. The centillion then builds and
updates a search index. updates a search index. That's all done in `centillion_search.py`.
The centillion also provides a simple web frontend for running The centillion also provides a simple web frontend for running
queries against the search index. queries against the search index. That's done using a Flask server
defined in `centillion.py`.
The centillion keeps it simple. The centillion keeps it simple.
## authentication layer
## work that is done Centillion lives behind a Github authentication layer, implemented with
[flask-dance](https://github.com/singingwolfboy/flask-dance). When you first
visit the site it will ask you to authenticate with Github so that it can
verify you have permission to access the site.
See [Workdone.md](Workdone.md) ## technologies
Centillion is a Python program built using whoosh (search engine library). It
indexes the full text of docx files in Google Documents, just the filenames for
non-docx files. The full text of issues and their comments are indexed, and
results are grouped by issue. Centillion requires Google Drive and Github OAuth
apps. Once you provide credentials to Flask you're all set to go.
## work that is being done ## control panel
See [Workinprogress.md](Workinprogress.md) for details about There's also a control panel at <https://search.nihdatacommons.us/control_panel>
route and function layout. Summary below. that allows you to rebuild the search index from scratch (the Google Drive indexing
takes a while).
### code organization ![Screen shot of centillion control panel](docs/images/cp.png)
centillion app routes:
- home
- if not logged in, landing page
- if logged in, redirect to search
- search
- main_index_update
- update main index, all docs period
centillion Search functions: ## quickstart (with Github auth)
- open_index creates the schema Start by creating a Github OAuth application.
Get the public and private application key
(client token and client secret token)
from the Github application's page.
You will also need a Github access token
(in addition to the app tokens).
- add_issue, add_md, add_document have three diff method sigs and add diff types When you create the application, set the callback
of documents to the search index URL to `/login/github/authorized`, as in:
- update_all_issues or update_all_md or update_all_documents iterates over items ```
and determines whether each item needs to be updated in the search index https://<url>/login/github/authorized
```
- update_main_index - update the entire search index Edit the Flask configuration `config_flask.py`
- calls all three update_all methods and set the public and private application keys.
- create_search_results - package things up for jinja Now run centillion:
- search - run the query, pass results to the jinja-packager ```
python centillion.py
```
or if you used http instead of https:
```
OAUTHLIB_INSECURE_TRANSPORT="true" python centillion.py
```
This will start a Flask server, and you can view the minimal search engine
interface in your browser at `http://<ip>:5000`.
## work that is planned ## troubleshooting
See [Workplanned.md](Workplanned.md) If you are having problems with your callback URL being treated
as HTTP by Github, even though there is an HTTPS address, and
everything else seems fine, try deleting the Github OAuth app
and creating a new one.

47
Todo.md Normal file
View File

@@ -0,0 +1,47 @@
# todo
Main task:
- hashing and caching
- <s>first, working out the logic of how we group items into sets
- needs to be deleted
- needs to be updated
- needs to be added
- for docs, issues, and comments</s>
- second, when we add or update an item, need to:
- go through the motions, download file, extract text
- check for existing indexed doc with that id
- check if existing indexed doc has same hash
- if so, skip
- otherwise, delete and re-index
Other bugs:
- Some github issues have no title (?)
- <s>Need to combine issues with comments</s>
- Not able to index markdown files _in a repo_
- (Longer term) update main index vs update diff index
Needs:
- <s>control panel</s>
Thursday product:
- Everything re-indexed nightly
- Search engine built on all documents in Google Drive, all issues, markdown files
- Using pandoc to extract Google Drive document contents
- BRIEF quickstart documentation
Future:
- Future plans to improve - plugins, improving matching
- Subdomain plans
- Folksonomy tagging and integration plans
config options for plugins
conditional blocks with import github inside
complicated tho - better to have components split off

View File

@@ -1,106 +0,0 @@
# Components
The components of centillion are as follows:
- Flask application, which creates a Search object and uses it to search index
- Search object, which allows you to create/update/search an index
## Routes layout
Current application routes are as follows:
- home -> search
- search
- update_index
Ideal application routes (using github flask dance oauth):
- home
- if not logged in, landing page
- if logged in, redirect to search
- search
- main_index_update
- update main index, all docs period
- delta_index_update
- updates delta index, docs that have changed since last main index
There should be one route to update the main index
There should be another route to update the delta index
These should go off and call the update index methods
for each respective type of document/collection.
For example, if I call `main_index_update` route it should
- call `main_index_update` for all github issues
- call `main_index_update` for folder of markdown docs
- call `main_index_update` for google drive folder
These are all members of the Search class
## Functions layout
Functions of the entire search app:
- create a search index
- load a search index
- call the search() method on the index
- update the search index
The first and last, creating and updating the search index,
are of greatest interest.
The Schema affects everything so it is hard to separate
functionality into a main Search class shared by many.
(Avoid inheritance/classes if possible.)
current Search:
- open_index creates the schema
- add_issue or add_document adds an item to the index
- add_all_issues or add_all_documents iterates over items and adds them to index
- update_index_incremental - update the search index
- create_search_results - package things up for jinja
- search - run the query, pass results to the jinja-packager
centillion Search:
- open_index creates the schema
- add_issue, add_md, add_document have three diff method sigs and add diff types
of documents to the search index
- update_all_issues or update_all_md or update_all_documents iterates over items
and determines whether each item needs to be updated in the search index
- update_main_index - update the entire search index
- calls all three update_all methods
- create_search_results - package things up for jinja
- search - run the query, pass results to the jinja-packager
Nice to have but focus on it later:
- update_diff_issues or update_diff_md or update_diff_documents iterates over items
and indexes recently-added items
- update_diff_index - update the diff search index (what's been added since last
time)
- calls all three update_diff methods
## Files layout
Schema definition:
* include a "kind" or "class" to group objects
* can provide different searches of different collections
* eventually can provide user with checkboxes

View File

@@ -2,8 +2,11 @@ import threading
from subprocess import call from subprocess import call
import codecs import codecs
import os import os, json
from werkzeug.contrib.fixers import ProxyFix
from flask import Flask, request, redirect, url_for, render_template, flash from flask import Flask, request, redirect, url_for, render_template, flash
from flask_dance.contrib.github import make_github_blueprint, github
# create our application # create our application
from centillion_search import Search from centillion_search import Search
@@ -22,10 +25,18 @@ You provide:
- Google Drive API key via file - Google Drive API key via file
""" """
class UpdateIndexTask(object): class UpdateIndexTask(object):
def __init__(self, diff_index=False): def __init__(self, app_config, diff_index=False):
self.diff_index = diff_index self.diff_index = diff_index
thread = threading.Thread(target=self.run, args=()) thread = threading.Thread(target=self.run, args=())
self.gh_token = app_config['GITHUB_TOKEN']
self.groupsio_credentials = {
'groupsio_token' : app_config['GROUPSIO_TOKEN'],
'groupsio_username' : app_config['GROUPSIO_USERNAME'],
'groupsio_password' : app_config['GROUPSIO_PASSWORD']
}
thread.daemon = True thread.daemon = True
thread.start() thread.start()
@@ -38,90 +49,180 @@ class UpdateIndexTask(object):
from get_centillion_config import get_centillion_config from get_centillion_config import get_centillion_config
config = get_centillion_config('config_centillion.json') config = get_centillion_config('config_centillion.json')
gh_token = os.environ['GITHUB_ACESS_TOKEN'] search.update_index_groupsioemails(self.groupsio_credentials,config)
search.update_index_issues(gh_token, config) ###search.update_index_ghfiles(self.gh_token,config)
search.update_index_gdocs(config) ###search.update_index_issues(self.gh_token,config)
###search.update_index_gdocs(config)
app = Flask(__name__) app = Flask(__name__)
app.wsgi_app = ProxyFix(app.wsgi_app)
# Load default config and override config from an environment variable # Load default config and override config from an environment variable
app.config.from_pyfile("config_flask.py") app.config.from_pyfile("config_flask.py")
last_searches_file = app.config["INDEX_DIR"] + "/last_searches.txt" #github_bp = make_github_blueprint()
github_bp = make_github_blueprint(
client_id = os.environ.get('GITHUB_OAUTH_CLIENT_ID'),
client_secret = os.environ.get('GITHUB_OAUTH_CLIENT_SECRET'),
scope='read:org')
app.register_blueprint(github_bp, url_prefix="/login")
contents404 = "<html><body><h1>Status: Error 404 Page Not Found</h1></body></html>"
contents403 = "<html><body><h1>Status: Error 403 Access Denied</h1></body></html>"
contents200 = "<html><body><h1>Status: OK 200</h1></body></html>"
############################## ##############################
# Flask routes # Flask routes
@app.route('/') @app.route('/')
def index(): def index():
return redirect(url_for("search", query="", fields=""))
if not github.authorized:
return redirect(url_for("github.login"))
else:
username = github.get("/user").json()['login']
resp = github.get("/user/orgs")
if resp.ok:
# If they are in team copper, redirect to search.
# Otherwise, hit em with a 403
all_orgs = resp.json()
for org in all_orgs:
if org['login']=='dcppc':
copper_team_id = '2700235'
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
if mresp.status_code==204:
# --------------------
# Business as usual
return redirect(url_for("search", query="", fields=""))
return contents403
return contents404
### @app.route('/')
### def index():
### return redirect(url_for("search", query="", fields=""))
@app.route('/search') @app.route('/search')
def search(): def search():
query = request.args['query']
fields = request.args.get('fields')
if fields == 'None':
fields = None
search = Search(app.config["INDEX_DIR"]) if not github.authorized:
if not query: return redirect(url_for("github.login"))
parsed_query = ""
result = []
else: username = github.get("/user").json()['login']
parsed_query, result = search.search(query.split(), fields=[fields])
store_search(query, fields)
total = search.get_document_total_count() resp = github.get("/user/orgs")
if resp.ok:
return render_template('search.html', entries=result, query=query, parsed_query=parsed_query, fields=fields, last_searches=get_last_searches(), total=total) all_orgs = resp.json()
for org in all_orgs:
if org['login']=='dcppc':
@app.route('/open') copper_team_id = '2700235'
def open_file():
path = request.args['path'] mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
fields = request.args.get('fields') if mresp.status_code==204:
query = request.args['query']
call([app.config["EDIT_COMMAND"], path]) # --------------------
# Business as usual
query = request.args['query']
fields = request.args.get('fields')
if fields == 'None':
fields = None
search = Search(app.config["INDEX_DIR"])
if not query:
parsed_query = ""
result = []
else:
parsed_query, result = search.search(query.split(), fields=[fields])
totals = search.get_document_total_count()
return render_template('search.html',
entries=result,
query=query,
parsed_query=parsed_query,
fields=fields,
totals=totals)
return contents403
return redirect(url_for("search", query=query, fields=fields))
@app.route('/update_index') @app.route('/update_index')
def update_index(): def update_index():
rebuild = request.args.get('rebuild')
UpdateIndexTask(diff_index=False) if not github.authorized:
flash("Rebuilding index, check console output") return redirect(url_for("github.login"))
return render_template("search.html", query="", fields="", last_searches=get_last_searches())
username = github.get("/user").json()['login']
resp = github.get("/user/orgs")
if resp.ok:
all_orgs = resp.json()
for org in all_orgs:
if org['login']=='dcppc':
copper_team_id = '2700235'
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
if mresp.status_code==204:
# --------------------
# Business as usual
UpdateIndexTask(app.config,
diff_index=False)
flash("Rebuilding index, check console output")
return render_template("controlpanel.html",
totals={})
return contents403
##############
# Utility methods
def get_last_searches(): @app.route('/control_panel')
if os.path.exists(last_searches_file): def control_panel():
with codecs.open(last_searches_file, 'r', encoding='utf-8') as f:
contents = f.readlines()
else:
contents = []
return contents
def store_search(query, fields): if not github.authorized:
if os.path.exists(last_searches_file): return redirect(url_for("github.login"))
with codecs.open(last_searches_file, 'r', encoding='utf-8') as f:
contents = f.readlines()
else:
contents = []
search = "query=%s&fields=%s\n" % (query, fields) username = github.get("/user").json()['login']
if not search in contents:
contents.insert(0, search)
with codecs.open(last_searches_file, 'w', encoding='utf-8') as f: resp = github.get("/user/orgs")
f.writelines(contents[:30]) if resp.ok:
all_orgs = resp.json()
for org in all_orgs:
if org['login']=='dcppc':
copper_team_id = '2700235'
mresp = github.get('/teams/%s/members/%s'%(copper_team_id,username))
if mresp.status_code==204:
return render_template("controlpanel.html",
totals={})
return contents403
@app.errorhandler(404)
def oops(e):
return contents404
if __name__ == '__main__': if __name__ == '__main__':
app.run() # if running local instance, set to true
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = 'true'
app.run(host="0.0.0.0",port=5000)

5
centillion_prepare.py Normal file
View File

@@ -0,0 +1,5 @@
from gdrive_util import GDrive
gd = GDrive()
service = gd.get_service()

View File

@@ -1,9 +1,11 @@
import shutil import shutil
import html.parser import html.parser
from github import Github from github import Github, GithubException
import base64
from gdrive_util import GDrive from gdrive_util import GDrive
from groupsio_util import GroupsIOArchivesCrawler
from apiclient.http import MediaIoBaseDownload from apiclient.http import MediaIoBaseDownload
import mistune import mistune
@@ -14,6 +16,8 @@ import tempfile, subprocess
import pypandoc import pypandoc
import os.path import os.path
import codecs import codecs
from datetime import datetime
from whoosh.qparser import MultifieldParser, QueryParser from whoosh.qparser import MultifieldParser, QueryParser
from whoosh.analysis import StemmingAnalyzer from whoosh.analysis import StemmingAnalyzer
@@ -40,6 +44,7 @@ Search object functions:
Schema: Schema:
- id - id
- kind - kind
- fingerprint
- created_time - created_time
- modified_time - modified_time
- indexed_time - indexed_time
@@ -57,6 +62,10 @@ Schema:
""" """
def clean_timestamp(dt):
return dt.replace(microsecond=0).isoformat()
class SearchResult: class SearchResult:
score = 1.0 score = 1.0
path = None path = None
@@ -89,6 +98,11 @@ class Search:
def __init__(self, index_folder): def __init__(self, index_folder):
self.open_index(index_folder) self.open_index(index_folder)
# ------------------------------
# Create a schema and open a search index
# on disk.
def open_index(self, index_folder, create_new=False): def open_index(self, index_folder, create_new=False):
""" """
Create a schema, Create a schema,
@@ -109,13 +123,12 @@ class Search:
# ------------------------------ # ------------------------------
# IMPORTANT:
# This is where the search index's document schema # This is where the search index's document schema
# is defined. # is defined.
schema = Schema( schema = Schema(
id = ID(stored=True, unique=True), id = ID(stored=True, unique=True),
kind = ID(), kind = ID(stored=True),
created_time = ID(stored=True), created_time = ID(stored=True),
modified_time = ID(stored=True), modified_time = ID(stored=True),
@@ -154,28 +167,49 @@ class Search:
# Define how to add documents # Define how to add documents
def add_drive_file(self, writer, item, indexed_ids, temp_dir, config): def add_drive_file(self, writer, item, temp_dir, config, update=False):
""" """
Add a Google Drive document/file to a search index. Add a Google Drive document/file to a search index.
If it is a document, extract the contents. If it is a document, extract the contents.
""" """
gd = GDrive()
service = gd.get_service()
# ------------------------ # There are two kinds of documents:
# Two kinds of documents:
# - documents with text that can be extracted (docx) # - documents with text that can be extracted (docx)
# - everything else # - everything else
mimetype = re.split('[/\.]',item['mimeType'])[-1] mimetype = re.split('[/\.]',item['mimeType'])[-1]
mimemap = { mimemap = {
'document' : 'docx', 'document' : 'docx',
} }
if(mimetype not in mimemap.keys()): content = ""
# Not a document - if mimetype not in mimemap.keys():
# Just a file
print("Indexing document %s of type %s"%(item['name'], mimetype)) # Not a document - just a file
print("Indexing Google Drive file \"%s\" of type %s"%(item['name'], mimetype))
writer.delete_by_term('id',item['id'])
# Index a plain google drive file
writer.add_document(
id = item['id'],
kind = 'gdoc',
created_time = item['createdTime'],
modified_time = item['modifiedTime'],
indexed_time = datetime.now().replace(microsecond=0).isoformat(),
title = item['name'],
url = item['webViewLink'],
mimetype = mimetype,
owner_email = item['owners'][0]['emailAddress'],
owner_name = item['owners'][0]['displayName'],
repo_name='',
repo_url='',
github_user='',
issue_title='',
issue_url='',
content = content
)
else: else:
# Document with text # Document with text
# Perform content extraction # Perform content extraction
@@ -187,7 +221,8 @@ class Search:
# This is a file type we know how to convert # This is a file type we know how to convert
# Construct the URL and download it # Construct the URL and download it
print("Extracting content from %s of type %s"%(item['name'], mimetype)) print("Indexing Google Drive document \"%s\" of type %s"%(item['name'], mimetype))
print(" > Extracting content")
# Create a URL and a destination filename # Create a URL and a destination filename
@@ -208,7 +243,7 @@ class Search:
outfile_name = name+'.'+out_ext outfile_name = name+'.'+out_ext
# assemble input/output file paths # Assemble input/output file paths
fullpath_input = os.path.join(temp_dir,infile_name) fullpath_input = os.path.join(temp_dir,infile_name)
fullpath_output = os.path.join(temp_dir,outfile_name) fullpath_output = os.path.join(temp_dir,outfile_name)
@@ -217,7 +252,6 @@ class Search:
with open(fullpath_input, 'wb') as f: with open(fullpath_input, 'wb') as f:
f.write(r.content) f.write(r.content)
# Try to convert docx file to plain text # Try to convert docx file to plain text
try: try:
output = pypandoc.convert_file(fullpath_input, output = pypandoc.convert_file(fullpath_input,
@@ -227,12 +261,11 @@ class Search:
) )
assert output == "" assert output == ""
except RuntimeError: except RuntimeError:
print("XXXXXX Failed to index document %s"%(item['name'])) print(" > XXXXXX Failed to index document \"%s\""%(item['name']))
# If export was successful, read contents of markdown # If export was successful, read contents of markdown
# into the content variable. # into the content variable.
# into the content variable.
if os.path.isfile(fullpath_output): if os.path.isfile(fullpath_output):
# Export was successful # Export was successful
with codecs.open(fullpath_output, encoding='utf-8') as f: with codecs.open(fullpath_output, encoding='utf-8') as f:
@@ -240,88 +273,196 @@ class Search:
# No matter what happens, clean up. # No matter what happens, clean up.
print("Cleaning up %s"%item['name']) print(" > Cleaning up \"%s\""%item['name'])
subprocess.call(['rm','-fr',fullpath_output]) ## test
#print(" ".join(['rm','-fr',fullpath_output])) #print(" ".join(['rm','-fr',fullpath_output]))
subprocess.call(['rm','-fr',fullpath_input])
#print(" ".join(['rm','-fr',fullpath_input])) #print(" ".join(['rm','-fr',fullpath_input]))
# do it
subprocess.call(['rm','-fr',fullpath_output])
subprocess.call(['rm','-fr',fullpath_input])
# ------------------------------ if update:
# IMPORTANT: print(" > Removing old record")
# This is where the search documents are actually created. writer.delete_by_term('id',item['id'])
else:
print(" > Creating a new record")
mimetype = re.split('[/\.]', item['mimeType'])[-1] writer.add_document(
writer.add_document( id = item['id'],
id = item['id'], kind = 'gdoc',
kind = 'gdoc', created_time = item['createdTime'],
created_time = item['createdTime'], modified_time = item['modifiedTime'],
modified_time = item['modifiedTime'], indexed_time = datetime.now().replace(microsecond=0).isoformat(),
title = item['name'], title = item['name'],
url = item['webViewLink'], url = item['webViewLink'],
mimetype = mimetype, mimetype = mimetype,
owner_email = item['owners'][0]['emailAddress'], owner_email = item['owners'][0]['emailAddress'],
owner_name = item['owners'][0]['displayName'], owner_name = item['owners'][0]['displayName'],
repo_name=None, repo_name='',
repo_url=None, repo_url='',
github_user=None, github_user='',
issue_title=None, issue_title='',
issue_url=None, issue_url='',
content = content content = content
) )
def add_issue(self, writer, issue, repo, config):
# ------------------------------
# Add a single github issue and its comments
# to a search index.
def add_issue(self, writer, issue, gh_token, config, update=True):
""" """
Add a Github issue/comment to a search index. Add a Github issue/comment to a search index.
""" """
repo_name = repo.name repo = issue.repository
repo_name = repo.owner.login+"/"+repo.name
repo_url = repo.html_url repo_url = repo.html_url
count = 0
# Handle the issue content
print("Indexing issue %s"%(issue.html_url)) print("Indexing issue %s"%(issue.html_url))
writer.add_document(
id = issue.html_url,
kind = 'issue',
url = issue.html_url,
is_comment = False,
timestamp = issue.created_at,
repo_name = repo_name,
repo_url = repo_url,
issue_title = issue.title,
issue_url = issue.html_url,
user = issue.user.login,
content = issue.body.rstrip()
)
count += 1
# Combine comments with their respective issues.
# Otherwise just too noisy.
issue_comment_content = issue.body.rstrip()
issue_comment_content += "\n"
# Handle the comments content # Handle the comments content
if(issue.comments>0): if(issue.comments>0):
comments = issue.get_comments() comments = issue.get_comments()
for comment in comments: for comment in comments:
print(" > Indexing comment %s"%(comment.html_url))
writer.add_document(
id = comment.html_url,
kind = 'comment',
url = comment.html_url,
is_comment = True,
timestamp = comment.created_at,
repo_name = repo_name,
repo_url = repo_url,
issue_title = issue.title,
issue_url = issue.html_url,
user = comment.user.login,
content = comment.body.strip()
)
count += 1 issue_comment_content += comment.body.rstrip()
return count issue_comment_content += "\n"
# Now create the actual search index record
created_time = clean_timestamp(issue.created_at)
modified_time = clean_timestamp(issue.updated_at)
indexed_time = clean_timestamp(datetime.now())
# Add one document per issue thread,
# containing entire text of thread.
writer.add_document(
id = issue.html_url,
kind = 'issue',
created_time = created_time,
modified_time = modified_time,
indexed_time = indexed_time,
title = issue.title,
url = issue.html_url,
mimetype='',
owner_email='',
owner_name='',
repo_name = repo_name,
repo_url = repo_url,
github_user = issue.user.login,
issue_title = issue.title,
issue_url = issue.html_url,
content = issue_comment_content
)
def add_ghfile(self, writer, d, gh_token, config, update=True):
"""
Use a Github file API record to add a filename
to the search index.
"""
MARKDOWN_EXTS = ['.md','.markdown']
repo = d['repo']
org = d['org']
repo_name = org + "/" + repo
repo_url = "https://github.com/" + repo_name
try:
fpath = d['path']
furl = d['url']
fsha = d['sha']
_, fname = os.path.split(fpath)
_, fext = os.path.splitext(fpath)
except:
print(" > XXXXXXXX Failed to find file info.")
return
indexed_time = clean_timestamp(datetime.now())
if fext in MARKDOWN_EXTS:
print("Indexing markdown doc %s from repo %s"%(fname,repo_name))
# Unpack the requests response and decode the content
#
# don't forget the headers for private repos!
# useful: https://bit.ly/2LSAflS
headers = {'Authorization' : 'token %s'%(gh_token)}
response = requests.get(furl, headers=headers)
if response.status_code==200:
jresponse = response.json()
content = ""
try:
binary_content = re.sub('\n','',jresponse['content'])
content = base64.b64decode(binary_content).decode('utf-8')
except KeyError:
print(" > XXXXXXXX Failed to extract 'content' field. You probably hit the rate limit.")
else:
print(" > XXXXXXXX Failed to reach file URL. There may be a problem with authentication/headers.")
return
usable_url = "https://github.com/%s/blob/master/%s"%(repo_name, fpath)
# Now create the actual search index record
writer.add_document(
id = fsha,
kind = 'markdown',
created_time = '',
modified_time = '',
indexed_time = indexed_time,
title = fname,
url = usable_url,
mimetype='',
owner_email='',
owner_name='',
repo_name = repo_name,
repo_url = repo_url,
github_user = '',
issue_title = '',
issue_url = '',
content = content
)
else:
print("Indexing github file %s from repo %s"%(fname,repo_name))
key = fname+"_"+fsha
# Now create the actual search index record
writer.add_document(
id = key,
kind = 'ghfile',
created_time = '',
modified_time = '',
indexed_time = indexed_time,
title = fname,
url = repo_url,
mimetype='',
owner_email='',
owner_name='',
repo_name = repo_name,
repo_url = repo_url,
github_user = '',
issue_title = '',
issue_url = '',
content = ''
)
@@ -329,133 +470,376 @@ class Search:
# Define how to update search index # Define how to update search index
# using different kinds of collections # using different kinds of collections
# ------------------------------
# Google Drive Files/Documents
def update_index_gdocs(self, def update_index_gdocs(self,
config): config):
""" """
Update the search index using a collection of Update the search index using a collection of
Google Drive documents and files. Google Drive documents and files.
Uses the 'id' field to uniquely identify documents.
Also see:
https://developers.google.com/drive/api/v3/reference/files
""" """
gd = GDrive()
service = gd.get_service()
# ----- # Updated algorithm:
# Get the set of all documents on Google Drive: # - get set of indexed ids
# - get set of remote ids
# - drop indexed ids not in remote ids
# - index all remote ids
# - add hash check in add_
# ------------------------------
# IMPORTANT:
# This determines what information about the Google Drive files
# you'll get back, and that's all you're going to have to work with.
# If you need more information, modify the statement below.
# Also see:
# https://developers.google.com/drive/api/v3/reference/files
# Get the set of indexed ids:
# ------
indexed_ids = set()
p = QueryParser("kind", schema=self.ix.schema)
q = p.parse("gdoc")
with self.ix.searcher() as s:
results = s.search(q,limit=None)
for result in results:
indexed_ids.add(result['id'])
# Get the set of remote ids:
# ------
# Start with google drive api object
gd = GDrive() gd = GDrive()
service = gd.get_service() service = gd.get_service()
drive = service.files() drive = service.files()
# Now index all the docs in the google drive folder
# The trick is to set next page token to None 1st time thru (fencepost) # The trick is to set next page token to None 1st time thru (fencepost)
nextPageToken = None nextPageToken = None
# Use the pager to return all the things # Use the pager to return all the things
items = [] remote_ids = set()
full_items = {}
while True: while True:
ps = 100
results = drive.list( results = drive.list(
pageSize=100, pageSize=ps,
pageToken=nextPageToken, pageToken=nextPageToken,
fields="files(id, kind, createdTime, modifiedTime, mimeType, name, owners, webViewLink)", fields = "nextPageToken, files(id, kind, createdTime, modifiedTime, mimeType, name, owners, webViewLink)",
spaces="drive" spaces="drive"
).execute() ).execute()
nextPageToken = results.get("nextPageToken") nextPageToken = results.get("nextPageToken")
items += results.get("files", []) files = results.get("files",[])
for f in files:
# Add all remote docs to a set
remote_ids.add(f['id'])
# Also store the doc
full_items[f['id']] = f
## Shorter:
#break
# Longer:
if nextPageToken is None: if nextPageToken is None:
break break
indexed_ids = set()
for item in items:
indexed_ids.add(item['id'])
writer = self.ix.writer() writer = self.ix.writer()
count = 0
temp_dir = tempfile.mkdtemp(dir=os.getcwd()) temp_dir = tempfile.mkdtemp(dir=os.getcwd())
print("Temporary directory: %s"%(temp_dir)) print("Temporary directory: %s"%(temp_dir))
if not os.path.exists(temp_dir):
os.mkdir(temp_dir)
count = 0
for item in items:
self.add_item(writer, item, indexed_ids, temp_dir, config) # Drop any id in indexed_ids
# not in remote_ids
drop_ids = indexed_ids - remote_ids
for drop_id in drop_ids:
writer.delete_by_term('id',drop_id)
# Update any id in indexed_ids
# and in remote_ids
update_ids = indexed_ids & remote_ids
for update_id in update_ids:
# cop out
writer.delete_by_term('id',update_id)
item = full_items[update_id]
self.add_drive_file(writer, item, temp_dir, config, update=True)
count += 1 count += 1
# Add any id not in indexed_ids
# and in remote_ids
add_ids = remote_ids - indexed_ids
for add_id in add_ids:
item = full_items[add_id]
self.add_drive_file(writer, item, temp_dir, config, update=False)
count += 1
print("Cleaning temporary directory: %s"%(temp_dir))
subprocess.call(['rm','-fr',temp_dir])
writer.commit() writer.commit()
print("Done, updated %d documents in the index" % count) print("Done, updated %d documents in the index" % count)
# ------------------------------
# Github Issues/Comments
def update_index_issues(self, gh_token, config):
def update_index_issues(self,
gh_access_token,
config):
""" """
Update the search index using a collection of Update the search index using a collection of
Github repo issues and comments. Github repo issues and comments.
""" """
# Strategy: # Updated algorithm:
# To get the proof of concept up and running, # - get set of indexed ids
# we are just deleting and re-indexing every issue/comment. # - get set of remote ids
# - drop indexed ids not in remote ids
# - index all remote ids
g = Github(gh_access_token) # Get the set of indexed ids:
# ------
indexed_issues = set()
p = QueryParser("kind", schema=self.ix.schema)
q = p.parse("issue")
with self.ix.searcher() as s:
results = s.search(q,limit=None)
for result in results:
indexed_issues.add(result['id'])
# Set of all URLs as existing on github
to_index = set()
writer = self.ix.writer() # Get the set of remote ids:
# ------
# Start with api object
g = Github(gh_token)
# Iterate over each repo # Now index all issue threads in the user-specified repos
list_of_repos = config['repos']
# Start by collecting all the things
remote_issues = set()
full_items = {}
# Iterate over each repo
list_of_repos = config['repositories']
for r in list_of_repos: for r in list_of_repos:
if '/' not in r: if '/' not in r:
err = "Error: specify org/reponame or user/reponame in list of repos" err = "Error: specify org/reponame or user/reponame in list of repos"
raise Exception(err) raise Exception(err)
this_repo, this_org = re.split('/',r) this_org, this_repo = re.split('/',r)
try:
org = g.get_organization(this_org)
repo = org.get_repo(this_repo)
except:
print("Error: could not gain access to repository %s"%(r))
continue
org = g.get_organization(this_org) # Iterate over each issue thread
repo = org.get_repo(this_repo)
count = 0
# Iterate over each thread
issues = repo.get_issues() issues = repo.get_issues()
for issue in issues: for issue in issues:
# This approach is more work than is needed
# but PoC||GTFO
# For each issue/comment URL, # For each issue/comment URL,
# remove the corresponding item # grab the key and store the
# and re-add it to the index # corresponding issue object
key = issue.html_url
value = issue
to_index.add(issue.html_url) remote_issues.add(key)
writer.delete_by_term('url', issue.html_url) full_items[key] = value
comments = issue.get_comments()
for comment in comments: writer = self.ix.writer()
to_index.add(comment.html_url) count = 0
writer.delete_by_term('url', comment.html_url)
# Now re-add this issue to the index # Drop any issues in indexed_issues
# (this will also add the comments) # not in remote_issues
count += self.add_issue(writer, issue, repo, config) drop_issues = indexed_issues - remote_issues
for drop_issue in drop_issues:
writer.delete_by_term('id',drop_issue)
# Update any issue in indexed_issues
# and in remote_issues
update_issues = indexed_issues & remote_issues
for update_issue in update_issues:
# cop out
writer.delete_by_term('id',update_issue)
item = full_items[update_issue]
self.add_issue(writer, item, gh_token, config, update=True)
count += 1
# Add any issue not in indexed_issues
# and in remote_issues
add_issues = remote_issues - indexed_issues
for add_issue in add_issues:
item = full_items[add_issue]
self.add_issue(writer, item, gh_token, config, update=False)
count += 1
writer.commit() writer.commit()
print("Done, updated %d documents in the index" % count) print("Done, updated %d documents in the index" % count)
# ------------------------------
# Github Files
def update_index_ghfiles(self, gh_token, config):
"""
Update the search index using a collection of
files (and, separately, Markdown files) from
a Github repo.
"""
# Updated algorithm:
# - get set of indexed ids
# - get set of remote ids
# - drop indexed ids not in remote ids
# - index all remote ids
# Get the set of indexed ids:
# ------
indexed_ids = set()
p = QueryParser("kind", schema=self.ix.schema)
q = p.parse("ghfiles")
with self.ix.searcher() as s:
results = s.search(q,limit=None)
for result in results:
indexed_ids.add(result['id'])
q = p.parse("markdown")
with self.ix.searcher() as s:
results = s.search(q,limit=None)
for result in results:
indexed_ids.add(result['id'])
# Get the set of remote ids:
# ------
# Start with api object
g = Github(gh_token)
# Now index all the files.
# Start by collecting all the things
remote_ids = set()
full_items = {}
# Iterate over each repo
list_of_repos = config['repositories']
for r in list_of_repos:
if '/' not in r:
err = "Error: specify org/reponame or user/reponame in list of repos"
raise Exception(err)
this_org, this_repo = re.split('/',r)
try:
org = g.get_organization(this_org)
repo = org.get_repo(this_repo)
except:
print("Error: could not gain access to repository %s"%(r))
continue
# Get head commit
commits = repo.get_commits()
try:
last = commits[0]
sha = last.sha
except GithubException:
print("Error: could not get commits from repository %s"%(r))
continue
# Get all the docs
tree = repo.get_git_tree(sha=sha, recursive=True)
docs = tree.raw_data['tree']
print("Parsing file ids from repository %s"%(r))
for d in docs:
# For each doc, get the file extension
# and decide what to do with it.
fpath = d['path']
_, fname = os.path.split(fpath)
_, fext = os.path.splitext(fpath)
key = d['sha']
d['org'] = this_org
d['repo'] = this_repo
value = d
remote_ids.add(key)
full_items[key] = value
writer = self.ix.writer()
count = 0
# Drop any id in indexed_ids
# not in remote_ids
drop_ids = indexed_ids - remote_ids
for drop_id in drop_ids:
writer.delete_by_term('id',drop_id)
# Update any id in indexed_ids
# and in remote_ids
update_ids = indexed_ids & remote_ids
for update_id in update_ids:
# cop out: just delete and re-add
writer.delete_by_term('id',update_id)
item = full_items[update_id]
self.add_ghfile(writer, item, gh_token, config, update=True)
count += 1
# Add any issue not in indexed_ids
# and in remote_ids
add_ids = remote_ids - indexed_ids
for add_id in add_ids:
item = full_items[add_id]
self.add_ghfile(writer, item, gh_token, config, update=False)
count += 1
writer.commit()
print("Done, updated %d Github files in the index" % count)
# ------------------------------
# Groups.io Emails
def update_index_groupsioemails(self, groupsio_token, config):
"""
Update the search index using the email archives
of groups.io groups.
This requires the use of a spider.
RELEASE THE SPIDER!!!
"""
spider = GroupsIOArchivesCrawler(groupsio_token,'dcppc')
# - ask spider to crawl the archives
spider.crawl_group_archives()
# - ask spider for list of all email records
# - 1 email = 1 dictionary
# - email records compiled by the spider
archives = spider.get_archives()
# - email object is sent off to add email method
print("Finished indexing groups.io emails")
# --------------------------------- # ---------------------------------
# Search results bundler # Search results bundler
@@ -477,11 +861,6 @@ class Search:
# contains a {% for e in entries %} # contains a {% for e in entries %}
# and then an {{e.score}} # and then an {{e.score}}
# ------------------
# cheseburger
# create search results
sr = SearchResult() sr = SearchResult()
sr.score = r.score sr.score = r.score
@@ -495,37 +874,29 @@ class Search:
sr.id = r['id'] sr.id = r['id']
sr.kind = r['kind'] sr.kind = r['kind']
sr.url = r['url']
sr.created_time = r['created_time']
sr.modified_time = r['modified_time']
sr.indexed_time = r['indexed_time']
sr.title = r['title'] sr.title = r['title']
sr.url = r['url']
sr.mimetype = r['mimetype'] sr.mimetype = r['mimetype']
sr.owner_email = r['owner_email'] sr.owner_email = r['owner_email']
sr.owner_name = r['owner_name'] sr.owner_name = r['owner_name']
sr.content = r['content']
# -----------------
# github isuses
# create search results
sr = SearchResult()
sr.score = r.score
sr.url = r['url']
sr.title = r['issue_title']
sr.repo_name = r['repo_name'] sr.repo_name = r['repo_name']
sr.repo_url = r['repo_url'] sr.repo_url = r['repo_url']
sr.issue_title = r['issue_title'] sr.issue_title = r['issue_title']
sr.issue_url = r['issue_url'] sr.issue_url = r['issue_url']
sr.is_comment = r['is_comment'] sr.github_user = r['github_user']
sr.content = r['content'] sr.content = r['content']
# ------------------
highlights = r.highlights('content') highlights = r.highlights('content')
if not highlights: if not highlights:
# just use the first 1,000 words of the document # just use the first 1,000 words of the document
@@ -533,21 +904,18 @@ class Search:
highlights = self.html_parser.unescape(highlights) highlights = self.html_parser.unescape(highlights)
html = self.markdown(highlights) html = self.markdown(highlights)
html = re.sub(r'\n','<br />',html)
sr.content_highlight = html sr.content_highlight = html
search_results.append(sr) search_results.append(sr)
return search_results return search_results
# ------------------
# github issues
# create search results
def search(self, query_list, fields=None): def search(self, query_list, fields=None):
with self.ix.searcher() as searcher: with self.ix.searcher() as searcher:
query_string = " ".join(query_list) query_string = " ".join(query_list)
query = None query = None
@@ -558,27 +926,15 @@ class Search:
elif len(fields) == 2: elif len(fields) == 2:
pass pass
else: else:
fields = ['id', # If the user does not specify a field,
'kind', # these are the fields that are actually searched
'created_time', fields = ['title',
'modified_time', 'content']
'indexed_time',
'title',
'url',
'mimetype',
'owner_email',
'owner_name',
'repo_name',
'repo_url',
'issue_title',
'issue_url',
'github_user',
'content']
if not query: if not query:
query = MultifieldParser(fields, schema=self.ix.schema).parse(query_string) query = MultifieldParser(fields, schema=self.ix.schema).parse(query_string)
parsed_query = "%s" % query parsed_query = "%s" % query
print("query: %s" % parsed_query) print("query: %s" % parsed_query)
results = searcher.search(query, terms=False, scored=True, groupedby="url") results = searcher.search(query, terms=False, scored=True, groupedby="kind")
search_result = self.create_search_result(results) search_result = self.create_search_result(results)
return parsed_query, search_result return parsed_query, search_result
@@ -589,9 +945,29 @@ class Search:
return s if len(s) <= l else s[0:l - 3] + '...' return s if len(s) <= l else s[0:l - 3] + '...'
def get_document_total_count(self): def get_document_total_count(self):
return self.ix.searcher().doc_count_all() p = QueryParser("kind", schema=self.ix.schema)
counts = {
"gdoc" : None,
"issue" : None,
"ghfile" : None,
"markdown" : None,
"total" : None
}
for key in counts.keys():
q = p.parse(key)
with self.ix.searcher() as s:
results = s.search(q,limit=None)
counts[key] = len(results)
counts['total'] = sum(counts[k] for k in counts.keys())
return counts
if __name__ == "__main__": if __name__ == "__main__":
raise Exception("Error: main method not implemented (fix groupsio credentials first)")
search = Search("search_index") search = Search("search_index")
from get_centillion_config import get_centillion_config from get_centillion_config import get_centillion_config

View File

@@ -1,7 +1,27 @@
{ {
"repositories" : [ "repositories" : [
"dcppc/project-management",
"dcppc/nih-demo-meetings",
"dcppc/internal",
"dcppc/organize",
"dcppc/dcppc-bot",
"dcppc/full-stacks",
"dcppc/design-guidelines-discuss",
"dcppc/dcppc-deliverables",
"dcppc/dcppc-milestones",
"dcppc/crosscut-metadata",
"dcppc/lucky-penny",
"dcppc/dcppc-workshops",
"dcppc/metadata-matrix",
"dcppc/data-stewards",
"dcppc/dcppc-phase1-demos",
"dcppc/apis",
"dcppc/2018-june-workshop", "dcppc/2018-june-workshop",
"dcppc/2018-july-workshop", "dcppc/2018-july-workshop",
"dcppc/data-stewards" "dcppc/2018-august-workshop",
"dcppc/2018-september-workshop",
"dcppc/design-guidelines",
"dcppc/2018-may-workshop",
"dcppc/centillion"
] ]
} }

20
config_flask.example.py Normal file
View File

@@ -0,0 +1,20 @@
# Location of index file
INDEX_DIR = "search_index"
# oauth client deets
GITHUB_OAUTH_CLIENT_ID = "XXX"
GITHUB_OAUTH_CLIENT_SECRET = "YYY"
GITHUB_TOKEN = "ZZZ"
# More information footer: Repository label
FOOTER_REPO_ORG = "charlesreid1"
FOOTER_REPO_NAME = "centillion"
# Toggle to show Whoosh parsed query
SHOW_PARSED_QUERY=True
TAGLINE = "Search All The Things"
# Flask settings
DEBUG = True
SECRET_KEY = 'WWWWW'

View File

@@ -1,27 +0,0 @@
# Path to markdown files
MARKDOWN_FILES_DIR = "/Users/charles/codes/whoosh/markdown-search/fake-docs/"
# Location of index file
INDEX_DIR = "search_index"
# Command to use when clicking on filepath in search results
EDIT_COMMAND = "view"
# Toggle to show Whoosh parsed query
SHOW_PARSED_QUERY=True
# Toogle to use tags
USE_TAGS=True
# Optional prefix in a markdown file, e.g. "tags: python search markdown tutorial"
TAGS_PREFIX=""
# List of tags that should be ignored
TAGS_TO_IGNORE = "and are what how its not with the"
# Regular expression to select tags, eg tag has to start with alphanumeric followed by at least two alphanumeric or "-" or "."
TAGS_REGEX = r"\b([A-Za-z0-9][A-Za-z0-9-.]{2,})\b"
# Flask settings
DEBUG = True
SECRET_KEY = '42c5a8eda356ca9d9c3ab2d149541e6b91d843fa'

View File

@@ -0,0 +1,22 @@
# Centillion Components
Centillion keeps it simple.
There are two components:
* The `Search` object, which uses whoosh and various
APIs (Github, Google Drive) to build and manage
the search index. The `Search` object also runs all
queries against the search index. (See the
[Centillion Whoosh](centillion_whoosh.md) page
or the `centillion_search`.py` file
for details.)
* Flask app, which uses Jinja templates to present the
user with a minimal web frontend that allows them
to interact with the search engine. (See the
[Centillion Flask](centillion_flask.md) page
or the `centillion`.py` file
for details.)

30
docs/centillion_flask.md Normal file
View File

@@ -0,0 +1,30 @@
# Centillion Flask
## What the flask server does
Flask is a web server framework
that allows developers to define
behavior for specific endpoints,
such as `/hello_world`, or
<http://localhost:5000/hello_world>
on a web server running locally.
## Flask server routes
- `/home`
- if not logged in, this redirects to a "log into github" landing page (not implemented yet)
- if logged in, this redirects to the search route
- `/search`
- search template
- `/main_index_update`
- update main index, all docs period
- `/control_panel`
- this is the control panel, where you can trigger
the search index to be re-made

34
docs/centillion_whoosh.md Normal file
View File

@@ -0,0 +1,34 @@
# Centillion Whoosh
The `centillion_search.py` file defines a
`Search` class that serves as the backend
for centillion.
## What the Search class does
The `Search` class has two roles:
- create (and update) the search index
- this also requires the `Search` class
to define the schema for storing documents
- run queries against the search index,
and package results up for Flask and Jinja
## Search class functions
The `Search` class defines several functions:
- `open_index()` creates the schema
- `add_issue()`, `add_md()`, `add_document()` have three diff method sigs and add diff types
of documents to the search index
- `update_all_issues()` or `update_all_md()` or `update_all_documents()` iterates over items
and determines whether each item needs to be updated in the search index
- `update_main_index()` - update the entire search index
- calls all three update_all methods
- `create_search_results()` - package things up for jinja
- `search()` - run the query, pass results to the jinja-packager

BIN
docs/images/cp.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 498 KiB

BIN
docs/images/ss.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 355 KiB

84
docs/index.md Normal file
View File

@@ -0,0 +1,84 @@
# Centillion
**centillion**: a pan-github-markdown-issues-google-docs search engine.
**a centillion**: a very large number consisting of a 1 with 303 zeros after it.
centillion is 3.03 log-times better than the googol.
## What is centillion
Centillion is a search engine built using [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html),
a Python library for building search engines.
We define the types of documents centillion should index,
what info and how. Centillion then builds and
updates a search index. That's all done in `centillion_search.py`.
Centillion also provides a simple web frontend for running
queries against the search index. That's done using a Flask server
defined in `centillion.py`.
Centillion keeps it simple.
## Quickstart
Run centillion with a github access token API key set via
environment variable:
```
GITHUB_TOKEN="XXXXXXXX" python centillion.py
```
This will start a Flask server, and you can view the minimal search engine
interface in your browser at <http://localhost:5000>.
## Configuration
### Centillion configuration
`config_centillion.json` defines configuration variables
for centillion - namely, what to index, and how, and where.
### Flask configuration
`config_flask.py` defines configuration variables
used by flask, which controls the web frontend
for centillion.
## Control Panel/Rebuilding Search Index
To rebuild the search engine, visit the control panel route (`/control_panel`),
for example at <http://localhost:5000/control_panel>.
This allows you to rebuild the search engine index. The search index
is stored in the `search_index/` directory, and that directory
can be configured with centillion's configuration file.
The diff search index is faster to build, as it only
indexes documents that have been added since the last
new document was added to the search index.
The main search index is slower to build, as it will
re-index everything.
(Cron scripts? Threaded task that runs hourly?)
## Details
More on the details of how centillion works.
Under the hood, centillion uses flask and whoosh.
Flask builds and runs the web server.
Whoosh handles search requests and management
of the search index.
[Centillion Components](centillion_components.md)
[Centillion Flask](centillion_flask.md)
[Centillion Whoosh](centillion_whoosh.md)

View File

@@ -31,3 +31,4 @@ Stateless

View File

@@ -1,4 +1,4 @@
## work that is done ## work that is done: standalone
**Stage 1: index folder of markdown files** (done) **Stage 1: index folder of markdown files** (done)
* See [markdown-search](https://git.charlesreid1.com/charlesreid1/markdown-search.git) * See [markdown-search](https://git.charlesreid1.com/charlesreid1/markdown-search.git)
@@ -13,7 +13,7 @@
Needs work: Needs work:
* More appropriate schema * <s>More appropriate schema</s>
* Using more features (weights) plus pandoc filters for schema * Using more features (weights) plus pandoc filters for schema
* Sqlalchemy (and hey waddya know safari books has it covered) * Sqlalchemy (and hey waddya know safari books has it covered)
@@ -25,15 +25,16 @@ Needs work:
* Main win here is uncovering metadata/linking/presentation issues * Main win here is uncovering metadata/linking/presentation issues
Needs work: Needs work:
- treat comments and issues as separate objects, fill out separate schema fields - <s>treat comments and issues as separate objects, fill out separate schema fields
- map out and organize how the schema is updated to make it more flexible - map out and organize how the schema is updated to make it more flexible
- configuration needs to enable user to specify organization+repos - configuration needs to enable user to specify organization+repos</s>
```plain ```plain
{ {
"to_index" : { "to_index" : [
"google" : "google-api-python-client", "google/google-api-python-client",
"microsoft" : ["TypeCode","api-guidelines"] "microsoft/TypeCode",
"microsoft/api-guielines"
} }
} }
``` ```
@@ -48,3 +49,4 @@ Needs work:
* Use the google drive api (see simple-simon) * Use the google drive api (see simple-simon)
* Main win is more uncovering of metadata issues, identifying * Main win is more uncovering of metadata issues, identifying
big-picture issues for centillion big-picture issues for centillion

48
docs/workinprogress.md Normal file
View File

@@ -0,0 +1,48 @@
# Components
The components of centillion are as follows:
- Flask application, which creates a Search object and uses it to search index
- Search object, which allows you to create/update/search an index
## Routes layout
Centillion flask app routes:
- `/home`
- if not logged in, landing page
- if logged in, redirect to search
- `/search`
- `/main_index_update`
- update main index, all docs period
## Functions layout
Centillion Search class functions:
- `open_index()` creates the schema
- `add_issue()`, `add_md()`, `add_document()` have three diff method sigs and add diff types
of documents to the search index
- `update_all_issues()` or `update_all_md()` or `update_all_documents()` iterates over items
and determines whether each item needs to be updated in the search index
- `update_main_index()` - update the entire search index
- calls all three update_all methods
- `create_search_results()` - package things up for jinja
- `search()` - run the query, pass results to the jinja-packager
Nice to have but focus on it later:
- update diff search index (what's been added since last index time)
- max index time
## Files layout
Schema definition:
* include a "kind" or "class" to group objects
* can provide different searches of different collections
* eventually can provide user with checkboxes

View File

@@ -29,8 +29,7 @@ class GDrive(object):
): ):
""" """
Set up the Google Drive API instance. Set up the Google Drive API instance.
Factory method: create it and hand it over. Factory method: create it here, hand it over in get_service().
Then we're finished.
""" """
self.credentials_file = credentials_file self.credentials_file = credentials_file
self.client_secret_file = client_secret_file self.client_secret_file = client_secret_file
@@ -40,6 +39,9 @@ class GDrive(object):
self.store = file.Storage(credentials_file) self.store = file.Storage(credentials_file)
def get_service(self): def get_service(self):
"""
Return an instance of the Google Drive API service.
"""
creds = self.store.get() creds = self.store.get()
if not creds or creds.invalid: if not creds or creds.invalid:

382
groupsio_util.py Normal file
View File

@@ -0,0 +1,382 @@
import requests, os, re
from bs4 import BeautifulSoup
class GroupsIOArchivesCrawler(object):
"""
This is a Groups.io spider
designed to crawl the email
archives of a group.
credentials (dictionary):
groupsio_token : api access token
groupsio_username : username
groupsio_password : password
"""
def __init__(self,
credentials,
group_name):
# template url for archives page (list of topics)
self.url = "https://{group}.groups.io/g/{subgroup}/topics"
self.login_url = "https://groups.io/login"
self.credentials = credentials
self.group_name = group_name
self.crawled_archives = False
self.archives = None
def get_archives(self):
"""
Return a list of dictionaries containing
information about each email topic in the
groups.io email archive.
Call crawl_group_archives() first!
"""
return self.archives
def get_subgroups_list(self):
"""
Use the API to get a list of subgroups.
"""
subgroups_url = 'https://api.groups.io/v1/getsubgroups'
key = self.credentials['groupsio_token']
data = [('group_name', self.group_name),
('limit',100)
]
response = requests.post(subgroups_url,
data=data,
auth=(key,''))
response = response.json()
data = response['data']
subgroups = {}
for group in data:
k = group['id']
v = re.sub(r'dcppc\+','',group['name'])
subgroups[k] = v
return subgroups
def crawl_group_archives(self):
"""
Spider will crawl the email archives of the entire group
by crawling the email archives of each subgroup.
"""
subgroups = self.get_subgroups_list()
# ------------------------------
# Start by logging in.
# Create session object to persist session data
session = requests.Session()
# Log in to the website
data = dict(email = self.credentials['groupsio_username'],
password = self.credentials['groupsio_password'],
timezone = 'America/Los_Angeles')
r = session.post(self.login_url,
data = data)
csrf = self.get_csrf(r)
# ------------------------------
# For each subgroup, crawl the archives
# and return a list of dictionaries
# containing all the email threads.
for subgroup_id in subgroups.keys():
self.crawl_subgroup_archives(session,
csrf,
subgroup_id,
subgroups[subgroup_id])
# Done. archives are now tucked away
# in the variable self.archives
#
# self.archives is a list of dictionaries,
# with each dictionary containing info about
# a topic/email thread in a subgroup.
# ------------------------------
def crawl_subgroup_archives(self, session, csrf, subgroup_id, subgroup_name):
"""
This kicks off the process to crawl the entire
archives of a given subgroup on groups.io.
For a given subgroup the url is self.url,
https://{group}.groups.io/g/{subgroup}/topics
This is the first of a paginated list of topics.
Procedure is:
- passed a starting page (or its contents)
- iterate through all topics via the HTML page elements
- assemble a bundle of information about each topic:
- topic title, by, URL, date, content, permalink
- content filtering:
- ^From, Reply-To, Date, To, Subject
- Lines containing phone numbers
- 9 digits
- XXX-XXX-XXXX, (XXX) XXX-XXXX
- XXXXXXXXXX, XXX XXX XXXX
- ^Work: or (Work) or Work$
- Home, Cell, Mobile
- +1 XXX
- \w@\w
- while next button is not greyed out,
- click the next button
everything stored in self.archives:
list of dictionaries.
"""
self.archives = []
prefix = "https://{group}.groups.io".format(group=self.group_name)
url = self.url.format(group=self.group_name,
subgroup=subgroup_name)
# ------------------------------
# Now get the first page
r = session.get(url)
# ------------------------------
# Fencepost algorithm:
# First page:
# Extract a list of (title, link) items
items = self.extract_archive_page_items_(r)
# Get the next link
next_url = self.get_next_url_(r)
# Now add each item to the archive of threads,
# then find the next button.
self.add_items_to_archives_(session,subgroup_name,items)
if next_url is None:
return
else:
full_next_url = prefix + next_url
# Now click the next button
next_request = requests.get(full_next_url)
while next_request.status_code==200:
items = self.extract_archive_page_items_(next_request)
next_url = self.get_next_url_(next_request)
self.add_items_to_archives_(session,subgroup_name,items)
if next_url is None:
return
else:
full_next_url = prefix + next_url
next_request = requests.get(full_next_url)
def add_items_to_archives_(self,session,subgroup_name,items):
"""
Given a set of items from a list of threads,
items being title and link,
get the page and store all info
in self.archives variable
(list of dictionaries)
"""
for (title, link) in items:
# Get the thread page:
prefix = "https://{group}.groups.io".format(group=self.group_name)
full_link = prefix + link
r = session.get(full_link)
soup = BeautifulSoup(r.text,'html.parser')
# soup contains the entire thread
# What are we extracting:
# 1. thread number
# 2. permalink
# 3. content/text (filtered)
# - - - - - - - - - - - - - -
# 1. topic/thread number:
# <a rel="nofollow" href="">
# where link is:
# https://{group}.groups.io/g/{subgroup}/topic/{topic_id}
# example topic id: 24209140
#
# ugly links are in the form
# https://dcppc.groups.io/g/{subgroup}/topic/some_text_here/{thread_id}?p=,,,,,1,2,3,,,4,,5
# split at ?, 0th portion
# then split at /, last (-1th) portion
topic_id = link.split('?')[0].split('/')[-1]
# - - - - - - - - - - - - - - -
# 2. permalink:
# - current link is ugly link
# - permalink is the nice one
# - topic id is available from the ugly link
# https://{group}.groups.io/g/{subgroup}/topic/{topic_id}
permalink_template = "https://{group}.groups.io/g/{subgroup}/topic/{topic_id}"
permalink = permalink_template.format(
group = self.group_name,
subgroup = subgroup_name,
topic_id = topic_id
)
# - - - - - - - - - - - - - - -
# 3. content:
# Need to rearrange how we're assembling threads here.
# This is one thread, no?
content = []
subject = soup.find('title').text
# Extract information for the schema:
# - permalink for thread (done)
# - subject/title (done)
# - original sender email/name (done)
# - content (done)
# Groups.io pages have zero CSS classes, which makes everything
# a giant pain in the neck to interact with. Thanks Groups.io!
original_sender = ''
for i, tr in enumerate(soup.find_all('tr',{'class':'test'})):
# Every other tr row contains an email.
if (i+1)%2==0:
# nope, no email here
pass
else:
# found an email!
# this is a maze, thanks groups.io
td = tr.find('td')
divrow = td.find('div',{'class':'row'}).find('div',{'class':'pull-left'})
if (i+1)==1:
original_sender = divrow.text.strip()
for div in td.find_all('div'):
if div.has_attr('id'):
# purge any signatures
for x in div.find_all('div',{'id':'Signature'}):
x.extract()
# purge any headers
for x in div.find_all('div'):
nonos = ['From:','Sent:','To:','Cc:','CC:','Subject:']
for nono in nonos:
if nono in x.text:
x.extract()
message_text = div.get_text()
# More filtering:
# phone numbers
message_text = re.sub(r'[0-9]{3}-[0-9]{3}-[0-9]{4}','XXX-XXX-XXXX',message_text)
message_text = re.sub(r'[0-9]\{10\}','XXXXXXXXXX',message_text)
content.append(message_text)
full_content = "\n".join(content)
thread = {
'permalink' : permalink,
'subject' : subject,
'original_sender' : original_sender,
'content' : full_content
}
print('*'*40)
for k in thread.keys():
if k=='content':
pass
else:
print("%s : %s"%(k,thread[k]))
print('*'*40)
self.archives.append(thread)
def extract_archive_page_items_(self, response):
"""
(Private method)
Given a response from a GET request,
use beautifulsoup to extract all items
(thread titles and ugly thread links)
and pass them back in a list.
"""
soup = BeautifulSoup(response.content,"html.parser")
rows = soup.find_all('tr',{'class':'test'})
if 'rate limited' in soup.text:
raise Exception("Error: rate limit in place for Groups.io")
results = []
for row in rows:
# We don't care about anything except title and ugly link
subject = row.find('span',{'class':'subject'})
title = subject.get_text()
link = row.find('a')['href']
print(title)
results.append((title,link))
return results
def get_next_url_(self, response):
"""
(Private method)
Given a response (which is a list of threads),
find the next button and return the URL.
If no next URL, if is disabled, then return None.
"""
soup = BeautifulSoup(response.text,'html.parser')
chevron = soup.find('i',{'class':'fa-chevron-right'})
try:
if '#' in chevron.parent['href']:
# empty link, abort
return None
except AttributeError:
# I don't even now
return None
if chevron.parent.parent.has_attr('class') and 'disabled' in chevron.parent.parent['class']:
# no next link, abort
return None
return chevron.parent['href']
def get_csrf(self,resp):
"""
Find the CSRF token embedded in the subgroup page
"""
soup = BeautifulSoup(resp.text,'html.parser')
csrf = ''
for i in soup.find_all('input'):
# Note that i.name is different from i['name']
# the first is the actual tag,
# the second is the attribute name="xyz"
if i['name']=='csrf':
csrf = i['value']
if csrf=='':
err = "ERROR: Could not find csrf token on page."
raise Exception(err)
return csrf

19
install_pandoc.sh Executable file
View File

@@ -0,0 +1,19 @@
#!/bin/bash
#
# for ubuntu
if [ "$(id -u)" != "0" ]; then
echo ""
echo ""
echo "This script should be run as root."
echo ""
echo ""
exit 1;
fi
OFILE="/tmp/pandoc.deb"
curl -L https://github.com/jgm/pandoc/releases/download/2.2.2.1/pandoc-2.2.2.1-1-amd64.deb -o ${OFILE}
dpkg -i ${OFILE}
rm -f ${OFILE}

1
mkdocs-material-dib Submodule

Submodule mkdocs-material-dib added at c3dd912f3c

View File

@@ -9,3 +9,5 @@ PyGithub>=1.39
pypandoc>=1.4 pypandoc>=1.4
requests>=2.19 requests>=2.19
pandoc>=1.0 pandoc>=1.0
flask-dance>=1.0.0
beautifulsoup4>=4.6

6
static/bootstrap.min.css vendored Normal file

File diff suppressed because one or more lines are too long

7
static/bootstrap.min.js vendored Normal file

File diff suppressed because one or more lines are too long

BIN
static/centillion_black.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

BIN
static/centillion_white.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

2
static/jquery.min.js vendored Normal file

File diff suppressed because one or more lines are too long

View File

@@ -1,3 +1,45 @@
span.badge {
vertical-align: text-bottom;
}
a.badgelinks, a.badgelinks:hover {
color: #fff;
text-decoration: none;
}
div.list-group {
border: 1px solid rgba(86,61,124,.2);
}
li.list-group-item {
position: relative;
display: block;
/*padding: 20px 10px;*/
margin-bottom: -1px;
background-color: #f8f8f8;
border: 1px solid #ddd;
}
li.search-group-item {
position: relative;
display: block;
padding: 0px;
margin-bottom: -1px;
background-color: #fff;
border: 1px solid #ddd;
}
div.url {
background-color: rgba(86,61,124,.15);
padding: 8px;
}
/***************************/
body { body {
font-family: sans-serif; font-family: sans-serif;
} }
@@ -56,7 +98,7 @@ table {
overflow: hidden; overflow: hidden;
} }
td.info, .last-searches { .info, .last-searches {
color: gray; color: gray;
font-size: 12px; font-size: 12px;
font-family: Arial, serif; font-family: Arial, serif;

108
templates/controlpanel.html Executable file
View File

@@ -0,0 +1,108 @@
{% extends "layout.html" %}
{% block body %}
{% with messages = get_flashed_messages() %}
{% if messages %}
<div class="container">
<div class="alert alert-success alert-dismissible">
<a href="#" class="close" data-dismiss="alert" aria-label="close">&times;</a>
<ul class=flashes>
{% for message in messages %}
<li>{{ message }}</li>
{% endfor %}
</ul>
</div>
</div>
{% endif %}
{% endwith %}
<div class="container">
<div class="row">
<div class="col-md-12">
<center>
<a href="{{ url_for('search')}}?query=&fields=">
<img src="{{ url_for('static', filename='centillion_white.png') }}">
</a>
{% if config['TAGLINE'] %}
<h2><a href="{{ url_for('search')}}?query=&fields=">
{{config['TAGLINE']}}
</a></h2>
{% endif %}
</center>
</div>
</div>
{% if config['zzzTAGLINE'] %}
<div class="row">
<div class="col12sm">
<center>
<h2><a href="{{ url_for('search')}}?query=&fields=">
{{config['TAGLINE']}}
</a></h2>
</center>
</div>
</div>
{% endif %}
</div>
<hr />
<div class="container">
<div class="row">
{# update main search index #}
<div class="panel panel-danger">
<div class="panel-heading">
<h3 class="panel-title">
Update Main Search Index
</h3>
</div>
<div class="panel-body">
<div class="container-fluid">
<div class="row">
<div class="col-md-12">
<p class="panel-text">Re-index <i>every</i> document in the
remote collection in the search index. <b>Warning: this operation may take a while.</b>
<p/> <p>
<a href="{{ url_for('update_index') }}" class="btn btn-large btn-danger">Update Main Index</a>
<p/>
</div>
</div>
</div>
</div>
</div>
{# update diff search index #}
<div class="panel panel-danger">
<div class="panel-heading">
<h3 class="panel-title">
Update Diff Search Index
</h3>
</div>
<div class="panel-body">
<div class="container-fluid">
<div class="row">
<div class="col-md-12">
<p class="panel-text">Diff search index only re-indexes documents created after the last
search index update. <b>Not currently implemented.</b>
<p/> <p>
<a href="#" class="btn btn-large disabled btn-danger">Update Diff Index</a>
<p/>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
{% endblock %}

View File

@@ -1,10 +1,12 @@
<!doctype html> <!doctype html>
<title>Markdown Search</title> <title>Centillion Search Engine</title>
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}"> <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='style.css') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='github-markdown.css') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='bootstrap.min.css') }}">
<script src="{{ url_for('static', filename='jquery.min.js') }}"></script>
<script src="{{ url_for('static', filename='bootstrap.min.js') }}"></script>
<div> <div>
{% for message in get_flashed_messages() %}
<div class="flash">{{ message }}</div>
{% endfor %}
{% block body %}{% endblock %} {% block body %}{% endblock %}
</div> </div>

View File

@@ -1,62 +1,188 @@
{% extends "layout.html" %} {% extends "layout.html" %}
{% block body %} {% block body %}
<h1><a href="{{ url_for('search')}}?query=&fields=">Search directory: {{ config.MARKDOWN_FILES_DIR }}</a></h1>
<a class="index" href="{{ url_for('update_index')}}">[update index]</a>
<a class="index" href="{{ url_for('update_index')}}?rebuild=True">[rebuild index]</a> <div class="container">
<form action="{{ url_for('search') }}" name="search">
<input type="text" name="query" value="{{ query }}"> {#
<input type="submit" value="search"> banner image
<a href="{{ url_for('search')}}?query=&fields=">[clear]</a> #}
</form> <div class="row">
<table cellspacing="3"> <div class="col12sm">
{% if directories %} <center>
<tr> <a href="{{ url_for('search')}}?query=&fields=">
<td class="directories-cloud">File directories:&nbsp <img src="{{ url_for('static', filename='centillion_white.png') }}">
</a>
{#
need a tag line
#}
{% if config['TAGLINE'] %}
<h2><a href="{{ url_for('search')}}?query=&fields=">
{{config['TAGLINE']}}
</a></h2>
{% endif %}
</center>
</div>
</div>
</div>
<div class="container">
<div class="row">
<div class="col-xs-12">
<center>
<form action="{{ url_for('search') }}" name="search">
<input type="text" name="query" value="{{ query }}"> <br />
<button type="submit" style="font-size: 20px; padding: 10px; padding-left: 50px; padding-right: 50px;"
value="search" class="btn btn-primary">Search</button>
<br />
<a href="{{ url_for('search')}}?query=&fields=">[clear all results]</a>
</form>
</center>
</div>
</div>
</div>
<div class="container">
<div class="row">
{% if directories %}
<div class="col-xs-12 info directories-cloud">
<b>File directories:</b>
{% for d in directories %} {% for d in directories %}
<a href="{{url_for('search')}}?query={{d|trim}}&fields=filename">{{d|trim}}</a> <a href="{{url_for('search')}}?query={{d|trim}}&fields=filename">{{d|trim}}</a>
{% endfor %} {% endfor %}
</td> </div>
</tr> {% endif %}
{% endif %}
{% if config['SHOW_PARSED_QUERY']%}
<tr>
<td class="info">Parsed query: {{ parsed_query }}</td>
</tr>
{% endif %}
<tr>
<td class="info">FOUND {{ entries | length }} results of {{total}} documents</td>
</tr>
{% for e in entries %} <ul class="list-group">
<tr>
<td class="search-result"> {% if config['SHOW_PARSED_QUERY'] and parsed_query %}
<!-- <li class="list-group-item">
<div class="path"><a href='{{ url_for("open_file")}}?path={{e.path|urlencode}}&query={{query}}&fields={{fields}}'>{{e.path}}</a>score: {{'%d' % e.score}}</div> <div class="container-fluid">
--> <div class="row">
<div class="url"> <div class="col-xs-12 info">
{% if e.is_comment %} <b>Parsed query:</b> {{ parsed_query }}
<b>Comment</b> <a href='{{e.url}}'>(comment link)</a> </div>
on issue <a href='{{e.issue_url}}'>{{e.issue_title}}</a> </div>
in repo <a href='{{e.repo_url}}'>dcppc/{{e.repo_name}}</a> </div>
<br /> </li>
{% else %} {% endif %}
<b>Issue</b> <a href='{{e.issue_url}}'>{{e.issue_title}}</a>
in repo <a href='{{e.repo_url}}'>dcppc/{{e.repo_name}}</a> {% if parsed_query %}
<br /> <li class="list-group-item">
{% endif %} <div class="container-fluid">
score: {{'%d' % e.score}} <div class="row">
</div> <div class="col-xs-12 info">
<div class="markdown-body">{{ e.content_highlight|safe}}</div> <b>Found:</b> <span class="badge">{{entries|length}}</span> results
</td> out of <span class="badge">{{totals["total"]}}</span> total items indexed
</tr> </div>
{% endfor %} </div>
</table> </div>
<div class="last-searches">Last searches: <br/> </li>
{% for s in last_searches %} {% endif %}
<span><a href="{{url_for('search')}}?{{s}}">{{s}}</a></span>
{% endfor %} <li class="list-group-item">
<div class="container-fluid">
<div class="row">
<div class="col-xs-12 info">
<b>Indexing:</b> <span
class="badge">{{totals["gdoc"]}}</span> Google Documents,
<span class="badge">{{totals["issue"]}}</span> Github issues,
<span class="badge">{{totals["ghfile"]}}</span> Github files,
<span class="badge">{{totals["markdown"]}}</span> Github markdown files.
</div>
</div>
</div>
</li>
</ul>
</div>
</div> </div>
<p>
More info can be found in the <a href="https://github.com/BernhardWenzel/markdown-search">README.md file</a> <div class="container">
</p> <div class="row">
<ul class="list-group">
{% for e in entries %}
<li class="search-group-item">
<div class="url">
{% if e.kind=="gdoc" %}
{% if e.mimetype=="" %}
<b>Google Document:</b>
<a href='{{e.url}}'>{{e.title}}</a>
(Owner: {{e.owner_name}}, {{e.owner_email}})<br />
<b>Document Type</b>: {{e.mimetype}}
{% else %}
<b>Google Drive:</b>
<a href='{{e.url}}'>{{e.title}}</a>
(Owner: {{e.owner_name}}, {{e.owner_email}})
{% endif %}
{% elif e.kind=="issue" %}
<b>Github Issue:</b>
<a href='{{e.url}}'>{{e.title}}</a>
{% if e.github_user %}
opened by <a href='https://github.com/{{e.github_user}}'>@{{e.github_user}}</a>
{% endif %}
<br/>
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
{% elif e.kind=="markdown" %}
<b>Github Markdown:</b>
<a href='{{e.url}}'>{{e.title}}</a>
<br/>
<b>Repository:</b> <a href='{{e.repo_url}}'>{{e.repo_name}}</a>
{% else %}
<b>Item:</b> (<a href='{{e.url}}'>link</a>)
{% endif %}
<br />
Score: {{'%d' % e.score}}
</div>
<div class="markdown-body">
{% if e.content_highlight %}
{{ e.content_highlight|safe}}
{% else %}
<p>(A preview of this document is not available.)</p>
{% endif %}
</div>
</li>
{% endfor %}
</ul>
</div>
</div>
<div class="container">
<div class="row">
<ul class="list-group">
{% if config['FOOTER_REPO_NAME'] %}
{% if config['FOOTER_REPO_ORG'] %}
<li class="list-group-item">
<div class="container-fluid">
<div class="row">
<div class="col-xs-12 info">
More information about {{config['FOOTER_REPO_NAME']}} can be found
in the <a href="https://github.com/{{config['FOOTER_REPO_ORG']}}/{{config['FOOTER_REPO_NAME']}}">{{config['FOOTER_REPO_ORG']}}/{{config['FOOTER_REPO_NAME']}}</a>
repository on Github.
</div>
</div>
</div>
</li>
{% endif %}
{% endif %}
</ul>
</div>
</div>
{% endblock %} {% endblock %}