Centillion is a search engine that is 3.03 log-times better than a googol.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
Charles Reid de796880c5 Merge branch 'master' of github.com:charlesreid1/centillion 6 years ago
docs Move images, resize images, update image markdown in readme 6 years ago
mkdocs-material-dib@c3dd912f3c add mkdocs-material-dib submodule 6 years ago
static add results-handling for markdown files 6 years ago
templates fix search template 6 years ago
.gitignore Update '.gitignore' 6 years ago
.gitmodules add mkdocs-material-dib submodule 6 years ago
LICENSE add init license + readme 6 years ago
Readme.md Update Readme.md 6 years ago
Schema.md update schema names 6 years ago
Todo.md update documentation, add new docs pages on components/flask/whoosh 6 years ago
auth.py add auth checking and gdrive util from cheeseburger-search. clarify docs. 6 years ago
centillion.py keep going with spider idea 6 years ago
centillion_prepare.py update installation preparation step 6 years ago
centillion_search.py locked out by rate limit, but otherwise successful in indexing so far. 6 years ago
config_centillion.json make it valid json 6 years ago
config_flask.example.py update config_flask.example.py to strip dc info 6 years ago
gdrive_util.py improve docstrings on gdrive_util.py too 6 years ago
get_centillion_config.py organize the config files for flask/centillion separately. create factory method to get config file. re-use between centillion_search test method and centillion live flask method 6 years ago
groupsio_util.py locked out by rate limit, but otherwise successful in indexing so far. 6 years ago
install_pandoc.sh update installation preparation step 6 years ago
requirements.txt adding calls to index groupsio emails 6 years ago

Readme.md

Centillion

centillion: a pan-github-markdown-issues-google-docs search engine.

a centillion: a very large number consisting of a 1 with 303 zeros after it.

one centillion is 3.03 log-times better than a googol.

Screen shot of centillion

what is it

Centillion (https://github.com/dcppc/centillion) is a search engine that can index three kinds of collections: Google Documents, Github issues, and Markdown files in Github repos.

We define the types of documents the centillion should index, what info and how. The centillion then builds and updates a search index. That's all done in centillion_search.py.

The centillion also provides a simple web frontend for running queries against the search index. That's done using a Flask server defined in centillion.py.

The centillion keeps it simple.

authentication layer

Centillion lives behind a Github authentication layer, implemented with flask-dance. When you first visit the site it will ask you to authenticate with Github so that it can verify you have permission to access the site.

technologies

Centillion is a Python program built using whoosh (search engine library). It indexes the full text of docx files in Google Documents, just the filenames for non-docx files. The full text of issues and their comments are indexed, and results are grouped by issue. Centillion requires Google Drive and Github OAuth apps. Once you provide credentials to Flask you're all set to go.

control panel

There's also a control panel at https://search.nihdatacommons.us/control_panel that allows you to rebuild the search index from scratch (the Google Drive indexing takes a while).

Screen shot of centillion control panel

quickstart (with Github auth)

Start by creating a Github OAuth application. Get the public and private application key (client token and client secret token) from the Github application's page. You will also need a Github access token (in addition to the app tokens).

When you create the application, set the callback URL to /login/github/authorized, as in:

https://<url>/login/github/authorized

Edit the Flask configuration config_flask.py and set the public and private application keys.

Now run centillion:

python centillion.py

or if you used http instead of https:

OAUTHLIB_INSECURE_TRANSPORT="true" python centillion.py

This will start a Flask server, and you can view the minimal search engine interface in your browser at http://<ip>:5000.

troubleshooting

If you are having problems with your callback URL being treated as HTTP by Github, even though there is an HTTPS address, and everything else seems fine, try deleting the Github OAuth app and creating a new one.