Build Your Own Search Engine

(Wiby Install Guide)

Overview
Installation
Controlling
Scaling

Overview

Wiby is a search engine for the World Wide Web. The source code is now free as of July 8, 2022 under the GPLv2 license. I have been longing for this day! You can watch a quick demo here.

It includes a web interface allowing guardians to control where, how far, and how often it crawls websites and follows hyperlinks. The search index is stored inside of a MySQL full-text index.

Fast queries are maintained by concurrently reading different sections of the index across multiple replication servers or across duplicate server connections, returning a list of top results from each connection, then searching the combined list to ensure correct ordering. Replicas that fail are automatically excluded; new replicas are easy to include. As new pages are crawled, they are stored randomly across the index, ensuring each replica can obtain relevant results.

The search engine is not meant to index the entire web and then sort it with a ranking algorithm. It prefers to seed its index through human submissions made by guests, or by the guardian(s) of the search engine.

The software is designed for anyone with some extra computers (even a Pi), to host their own search engine catering to whatever niche matters to them. The search engine includes a simple API for meta search engines to harness.

I hope this will enable anyone with a love of computers to cheaply build and maintain a search engine of their own. I hope it can cultivate free and independent search engines, ensuring accessibility of ideas and information across the World Wide Web.


       Web Traffic
            |
            |
+-----------+-----------+
| Reverse Proxy (nginx) |
+-----------+-----------+
            |
            |
+-----------+-----------+
|  Wiby Core Server(s)  |+-----------------+----------------------------+
|(Golang or PHP version)|                  |                            |
+-----------+-----------+       +----------+----------+       +---------+---------+
            |                   |Replication Databases|+-----+|Replication Tracker|
            |                   +----------+----------+       +-------------------+
+-----------+-----------+                  |
|    Primary Database   |+-----------------+
|   (MySQL or MariaDB)  |
+----+-------------+----+
     |             |  
     |             |  
+----+-----+  +----+----+
|   Web    |  | Refresh |
|Crawler(s)|  |Scheduler|
+----------+  +---------+


Installation

I can only provide manual install instructions at this time.

Note that while the software is functionally complete, it is still in beta. Anticipate that some bugs will be discovered now that the source is released. Ensure that you isolate the search engine from your other important services, and if you are running parts of it out of your home, keep the servers on a separate VLAN. Make sure this VLAN cannot access your router or switch interface. Continue this practise even when the software reaches "1.0".

If you have created a "LAMP", or rather a "LEMP" server before, this isn't much more complicated. If you've never done that, I suggest you find a "LEMP" tutorial.

Build a LEMP server

Digital Ocean tutorials are usually pretty good so here is a link to one for Ubuntu 20 and Ubuntu 22.

For the sake of simplicity, assume all instructions are for Ubuntu 20 or 22. If you are on a different distro, modify the install steps accordingly to suit your distro.

If you don't have a physical server, you can rent computing space by looking for a "VPS provider". This virtual computer will be your reverse proxy, and if you want, it can host everything else too.

Install the following additional packages:

apt install build-essential php-gd libcurl4-openssl-dev libmysqlclient-dev mysql-server golang git

For Ubuntu 20:
go get -u github.com/go-sql-driver/mysql

For Ubuntu 22 or latest Golang versions:
go install github.com/go-sql-driver/mysql@latest
go mod init mysql
go get github.com/go-sql-driver/mysql

Get Wiby Source Files

Download the source directly from Wiby here, or from GitHub. The source is released under the GPLv2 license. Copy the source files for Wiby to your server.

Compile the crawler (cr), refresh scheduler (rs), replication tracker (rt):

gcc cr.c -o cr -I/usr/include/mysql -lmysqlclient -lcurl -std=c99 -O3
gcc rs.c -o rs -I/usr/include/mysql -lmysqlclient -std=c99 -O3
gcc rt.c -o rt -I/usr/include/mysql -lmysqlclient -std=c99 -O3
If you get any compile errors, it is likely due to the path of the mysql or libcurl header files. This could happen if you are not using Ubuntu 20. You might have to locate the correct path for curl.h, easy.h, and mysql.h.

Build the core server application:

Inside the go folder:

go build core.go
go build 1core.go
If you are just starting out, you can use '1core'. If you are going to setup replication servers or you are using a computer with a lot of available cores, you can use 'core', but make sure to read the scaling section. You can also use index.php in the root of the www directory and not use the Go version at all. Though the PHP version is used mainly for prototyping.

Build the database:

Make sure these lines are inside of /etc/mysql/my.cnf, then restart mysql
[client]
default-character-set=utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
max_connections = 400
ft_min_word_len=2
sql_mode = "NO_BACKSLASH_ESCAPES"
character-set-server = utf8mb4
collation-server = utf8mb4_0900_ai_ci
skip-character-set-client-handshake
default-authentication-plugin=mysql_native_password
wait_timeout = 800

#memory use settings, you should adjust this based on your hardware
innodb_buffer_pool_size = 1342177280
innodb_buffer_pool_instances = 2

Login to MySQL and type:
create database wiby;
create database wibytemp;
Import the wiby and wibytemp database files:
mysql -u root -p wiby < wiby.sql
mysql -u root -p wibytemp < wibytemp.sql
Login to MySQL, create the following accounts and give them the correct access:
create user 'guest'@'localhost' identified by 'qwer';
create user 'approver'@'localhost' identified by 'foobar';
create user 'crawler'@'localhost' identified by 'seekout';
use wiby;
grant select on accounts to 'approver'@'localhost';
grant select on reviewqueue to 'approver'@'localhost';
grant insert on indexqueue to 'approver'@'localhost';
grant delete on reviewqueue to 'approver'@'localhost';
grant update on reviewqueue to 'approver'@'localhost';
grant select on indexqueue to 'crawler'@'localhost';
grant insert on windex to 'crawler'@'localhost';
grant insert on indexqueue to 'crawler'@'localhost';
grant update on windex to 'crawler'@'localhost';
grant delete on indexqueue to 'crawler'@'localhost';
grant delete on windex to 'crawler'@'localhost';
grant select on windex to 'crawler'@'localhost';
grant insert on reviewqueue to 'crawler'@'localhost';
grant select on windex to 'guest'@'localhost';
grant insert on reviewqueue to 'guest'@'localhost';
grant insert on feedback to 'guest'@'localhost';
grant select on feedback to 'approver'@'localhost';
grant delete on feedback to 'approver'@'localhost';
grant insert on graveyard to 'approver'@'localhost';
grant update on graveyard to 'approver'@'localhost';
grant delete on graveyard to 'approver'@'localhost';
grant select on graveyard to 'approver'@'localhost';
grant update on accounts to 'approver'@'localhost';
grant insert on accounts to 'approver'@'localhost';
grant delete on accounts to 'approver'@'localhost';
use wibytemp;
grant select on titlecheck to 'crawler'@'localhost';
grant insert on titlecheck to 'crawler'@'localhost';
grant delete on titlecheck to 'crawler'@'localhost';
grant select on rejected to 'approver'@'localhost';
grant insert on rejected to 'approver'@'localhost';
grant delete on rejected to 'approver'@'localhost';
grant select on reserve_id to 'crawler'@'localhost';
grant insert on reserve_id to 'crawler'@'localhost';
grant delete on reserve_id to 'crawler'@'localhost';
FLUSH PRIVILEGES;

Copy the HTML files and PHP scripts to your web server

Copy the contents of the the html directory into the nginx html directory (/var/www/html)

Configure nginx for Wiby

In /etc/nginx/, create a directory called 'phpcache', and another one called 'cache'.
Instead of going through every detail, I will provide a template for you to try out as your default nginx config from inside /etc/nginx/sites-available/ of the source code.

You should learn nginx configuration on your own, this template is just to assist. If you are using only the php version, comment all "core app" location entries to revert Wiby search to the php only version.
Make sure ssl_certificate and ssl_certificate_key have the path for your SSL files instead of the example paths. If you don't want to use SSL, just remove the server {} configuration for SSL connections (on port 443). Also the example file references php7.4-fpm.sock, so if you are using a different version remember to update that as well (such as php8.1-fpm.sock on Ubuntu 22).

Start the Refresh Scheduler

This program (rs) will make sure all pages indexed are refreshed at least once per week (or sooner depending on how you assign updates to an individual website). You may want to run this on startup, easiest way to set that is with a cron job (crontab -e). Run './rs -h' to get more parameters.

Start the Crawler

It is best to run the crawler in a screen session so that you can monitor its output. You can have more than one crawler running as long as you keep them in separate directories, include a symlink to the same robots folder, and also set the correct parameters on each. To view the parameters, type './cr -h'. Without any parameters set, you can only run one crawler (which is probably all you need anyway). If necessary, you can change the database connection from 'localhost' to a different IP from inside cr.c, then rebuild.

Note that you may need to change the crawler's user-agent if you have issues indexing some websites. Pages that fail to index are noted inside of abandoned.txt.

Make sure the robots folder exists. All robots.txt files are stored in the robots folder. They are downloaded once and then referenced from that folder on future updates. Clear this folder every few weeks to ensure robots.txt files get refreshed from time to time. You can turn off checking for robots.txt files by commenting out the line calling the "checkrobots" function inside of cr.c.

If crawling through hyperlinks on a page, the following file types are accepted: html, htm, txt, php, asp. Links containing parameters are ignored. These limitations do not apply to pages directly submitted by people.

Start the core server

If you are just starting out, '1core' or the php version is easiest to start with. Use 'core' if you want to scale computer resources as the index grows or if you have a lot of available CPU cores. Make sure to read the scaling section. You can run the core server on startup with a cron job.

Set Administrator Password for the Web Interface

There is no default web login, you will have to set this manually the first time:
Rename the /html/hash folder to something private.

Edit html/private_folder_name/hashmake.php and change 'secretpassword' to your preferred admin password. 

Access /private_folder_name/hashmake.php from your browser and copy down the hash.

After you have copied it down, delete or remove hashmake.php from your web server folder so that the hash cannot be discovered.
Login to MySQL and create the account:
use wiby;
INSERT INTO accounts (name,hash,level) VALUES('your_username','your_password_hash','admin');
You can now access /accounts/ from your browser, login to create and manage all accounts for administrators and guardians of the search engine.

admin - Can access all web forms for the search engine and use the /accounts/ page to create and delete accounts.

guardian - The main role of a guardian is to gatekeep the index of the search engine. Can access all forms except for /readf/, and can only use the /accounts/ page to change their own password.



Controlling the Search Engine


There are several forms to control the search engine. There is no central form linking everything together, just a collection of different folders that you can rename if you want.

/submit/

This public facing form allows users of the search engine to submit websites for indexing, provided they comply with your submission criteria, which you can modify on /submit/form.html.php.

/accounts/

This is the account management page. Admins have options to create, lock, change account type, delete, and reset passwords. Guardians have the option to change their password.

/review/

This is the most important form, intended for you to verify website submissions meet your criteria. Up to 10 pages are assigned to each guardian or admin that accesses the form. The pages will remain assigned to that account for up to 30 minutes. From here you can control how much, how deep, and how often the web crawler will access each submission. Here is an example of the available options for a website submission:

url_that_was_submitted
[Worksafe] [Surprise] [Skip] [Bury] [Deny] [Updatable]
[Crawl: Depth Pages Type Enforce Rules Repeat]

Explanation of the above options:

Worksafe - Indicates if the website is safe for work. Set by the user who submitted the website, however you can change it based on your determination.

Surprise - Checking this box will put it in the "surprise me" feature, where users get redirected to random websites when they click "surprise me". Note that this feature won't show NSFW websites even if they are set to surprise.

Skip - Selecting this option will skip indexing the page and it will reappear on the review form after you submit the rest of the pages for crawling.

Bury - Selecting this will move the page to a grave yard (/grave/), a holding place with the same options as /review/ for websites that might have stopped working but that you suspect may come back online. The crawler will detect this automatically and send the page back into review. When you click on the link and see a 404, you can be assured the crawler sent it back to review after failing two update cycles. This also happens if the title of the page changes. The crawler will only do this for pages directly submitted by people. This curtesy is not given to websites that are automatically crawled but then fail to work later on. For those sites, after two failed update cycles, the page will be removed.

Deny - Select this to drop the page from being indexed. If the page does not meet your submission criteria, this would be the option to remove it from the queue.

Updatable - The update cycle for the web crawler to return to the page. This only applies to pages submitted by people, pages found by link crawling always go on a 1 week update cycle.

------------------- Crawl -------------------
The options listed below control how the crawler indexes hyperlinks on the website. By default, the crawler does not index any hyperlinks, it will only index the page that is submitted.

Depth - How many layers of links to crawl through. You must set at least a depth of 1 if you want to crawl any hyperlinks. Setting a negative value = no limit. Be careful about that.

Pages - How many pages to crawl on each link layer (depth). They will be randomly selected. You must set at least 1 if you want to crawl any hyperlinks. Setting a negative value = no limit. Be careful about that.

Type - Indicates if you want to only crawl links local to the website, or links external to the website, or both.

Enforce rules - This is a blunt tool that checks if pages have more than two scripts and/or css files. If the limit is exceded, the page will not be indexed. I don't use it and prefer to manually check based on more forgiving criteria.

Repeat - While the crawler will always return to update each page in the index, it wont crawl through hyperlinks again unless you tell it to. Even so, it only crawls hyperlinks on the page at a depth of 1 when repeat is selected.

/ban/

You can delete or ban individual URL's from the index with this form. Its pretty simple as I don't use it much. You can't delete an entire domain with it, for that you can build your own query in the MySQL console.

/bulksubmit/

Admins/Guardians can import a list of URLs into the review queue with this form.

/feedback/

Users can submit feedback for you with this form.

/readf/

Where admin accounts can read feedback submitted by users.

/grave/

It has the same features as /review/. Websites that you don't yet want to index but don't want to forget about are stored inside /grave/ by selecting 'bury' from inside /review/. The web crawler will (only for pages submitted directly by people), move 404'd pages or pages where the title has changed back to /review/ after two update cycles where the page does not return to normal. So after a few weeks you may notice dead pages appearing in /review/, you can decide to drop the page or to bury it where it will be moved to /grave/. The page might go back to normal at some point and you can check /grave/ to see if it resurrects.

/insert/

This was the first form created back in late 2016 to populate the Wiby index and see if the search engine could even work as a proof of concept. It was meant to manually enter pages into the index as no crawler existed yet. It is still useful if you want to manually index a page that refuses to permit the crawler to access it. In that case, set updatable to 0.

/tags/

If you want to force a website to appear at the top rank for specific single word queries (like "weather"), you can force it by tagging the words to the target url.

/json/

This is the JSON API developers can use to connect their services to the search engine. Instructions are located at that location.

Additional Notes

If you need to stop the web crawler in a situation where it was accidently queued to index an unlimited number of pages, first stop the crawler program, truncate the indexqueue table 'truncate indexqueue;', then restart the crawler.



Scaling the Search Engine


You can help ensure sub-second search queries as your index grows by building MySQL replica servers on a local network close to eachother, run the core application AND replication tracker (rt) on one or more replica servers and point your reverse proxy to use it. Edit the servers.csv file for rt to indicate all available replica servers. If you have a machine with a huge amount of resources and cores, entering multiple duplicate entries to the same sever inside servers.csv (e.g. one for each core) works also.

The core application checks the replication tracker (rt) output to determine if any replicas are online, it will initiate a connection on those replicas and task each one to search a different section of the index, drastically speeding up search speeds especially for multi-word queries. By default, single-word queries will not initiate multiple connections across replicas. To enable that on single-word queries, comment out the IF statement on line 373 and rebuild the core application.

The number of available replicas must divide evenly into the search results per page limit (lim), OR, the search results per page limit must divide evenly into the number of available replicas. If there is an excess of available replicas such that they do not divide evenly, those will remain in synch but will not be used for searches unless another replica fails. You can adjust the search results per page limit (lim) to a different value (default 12), and then rebuild to make excess available replicas divide evenly (if necessary).

The reverse proxy and replica servers can be connected through a VPN such as wireguard or openvpn, however the IPs for servers.csv should be the local IPs for the LAN the replicas are all connected on. Here is a tutorial for setting up MySQL replicas.

Full instructions below:

On the primary server add these lines to my.cnf under [mysqld] but only once you have a VPN to reach your replicas. Replace my.vpn.ip with your own.
#setting up replication below
bind-address = 127.0.0.1,my.vpn.ip
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = wiby
binlog_format = mixed
In MySQL on the primary server, create a user for replica access:
create user 'slave_user'@'%' identified by 'd0gemuchw0w';
GRANT REPLICATION SLAVE ON *.* TO 'slave_user'@'%';
FLUSH PRIVILEGES;
On the replica server, ensure the following my.cnf configuration, set the server-id as a unique id for each replica, then restart mysql:
[client]
default-character-set=utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
max_connections = 400
ft_min_word_len=2
sql_mode = "NO_BACKSLASH_ESCAPES"
#character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_0900_ai_ci
skip-character-set-client-handshake
default-authentication-plugin=mysql_native_password
wait_timeout = 800

#memory use settings, you should adjust this based on your hardware
innodb_buffer_pool_size = 1342177280
innodb_buffer_pool_instances = 2

#setting up replication below
bind-address = 0.0.0.0
server-id = 2
relay_log_info_repository = TABLE
relay_log_recovery = ON
sync_binlog=1
Make sure only VPN and VLAN addresses can reach your replicas. The bind address of 0.0.0.0 can be replaced with '127.0.0.1,replica.vpn.ip' which is safer but also more crash prone if the VPN address is not available on startup.

To export the database to the replica server, on the primary server, stop the web crawler and hide any web forms that can accept new data, then open MySQL and do the following.
USE wiby;
FLUSH TABLES WITH READ LOCK;
SHOW MASTER STATUS;

+------------------+----------+--------------+------------------+-------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| mysql-bin.000055 | 15871269 | wiby         |                  |                   |
+------------------+----------+--------------+------------------+-------------------+
Keep the above session window open (or run it in a screen session).
Copy down the information from that table. In a separate session window, export the database:
mysqldump -u root -p wiby > wiby.sql
Once you have exported the database and recorded what you need, you can unlock the tables, and resume as normal. On the session window displaying the master status:
UNLOCK TABLES;
You can now close that window if you want.

On the replica server, login to MySQL and create the database:
CREATE DATABASE wiby;
EXIT;
Import the database:
mysql -u root -p wiby < wiby.sql
Login to MySQL and type the following but replace the primary_server_ip, MASTER_LOG_FILE, and MASTER_LOG_POS with yours from the table:
CHANGE MASTER TO MASTER_HOST='primary_server_ip',MASTER_USER='slave_user', MASTER_PASSWORD='d0gemuchw0w', MASTER_LOG_FILE='mysql-bin.000055', MASTER_LOG_POS=15871269;
START SLAVE;
To verify that the replica is syncronized, type the following on the replica in MySQL:
SHOW SLAVE STATUS\G
Make sure that:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
In MySQL on the replica, create the accounts required for the replication tracker and core application:
use wiby;
create user 'remote_guest'@'%' identified by 'd0gemuchw0w';
grant select on windex to 'remote_guest'@'%';
create user 'guest'@'localhost' identified by 'qwer';
grant select on windex to 'guest'@'localhost';
FLUSH PRIVILEGES;

Load Balancing

You should run the core application on one or more of your replicas and have nginx send traffic to it, this way you can reduce the burden on your VPS. The replication tracker (rt) must run on the same server and directory that the core application is running on (not required for 1core).

Add the replica server's VPN address/port to upstream remote_core {} from the default config for nginx (see the provided example template). You can use the VPS as a backup instead by adding 'backup' to its address (eg: server 127.0.0.1:8080 backup;)