2. Installation

This section provides a comprehensive guidelines on how to install the PilotScope Python package in the ML side, along with the necessary patches for PostgreSQL and Spark in the DB side.

The PilotScope package is a crucial tool that offers a range of frequently used functions, including database connection, data exchange, and the integration of AI algorithms into the database. Additionally, we have developed patches for both PostgreSQL and Spark to enhance their capabilities and support a wider array of functionalities.

Required Software Versions:

Python: 3.8
PostgreSQL: 13.1
Apache Spark: 3.3.2

Here, we provide two installation methods: installing from a docker image and installing from scratch. You can load test databases and check the installation referring to the section Load Test Databases.

2.1. Install from Pre-built Docker Image

We provide a pre-built docker image which contains the pre-installed PilotScope package and the modified PostgreSQL 13.1 and Spark 3.3.2. Here are some key details about this Docker image:

The provided image is based on Ubuntu 18.04.
The PilotScope Core in ML side is located in the /home/pilotscope/PilotScopeCore directory.
The source code of modified PostgreSQL is located in the /home/pilotscope/PilotScopePostgreSQL directory.
- PostgreSQL is installed in /home/pilotscope/pgsql, with data storage located at /home/pilotscope/pg_data.
The source code of modified Spark is located in the /home/pilotscope/PilotScopeSpark/spark-3.3.2 directory.
The pre-installed miniconda is located in the /home/pilotscope/miniconda3 directory with a pre-built Python environment named pilotscope.
Two system users are pre-created: root (password: root) and pilotscope (password: pilotscope). User pilotscope owns the postgresql database system, and the user must own the server process.
Some necessary tools such as vim, wget, git, ssh, etc. are pre-installed.

2.1.1. Start a Container from the Image

You can follow these steps to start a container from our docker image.

2.1.1.1. Step 1: Build an Image

To acquire the Docker image, you can download the Dockerfile from the following link and build the image locally.

# Navigate to the directory where you want to store the Dockerfile, such as ~/
cd ~/
# Download the Dockerfile
wget https://raw.githubusercontent.com/alibaba/pilotscope/master/Dockerfile

You can execute the following command to build the image for PostgreSQL.

docker build -t pilotscope:latest \
  --build-arg enable_postgresql=true \
  --build-arg enable_spark=false \
  .

You can execute the following command to build the image for Spark.

docker build -t pilotscope:latest \
  --build-arg enable_postgresql=true \
  --build-arg enable_spark=true \
  .

PilotScope only support PostgreSQL as datasource of Spark. Thus, you should set enable_postgresql=true and enable_spark=true at the same time.

2.1.1.2. Step 2: Start a Container from the Image

You can use the following command to initiate a Docker container:

docker run -it  --name pilotscope_test  --shm-size 5gb  --cap-add sys_ptrace  -p 54323:5432 -p 54023:22 -d pilotscope /bin/bash

This command boots up a container named pilotscope_test. Port mappings are configured as follows:

(Optional) The container’s port 5432 is mapped to host machine’s port 54323, allowing access to the database inside the container via host port 54323.
(Recommended) The container’s port 22 (SSH) is mapped to host machine’s port 54023, enabling SSH access to the container from the host via port 54023.

You can map more ports according to your requirements.

2.1.1.3. Step 3: Log into the Container

Access the container using the following command:

# Enter the Docker container
docker exec -it "container_id" /bin/bash

Replace container_id with the actual container ID from the previous step.

Once inside the container, enable the SSH service for remote connections:

# Start the SSH service
sudo service ssh start

2.1.2. Start PostgreSQL in the Container

To begin using PostgreSQL, switch to the pilotscope user and initiate the PostgreSQL service:

# Switch to the pilotscope user
su pilotscope

# Navigate to PostgreSQL binary path and start the service
cd $PG_PATH/bin
./pg_ctl start -D $PG_DATA

2.1.3. Start Spark in the Container

Pilotscope only support PostgreSQL as datasource of Spark. In order to enable all Spark functionalities, you should start the PostgreSQL referring to the section Start PostgreSQL in the Container.

After completing above steps, the PostgreSQL/Spark will be ready inside the container. You can use a suitable IDE to open the PilotScopeCore project in the /home/pilotscope/PilotScopeCore directory and start developing your algorithms or try different functionalities by executing the test files in test_pilotscope.

2.1.4. Get Started in the Container

To switch to the PilotScope development environment, run the following command in the container:

# Switch to PilotScope development environment
conda activate pilotscope

After completing above steps, the PilotScope/PostgreSQL/Spark will be ready inside the container.

You could leverage the PilotScope Python package to establish a connection to the started database. This enables you to access and utilize the extensive range of functionalities, especially the AI4DB algorithms, offered by PilotScope.

We recommend you to open the PilotScopeCore project in the /home/pilotscope/PilotScopeCore directory with a suitable IDE.

From now on, you can get started by following these steps:

Load test databases by referring to the section Load Test Databases or running the following command:

  # load stats_tiny
  python -m unittest test_pilotscope.test_pg.skip_test_dataset.TestDataset.test_load_stats_tiny

  # load imdb_tiny 
  python -m unittest test_pilotscope.test_pg.skip_test_dataset.TestDataset.test_load_imdb_tiny

Learn how the PilotScope system works when a SQL query is submitted into PilotScope referring to the chapter 3. System Pipeline.
Learn the components of PilotScope Core in ML side by running test_pilotscope files in the /home/pilotscope/PilotScopeCore directory and referring to the chapter 4. Core Components (ML Side).
Run the algorithm examples referring to test_example_algorithms files in the /home/pilotscope/PilotScopeCore directory and the chapter 5. Example.

2.2. Install All Components from Scratch

If you do not prefer to use Docker, you can also install all components independently. Please refer to the following sections.

2.2.1. Prepare the Environment and Configure the PilotScope User

Before installing PilotScope, you should ensure that you have prepared the environment and configured the PilotScope user.

2.2.1.1. Prepare the Environment

Ensure that your system has the required libraries for Database and PilotScope. You can install these libraries on a new Ubuntu Linux system with the following command:

apt-get update && apt-get install -y wget git bzip2 vim gcc openssh-server sudo

2.2.1.2. Configure the PilotScope User

Some databases are not allowed to be controlled by root user. Thus, if you are root user, you should create a new user for PilotScope following these commands:

# Add a new user (execute as root)
sudo adduser pilotscope

usermod -aG sudo pilotscope

# Change to the pilotscope user
su pilotscope

2.2.2. Install PilotScope Package

First, let us obtain the PilotScope repository by cloning it from GitHub:

# Clone the repository
git clone --branch master https://github.com/alibaba/pilotscope.git PilotScopeCore

# Navigate to the PilotScope directory
cd PilotScopeCore

# Install the minimal package using pip
pip install -e .

We have included a variety of examples on deploying AI4DB algorithms into the databases using PilotScope. To utilize these examples, additional dependencies must be installed. You can do so with this command:

pip install -e '.[dev]'

To verify the installation of the PilotScope package, you can run the following sample Python code:

import pilotscope

If you did not receive any error messages, you have installed PilotScope package successfully.

Next, you will need to configure PilotScope to interface with the PostgreSQL database. For those who prefer using SparkSQL, please refer to the Spark section for detailed instructions.

2.2.3. Install PostgreSQL

We also outline a comprehensive procedures for installing the PilotScope patch onto PostgreSQL from the scratch. Users could experience more customized installation through this method.

2.2.3.1. Step 1: Prepare the Environment

Ensure that your system has the required libraries for PostgreSQL. You can install these libraries on a new Ubuntu Linux system with the following command:

su root

apt-get update && apt-get install -y build-essential libreadline-dev zlib1g-dev bison flex gdb libssl-dev libbz2-dev libsqlite3-dev llvm libncurses5-dev libncursesw5-dev xz-utils libffi-dev liblzma-dev

su pilotscope

2.2.3.2. Step 2: Install PostgreSQL

Download and extract modified PostgreSQL-13.1 with PilotScope. This is the only version supported by PilotScope at now stage.

# Navigate to the directory where you want to store the PostgreSQL source code, such as ~/
cd ~/

# Download PostgreSQL 13.1 with PilotScope
git clone --branch pilotscope-postgresql https://github.com/alibaba/pilotscope.git PilotScopePostgreSQL

Set the PG_PATH environment variable to the desired installation path for PostgreSQL and PG_DATA to the path where the database data will be stored. Notice that the installation process will fail if this step is skipped. It is recommended to create a new directory specifically for this purpose.

# Example of setting PG_PATH. A path example is: echo export PG_PATH=~/pgsql/ >> ~/.bashrc 
echo 'export PG_PATH=/path/to/install/PostgreSQL' >> ~/.bashrc

# Example of setting PG_DATA. A path example is "~/pg_data/"
echo 'export PG_DATA=/path/to/store/database/data' >> ~/.bashrc

# Updating system paths
echo 'export PATH=$PG_PATH/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$PG_PATH/lib:$LD_LIBRARY_PATH' >> ~/.bashrc

# Apply the changes
source ~/.bashrc

Compile and install PostgreSQL to the specified path using the following commands:

# Navigate to the PostgreSQL source code directory
cd ~/PilotScopePostgreSQL/

# Run the configuration script with required options
./configure --prefix=$PG_PATH --enable-depend --enable-cassert --enable-debug CFLAGS="-ggdb -O0"

# Compile and install PostgreSQL
make && make install

2.2.3.3. Step 3: Install the Extension

After installing PostgreSQL, you need to install some important extensions for enabling all functionalities of PilotScope.

# Install the PilotScope extension
sh install_extension.sh

# recompile and install PostgreSQL
make && make install

2.2.3.4. Step 4: Configure Database Connection Settings

Initialize the database.

# Initialize the database
cd $PG_PATH/bin
./initdb -D $PG_DATA

To allow remote connections, edit the postgresql.conf and pg_hba.conf files located in the PG_DATA directory.

cd $PG_DATA 

# Allow listening on all IP addresses
echo "listen_addresses = '*'" >> postgresql.conf

# Configure client authentication
echo "host all all all md5" >> pg_hba.conf

# Enable the PilotScope extension pg_hint_plan
echo "shared_preload_libraries = 'pg_hint_plan'" >> postgresql.conf

Start the PostgreSQL database:

cd $PG_PATH/bin

# Start PostgreSQL
./pg_ctl -D $PG_DATA start

Set the password for the database user pilotscope to enable remote client connections. If you are non-root user and did not create user pilotscope, replace pilotscope with your username.

cd $PG_PATH/bin
./psql -d template1 -c "create database pilotscope;"
./psql -c "ALTER USER pilotscope PASSWORD 'pilotscope';"

2.2.3.5. Verify the connection to PostgreSQL via psql

To ensure the PilotScope patch onto PostgreSQL was installed correctly and the database service is running, you can connect to the PostgreSQL database server via psql by typing the following command in the terminal.

cd $PG_PATH/bin
./psql -d template1 

If you can access the postgres prompt like following and observe the word PilotScope in the prompt, Pilotscope in the DB side has been installed and started correctly.

psql (13.1)
PilotScope
Type "help" for help.

template1=#

Finally, you can quit the postgres prompt by using the \q command:

template1=# \q

Congratulations! PostgreSQL is fully configured and operational.

You can get started referring to the section Get Started

2.2.3.6. Update the Latest PilotScope Patch for PostgreSQL

Besides, if you want to update the latest modified PostgreSQL, you should pull the latest code and recompile and install PostgreSQL.

# Navigate to the PostgreSQL source code directory
cd ~/PilotScopePostgreSQL/

# Pull the latest code
git pull

# Recompile and install PostgreSQL.
make clean && make install

2.2.3.7. A Guideline for Installing PilotScope Patch into Your PostgreSQL Database.

Currently, PilotScope only support PostgreSQL 13.1 and have integrated some popular extension such as pg_hint_plan, what-if index. Thus, we do not recommend you to deploy PilotScope patch into your PostgreSQL database directly. In contrast, we recommend you to install our modified PostgreSQL 13.1 and migrate your origin data into the new PostgreSQL. This is valid if your origin PostgreSQL version is 13.1.

Specifically, since the data is in your origin PostgreSQL data path, you should skip the command ./initdb -D $PG_DATA when installing our modified PostgreSQL 13.1. Then, you can specify the origin PostgreSQL data path when executing the command ./pg_ctl -D $PG_DATA start. We recommend backing up the data before performing this operation.

2.2.4. Install Spark

We outline a comprehensive procedure for installing the PilotScope patch onto Spark from the scratch. Users could experience more customized installation through this method.

2.2.4.1. Step 1: Download PilotScope Patch for Spark

Download the PilotScope patch as follows:

# Navigate to the directory where you want to store the PilotScopeSpark repository, such as ~/
cd ~/

# Clone PilotScopeSpark repository
git clone --branch pilotscope-spark https://github.com/alibaba/pilotscope.git PilotScopeSpark
cd PilotScopeSpark/

2.2.4.2. Step 2: Download Spark and Apply Patch for Spark

Download and extract Spark 3.3.2. This is the only version supported by PilotScope at now stage.

# Download the Spark 3.3.2
wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2.tgz
tar -xzvf spark-3.3.2.tgz

Now there is a spark-3.3.2 directory which contains the source code of Spark in the same directory as apply_patch.sh . Apply the PilotScope patch into Spark as follows:

bash apply_patch.sh pilotscope_spark.patch

apply_patch.sh will make an initial git commit on the original code when first run, and then apply the patch to this source code of Spark.

2.2.4.3. Step 3: Configuring Java Environment Variables

Then, download JDK and configure the relevant environment variables. You can use the JDK provided by us (jdk-8u202-linux-x64.tar.gz), or download your own JDK.

# download and extract the JDK to the current directory
wget https://github.com/WoodyBryant/JDK/releases/download/v2/jdk-8u202-linux-x64.tar.gz
tar -xzvf jdk-8u202-linux-x64.tar.gz

# move the JDK to the a directory where you have the access permission, e.g. home directory.
mv jdk1.8.0_202 ~/ 

# Set necessary environment variables, JAVA_HOME and PATH
# for example:  echo 'export JAVA_HOME=~/jdk1.8.0_202/' >> ~/.bashrc if your save jdk in "~/jdk1.8.0_202" path
echo 'export JAVA_HOME=/path/to/your/jdk' >> ~/.bashrc
echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

2.2.4.4. Step 4: Compile and Build PilotScope and Spark

Compile and build Spark.

cd spark-3.3.2
./build/mvn -DskipTests clean package

2.2.4.5. Step 5: Install PySpark

Finally, install PySpark for allowing a Python code to control Spark.

You should use the Python environment used by PilotScope Core to install PySpark.

# This is provided by Spark 
cd python
python3 setup.py install

From now on, the Spark is fully configured and operational. You could leverage the PilotScope Python package to establish a connection to the started database. This enables you to access and utilize the extensive range of functionalities, especially the AI4DB algorithms, offered by PilotScope.

2.2.4.6. Update the Latest PilotScope Patch for Spark

If you want to update the latest pilotscope patch for Spark , you should pull the latest patch and recompile and install Spark.

Note that, apply_patch.sh will roll back the Spark into original Spark and then apply the lastest patch. Thus, if you have modified Spark codes for your requirements, you should not to execute the command directly.

cd PilotScopeSpark/

bash apply_patch.sh patch_to_update.patch

This script will update the code for Spark. Then follow steps 5-6 to complete the update, just like the first installing.

2.2.5. Get Started

After completing above steps, the PilotScope/PostgreSQL/Spark will be ready.

You could leverage the PilotScope Python package to establish a connection to the started database. This enables you to access and utilize the extensive range of functionalities, especially the AI4DB algorithms, offered by PilotScope.

Get started by following these steps:

Load test databases referring to the section Load Test Databases
Learn how the PilotScope system works when a SQL query is submitted into PilotScope referring to the chapter 3. System Pipeline.
Learn the components of PilotScope Core in ML side referring to the chapter 4. Core Components (ML Side).
Learn the algorithm examples referring to the chapter 5. Example.

2.3. Load Test Databases

Regardless of the installation method used from the above, you can follow these steps to load two test databases into PostgreSQL using PilotScope.

Please ensure that:

your database and pilotscope environment are in the same machine with the same user.
modify the following Python code according to your actual database configuration.

Here you will create a connection to PostgreSQL server at localhost, port 5432 using the username pilotscope and the password pilotscope and load stats_tiny database and imdb_tiny database into PostgreSQL.

from pilotscope.Dataset.ImdbTinyDataset import ImdbTinyDataset
from pilotscope.Dataset.StatsTinyDataset import StatsTinyDataset
from pilotscope.PilotConfig import PostgreSQLConfig
from pilotscope.PilotEnum import DatabaseEnum

# configure database settings 
db_port = "5432"  # database server port (default: "5432")
db_user = "pilotscope"  # database user name (default: "pilotscope")
db_user_pwd = "pilotscope"  # database user password (default: "pilotscope")
pg_bin_path = "/path/to/install/PostgreSQL/bin"  # database bin path, i.e. $PG_PATH/bin (default: None)
pg_data_path = "/path/to/store/database/data"  # database data path, i.e. $PG_DATA  (default: None)

db_config = PostgreSQLConfig(db_port=db_port, db_user=db_user, db_user_pwd=db_user_pwd)
# Please ensure that your database and pilotscope Core are in the same machine with the same user.
db_config.enable_deep_control_local(pg_bin_path, pg_data_path)

# load stats_tiny
ds = StatsTinyDataset(DatabaseEnum.POSTGRESQL, created_db_name="stats_tiny")
ds.load_to_db(db_config)

# load imdb_tiny
ds = ImdbTinyDataset(DatabaseEnum.POSTGRESQL, created_db_name="imdb_tiny")
ds.load_to_db(db_config)

If you can observe the following messages, these databases have been loaded successfully.

...
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
...

Congratulations! You have successfully installed complete PilotScope components.