2. Installation
This section provides a comprehensive guidelines on how to install the PilotScope Python package in the ML side, along with the necessary patches for PostgreSQL and Spark in the DB side.
The PilotScope package is a crucial tool that offers a range of frequently used functions, including database connection, data exchange, and the integration of AI algorithms into the database. Additionally, we have developed patches for both PostgreSQL and Spark to enhance their capabilities and support a wider array of functionalities.
Required Software Versions:
Python: 3.8
PostgreSQL: 13.1
Apache Spark: 3.3.2
Here, we provide two installation methods: installing from a docker image and installing from scratch. You can load test databases and check the installation referring to the section Load Test Databases.
2.1. Install from Pre-built Docker Image
We provide a pre-built docker image which contains the pre-installed PilotScope package and the modified PostgreSQL 13.1 and Spark 3.3.2. Here are some key details about this Docker image:
The provided image is based on Ubuntu 18.04.
The PilotScope Core in ML side is located in the
/home/pilotscope/PilotScopeCore
directory.The source code of modified PostgreSQL is located in the
/home/pilotscope/PilotScopePostgreSQL
directory.PostgreSQL is installed in
/home/pilotscope/pgsql
, with data storage located at/home/pilotscope/pg_data
.
The source code of modified Spark is located in the
/home/pilotscope/PilotScopeSpark/spark-3.3.2
directory.The pre-installed miniconda is located in the
/home/pilotscope/miniconda3
directory with a pre-built Python environment namedpilotscope
.Two system users are pre-created:
root
(password: root) andpilotscope
(password: pilotscope). Userpilotscope
owns the postgresql database system, and the user must own the server process.Some necessary tools such as
vim
,wget
,git
,ssh
, etc. are pre-installed.
2.1.1. Start a Container from the Image
You can follow these steps to start a container from our docker image.
2.1.1.1. Step 1: Build an Image
To acquire the Docker image, you can download the Dockerfile
from the following link and build the image locally.
# Navigate to the directory where you want to store the Dockerfile, such as ~/
cd ~/
# Download the Dockerfile
wget https://raw.githubusercontent.com/alibaba/pilotscope/master/Dockerfile
You can execute the following command to build the image for PostgreSQL.
docker build -t pilotscope:latest \
--build-arg enable_postgresql=true \
--build-arg enable_spark=false \
.
You can execute the following command to build the image for Spark.
docker build -t pilotscope:latest \
--build-arg enable_postgresql=true \
--build-arg enable_spark=true \
.
PilotScope only support PostgreSQL as datasource of Spark. Thus, you should set enable_postgresql=true
and enable_spark=true
at the same time.
2.1.1.2. Step 2: Start a Container from the Image
You can use the following command to initiate a Docker container:
docker run -it --name pilotscope_test --shm-size 5gb --cap-add sys_ptrace -p 54323:5432 -p 54023:22 -d pilotscope /bin/bash
This command boots up a container named pilotscope_test
. Port mappings are configured as follows:
(Optional) The container’s port
5432
is mapped to host machine’s port54323
, allowing access to the database inside the container via host port 54323.(Recommended) The container’s port
22
(SSH) is mapped to host machine’s port54023
, enabling SSH access to the container from the host via port 54023.
You can map more ports according to your requirements.
2.1.1.3. Step 3: Log into the Container
Access the container using the following command:
# Enter the Docker container
docker exec -it "container_id" /bin/bash
Replace container_id
with the actual container ID from the previous step.
Once inside the container, enable the SSH service for remote connections:
# Start the SSH service
sudo service ssh start
2.1.2. Start PostgreSQL in the Container
To begin using PostgreSQL, switch to the pilotscope user and initiate the PostgreSQL service:
# Switch to the pilotscope user
su pilotscope
# Navigate to PostgreSQL binary path and start the service
cd $PG_PATH/bin
./pg_ctl start -D $PG_DATA
2.1.3. Start Spark in the Container
Pilotscope only support PostgreSQL as datasource of Spark. In order to enable all Spark functionalities, you should start the PostgreSQL referring to the section Start PostgreSQL in the Container.
After completing above steps, the PostgreSQL/Spark will be ready inside the container.
You can use a suitable IDE to open the PilotScopeCore project in the /home/pilotscope/PilotScopeCore
directory and start developing your algorithms or try different functionalities by executing the test files
in test_pilotscope
.
2.1.4. Get Started in the Container
To switch to the PilotScope development environment, run the following command in the container:
# Switch to PilotScope development environment
conda activate pilotscope
After completing above steps, the PilotScope/PostgreSQL/Spark will be ready inside the container.
You could leverage the PilotScope Python package to establish a connection to the started database. This enables you to access and utilize the extensive range of functionalities, especially the AI4DB algorithms, offered by PilotScope.
We recommend you to open the PilotScopeCore project in the /home/pilotscope/PilotScopeCore
directory with a suitable IDE.
From now on, you can get started by following these steps:
Load test databases by referring to the section Load Test Databases or running the following command:
# load stats_tiny
python -m unittest test_pilotscope.test_pg.skip_test_dataset.TestDataset.test_load_stats_tiny
# load imdb_tiny
python -m unittest test_pilotscope.test_pg.skip_test_dataset.TestDataset.test_load_imdb_tiny
Learn how the PilotScope system works when a SQL query is submitted into PilotScope referring to the chapter 3. System Pipeline.
Learn the components of PilotScope Core in ML side by running
test_pilotscope
files in the/home/pilotscope/PilotScopeCore
directory and referring to the chapter 4. Core Components (ML Side).Run the algorithm examples referring to
test_example_algorithms
files in the/home/pilotscope/PilotScopeCore
directory and the chapter 5. Example.
2.2. Install All Components from Scratch
If you do not prefer to use Docker, you can also install all components independently. Please refer to the following sections.
2.2.1. Prepare the Environment and Configure the PilotScope User
Before installing PilotScope, you should ensure that you have prepared the environment and configured the PilotScope user.
2.2.1.1. Prepare the Environment
Ensure that your system has the required libraries for Database and PilotScope. You can install these libraries on a new Ubuntu Linux system with the following command:
apt-get update && apt-get install -y wget git bzip2 vim gcc openssh-server sudo
2.2.1.2. Configure the PilotScope User
Some databases are not allowed to be controlled by root
user. Thus, if you are root
user, you should create a new user for PilotScope following these commands:
# Add a new user (execute as root)
sudo adduser pilotscope
usermod -aG sudo pilotscope
# Change to the pilotscope user
su pilotscope
2.2.2. Install PilotScope Package
First, let us obtain the PilotScope repository by cloning it from GitHub:
# Clone the repository
git clone --branch master https://github.com/alibaba/pilotscope.git PilotScopeCore
# Navigate to the PilotScope directory
cd PilotScopeCore
# Install the minimal package using pip
pip install -e .
We have included a variety of examples on deploying AI4DB algorithms into the databases using PilotScope. To utilize these examples, additional dependencies must be installed. You can do so with this command:
pip install -e '.[dev]'
To verify the installation of the PilotScope package, you can run the following sample Python code:
import pilotscope
If you did not receive any error messages, you have installed PilotScope package successfully.
Next, you will need to configure PilotScope to interface with the PostgreSQL database. For those who prefer using SparkSQL, please refer to the Spark section for detailed instructions.
2.2.3. Install PostgreSQL
We also outline a comprehensive procedures for installing the PilotScope patch onto PostgreSQL from the scratch. Users could experience more customized installation through this method.
2.2.3.1. Step 1: Prepare the Environment
Ensure that your system has the required libraries for PostgreSQL. You can install these libraries on a new Ubuntu Linux system with the following command:
su root
apt-get update && apt-get install -y build-essential libreadline-dev zlib1g-dev bison flex gdb libssl-dev libbz2-dev libsqlite3-dev llvm libncurses5-dev libncursesw5-dev xz-utils libffi-dev liblzma-dev
su pilotscope
2.2.3.2. Step 2: Install PostgreSQL
Download and extract modified PostgreSQL-13.1 with PilotScope. This is the only version supported by PilotScope at now stage.
# Navigate to the directory where you want to store the PostgreSQL source code, such as ~/
cd ~/
# Download PostgreSQL 13.1 with PilotScope
git clone --branch pilotscope-postgresql https://github.com/alibaba/pilotscope.git PilotScopePostgreSQL
Set the PG_PATH
environment variable to the desired installation path for PostgreSQL and PG_DATA
to the path where
the database data will be stored. Notice that the installation process will fail if this step is skipped.
It is recommended to create a new directory specifically for this purpose.
# Example of setting PG_PATH. A path example is: echo export PG_PATH=~/pgsql/ >> ~/.bashrc
echo 'export PG_PATH=/path/to/install/PostgreSQL' >> ~/.bashrc
# Example of setting PG_DATA. A path example is "~/pg_data/"
echo 'export PG_DATA=/path/to/store/database/data' >> ~/.bashrc
# Updating system paths
echo 'export PATH=$PG_PATH/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$PG_PATH/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
# Apply the changes
source ~/.bashrc
Compile and install PostgreSQL to the specified path using the following commands:
# Navigate to the PostgreSQL source code directory
cd ~/PilotScopePostgreSQL/
# Run the configuration script with required options
./configure --prefix=$PG_PATH --enable-depend --enable-cassert --enable-debug CFLAGS="-ggdb -O0"
# Compile and install PostgreSQL
make && make install
2.2.3.3. Step 3: Install the Extension
After installing PostgreSQL, you need to install some important extensions for enabling all functionalities of PilotScope.
# Install the PilotScope extension
sh install_extension.sh
# recompile and install PostgreSQL
make && make install
2.2.3.4. Step 4: Configure Database Connection Settings
Initialize the database.
# Initialize the database
cd $PG_PATH/bin
./initdb -D $PG_DATA
To allow remote connections, edit the postgresql.conf
and pg_hba.conf
files located in the PG_DATA
directory.
cd $PG_DATA
# Allow listening on all IP addresses
echo "listen_addresses = '*'" >> postgresql.conf
# Configure client authentication
echo "host all all all md5" >> pg_hba.conf
# Enable the PilotScope extension pg_hint_plan
echo "shared_preload_libraries = 'pg_hint_plan'" >> postgresql.conf
Start the PostgreSQL database:
cd $PG_PATH/bin
# Start PostgreSQL
./pg_ctl -D $PG_DATA start
Set the password for the database user pilotscope
to enable remote client connections. If you are non-root user and did not create user pilotscope
, replace pilotscope
with your username.
cd $PG_PATH/bin
./psql -d template1 -c "create database pilotscope;"
./psql -c "ALTER USER pilotscope PASSWORD 'pilotscope';"
2.2.3.5. Verify the connection to PostgreSQL via psql
To ensure the PilotScope patch onto PostgreSQL was installed correctly and the database service is running, you can connect to the PostgreSQL database server via psql
by typing the following command in the terminal.
cd $PG_PATH/bin
./psql -d template1
If you can access the postgres prompt like following and observe the word PilotScope
in the prompt, Pilotscope in the DB side has been installed and started correctly.
psql (13.1)
PilotScope
Type "help" for help.
template1=#
Finally, you can quit the postgres prompt by using the \q
command:
template1=# \q
Congratulations! PostgreSQL is fully configured and operational.
You can get started referring to the section Get Started
2.2.3.6. Update the Latest PilotScope Patch for PostgreSQL
Besides, if you want to update the latest modified PostgreSQL, you should pull the latest code and recompile and install PostgreSQL.
# Navigate to the PostgreSQL source code directory
cd ~/PilotScopePostgreSQL/
# Pull the latest code
git pull
# Recompile and install PostgreSQL.
make clean && make install
2.2.3.7. A Guideline for Installing PilotScope Patch into Your PostgreSQL Database.
Currently, PilotScope only support PostgreSQL 13.1 and have integrated some popular extension such as pg_hint_plan, what-if index. Thus, we do not recommend you to deploy PilotScope patch into your PostgreSQL database directly. In contrast, we recommend you to install our modified PostgreSQL 13.1 and migrate your origin data into the new PostgreSQL. This is valid if your origin PostgreSQL version is 13.1.
Specifically, since the data is in your origin PostgreSQL data path, you should skip the command ./initdb -D $PG_DATA
when installing our modified PostgreSQL 13.1.
Then, you can specify the origin PostgreSQL data path when executing the command ./pg_ctl -D $PG_DATA start
.
We recommend backing up the data before performing this operation.
2.2.4. Install Spark
We outline a comprehensive procedure for installing the PilotScope patch onto Spark from the scratch. Users could experience more customized installation through this method.
2.2.4.1. Step 1: Download PilotScope Patch for Spark
Download the PilotScope patch as follows:
# Navigate to the directory where you want to store the PilotScopeSpark repository, such as ~/
cd ~/
# Clone PilotScopeSpark repository
git clone --branch pilotscope-spark https://github.com/alibaba/pilotscope.git PilotScopeSpark
cd PilotScopeSpark/
2.2.4.2. Step 2: Download Spark and Apply Patch for Spark
Download and extract Spark 3.3.2. This is the only version supported by PilotScope at now stage.
# Download the Spark 3.3.2
wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2.tgz
tar -xzvf spark-3.3.2.tgz
Now there is a spark-3.3.2
directory which contains the source code of Spark in the same directory as apply_patch.sh
.
Apply the PilotScope patch into Spark as follows:
bash apply_patch.sh pilotscope_spark.patch
apply_patch.sh
will make an initial git commit on the original code when first run, and then apply the patch to this
source code of Spark.
2.2.4.3. Step 3: Configuring Java Environment Variables
Then, download JDK and configure the relevant environment variables.
You can use the JDK provided by us (jdk-8u202-linux-x64.tar.gz
), or download your own JDK.
# download and extract the JDK to the current directory
wget https://github.com/WoodyBryant/JDK/releases/download/v2/jdk-8u202-linux-x64.tar.gz
tar -xzvf jdk-8u202-linux-x64.tar.gz
# move the JDK to the a directory where you have the access permission, e.g. home directory.
mv jdk1.8.0_202 ~/
# Set necessary environment variables, JAVA_HOME and PATH
# for example: echo 'export JAVA_HOME=~/jdk1.8.0_202/' >> ~/.bashrc if your save jdk in "~/jdk1.8.0_202" path
echo 'export JAVA_HOME=/path/to/your/jdk' >> ~/.bashrc
echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
2.2.4.4. Step 4: Compile and Build PilotScope and Spark
Compile and build Spark.
cd spark-3.3.2
./build/mvn -DskipTests clean package
2.2.4.5. Step 5: Install PySpark
Finally, install PySpark for allowing a Python code to control Spark.
You should use the Python environment used by PilotScope Core to install PySpark.
# This is provided by Spark
cd python
python3 setup.py install
From now on, the Spark is fully configured and operational. You could leverage the PilotScope Python package to establish a connection to the started database. This enables you to access and utilize the extensive range of functionalities, especially the AI4DB algorithms, offered by PilotScope.
2.2.4.6. Update the Latest PilotScope Patch for Spark
If you want to update the latest pilotscope patch for Spark , you should pull the latest patch and recompile and install Spark.
Note that, apply_patch.sh
will roll back the Spark into original Spark and then apply the lastest patch. Thus, if you
have modified Spark codes for your requirements, you should not to execute the command directly.
cd PilotScopeSpark/
bash apply_patch.sh patch_to_update.patch
This script will update the code for Spark. Then follow steps 5-6 to complete the update, just like the first installing.
2.2.5. Get Started
After completing above steps, the PilotScope/PostgreSQL/Spark will be ready.
You could leverage the PilotScope Python package to establish a connection to the started database. This enables you to access and utilize the extensive range of functionalities, especially the AI4DB algorithms, offered by PilotScope.
Get started by following these steps:
Load test databases referring to the section Load Test Databases
Learn how the PilotScope system works when a SQL query is submitted into PilotScope referring to the chapter 3. System Pipeline.
Learn the components of PilotScope Core in ML side referring to the chapter 4. Core Components (ML Side).
Learn the algorithm examples referring to the chapter 5. Example.
2.3. Load Test Databases
Regardless of the installation method used from the above, you can follow these steps to load two test databases into PostgreSQL using PilotScope.
Please ensure that:
your database and pilotscope environment are in the same machine with the same user.
modify the following Python code according to your actual database configuration.
Here you will create a connection to PostgreSQL server at localhost
, port 5432
using the username pilotscope
and the password pilotscope
and load stats_tiny
database and imdb_tiny
database into PostgreSQL.
from pilotscope.Dataset.ImdbTinyDataset import ImdbTinyDataset
from pilotscope.Dataset.StatsTinyDataset import StatsTinyDataset
from pilotscope.PilotConfig import PostgreSQLConfig
from pilotscope.PilotEnum import DatabaseEnum
# configure database settings
db_port = "5432" # database server port (default: "5432")
db_user = "pilotscope" # database user name (default: "pilotscope")
db_user_pwd = "pilotscope" # database user password (default: "pilotscope")
pg_bin_path = "/path/to/install/PostgreSQL/bin" # database bin path, i.e. $PG_PATH/bin (default: None)
pg_data_path = "/path/to/store/database/data" # database data path, i.e. $PG_DATA (default: None)
db_config = PostgreSQLConfig(db_port=db_port, db_user=db_user, db_user_pwd=db_user_pwd)
# Please ensure that your database and pilotscope Core are in the same machine with the same user.
db_config.enable_deep_control_local(pg_bin_path, pg_data_path)
# load stats_tiny
ds = StatsTinyDataset(DatabaseEnum.POSTGRESQL, created_db_name="stats_tiny")
ds.load_to_db(db_config)
# load imdb_tiny
ds = ImdbTinyDataset(DatabaseEnum.POSTGRESQL, created_db_name="imdb_tiny")
ds.load_to_db(db_config)
If you can observe the following messages, these databases have been loaded successfully.
...
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
...
Congratulations! You have successfully installed complete PilotScope components.