# Installation This section provides a comprehensive guidelines on how to install the PilotScope Python package in the ML side, along with the necessary patches for PostgreSQL and Spark in the DB side. The PilotScope package is a crucial tool that offers a range of frequently used functions, including database connection, data exchange, and the integration of AI algorithms into the database. Additionally, we have developed patches for both PostgreSQL and Spark to enhance their capabilities and support a wider array of functionalities. Required Software Versions: - Python: 3.8 - PostgreSQL: 13.1 - Apache Spark: 3.3.2 Here, we provide two installation methods: installing from a docker image and installing from scratch. You can load test databases and check the installation referring to the section [Load Test Databases](#load-test-databases). ## Install from Pre-built Docker Image We provide a pre-built docker image which contains the pre-installed PilotScope package and the modified PostgreSQL 13.1 and Spark 3.3.2. Here are some key details about this Docker image: - The provided image is based on Ubuntu 18.04. - The PilotScope Core in ML side is located in the `/home/pilotscope/PilotScopeCore` directory. - The source code of modified PostgreSQL is located in the `/home/pilotscope/PilotScopePostgreSQL` directory. - PostgreSQL is installed in `/home/pilotscope/pgsql`, with data storage located at `/home/pilotscope/pg_data`. - The source code of modified Spark is located in the `/home/pilotscope/PilotScopeSpark/spark-3.3.2` directory. - The pre-installed miniconda is located in the `/home/pilotscope/miniconda3` directory with a pre-built Python environment named `pilotscope`. - Two system users are pre-created: `root` (password: root) and `pilotscope` (password: pilotscope). User `pilotscope` owns the postgresql database system, and the user must own the server process. - Some necessary tools such as `vim`, `wget`, `git`, `ssh`, etc. are pre-installed. ### Start a Container from the Image You can follow these steps to start a container from our docker image. #### Step 1: Build an Image To acquire the Docker image, you can download the `Dockerfile` from the following link and build the image locally. ```bash # Navigate to the directory where you want to store the Dockerfile, such as ~/ cd ~/ # Download the Dockerfile wget https://raw.githubusercontent.com/alibaba/pilotscope/master/Dockerfile ``` You can execute the following command to build the image for PostgreSQL. ```bash docker build -t pilotscope:latest \ --build-arg enable_postgresql=true \ --build-arg enable_spark=false \ . ``` You can execute the following command to build the image for Spark. ```bash docker build -t pilotscope:latest \ --build-arg enable_postgresql=true \ --build-arg enable_spark=true \ . ``` PilotScope only support PostgreSQL as datasource of Spark. Thus, you should set `enable_postgresql=true` and `enable_spark=true` at the same time. #### Step 2: Start a Container from the Image You can use the following command to initiate a Docker container: ```bash docker run -it --name pilotscope_test --shm-size 5gb --cap-add sys_ptrace -p 54323:5432 -p 54023:22 -d pilotscope /bin/bash ``` This command boots up a container named `pilotscope_test`. Port mappings are configured as follows: - (Optional) The container's port `5432` is mapped to host machine's port `54323`, allowing access to the database inside the container via host port 54323. - (Recommended) The container's port `22` (SSH) is mapped to host machine's port `54023`, enabling SSH access to the container from the host via port 54023. You can map more ports according to your requirements. #### Step 3: Log into the Container Access the container using the following command: ```bash # Enter the Docker container docker exec -it "container_id" /bin/bash ``` Replace `container_id` with the actual container ID from the previous step. Once inside the container, enable the SSH service for remote connections: ```bash # Start the SSH service sudo service ssh start ``` ### Start PostgreSQL in the Container To begin using PostgreSQL, switch to the pilotscope user and initiate the PostgreSQL service: ```bash # Switch to the pilotscope user su pilotscope # Navigate to PostgreSQL binary path and start the service cd $PG_PATH/bin ./pg_ctl start -D $PG_DATA ``` ### Start Spark in the Container Pilotscope only support PostgreSQL as datasource of Spark. In order to enable all Spark functionalities, you should start the PostgreSQL referring to the section [Start PostgreSQL in the Container](#start-postgresql-in-the-container). After completing above steps, the PostgreSQL/Spark will be ready inside the container. You can use a suitable IDE to open the PilotScopeCore project in the `/home/pilotscope/PilotScopeCore` directory and start developing your algorithms or try different functionalities by executing the test files in `test_pilotscope`. ### Get Started in the Container To switch to the PilotScope development environment, run the following command in the container: ```bash # Switch to PilotScope development environment conda activate pilotscope ``` After completing above steps, the PilotScope/PostgreSQL/Spark will be ready inside the container. You could leverage the PilotScope Python package to establish a connection to the started database. This enables you to access and utilize the extensive range of functionalities, especially the AI4DB algorithms, offered by PilotScope. We recommend you to open the PilotScopeCore project in the `/home/pilotscope/PilotScopeCore` directory with a suitable IDE. From now on, you can get started by following these steps: - Load test databases by referring to the section [Load Test Databases](#load-test-databases) or running the following command: ```bash # load stats_tiny python -m unittest test_pilotscope.test_pg.skip_test_dataset.TestDataset.test_load_stats_tiny # load imdb_tiny python -m unittest test_pilotscope.test_pg.skip_test_dataset.TestDataset.test_load_imdb_tiny ``` - Learn how the PilotScope system works when a SQL query is submitted into PilotScope referring to the chapter [3. System Pipeline](section-system-pipeline). - Learn the components of PilotScope Core in ML side by running `test_pilotscope` files in the `/home/pilotscope/PilotScopeCore` directory and referring to the chapter [4. Core Components (ML Side)](section-core-components). - Run the algorithm examples referring to `test_example_algorithms` files in the `/home/pilotscope/PilotScopeCore` directory and the chapter [5. Example](section-example). ## Install All Components from Scratch If you do not prefer to use Docker, you can also install all components independently. Please refer to the following sections. ### Prepare the Environment and Configure the PilotScope User Before installing PilotScope, you should ensure that you have prepared the environment and configured the PilotScope user. #### Prepare the Environment Ensure that your system has the required libraries for Database and PilotScope. You can install these libraries on a new Ubuntu Linux system with the following command: ```bash apt-get update && apt-get install -y wget git bzip2 vim gcc openssh-server sudo ``` #### Configure the PilotScope User Some databases are not allowed to be controlled by `root` user. Thus, if you are `root` user, you should create a new user for PilotScope following these commands: ```bash # Add a new user (execute as root) sudo adduser pilotscope usermod -aG sudo pilotscope # Change to the pilotscope user su pilotscope ``` ### Install PilotScope Package First, let us obtain the PilotScope repository by cloning it from GitHub: ```bash # Clone the repository git clone --branch master https://github.com/alibaba/pilotscope.git PilotScopeCore # Navigate to the PilotScope directory cd PilotScopeCore # Install the minimal package using pip pip install -e . ``` We have included a variety of examples on deploying AI4DB algorithms into the databases using PilotScope. To utilize these examples, additional dependencies must be installed. You can do so with this command: ```bash pip install -e '.[dev]' ``` To verify the installation of the PilotScope package, you can run the following sample Python code: ```Python import pilotscope ``` If you did not receive any error messages, you have installed PilotScope package successfully. Next, you will need to configure PilotScope to interface with the PostgreSQL database. For those who prefer using SparkSQL, please refer to the [Spark](#install-spark) section for detailed instructions. ### Install PostgreSQL We also outline a comprehensive procedures for installing the PilotScope patch onto PostgreSQL from the scratch. Users could experience more customized installation through this method. #### Step 1: Prepare the Environment Ensure that your system has the required libraries for PostgreSQL. You can install these libraries on a new Ubuntu Linux system with the following command: ```bash su root apt-get update && apt-get install -y build-essential libreadline-dev zlib1g-dev bison flex gdb libssl-dev libbz2-dev libsqlite3-dev llvm libncurses5-dev libncursesw5-dev xz-utils libffi-dev liblzma-dev su pilotscope ``` #### Step 2: Install PostgreSQL Download and extract modified PostgreSQL-13.1 with PilotScope. This is the only version supported by PilotScope at now stage. ```bash # Navigate to the directory where you want to store the PostgreSQL source code, such as ~/ cd ~/ # Download PostgreSQL 13.1 with PilotScope git clone --branch pilotscope-postgresql https://github.com/alibaba/pilotscope.git PilotScopePostgreSQL ``` Set the `PG_PATH` environment variable to the desired installation path for PostgreSQL and `PG_DATA` to the path where the database data will be stored. Notice that the installation process will fail if this step is skipped. It is recommended to create a new directory specifically for this purpose. ```bash # Example of setting PG_PATH. A path example is: echo export PG_PATH=~/pgsql/ >> ~/.bashrc echo 'export PG_PATH=/path/to/install/PostgreSQL' >> ~/.bashrc # Example of setting PG_DATA. A path example is "~/pg_data/" echo 'export PG_DATA=/path/to/store/database/data' >> ~/.bashrc # Updating system paths echo 'export PATH=$PG_PATH/bin:$PATH' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=$PG_PATH/lib:$LD_LIBRARY_PATH' >> ~/.bashrc # Apply the changes source ~/.bashrc ``` Compile and install PostgreSQL to the specified path using the following commands: ```bash # Navigate to the PostgreSQL source code directory cd ~/PilotScopePostgreSQL/ # Run the configuration script with required options ./configure --prefix=$PG_PATH --enable-depend --enable-cassert --enable-debug CFLAGS="-ggdb -O0" # Compile and install PostgreSQL make && make install ``` #### Step 3: Install the Extension After installing PostgreSQL, you need to install some important extensions for enabling all functionalities of PilotScope. ```bash # Install the PilotScope extension sh install_extension.sh # recompile and install PostgreSQL make && make install ``` #### Step 4: Configure Database Connection Settings Initialize the database. ```bash # Initialize the database cd $PG_PATH/bin ./initdb -D $PG_DATA ``` To allow remote connections, edit the `postgresql.conf` and `pg_hba.conf` files located in the `PG_DATA` directory. ```bash cd $PG_DATA # Allow listening on all IP addresses echo "listen_addresses = '*'" >> postgresql.conf # Configure client authentication echo "host all all all md5" >> pg_hba.conf # Enable the PilotScope extension pg_hint_plan echo "shared_preload_libraries = 'pg_hint_plan'" >> postgresql.conf ``` Start the PostgreSQL database: ```bash cd $PG_PATH/bin # Start PostgreSQL ./pg_ctl -D $PG_DATA start ``` Set the password for the database user `pilotscope` to enable remote client connections. If you are non-root user and did not create user `pilotscope`, replace `pilotscope` with your username. ```bash cd $PG_PATH/bin ./psql -d template1 -c "create database pilotscope;" ./psql -c "ALTER USER pilotscope PASSWORD 'pilotscope';" ``` #### Verify the connection to PostgreSQL via psql To ensure the PilotScope patch onto PostgreSQL was installed correctly and the database service is running, you can connect to the PostgreSQL database server via `psql` by typing the following command in the terminal. ```bash cd $PG_PATH/bin ./psql -d template1 ``` If you can access the postgres prompt like following and observe the word `PilotScope` in the prompt, Pilotscope in the DB side has been installed and started correctly. ```bash psql (13.1) PilotScope Type "help" for help. template1=# ``` Finally, you can quit the postgres prompt by using the `\q` command: ```bash template1=# \q ``` Congratulations! PostgreSQL is fully configured and operational. You can get started referring to the section [Get Started](#get-started) #### Update the Latest PilotScope Patch for PostgreSQL Besides, if you want to update the latest modified PostgreSQL, you should pull the latest code and recompile and install PostgreSQL. ```bash # Navigate to the PostgreSQL source code directory cd ~/PilotScopePostgreSQL/ # Pull the latest code git pull # Recompile and install PostgreSQL. make clean && make install ``` #### A Guideline for Installing PilotScope Patch into Your PostgreSQL Database. Currently, PilotScope only support PostgreSQL 13.1 and have integrated some popular extension such as pg_hint_plan, what-if index. Thus, we do not recommend you to deploy PilotScope patch into your PostgreSQL database directly. In contrast, we recommend you to install our modified PostgreSQL 13.1 and migrate your origin data into the new PostgreSQL. This is valid if your origin PostgreSQL version is 13.1. Specifically, since the data is in your origin PostgreSQL data path, you should skip the command `./initdb -D $PG_DATA` when installing our modified PostgreSQL 13.1. Then, you can specify the origin PostgreSQL data path when executing the command `./pg_ctl -D $PG_DATA start`. We recommend backing up the data before performing this operation. ### Install Spark We outline a comprehensive procedure for installing the PilotScope patch onto Spark from the scratch. Users could experience more customized installation through this method. #### Step 1: Download PilotScope Patch for Spark Download the PilotScope patch as follows: ```bash # Navigate to the directory where you want to store the PilotScopeSpark repository, such as ~/ cd ~/ # Clone PilotScopeSpark repository git clone --branch pilotscope-spark https://github.com/alibaba/pilotscope.git PilotScopeSpark cd PilotScopeSpark/ ``` #### Step 2: Download Spark and Apply Patch for Spark Download and extract Spark 3.3.2. This is the only version supported by PilotScope at now stage. ```bash # Download the Spark 3.3.2 wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2.tgz tar -xzvf spark-3.3.2.tgz ``` Now there is a `spark-3.3.2` directory which contains the source code of Spark in the same directory as `apply_patch.sh` . Apply the PilotScope patch into Spark as follows: ```bash bash apply_patch.sh pilotscope_spark.patch ``` `apply_patch.sh` will make an initial git commit on the original code when first run, and then apply the patch to this source code of Spark. #### Step 3: Configuring Java Environment Variables Then, download JDK and configure the relevant environment variables. You can use the JDK provided by us (`jdk-8u202-linux-x64.tar.gz`), or download your own JDK. ```bash # download and extract the JDK to the current directory wget https://github.com/WoodyBryant/JDK/releases/download/v2/jdk-8u202-linux-x64.tar.gz tar -xzvf jdk-8u202-linux-x64.tar.gz # move the JDK to the a directory where you have the access permission, e.g. home directory. mv jdk1.8.0_202 ~/ # Set necessary environment variables, JAVA_HOME and PATH # for example: echo 'export JAVA_HOME=~/jdk1.8.0_202/' >> ~/.bashrc if your save jdk in "~/jdk1.8.0_202" path echo 'export JAVA_HOME=/path/to/your/jdk' >> ~/.bashrc echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc source ~/.bashrc ``` #### Step 4: Compile and Build PilotScope and Spark Compile and build Spark. ```bash cd spark-3.3.2 ./build/mvn -DskipTests clean package ``` #### Step 5: Install PySpark Finally, install PySpark for allowing a Python code to control Spark. You should use the Python environment used by PilotScope Core to install PySpark. ```bash # This is provided by Spark cd python python3 setup.py install ``` From now on, the Spark is fully configured and operational. You could leverage the PilotScope Python package to establish a connection to the started database. This enables you to access and utilize the extensive range of functionalities, especially the AI4DB algorithms, offered by PilotScope. #### Update the Latest PilotScope Patch for Spark If you want to update the latest pilotscope patch for Spark , you should pull the latest patch and recompile and install Spark. Note that, `apply_patch.sh` will roll back the Spark into original Spark and then apply the lastest patch. Thus, if you have modified Spark codes for your requirements, you should not to execute the command directly. ```bash cd PilotScopeSpark/ bash apply_patch.sh patch_to_update.patch ``` This script will update the code for Spark. Then follow steps 5-6 to complete the update, just like the first installing. ### Get Started After completing above steps, the PilotScope/PostgreSQL/Spark will be ready. You could leverage the PilotScope Python package to establish a connection to the started database. This enables you to access and utilize the extensive range of functionalities, especially the AI4DB algorithms, offered by PilotScope. Get started by following these steps: - Load test databases referring to the section [Load Test Databases](#load-test-databases) - Learn how the PilotScope system works when a SQL query is submitted into PilotScope referring to the chapter [3. System Pipeline](section-system-pipeline). - Learn the components of PilotScope Core in ML side referring to the chapter [4. Core Components (ML Side)](section-core-components). - Learn the algorithm examples referring to the chapter [5. Example](section-example). ## Load Test Databases Regardless of the installation method used from the above, you can follow these steps to load two test databases into PostgreSQL using PilotScope. Please ensure that: 1. your database and pilotscope environment are in the same machine with the same user. 2. modify the following Python code according to your actual database configuration. Here you will create a connection to PostgreSQL server at `localhost`, port `5432` using the username `pilotscope` and the password `pilotscope` and load `stats_tiny` database and `imdb_tiny` database into PostgreSQL. ```python from pilotscope.Dataset.ImdbTinyDataset import ImdbTinyDataset from pilotscope.Dataset.StatsTinyDataset import StatsTinyDataset from pilotscope.PilotConfig import PostgreSQLConfig from pilotscope.PilotEnum import DatabaseEnum # configure database settings db_port = "5432" # database server port (default: "5432") db_user = "pilotscope" # database user name (default: "pilotscope") db_user_pwd = "pilotscope" # database user password (default: "pilotscope") pg_bin_path = "/path/to/install/PostgreSQL/bin" # database bin path, i.e. $PG_PATH/bin (default: None) pg_data_path = "/path/to/store/database/data" # database data path, i.e. $PG_DATA (default: None) db_config = PostgreSQLConfig(db_port=db_port, db_user=db_user, db_user_pwd=db_user_pwd) # Please ensure that your database and pilotscope Core are in the same machine with the same user. db_config.enable_deep_control_local(pg_bin_path, pg_data_path) # load stats_tiny ds = StatsTinyDataset(DatabaseEnum.POSTGRESQL, created_db_name="stats_tiny") ds.load_to_db(db_config) # load imdb_tiny ds = ImdbTinyDataset(DatabaseEnum.POSTGRESQL, created_db_name="imdb_tiny") ds.load_to_db(db_config) ``` If you can observe the following messages, these databases have been loaded successfully. ```bash ... ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE CREATE INDEX CREATE INDEX CREATE INDEX CREATE INDEX CREATE INDEX CREATE INDEX CREATE INDEX CREATE INDEX CREATE INDEX CREATE INDEX CREATE INDEX ... ``` Congratulations! You have successfully installed complete PilotScope components.