In this blog post, we will show you, how to build an end-to-end Machine Learning Project using the Spark Mllib library.
In this project, we will take a regression problem statement. This tutorial will give you an overall idea of how you can build this project.
To start this project, you need a code editor. You can use any code editor. As you wish, but I will recommend using Visual Studio code editor.
Create a virtual environment or use the base environment default using Anaconda Navigator. Install the libraries required for this project. To install these libraries, you need to create a requirements.txt file and type pandas,pyspark Functions, Types, and ipynb kernel.
Type conda create -r requirements.txt in the command prompt in your VS code. Once you installed above libraries then, let's start executing our project.
You can use any dataset as you want. But make sure that when you are choosing any dataset. Then check the column values are numeric form.
Import the necessary libraries into your code editor, which you installed earlier.
Start the spark session by using the spark. builder.appOrCreate() fuction. It is used to create when applications may wish to share a SparkContext. You can use it to share a SparkContext object across Applications.
Then run the spark command to see the specifications of a particular session like version, master, and version.
Read the CSV file using spark by writing spark.read.csv(dataset name with location,header=True,inferschema=True). This function is used to read CSV files and also enables the header to indicate that the first row of the CSV file contains column names and, inferschema will automatically guess the data types for each field.
Print the schema using the PrintSchema function. This is used to print or display the schema of the DataFrame or Dataset in the tree format along with column name and data type.
Split the dataset into train and test split using the random split function. This is used to split the data usually a 70:30 ratio but you can specify according to your choice.
The most important step is to identify the categorical and numerical columns before deciding the further steps. so here we will use this code:
cat_cols = [x for x, dataType in train.dtypes if dataType=="String"]
num_cols = [x for x, dataType in train.dtypes if dataType!="String"]
The above code defines that it iterates each column finds each column data type and stores it into the cat and num variables.
If your dataset has missing values or categorial columns then you need to convert it into the numeric format so that the spark vector assembler transforms the numeric form columns.
The next step is to convert all the columns into vector form. For this, we use Vector Assembler which is used to work as a transformer that combines a given list of columns into a single vector column.
Import the VectorAssembler( ) and define input variables and output variables and then create the stage list variable define [] and increment += into the specific order into the [].
Now make the pipeline. It consists of a sequence of stages, each of which is either an Estimator or a Transformer.
Import the Pipeline and make the object ( ) and set stages ( ). Fit the training set and then transform the testing set. The transform converting data from one format or structure into another.
Store into the variable of the transformed test dataset. and select the columns using the selection method. This will store the columns in the data frame. To see the data frame or csv file use show( ).
Now the main step is to use the model. In Spark, there are my models are available according to the use case.
We use the Linear Regression model, so we import it
from pyspark.ml.regression import LinearRegression
After that select the test transform variable store it on the data variable and enter the feature column vector of the vector assembler and column of output variable in the dataset.
Create the object of the Linear Regression model and fit the data variable. Evaluate the accuracy of the linear model by using a summary. metrics like (model.summary.Mean_Squared_Error). Use can also try different evaluation metrics according to your use case.
Now we have created our model and it's time to save this model. For this use model. save("model name"). You can also load this model by using a model. load("saved model").
So, that's it for this end-to-end project implementation by using the Pyspark Mlib library. I hope you will learn something from this post. The same process is for any dataset whether you are using Supervised or Unsupervised ML algorithms you can follow this step-by-step guide.