First I tran my train data with random forest algorithm. This is a proven algorithm with its success. First I try to see results about it.

X = train.drop('SalePrice',axis = 1)

y = train['SalePrice']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Then I train the “X” data with “y” label and take the predictions from “X_test” data which is test data features.

regr = RandomForestRegressor(max_depth=2, random_state=0)

regr.fit(X_train, y_train)predictions = regr.predict(X_test)

The result is: 2220031963.926703

`mean_squared_error(predictions, y_test)`

That seems very high. However, the log transformations change it to very low.

Then I use randomized search for it. Nevertheless. I delete the running code because it takes so much time on kernel. The code is taken from another medium post.

`# Number of trees in random forest`

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split

max_features = ['auto', 'sqrt']

# Maximum number of levels in tree

max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]

max_depth.append(None)

# Minimum number of samples required to split a node

min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node

min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree

bootstrap = [True, False]

# Create the random grid

random_grid = {'n_estimators': n_estimators,

'max_features': max_features,

'max_depth': max_depth,

'min_samples_split': min_samples_split,

'min_samples_leaf': min_samples_leaf,

'bootstrap': bootstrap}

print(random_grid)

This random search is used for random forest algorithm. You can use it for all the other machine learning algorithms if you want.

Next, I extract PCA features with PCA analysis. The total column number is three.

`from sklearn.decomposition import PCA`

pca = PCA(n_components=3)

principalComponents_train = pca.fit_transform(X)

principalComponents_test = pca.fit_transform(test)

sum(pca.explained_variance_ratio_)

Then, I load these features into the “train” and “test” dataframe.

train['component_1'] = [i[0] for i in principalComponents_train]

train['component_2'] = [i[1] for i in principalComponents_train]

train['component_3'] = [i[2] for i in principalComponents_train]test['component_1'] = [i[0] for i in principalComponents_test]

test['component_2'] = [i[1] for i in principalComponents_test]

test['component_3'] = [i[2] for i in principalComponents_test]

again some steps for random forest algorithm.

X = train.drop('SalePrice',axis = 1)

y = train['SalePrice']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)regr = RandomForestRegressor(n_estimators = 400,min_samples_split = 2,min_samples_leaf = 1,max_features= 'sqrt',max_depth =None,bootstrap= False)

regr.fit(X, y)predictions = regr.predict(X)mean_squared_error(predictions, y)

The error rate is 23.29888698630137. That is lower than before.

This method is similar to ensemble learning. I use and bagging algorithm in the end. I had just used it. for details you can search the function and library on the google. I was using 7 different regressor for machine learning table to use as ensemble learning.

model_1 = RandomForestRegressor(n_estimators = 400,min_samples_split = 2,min_samples_leaf = 1,max_features= 'sqrt',max_depth =None,bootstrap= False)

model_1.fit(X, y)predict_1 = model_1.predict(X)model_2= linear_model.Ridge()

model_2.fit(X,y)

predict_2 =model_2.predict(X)model_3 =KNeighborsRegressor(10,weights='uniform')

model_3.fit(X,y)

predict_3 = model_3.predict(X)model_4 = linear_model.BayesianRidge()

model_4.fit(X,y)

predict_4 =model_4.predict(X)model_5 = tree.DecisionTreeRegressor(max_depth=1)

model_5.fit(X,y)

predict_5 =model_5.predict(X)model_6= svm.SVR(C=1.0, epsilon=0.2)

model_6.fit(X,y)

predict_6 = model_6.predict(X)model_7 = xgb.XGBRegressor()

model_7.fit(X,y)

predict_7 = model_7.predict(X)

Then, I collect them in an other dataframe.

final_df = pd.DataFrame()

final_df['SalePrice'] = yfinal_df['RandomForest'] = predict_1

final_df['Ridge'] = predict_2

final_df['Kneighboors'] = predict_3

final_df['BayesianRidge'] = predict_4

final_df['DecisionTreeRegressor'] = predict_5

final_df['Svm'] = predict_6

final_df['XGBoost'] = predict_7

I loaded predictions into this dataframe. Next, I will use bagging algorithm for predictions.

Again, if you print the errors on the data, the most accurate is random forest.

`print(mean_squared_error(final_df['SalePrice'], predict_1))`

print(mean_squared_error(final_df['SalePrice'], predict_2))

print(mean_squared_error(final_df['SalePrice'], predict_3))

print(mean_squared_error(final_df['SalePrice'], predict_4))

print(mean_squared_error(final_df['SalePrice'], predict_5))

print(mean_squared_error(final_df['SalePrice'], predict_6))

print(mean_squared_error(final_df['SalePrice'], predict_7))

After that, I take the features and label from this final dataframe and train it BaggingRegressor.

X_final = final_df.drop('SalePrice',axis = 1)

y_final = final_df['SalePrice']model_last = RandomForestRegressor()

model_last.fit(X_final, y_final)predict_final = model_last.predict(X_final)final_dt = RandomForestRegressor()

model_last = BaggingRegressor(base_estimator=final_dt, n_estimators=40, random_state=1, oob_score=True)model_last.fit(X_final, y_final)

predict_final = model_last.predict(X_final)acc_oob = model_last.oob_score_

print(acc_oob)

The error is 8578886.582733957.

`mean_squared_error(predict_final, y_final)`

That is very high. Nonetheless, only random forest very with pca values was very low. I select that mode over this complicated one. Sometimes even if the ideas seems good results could not be pleasant by means of machine learning research. This area is fuzzy.

However the results are not delightful, I will explain the other steps. This methodology could work with some changes and could be improved.

We predict previous model the test dataframe.

`test_predictions_1 = model_1.predict(test)`

test_predictions_2 = model_2.predict(test)

test_predictions_3 = model_3.predict(test)

test_predictions_4 = model_4.predict(test)

test_predictions_5 = model_5.predict(test)

test_predictions_6 = model_6.predict(test)

test_predictions_7 = model_7.predict(test)

Next, I create another dataframe for test results.

test_final_df = pd.DataFrame()test_final_df['RandomForest'] = test_predictions_1

test_final_df['Ridge'] = test_predictions_2

test_final_df['Kneighboors'] = test_predictions_3

test_final_df['BayesianRidge'] = test_predictions_4

test_final_df['DecisionTreeRegressor'] = test_predictions_5

test_final_df['Svm'] = test_predictions_6

test_final_df['XGBoost'] = test_predictions_7

Finally, I predict the last dataframe with lastly trained model.

`last_predictions = model_last.predict(test_final_df)`

Then, I load the submission csv,

`submission = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv')`

Then matching the right values with right indexis.

`submission['SalePrice'] = last_predictions`

I changed this last_predictions with “test_predictions_1” variable. Finally I write the csv file into kaggle platform. That is it. Then, you should find the output and submit it.

`submission.to_csv(‘submission.csv’,index = False)`

Thank you for reading. Have a nice week.

Credit: BecomingHuman By: Barış Can Tayiz