xlabel ( 'Number of principal components in regression' ) plt. ravel (), cv = kf_10, scoring = 'neg_mean_squared_error' ). cross_val_score ( regr, X_reduced_train, y_train. arange ( 1, 20 ): score = - 1 * model_selection. append ( score ) # Calculate MSE using CV for the 19 principle components, adding one component at the time. 50) if the number of features is very high. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. ravel (), cv = kf_10, scoring = 'neg_mean_squared_error' ). The scikit-learn documentation recommends using PCA to first lower the dimension of the data: It is highly recommended to use another dimensionality reduction method (e.g. Using the dataset prepared in part 1, this post is a continuation of the applications of unsupervised machine learning algorithms covered in part 2 and illustrates principal component analysis as a method of data reduction technique. KFold ( n_splits = 10, shuffle = True, random_state = 1 ) mse = # Calculate MSE with only the intercept (no principal components in regression) score = - 1 * model_selection. This is the final part of a three-part article recently published in DataScience+.
fit_transform ( scale ( X_train )) n = len ( X_reduced_train ) # 10-fold CV, with shuffle kf_10 = model_selection. train_test_split ( X, y, test_size = 0.5, random_state = 1 ) # Scale the data X_reduced_train = pca2. Pca2 = PCA () # Split into training and test sets X_train, X_test, y_train, y_test = model_selection.