- Since the median income is a very important attribute to predict median housing prices. If we look at the median income histogram more closely most of the median income values are clustered around 2 to 5 (i.e., $20,000 — $50,000), but some median goes far beyond 6 ($60,000).
- It is important to have a sufficient number of instances in our dataset for each stratum, or else the estimation of the dataset may be biased. This means that we should not have too many strata.
#to limit the number of category, rounding up using ceil to have discrete category
housing['income_cat'] = np.ceil(housing['median_income'] / 1.5)
# keeping only the category lower then 5 and marging other categorries into category 5
housing['income_cat'].where(housing['income_cat'] < 5, 5.0, inplace = True)housing['income_cat'].hist()<matplotlib.axes._subplots.AxesSubplot at 0xaf0e610>
- Now we are ready to do stratified sampling based on income category. Stratified sampling refers to a type of sampling method. With stratified sampling, the population is divided into separate groups, called strata. Then, a probability sample (often a simple random sample) is drawn from each group.
Using scikit-learn’s StratifiedShuffleSplit class here.
from sklearn.model_selection import StratifiedShuffleSplit#test set generation
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]strat_test_set['income_cat'].value_counts() / len(strat_test_set)3.0 0.350533
Name: income_cat, dtype: float64strat_test_set['income_cat'].describe()count 4128.000000
Name: income_cat, dtype: float64
- Now we should remove the imcome_cat attribute so that the data back to the original state.
for set_ in (strat_train_set, strat_test_set):
set_.drop('income_cat', axis=1, inplace = True)
- Let’s have a look at all the districts of California to visualize the data.
housing.plot(kind='scatter', x='longitude', y='latitude')<matplotlib.axes._subplots.AxesSubplot at 0xd7992f0>
- Oh! This looks like California, but its harder to see where the density means any particular pattern.
# by setting alpha value 0.1 easy to visualize where there is a high density data points are.
housing.plot(kind= 'scatter', x='longitude',y='latitude',alpha=0.1)<matplotlib.axes._subplots.AxesSubplot at 0xd8051d0>
- Now that’s much better we can see the clearer view of the high-density areas, namely the Bay Area and around Los Angeles and San Diego, plus a long line of fairly high density in the Central Valley, in particular around Sacramento and Fresno.
- Now let us look at the housing price. The radius of each circle represents the district’s population and the color represents the price.
housing.plot(kind= 'scatter', x='longitude',y='latitude',alpha=0.4,
s = housing['population']/100,label='population',figsize=(10,7),
plt.title('California housing prices')
plt.legend()<matplotlib.legend.Legend at 0x12450710>
Natural Language Generation:
The Commercial State of the Art in 2020
This Entire Article Was Written by Open AI’s GPT2
Learning To Classify Images Without Labels
Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst
Well well the housing prices are very much related to the location and to the population density
- From the scatter plot it’s obvious that the house price is higher close to the ocean. The ocean proximity attribute is useful is this case. although the house price in northern California near the coastal area is not too high.
- Next, we will be looking for correlations between attributes:
- i.e., Pearson’s r between every pair of attributes.
corr_matrix = housing.corr()
Name: median_house_value, dtype: float64
- The correlation coefficient ranges from 1 to -1. When it is close to 1 that means there is a strong positive correlation; Here median house value tends to go up when the median income goes up. When a coefficient value is close to -1 it means there is a strong negative correlation, we can see there is a small negative correlation between latitude and median house value.
- If we investigate in our plot more we go north the house price goes down.
- and when its zero there will be no linear correlation.
- Standard correlation coefficient will be something like the below plot of various datasets:
from IPython.display import SVG
- We can also show this correlation using pandas scatter_matrix function, in this plot we will be focusing on a few promising attributes.
from pandas.plotting import scatter_matrixattributes = ['median_house_value','median_income','total_rooms','housing_median_age']scatter_matrix(housing[attributes],figsize=(12,8))array([[<matplotlib.axes._subplots.AxesSubplot object at 0x150E0350>,
<matplotlib.axes._subplots.AxesSubplot object at 0x150E0570>,
<matplotlib.axes._subplots.AxesSubplot object at 0x13B7C770>,
<matplotlib.axes._subplots.AxesSubplot object at 0x13B97850>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x13BB3930>,
<matplotlib.axes._subplots.AxesSubplot object at 0x13BCE9F0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x13BE9B50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x13C02630>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x13C02BB0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x13C1ED50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x13C53E50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x13C6BF30>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x13C85A90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x13CA1B70>,
<matplotlib.axes._subplots.AxesSubplot object at 0x144CBC50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x144E6D30>]],
housing.plot(x = 'median_income' , y = 'median_house_value', alpha = 0.4, kind='scatter')<matplotlib.axes._subplots.AxesSubplot at 0x145f56b0>
- It is an obvious reason that the house price has a correlation with median income.
- Now its time to prepare the Data for our machine learning algorithm. Is to try out various combinations possible.
- I mean the total number of rooms in a district is not very significant if we don’t know how many households are there.
- And for sure we want to see the total population in those households.
housing['rooms_per_household'] = housing['total_rooms']/housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms']/housing['total_rooms']
housing['population_per_household'] = housing['population']/housing['households']
I want to see the correlation matrix again.
corr_matrix = housing.corr()corr_matrix['median_house_value'].sort_values(ascending=False)median_house_value 1.000000
Name: median_house_value, dtype: float64
- The new bedrooms_house_room attributes are much more correlated with the median house value than the total number of rooms or bedrooms. Apparently houses with a lower bedrooms/room ratio tend to be more expensive.
- The number of rooms per household is also more informative than the total number of rooms in a district, obviously the larger the house more expensive they are.