Using a sledgehammer to crack a nut: A machine learning model to sell my car

October 28, 2018 Thomas Jansson 0 Comments

Introduction

Circumstances changed, and after only 16 months I was going to sell my rather new Skoda Fabia Combi, but how much was it worth? There are pages dedicated to answering this question, but I suspected a bias toward selling it too cheap to resellers, so I wanted to do my own research. As I intended to sell my car on a large Danish car site called, www.bilbasen.dk I used the data on this site as a reference for my research. All the code and data from the research can be found in this GitHub repository: https://github.com/tjansson60/car_sale_model

Getting the data for the analysis

To get data for the analysis I wrote a simple scraper that got basic data from 490 Skoda Fabia’s currently on sale on bilbasen.dk, the script can be seen here and should be fairly easy to hack for another model:

scrape_bilbasen.py

The only manual part is to set up the trim packages filters. For the 490 cars, it took around 10 minutes to run. The scraped data has the following form:

Analysing data from the 490 cars

After having downloaded the data analyzed the data set in the following Jupiter notebook:

data_analysis.ipynb

A key learning from the analysis was that it seem that the major feature in determining the price would be the model year and odometer (total distance traveled by the car). The following plots sums this up:

This was confirmed by inspecting the correlation between all variables. The price is strongly correlated with the model year and the horsepowers of the car and heavily anticorrelated with the odometer. It can also be seen that manual gears are anticorrelated with the price, so my automatic gear car should benefit from this:

I had expected the region to play a role, but I found no real correlation between the price and the region, which in hindsight makes sense as the internet removes this bias:

Machine learning model – Random Forest Regressor

I could very well just have used the cross-plots and found a reasonable price, but this would not be using a sledgehammer to crack the nut, so instead, I fitted a random forest regressor to make a machine learning model predict the correct price based on the data I had. In the Jupiter notebook:

predictive_price_model.ipynb

I trained a Random Forest Regressor to predict the selling price of my car using the 490 cars as input. The median absolute error of model is 4882 DKK, which is not bad considering cars price ranged from 4999 to 207800 DKK with a mean price of 87660 DKK:

One of the benefits of using a random forest regressor is that it is quite easy to understand the results. From the model, it is possible to extract the feature importance and hence understand what is the most defining factors in the model. In this case, and a bit to my surprise, the model year was even more important than the odometer.

Model explainer and test case

Using a tool as lime or shap it is also rather easy to understand how the model makes a prediction. In the following example, I have taken a known example from the test part of my data (which has not been used in the training of the model). The estimate is only 7000 DKK from the actual listed price which I consider fairly good, but more importantly, using lime I can understand why this is the case:
As it can be seen the largest contributor to this price is that the car is fairly new, but also that it has driven less than 42000 KM. Apparently, the car being sold from the North of Zealand does take the price estimate down by 2129 DKK, but it is a minor effect.

What is my car worth?

Formatting my car in the parameters of my model it is possible to ask the model what the price estimate should be for my car and why which factors contributed to this:

So according to the model, I should sell my car for 180600 DKK to adhere to the model.

Model limitations and follow up

I only had a limited amount of metadata from the scraping to train my model and in the price is also influenced by the following factors:

Is it a dealer sale or a private sale?
What kind of warrenty is available?
How many cylinders does the egine have?
…

Using the model output as a starting point I decided to set the price at 189.900 DKK as I had extended the factory warranty by 3 years in the original purchase. This is not something my model accounted for and I knew this would be a strong selling point compared to the dealership sales which usually have some sort of extended warranty.

After a week or so I sold the car for 192.000 DKK as it turned out that the engine having 4 cylinders was worth extra to the final buyer that entered a small bidding war. The newer 2018 Skoda Fabia Combi 1.0 TSI engine is only 3 cylinders, so this increased the value of my older car.

Conclusion

The model gave a good indication of the selling price of my car, but factors unknown to me turned out to make my car even more worth than I anticipated. An added bonus of this whole exercise was that it was fairly easy for me to reject buyers trying to low-ball the price. As it turned out these people seldom had done the same amount of research when I asked them.