Data Science Portfolio
Having 3+ years of IT experience and 1+ year of data science experience. Worked on multi-class classification algorithms (SVM, Decision Tree, Random Forest, and XGBoost) for handwritten digit recognition, raisin class prediction with KNN and Logistic Regression algorithms, and concrete strength prediction with Linear Regression. Extensive experience in web scraping and web automation. A motivated and detail-oriented team player with a strong work ethic and a passion for innovation. Looking for opportunities to leverage my skills and experience to drive impactful projects and achieve business objectives.
The concrete slump test measures the consistency of fresh concrete before it sets. It is performed to check the workability of freshly made concrete, and therefore the ease with which concrete flows. It can also be used as an indicator of an improperly mixed batch.
Data set consists of various cement properties and the resulting slump test metrics in cm. Later on the set concrete is tested for its compressive strength 28 days later.
Ridge model seems better according to the R-square and mean absolute error (MAE) scores.
According to the feature importance coefficients; Cement, Fly ash, Coarse aggr., and Water are the most important and representative features for the model.
Go to the GitHub page of Logistic Regression...
Go to the GitHub page of KNN...
Images of Kecimen and Besni raisin varieties grown in Turkey were obtained with CVS. A total of 900 raisin grains were used, including 450 pieces from both varieties. These images were subjected to various stages of pre-processing and 7 morphological features were extracted. These features have been classified using three different artificial intelligence techniques.
https://archive.ics.uci.edu/ml/datasets/Raisin+Dataset
Area, MajorAxisLength, ConvexArea, and Perimeter features will be more useful for distinguishing the two classes.
Standard and Robust Scaler are applied and compared. Results were slightly better for the Robust Scaler.
Since data have marginal values, two scaling methods are used for comparison. No significant difference was found between them, yet Robust Scaler results are slightly better on the confusion matrix since Robost Scaler is more resistant to outliers.
* Standard Scaled Model ---> ROC-AUC = 0.92 ----> 23 false predictions
* Robust Scaled Model ---> ROC-AUC = 0.92 ----> 22 false predictions
Red dots are representing test data, while blue ones are representing train data. The smallest gap between train & test data for the Error Rate corresponds to k=13. However, smaller "k" values could also be a good option if there is a cost concern. So smaller "k" values are also applied to see and compare the results.
* KNN_19 (grid) Model ---> accuracy: 0.83 ---> roc_auc: 0.92 ---> 30 false predictions
* KNN_13 (elbow) Model ---> accuracy: 0.87 ---> roc_auc: 0.91 ---> 29 false predictions
* KNN_5 (elbow) Model ---> accuracy: 0.86 ---> roc_auc: 0.89 ---> 26 false predictions
* Two of the scores for KNN_5 model are better than grid model.
* KNN_5 would be more cost efficient option to choose. However, if cost is not an issue, KNN_13 is also a good option.
Logistic Regression scores were better than the KNN scores.
* KNN_13 (elbow) Model ---> accuracy: 0.87 ---> roc_auc: 0.91 ---> 28 false predictions
* Robust Scaled Model ---> ROC-AUC = 0.92 ----> 22 false predictions
Go to the GitHub page of SVM...
Go to the GitHub page of Decision Tree...
Go to the GitHub page of Random Forest...
Go to the GitHub page of XGBoost...
This digit database was created by collecting 250 samples from 44 writers. WACOM PL-100V pressure sensitive tablet used with an integrated LCD display and a cordless stylus. The input and display areas are located in the same place. Attached to the serial port of an Intel 486 based PC, it allows us to collect handwriting samples. The tablet sends $x$ and $y$ tablet coordinates and pressure level values of the pen at fixed time intervals (sampling rate) of 100 miliseconds.
These writers are asked to write 250 digits in random order inside boxes of 500 by 500 tablet pixel resolution. Subject are monitored only during the first entry screens. Each screen contains five boxes with the digits to be written displayed above. Subjects are told to write only inside these boxes. If they make a mistake or are unhappy with their writing, they are instructed to clear the content of a box by using an on-screen button. The first ten digits are ignored because most writers are not familiar with this type of input devices, but subjects are not aware of this.
In this study, researchers use only ($x, y$) coordinate information. The stylus pressure level values are ignored. The raw data that we capture from the tablet consist of integer values between 0 and 500 (tablet input box resolution). The new coordinates are such that the coordinate which has the maximum range varies between 0 and 100. Usually $x$ stays in this range, since most characters are taller than they are wide.
Attribute information:
In order to train and test our classifiers, we need to represent digits as constant length feature vectors. A commonly used technique leading to good results is resampling the ( x_t, y_t) points. Temporal resampling (points regularly spaced in time) or spatial resampling (points regularly spaced in arc length) can be used here. Raw point data are already regularly spaced in time but the distance between them is variable.
- Data reference link :
https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits
Data (input2, input4, and input13) have outliers. All rows which contain outliers will be dropped. However, both data will be used for models in order to compare the results.
Comparison between cleaned and uncleaned data results
* Cleaned data (with dropped outliers) is slightly better than uncleaned data.
* Outlier cleaning seemed to help clean the noise in the data.
* The current results are good enough. But for better results, further cleaning can be done considering the class outliers individually.
Comparison between cleaned, uncleaned, bagging results
* Decision Tree model test results were slightly increased while train results decreased with cleaned data. It seems that overfit decreased slightly, after dropping outliers.
* Cleaned data also used with bagging method for comparison. The results were better both for test and training scores. Overall results showed that overfitting eliminated with a better test scores with cleaned data & bagging.
* All four models has pretty good scores.
* SVM has the first place in terms of scores.
* However, in the real world scenario, score not the only thing to consider.
* For example, if there is a need of explainable model, DT is the only option.
* In this context, which algorithm to choose for deployment is the subject of the business problem we plan to solve.
The goal was likely to gain insights into the income distribution of US citizens and explore potential relationships between income and other variables like race, sex, education, age, etc...
Data was cleaned, explored, univariate/multivariate analysis and outlier detection were done. Trends, patterns, and interesting insights were identified about the income distribution and its relationship with other variables. Data visualization were performed using graphs, charts, and plots to present findings in a clear and understandable way.
Data was downloaded from the website of the Turkish Statistical Institute. Subsequently, data cleaning and preparation were performed using Python (Pandas library). The information was then visualized in interactive dashboards to showcase regional, city-based, and year-based differences. Below, you can access and explore the Tableau dashboards interactively.
Go to the GitHub page of Drinking Water Quality Assessment Project...
Go to the GitHub page of Automation Project...
The project aim was to have final excel sheets presenting cleaned raw chemical monitoring data and calculation of drinking water quality classes for Drinking Water Treatment Plants of Turkey. The raw data scraped from a web-based database, using Selenium. Then the data sorted according to chemical names, date etc., and assigned to the relative quality classes according to the legislation requirements, using Python, Pandas and SQL. The project automated the process 100% for four colleagues and prevented the 3-4 months of hard work for each user.
The program automates the official letter traffic, including preparing response letters and attachments, downloading /uploading files from web-based/cloud platforms, and keeping an organized file system. The program processed nearly 5000 official letters, automated by 90-95%, using Python and Selenium.