10 skills found
UniprJRC / FSDAFlexible Statistics and Data Analysis (FSDA) extends MATLAB for a robust analysis of data sets affected by different sources of heterogeneity. It is open source software licensed under the European Union Public Licence (EUPL). FSDA is a joint project by the University of Parma and the Joint Research Centre of the European Commission.
jroakes / ForecastGAA Python tool to forecast Google Analytics data using several popular time series models.
tk3369 / BoxCoxTrans.jlBox Cox transformation in Julia
shenxiangzhuang / BeerBeer: Challenging Problems in Probability and Statistics
jvirico / Normality Tests Pvalues BoxcoxtransformationsStrategies for analyzing the distribution of datasets, switching the data towards a normal distribution testing different manual transformations and Box-Cox transformation.
sharmaroshan / Predicting Money Spent At ResortIt is From Analytics Vidhya Hackathons, Sponsored by Club Mahindra. It is based on Regression Problem, Where Accuracy matters the most, It is measured by RMSE Score. Different Techniques such as Stacking, Ensembling, Boosting and Scientific Operations such box-cox Operations to reduce skewness of the data.
janeminmin / Bluebikes Project1> Background information Bluebikes is Metro Boston’s public bike share program, with more than 1800 bikes at over 200 stations across Boston and nearby areas. The bikes sharing program launched in 2011. The program aimed for individuals to use it for short-term basis for a price. It allows individuals to borrow a bike from a dock station after using it, which makes it ideal for one-way trips. The City of Boston is committed to providing bike share as a part of the public transportation system. However, to build a transport system that encourages bicycling, it is important to build knowledge about the current bicycle flows, and what factors are involved in the decision-making of potential bicyclists when choosing whether to use the bicycle. It is logical to make hypotheses that age and gender, bicycle infrastructure, safety perception are possible determinants of bicycling. On the short-term perspective, it has been shown that weather plays an important role whether to choose the bicycle. 2> Data collection The Bluebikes collects and provides system data to the public. The datasets used in the project can be download through this link (https://www.bluebikes.com/system-data). Based on this time series dataset (start from 2017-01-01 00:00:00 to 2019-03-31 23:00:00), we could have the information includes: Trip duration, start time and data, stop time and data, start station name and id, end station name and id, bike id, user type (casual or subscribed), birth year, gender. Besides, any trips that were below 60 seconds in length is considered as potentially false starts, which is already removed in the datasets. The number of bicycles used during a particular time period, varies over time based on several factors, including the current weather conditions, time of the day, time of the year and the current interest of the biker to use the bicycle as a transport mode. The current interest is different between subscribed users and casual users, so we should analyze them separately. Factors such as season, day of a week, month, hour, and if a holiday can be extracted from the date and time column in the datasets. Since we would analyze the hourly bicycle rental flow, we need hourly weather conditions data from 2017-01-01 00:00:00 to 2019-03-31 23:00:00 to complete our regression model of prediction. The weather data used in the project is scrapped using python selenium from Logan airport station (42.38 °N, 71.04 °W) webpage (https://www.wunderground.com/history/daily/us/ma/boston/KBOS/date/2019-7-15) maintained by weather underground website. The hourly weather observations include time, temperature, dew point, humidity, wind, wind speed, wind gust, pressure, precipitation, precipitation accumulated, condition. 3> The problem The aims of the project are to gain insight of the factors that could give short-term perspective of bicycle flows in Boston. It also aimed to investigate the how busy each station is, the division of bicycle trip direction and duration of the usage of a busy station and the mean flows variation within a day or during that period. The addition to the factors included in the regression model, there also exist other factors than influence how the bicycle flows vary over longer periods time. For example, general tendency to use the bicycle. Therefore, there is potential to improve the regression model accuracy by incorporating a long-term trend estimate taken over the time series of bicycle usage. Then the result from the machine learning algorithm-based regression model should be compared with the time series forecasting-based models. 4> Possible solutions Data preprocessing/Exploration and variable selection: date approximation manipulation, correlation analysis among variables, merging data, scrubbing for duplicate data, verifying errors, interpolation for missing values, handling outliers and skewness, binning low frequent levels, encoding categorical variables. Data visualization: split number of bike usage by subscribed/casual to build time series; build heatmap to present how busy is each station and locate the busiest station in the busiest period of a busy day; using boxplot and histogram to check outliers and determine appropriate data transformation, using weather condition text to build word cloud. Time series trend curve estimates: two possible way we considered are fitting polynomials of various degrees to the data points in the time series or by using time series decomposition functions and forecast functions to extract and forecast. We would emphasize on the importance to generate trend curve estimates that do not follow the seasonal variations: the seasonal variations should be captured explicitly by the input weather related variables in the regression model. Prediction/regression/time series forecasting: It is possible to build up multilayer perceptron neural network regressor to build up models and give prediction based on all variables of data, time and weather. However, considering the interpretability of model, we prefer to build regression models based on machine learning algorithms (like random forest or SVM) respectively for subscribed/casual users. Then the regressor would be combined with trend curve extracted and forecasted by ARIMA, and then comparing with the result of time series forecasting by STL (Seasonal and Trend decomposition using Loess) with multiple seasonal periods and the result of TBATS (Trigonometric Seasonal, Box-Cox Transformation, ARMA residuals, Trend and Seasonality).
kentmacdonald2 / Box Cox Transformation Python ExampleCompanion code from Box-Cox transformation tutorial on kmdatascience.com
czaj / BoxCoxBox-Cox regression
JuliaMixedModels / BoxCox.jlBox-Cox transformation in Julia