Six definite ways to improve efficiency and reduce load times

Photo by Jia Ye on Unsplash

Sqoop is a tool offered by the Apache foundation that is used commonly in the Big Data world to import-export millions of records between heterogeneous relational databases (RDBMS) and Hadoop Distributed File System (HDFS). This data transfer can lead to varying load times ranging from a couple of minutes to multiple hours. This scenario is when Data engineers worldwide look under the hood to fine-tune settings. The goal of performance tuning is to get more data loaded in a shorter time, thus increasing efficiency and lessening the chance of data loss in case of network timeouts.

In General, performance tuning…


Regression and EDA on personal health data to determine factors contributing to treatment

Photo by Kendal on Unsplash

Introduction

Linear regression is one of the most important algorithms under the supervised learning category in Machine Learning. It is also the simplest and commonly used model for predictive analysis. Using this we explore the personal health dataset and predict treatment and insurance costs.

What is a Linear Regression?

In the simplest terms, when a relationship between the target and one or more predictors is linear, it is a linear regression.


A Comparative Analytics Study Benchmarking Popular Programming Languages and Execution Engines.

Photo by PAUL SMITH on Unsplash

Introduction

Have you ever wondered which programming languages and execution engines are the quickest or the slowest at processing files? Are you in a dilemma as to which programming language should you code in to solve your business problem efficiently? Well look no further, here’s your answer.

We take a look at popular languages like Python, Java, and Scala and execution engines like Hadoop and Spark and see how they fare at processing files and benchmark them.

Methodology

We explore and conduct data analysis & comparisons of the execution times taken for computing the word count of input text files varying from…


IN-DEPTH ANALYSIS

Clustering Neighborhoods of London and Paris using Machine Learning

Photo by cyril mazarin on Unsplash

Introduction

A Tale of Two Cities, a novel written by Charles Dickens was set in London and Paris, which takes place during the French Revolution. These cities were both happening then and now. A lot has changed over the years, and we now take a look at how the cities have grown.

London and Paris are quite a popular tourist and vacation destinations for people all around the world. They are diverse and multicultural and offer a wide variety of experiences that are widely sought after. …

Thomas George Thomas

Senior Data Engineer & IBM Certified Data Science Professional. See more: https://thomasgeorgethomas.ml

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store