One day a client ask me, look we have a set of users and they got some data, but they are not sure what question to ask of it.
Now you have to analyze the data but I am not going to tell you what to analyze.
This is not an uncomman question, Often we use to get this question, “here is a data do some magic”. Is there any science for this or is it just that some people are good with data and can find some intresting things with this.
Atleast This process has name and that is called Exploratory Data Analysis also called EDA
EDA (Getting a better understanding of data)
Exploratory data analysis is the first and foremost step to analyzing any kind of data. EDA is an approach, which seeks to explore the most important and often hidden pattern in the data set.
While doing data exploration we form a hypothesis, which can proven using the hypothesis testing technique. Statisticians called this a bird’s eye view of data and try to make some sense of it.
To solve the business problems we required clean historical data. The better the data, the more insight you can get. Better data means we require a clean, normalized, and standardized data set. Data comes from different sources, for example, Relational data sources like Oracle, MySql, Postgres, and all other relational database systems. Web, IOT data, Marketing, Sales, and the list goes on and on.
There are two main sources of data. Private and Public data sources.
- Private Data: Private data is more sensitive and available in the public domain. Banking, Retail, telecom, and media are some of the sectors which rely on data to make decisions. An organization leverages data analytics to become customer-centric.
- Public Data: Government and public agencies have collected a large amount of data for research purposes. Such data does not require any special permission for access. public data available on the internet and various other platforms. Anyone can you these datasets for analysis purposes.
Data Cleaning Approach
There are various type of data quality issues that comes with data. Data Cleaning is the most time-consuming job in the analytics process.
Fix Rows and Column
Checklist for Fixing Rows
- Delete summary rows: for example, we can delete the summary rows that have info about the total and subtotal rows.
- Incorrect rows: for example, unnecessary header rows, and footer rows can be deleted.
- Extra rows: for example, We can delete column number indicators rows, blank rows, page numbers, etc.
Checklist for Fixing Columns
- Merge columns for creating unique identifiers if needed: for example Merge State and city into Full address
- Split columns for more data: for example, Split address to get State and City to analyze each separately
- Add column names: for example Add column names if missing
- Rename columns: Abbreviations, encoded columns
- Delete columns: Delete unnecessary columns
- Align misaligned columns: Dataset may have shifted columns.
Fix Missing Values
- Set values as missing values: for example Identify values that indicate missing data. and yet are not recognized by the software as such, e.g treat blank strings, “NA”, “XX”, “999”, etc. as missing.
- Delete rows, columns: for example you can delete the row if the number of missing values is very large in number. As this would not impact the analysis. The same goes with Columns, we can delete columns could if the missing values are large in number.
- Fill partial missing values using business judgment. for example, Missing time zone, century, etc. These values are identifiable.
- we should try to get information from reliable external sources as much as possible. but if you can’t, then it is better to keep missing values as such rather than magnifying the existing rows/columns.
Standardizing Values
- Standardize units: Ensure all observations under a variable have a common and consistent unit. for example convert lbs to kgs, miles/hr to km/hr, etc.
- Scale values if required: make sure the observations under a variable have a common scale.
- Standardize precision for a better presentation of data, e.g. 4.5312341 kgs to 4.53 kgs.
- Remove outliers: Remove high and low values that would disproportionately affect the results of your analysis.
- Remove extra characters such as common prefixes/suffixes. leading/trailing/many spaces, etc. These are irrelevant to the analysis.
- Standardize case: There are various cases that string variables may take. For example UPPERCASE, lowercase, Title Case, Sentence case, etc.
- Standardize format: E.g. 23/10/16 to 2016/10/20, “Modi, Narendra” to “Narendra Modi”, etc.
Fix Invalid Values
- Encode Unicode correct: In case the data is being read as junk characters, try to change the encoding, E.g. CP1252 instead of UTF-8.
- Convert incorrect data types: Correct the incorrect data types to the correct data types for ease of analysis. For example, if numeric values contain strings, it would not be possible to calculate metrics such as mean, median, etc. Some of the common data type corrections are — string to number: “12,300” to “12300”; string to date: “2013-Aug” to “2013/08”; number to a string: “PIN Code 231304” to “231304”; etc.
- Correct values that go beyond range: If some of the values are beyond logical range, e.g. temperature less than -273° C (0° K), you would need to correct them as required. A close look would help you check if there is scope for correction, or if possible to remove this value.
- Correct values not in the list: Remove values that don’t belong to a list. for example, In a data set containing blood groups of individuals, strings “E” or “F” are invalid values and we can remove this.
- Correct wrong structure: Values that don’t follow a defined structure we can remove. for example. In a data set containing pin codes of Indian cities, a pin code of 12 digits would be an invalid value, which we can remove. The same goes with, a phone number of 12 digits would be an invalid value.
After fixing the missing values, standardized the existing values, and fixed the invalid values. We came to the last stage of data cleaning.
Filtering Data
- Deduplicate data: Remove identical rows, remove rows where some columns are identical
- Filter rows: Filter by segment, and filter by date period to get only the rows relevant to the analysis
- Filter columns: Pick columns relevant to the analysis
- Aggregate data: Group by required keys, aggregate the rest
So This brings us to our first conclusion on the Data Cleaning process. After attempting these processes on the data set that you need, the data is ready for analysis.
In the next section, we will understand the very important topic, a set of rules to analyze the data
- Univariate Analysis
- Bivariate analysis
- Derive Metrics
OK, that’s it, we are done now. If you have any questions or suggestions, please feel free to comment. I’ll come up with more Machine Learning topics soon.