Introduce several data types for machine learning exploratory data analysis

The data type is an important concept in statistics. We need to have a correct understanding of it in order to use the right data types to reach conclusions. This article will introduce several types of data for machine learning exploratory data analysis in order to properly grasp and use data.

A good understanding of the data structure is very important for exploratory analysis in machine learning. For different types of data we need different statistical measures to perform analysis tests. At the same time, we also need to choose the appropriate visualization based on the type of data to help us better understand the data. The last data type also provides an effective way to classify variables.

Classification data

The classification data represents the attributes of the object. Most of the population's gender, language, and nationality are classified data. The categorical data can also usually be represented numerically (eg 1 for women and 0 for men), but it should be noted that this value is not a mathematical indicator but merely a categorical marker.

Class data

Qualitative variables are used to mark the characteristics of different variables, but do not require quantitative values, they are just labels. It is important to note that the data is unordered. Changes to the order of variables do not change the essential characteristics of the data.

The figure above shows a sample of typical categorical data, describing the individual's gender and language attributes. A special plot is a binary branch with only two attributes.

Sequencing data

Sequencing data represents discrete but ordered variable units. It is a very typed, but well-ordered data organization. The following educational background data well describes the characteristics of the sequencing data.

The four options in the figure above represent different degrees of education, but they cannot quantify the difference between primary education and high school and the difference between high school and university. The lack of sequencing data quantifies the differences between features so that it can only be used to evaluate a range of non-numerical features such as emotion and user satisfaction.

Numerical data

Discrete data

Discrete data refers to discrete values ​​whose values ​​are not continuous. Data can only be taken at specific points. Such data cannot be measured quantitatively but can be statistically measured and the information it contains can be expressed in a categorical manner. Tossing a coin is the most famous example. We cannot predict the positive and negative of the next coin, but we can use statistical data to estimate the distribution of probability.

When dealing with discrete data, we need to think deeply about two issues: whether the data can count statistics and whether it can be divided into smaller parts. If the conclusion that this relevant data can be measured and not counted, then it means that we need to deal with continuous data types.

Continuous data

The continuous data type represents a continuous measure of the object's measurability. Although it can't be counted, it can use a certain scale for continuous measurements. For example, the person's height and age are consecutive values. Usually people only use real numbers or they are expressed.

Distance data

The distance variable is used to describe the description method of the object's equidistance properties. When we use a distance variable we can clearly know the order and difference between the values ​​and measure the difference. The description of temperature is a typical example of distance data.

However, the problem with the distance variable is that it does not have an absolute reference zero. For the temperature in the above figure, 0 degrees does not mean that there is no temperature. For fixed-range variables, we can perform addition and subtraction but we cannot perform multiplication and division or proportional calculations. Because there is no absolute zero, descriptive and inferential statistical methods cannot be applied to distance data.

Ratio data

The fixed ratio data and the fixed distance data are all ordered data arrangements. However, the fixed ratio data has an absolute zero value. All described variables are zero-based references, including weight, height, and length.

Why is the data type so important?

Since different statistical methods are suitable for different data types, the type of data is very important for statistical and machine learning analysis. Just imagine that if you use continuous data analysis methods to study categorical data, then there will be erroneous conclusions. An understanding of the data types will help us choose the right method and statistical model to explore and analyze the data. What kind of statistical model should we choose for different data types?

For the categorized data, it mainly needs to pay attention to three factors: frequency, proportion/percentage, and visualization method. Use frequency to measure the number of occurrences of something at a certain time or in a data set. At the same time, it can be used to count and separate the percentage from the data. Pie charts and histograms are the best way to present this data.

For the sequence data, in addition to indicators such as percentage and frequency, statistics such as percentiles and medians can be used to describe the data.

For continuous data, more abundant means can be used for processing. In addition to the mean and variance of common statistical means, there are peak-to-peak, range, and other indicators. In order to represent the error and discrepancies of the data, box maps and histograms with error bars are an intuitive way of presentation. Boxes can be used to see the degree of data concentration and degree of error, while histograms can provide the overall shape, median value, distribution, and trend of the data.

In this article we see that in addition to continuous and discrete numerical types, statistics also include categories such as sequencing data, classification data, distance data, and ratio data. There are different methods of analysis and visualization for different data types. Understanding the data is the primary condition for starting work when dealing with data. It not only helps us to choose the right tools and methods, but also helps us to use correct thinking. Exploring and analyzing data makes it easier to draw correct and valid conclusions.

Bookshelf Speaker

Comcn Electronics Limited , http://www.comencnspeaker.com