Top ten ways to clean your data.
Misspelled words, stubborn trailing spaces, unwanted prefixes, improper cases, and nonprinting characters make a bad first impression. And that is not even a complete list of ways your data can get dirty. Roll up your sleeves. It is time for some major spring-cleaning of your worksheets with Microsoft Excel.
You don’t always have control over the format and type of data that you import from an external data source, such as a database, text file, or a Web page. Before you can analyze the data, you often need to clean it up. Fortunately, Excel has many features to help you get data in the precise format that you want. Sometimes, the task is straightforward and there is a specific feature that does the job for you. For example, you can easily use Spell Checker to clean up misspelled words in columns that contain comments or descriptions. Or, if you want to remove duplicate rows, you can quickly do this by using the Remove Duplicates dialog box.
At other times, you may need to manipulate one or more columns by using a formula to convert the imported values into new values. For example, if you want to remove trailing spaces, you can create a new column to clean the data by using a formula, filling down the new column, converting that new column’s formulas to values, and then removing the original column.
The basic steps for cleaning data are as follows:
Import the data from an external data source.
Create a backup copy of the original data in a separate workbook.
Ensure that the data is in a tabular format of rows and columns with: similar data in each column, all columns and rows visible, and no blank rows within the range. For best results, use an Excel table.
Do tasks that don’t require column manipulation first, such as spell-checking or using the Find and Replace dialog box.
Next, do tasks that do require column manipulation. The general steps for manipulating a column are:
Insert a new column (B) next to the original column (A) that needs cleaning.
Add a formula that will transform the data at the top of the new column (B).
Fill down the formula in the new column (B). In an Excel table, a calculated column is automatically created with values filled down.
Select the new column (B), copy it, and then paste as values into the new column (B).
Remove the original column (A), which converts the new column from B to A.
To periodically clean the same data source, consider recording a macro or writing code to automate the entire process. There are also a number of external add-ins written by third-party vendors, listed in the Third-party providers section, that you can consider using if you don’t have the time or resources to automate the process on your own.
Shows how to use the Fill command.
Show how to create an Excel table and add or delete columns or calculated columns.
Shows several ways to automate repetitive tasks by using a macro.
You can use a spell checker to not only find misspelled words, but to find values that are not used consistently, such as product or company names, by adding those values to a custom dictionary.
Shows how to correct misspelled words on a worksheet.
Explains how to use custom dictionaries.
Duplicate rows are a common problem when you import data. It is a good idea to filter for unique values first to confirm that the results are what you want before you remove duplicate values.
Shows two closely-related procedures: how to filter for unique rows and how to remove duplicate rows.
You may want to remove a common leading string, such as a label followed by a colon and space, or a suffix, such as a parenthetic phrase at the end of the string that is obsolete or unnecessary. You can do this by finding instances of that text and then replacing it with no text or other text.
Show how to use the Find command and several functions to find text.
Shows how to use the Replace command and several functions to remove text.
Show how to use the Find and Replace dialog boxes.
These are the functions that you can use to do various string manipulation tasks, such as finding and replacing a substring within a string, extracting portions of a string, or determining the length of a string.
Sometimes text comes in a mixed bag, especially when the case of text is concerned. Using one or more of the three Case functions, you can convert text to lowercase letters, such as e-mail addresses, uppercase letters, such as product codes, or proper case, such as names or book titles.
Shows how to use the three Case functions.
Converts all uppercase letters in a text string to lowercase letters.
Capitalizes the first letter in a text string and any other letters in text that follow any character other than a letter. Converts all other letters to lowercase letters.
Converts text to uppercase letters.
Sometimes text values contain leading, trailing, or multiple embedded space characters (Unicode character set values 32 and 160), or nonprinting characters (Unicode character set values 0 to 31, 127, 129, 141, 143, 144, and 157). These characters can sometimes cause unexpected results when you sort, filter, or search. For example, in the external data source, users may make typographical errors by inadvertently adding extra space characters, or imported text data from external sources may contain nonprinting characters that are embedded in the text. Because these characters are not easily noticed, the unexpected results may be difficult to understand. To remove these unwanted characters, you can use a combination of the TRIM, CLEAN, and SUBSTITUTE functions.
Returns a numeric code for the first character in a text string.
Removes the first 32 nonprinting characters in the 7-bit ASCII code (values 0 through 31) from text.
Removes the 7-bit ASCII space character (value 32) from text.
You can use the SUBSTITUTE function to replace the higher value Unicode characters (values 127, 129, 141, 143, 144, 157, and 160) with the 7-bit ASCII characters for which the TRIM and CLEAN functions were designed.
There are two main issues with numbers that may require you to clean the data: the number was inadvertently imported as text, and the negative sign needs to be changed to the standard for your organization.
Shows how to convert numbers that are formatted and stored in cells as text, which can cause problems with calculations or produce confusing sort orders, to number format.
Converts a number to text format and applies a currency symbol.
Converts a value to text in a specific number format.
Rounds a number to the specified number of decimals, formats the number in decimal format by using a period and commas, and returns the result as text.
Converts a text string that represents a number to a number.
Because there are so many different date formats, and because these formats may be confused with numbered part codes or other strings that contain slash marks or hyphens, dates and times often need to be converted and reformatted.
Describes how the date system works in Office Excel.
Shows how to convert between different time units.
Shows how to convert dates that are formatted and stored in cells as text, which can cause problems with calculations or produce confusing sort orders, to date format.
Returns the sequential serial number that represents a particular date. If the cell format was General before the function was entered, the result is formatted as a date.
Converts a date represented by text to a serial number.
Returns the decimal number for a particular time. If the cell format was General before the function was entered, the result is formatted as a date.
Returns the decimal number of the time represented by a text string. The decimal number is a value ranging from 0 (zero) to 0.99999999, representing the times from 0:00:00 (12:00:00 AM) to 23:59:59 (11:59:59 P.M.).
A common task after importing data from an external data source is to either merge two or more columns into one, or split one column into two or more columns. For example, you may want to split a column that contains a full name into a first and last name. Or, you may want to split a column that contains an address field into separate street, city, region, and postal code columns. The reverse may also be true. You may want to merge a First and Last Name column into a Full Name column, or combine separate address columns into one column. Additional common values that may require merging into one column or splitting into multiple columns include product codes, file paths, and Internet Protocol (IP) addresses.
Show typical examples of combining values from two or more columns.
Shows how to use this wizard to split columns based on various common delimiters.
Shows how to use the LEFT, MID, RIGHT, SEARCH, and LEN functions to split a name column into two or more columns.
Shows how to use the CONCATENATE function, & (ampersand) operator, and Convert Text to Columns Wizard.
Shows how to use the Merge Cells , Merge Across , and Merge and Center commands.
Joins two or more text strings into one text string.
Most of the analysis and formatting features in Office Excel assume that the data exists in a single, flat two-dimensional table. Sometimes you may want to make the rows become columns, and the columns become rows. At other times, data is not even structured in a tabular format, and you need a way to transform the data from a nontabular to a tabular format.
Returns a vertical range of cells as a horizontal range, or vice versa.
Occasionally, database administrators use Office Excel to find and correct matching errors when two or more tables are joined. This might involve reconciling two tables from different worksheets, for example, to see all records in both tables or to compare tables and find rows that don’t match.
Shows common ways to look up data by using the lookup functions.
Returns a value either from a one-row or one-column range or from an array. The LOOKUP function has two syntax forms: the vector form and the array form.
Searches for a value in the top row of a table or an array of values, and then returns a value in the same column from a row you specify in the table or array.
Searches for a value in the first column of a table array and returns a value in the same row from another column in the table array.
Returns a value or the reference to a value from within a table or range. There are two forms of the INDEX function: the array form and the reference form.
Returns the relative position of an item in an array that matches a specified value in a specified order. Use MATCH instead of one of the LOOKUP functions when you need the position of an item in a range instead of the item itself.
Returns a reference to a range that is a specified number of rows and columns from a cell or range of cells. The reference that is returned can be a single cell or a range of cells. You can specify the number of rows and the number of columns to be returned.
The following is a partial list of third-party providers that have products that are used to clean data in a variety of ways.
Note: Microsoft does not provide support for third-party products.