Updated: Jan 6, 2022
Database Structure Affects Quality of Data
Modern Community Development Data Management Systems (CDDMS) are typically highly relational. This means that instead of data such as entities/names and addresses being input each time an application is entered and stored redundantly for each record, they are instead stored a single time in a separate table and referenced by the application record. This allows a user to search for an existing address, parcel, or location to attach an application record to and choose from a list of existing names to set as an applicant, owner, contractor, or other. Of course, this is a bit of a simplification of the process so let's take a closer look to fully understand this concept.
Names and Addresses are two of the most common record types with bad data since they don't typically have truly unique identifiers to them in the real world. Hundreds of people in a town could share the same first and last name, and a County could even have two sites with the same address such as '100 Main St'. Some systems use an SSN or DL Num to track names but asking for that kind of PII is not typical of Community Development processes.
As someone who has managed hundreds of software conversions for municipalities, I have seen and converted a large majority of the software vendor systems out there. As a result, I have seen how each manages names and addresses. There are a couple who do some things a little different but nearly all of them use one of these 3 data schemes.
Non-Relational - Flattened
This approach is an older and less common approach and is typically seen in home-grown systems such as custom Access DB's and spreadsheets used to manage data.
It involves all pertinent data being stored directly on the application record (think of an application record as a permit, planning/zoning app, or code enforcement record).
Application Table Example
123 N Main St
In this type of system, it is easy to end up with poor, inconsistent data because it typically requires the user to type in an applicant name and address each time an application is entered which can lead to many accidental variations. There are instances where a system will perform a lookup of data already entered to assist but many users still just bypass this and commit what they have typed. This results in many instances of the same name like 'Jan Dow', 'Jane Dow', 'Jane Do', etc. The same goes for the address where you will be left with variations such as '123 Main', '123 N Main', '123 Main St', '123 N Main Street' over time. You can see that the number of possibilities on just one address are seemingly endless.
Relational - One-to-One
Relational databases separate this data into separate tables so instead of typing in names and addresses from scratch, the user can search for ones already being stored within the database. This results in fewer redundancies and generally higher accuracy in data. The first example of this, and the most common among systems I've seen in the past is the one-to-one model which links one application record to one name of that type. In other words, one permit to one applicant, or one owner, or one contractor.
The application record stores a reference on it (see ApplNameID) which points to the ID of the record in the related table that the software uses to look-up and display in the system when you access that record.
Application Table Example
Name Table Example
In this example, the ID in the name table is referenced by the ApplNameID in the application table. This is what the software uses to match (join) between the two tables in order to pull up the correct name for that application record. One thing to note is that the fields are examples only and may be broken out in different ways in different systems such as First Name, Last Name or additional fields for address parts. In this design, the system only allows one name for the applicant or any other names it relates to such as contractors, owners, etc.
Relational - One-To-Many
The method being deployed more commonly in modern software systems is instead of the system pre-defining which name types are available or required, it allows a theoretical unlimited number of user-defined types. It has all of the benefits as above but also adds much more flexibility.
In this example, there are oftentimes 3 tables involved:
In some systems, there are other tables as well. We won't get in-depth into this but some vendors only store a name on the name tables and keep addresses, phones, emails, etc., in their own tables, allowing a single name to have an unlimited number of related contact data.
For this example, we'll assume a single name table.
Application Table Example
Link Table Example
Note: Primary column would typically use a Boolean (0 or 1) value.
Name Table Example
As you can see in this example, the Link table sits in the middle and acts as a hub which links the Application records to name records. As mentioned, this allows an unlimited number of user-defined name types to be associated with a record.
Where Data Goes Bad
Now that we understand the basics of database structure and how it can contribute to quality of data, let's look at ways data within these structures goes bad.
The majority of inaccurate or inconsistent data stems from user input. This is not to say that it is necessarily their fault. In most cases it's the fault of the system and database design. Software should help lead users into choosing or inputting correct data by guiding them using a combination of data types, lookup tables and data input validations.
Data types restrict the type of data required for a field. The common types are Text, Numerical, Dates, and Boolean (Yes/No) values.
Lookup tables are tables with preset values for a user to choose from. These can be attributes that use preset lists such as Permit Types, Use Types, Fee Types, etc., or they can be user-input lists such as valid Addresses, Names, Contractors, etc.
Data input validations are constraints on the values that users are able to input. Examples might be that dates in a certain field have to be between specific date range, or the value for a numerical field has to fall between two sets of numbers. Another example is a predefined list of valid address ranges that are evaluated by the software when someone is creating a new address.
Without these types of constraints in a system, users are likely to make small data input errors over time that leads to inaccurate data and in some cases create many variations of a single record like an Address or Name.
When you migrate from one software system to another and take the historical data across, it is common for your vendor to perform a data conversion to move that data from the legacy system to the new one.
Some vendors require you to export data from your current vendor system, 'massage' it a bit to hit a specific format, then they import it into the new software using a canned (or non-customized) import process. This process is typically the quickest and easiest but it is susceptible to the most amount of data loss as you may have to leave behind data that is not included in the import format. Also, since it is not custom, if you have fields where data was not being stored consistently, it does not provide rules to conditionally route or operate on data as it is converted to clean it up. This is usually best for converting from small or simple databases but is not ideal if the data in those databases needs some adjustments on it's way across.
The other approach is a custom conversion. In this case, a vendor will either build a conversion from the ground up for the vendor you are leaving or will start with a base conversion if they have converted that system before. Some vendors will even work with you to help extract data from the source system and work to identify mappings for custom fields that you may have configured differently from the customer who they converted from that system before. Data review is a shared responsibility but may be performed in part by the vendor and they may even work with you to some extent to clean up data using custom logic in the conversion to transform it as is comes across.
What goes wrong
During a conversion, data is by default brought across verbatim. This recalls the old adage of "data in data out". This means that all of the issues created in your legacy system without all of the constraints of the new one will appear in the new one.
In many cases with migrations to modern systems, you be able to visualize data in ways not previously possible. While this can be great in some ways, it can also expose issues with legacy data.
Imagine moving from a system where you worked from a list view of permits and the address was simply an attribute on the record. Now move those records into a system where you can visualize them by address. With a number of variations of the same address, you would not be able to call up a single address and truly view all of the records for it since they would be spread across all of the variations of it.
How Conversions Can Help Fix It
Good conversion logic will start with sound lookup/update/create logic. This logic will define rules to evaluate each record from the source system, then establish some guidelines in what it would consider a match in the new database in order to determine whether or not to create it in the system or use it to update or append data on an existing record.
For instance, if two permits were converted and both listed "123 Main St" at its site address, the conversion would process the first and insert the permit record, then lookup the address "123 Main St" in the new system. Since it's the first time a record is being converted there, the address may not exist so it creates it. When the next permit is processed, it performs the same lookup. Now that the address exists, instead of creating it again, it simply attaches the second permit to it.
The second thing a good conversion will do is compare some record types using sound logic to determine whether or not to allow some small variations in the data and still allow it to be considered a match. It usually does this by using an address parser which will take an address string like "123 E Main St" and break it out into individual address parts using a series of rules. This example would place the "123" in the street number field, the "E" into the street directional field, and so on. A great parser would evaluate "123 East Main St", "123 E Main Street" and recognize those common variations and convert each permit to the same address of "123 E Main St", reducing the number of variations of the same address in the new system. This is one of the most impactful ways a conversion can clean-up address data.
Another along the same thread is a bit riskier but can really return some good results if used sparingly and in the right cases, and that is what is known as the Levenshtein Distance method. In this technique, the data is sent through a function where a "Distance" is input as a parameter which represents the maximum number of character differences between two strings that are allowed while still considering it a match. For instance, a distance of 1 would allow "Pine" and "Pone" streets to be considered the same street. This can be especially helpful when a master list of valid, clean addresses is available. They can be imported/converted first into a database, then the Levenshtein method is used to compare converted records against those valid records to determine whether to attach or to create a new one. The risk here of course is over-merging so this must be done with caution and the distance left as low as possible to still achieve the desired results.
Well-written and mature conversion logic can clean upwards of 90% of the issues in a database. Proper expectations are important with data conversions. Sometimes it can feel like conversions made data worse when all that happened was that the new system exposed many of the issues your prior system was hiding.
Imports and Integrations
The third way data can go bad is through imports and integrations with third party software.
In many cases these interfaces are necessary and do more good than bad. In some cases however, they can create a pipeline of inconsistent or inaccurate data directly into your system. This happens when the system where data is being inserted from lacks the same validation constraints your CDDMS has.
If citizen service requests are submitted via an online form which allows free-form text in certain fields and that data is inserted directly into your database, it defeats the purpose of the software's data validation logic.
It's best to ask these questions of any interface vendor you are implementing to see what kind of impacts it can have on the quality of your data.
The other issues with imports and integrations that can lead to issues are:
Lack of hard keys between source and destination records
Incorrectly or poorly mapped fields
Poor parsing logic
Overwriting and reverting data already once corrected/fixed
Why Bad Data is A Problem
Inconsistent and inaccurate data can make it more difficult and time-consuming to find all pertinent related data within your system. This can lead to added costs and resources over time in trying to work around this issue.
It can also lead to incomplete or inaccurate information in lookups and reports. If data is cast across multiple address variations or there are many redundant names in your system, it is impossible to find all case history related to just one of them.
Not only can this be problematic internally when trying to search or produce a report, it can be an issue when attempting to issue online permits. Users online who begin their application process by searching for a valid address are not going to know which variation to choose. Resolving these duplicates is essential.
How to Prevent and Fix Bad Data
Choose software that validates against master tables
Choose software that maintains both a string version and individual address parts for Addresses. This helps ensure that the software has some level of parsing logic built-in to standardize addresses.
Consider using GIS as an enterprise master address source. This can be essential when you have a system with addresses maintained that are with different vendors. Each vendor should be able to consume this data via a Rest endpoint import to keep their systems updated, making changes only necessary in the central GIS system.
Implement Style Guides in your internal SOP so that team members know how certain data types that aren't as easily validated within the software can be standardized.
Using ArcGIS to geocode bad addresses can return a "match" for each address. This can also return a probability score that indicates the likelihood that the returned match is an actual one. The higher the score, the more likely it is that the address returned by the GIS system is the correct version. The results of this can be dumped to a spreadsheet and used to create scripts to update and fix many of your bad addresses.
Some CDDMS's come with data merge utilities. These allow you to review records that are similar and merge them. This is a bit more of a manual process but its still quicker than fixing each by hand.
Fix Manually by hand. This involves a user or experienced consultant reviewing records individually and making corrections. This is usually done after resolving the majority of records using other technical methods and focuses in on records lacking enough of a predictable pattern to resolve systematically.
If you have data you need to clean up, we can help. Give us a call us at 517-614-3463 or email at firstname.lastname@example.org. We specialize in data cleanup and can help you get your system back on track quickly and accurately.