Introduction¶

This Jupyter Notebook documents the process of creating a SQLite database from three large CSV files. The primary objective of this endeavor is to enable data analysis on low-performance computers with limited RAM. Working directly with large datasets in memory-intensive environments like pandas often leads to kernel crashes due to memory overloads. By transitioning the data into a SQLite database, we can significantly reduce the memory footprint of our data processing tasks.

SQLite offers a lightweight, file-based database system that is ideal for handling large datasets on machines with constrained resources. By storing data in a database, we can leverage SQL's efficient data retrieval capabilities, allowing for complex queries and data analysis without the need for loading the entire dataset into memory. This approach is not only resource-efficient but also scalable and more manageable, especially when dealing with large volumes of data.

The following sections in this notebook detail the steps taken to import, clean, and transfer the data from CSV files into a structured database. This includes data preprocessing, schema definition, and data insertion, ensuring that the database is well-organized and ready for subsequent analysis. This setup aims to facilitate robust data analysis even on computers that struggle with high-memory demands, thus democratizing data analysis to a wider range of computational environments.

Variables Description¶

Data Item	Explanation
TUI	A reference number which is generated automatically recording each published sale. The number is unique and will change each time a sale is recorded.
Price	Sale price stated on the transfer deed.
Date	Date when the sale was completed, as stated on the transfer deed.
Postcode	This is the postcode used at the time of the original transaction. Note that postcodes can be reallocated and these changes are not reflected in the Price Paid Dataset.
Property Type	D = Detached, S = Semi-Detached, T = Terraced, F = Flats/Maisonettes, O = Other. Note that: - we only record the above categories to describe property type, we do not separately identify bungalows - end-of-terrace properties are included in the Terraced category above - ‘Other’ is only valid where the transaction relates to a property type that is not covered by existing values, for example where a property comprises more than one large parcel of land
Old/New	Indicates the age of the property and applies to all price paid transactions, residential and non-residential. Y = a newly built property, N = an established residential building
Duration	Relates to the tenure: F = Freehold, L= Leasehold etc. Note that HM Land Registry does not record leases of 7 years or less in the Price Paid Dataset.
PAON	Primary Addressable Object Name. Typically the house number or name.
SAON	Secondary Addressable Object Name. Where a property has been divided into separate units (for example, flats), the PAON (above) will identify the building and a SAON will be specified that identifies the separate unit/flat.
Price Type	Indicates the type of Price Paid transaction. A = Standard Price Paid entry, includes single residential property sold for value. B = Additional Price Paid entry including transfers under a power of sale/repossessions, buy-to-lets (where they can be identified by a Mortgage), transfers to non-private individuals and sales where the property type is classed as ‘Other’. Note that category B does not separately identify the transaction types stated. HM Land Registry has been collecting information on Category A transactions from January 1995. Category B transactions were identified from October 2013.
Status	Indicates additions, changes and deletions to the records.(see guide below). A = Added records: records added into the price paid dataset in the monthly refresh due to new sales transactions C = Changed records: records changed in the price paid dataset in the monthly refresh. You should replace or update records in any stored data using the unique identifier to recognise them D = Deleted records: records deleted from the price paid dataset in the monthly refresh. You should delete records from any stored data using the unique identifier to recognise them. Note that where a transaction changes category type due to misallocation (as above) it will be deleted from the original category type and added to the correct category with a new transaction unique identifier.

=================================================================================================================================================================================

Implementation: Executing the Code¶

This code snippet prepares essential tools for data handling:¶

Pandas: Organizes and processes large data sets.
NumPy: Facilitates advanced mathematical operations.
SQLite3: Manages data storage in a structured database.

Purpose: These tools are essential for efficient data analysis and storage, especially suitable for handling large volumes of data on less powerful computers.

In [ ]:

import pandas as pd
          import numpy as np
          import sqlite3
          

Loads a dataset called 'pp-complete.csv' from a specific folder on the computer:¶

In [ ]:

df_pp = pd.read_csv('/Users/Albakov/Desktop/Data Analysis/My Analysis of Property/pp-complete.csv')
          

Detailed information about the property dataset:¶

It displays characteristics of the dataset like the number of entries, the type of each column, and the memory usage.
This is important for understanding the scale and complexity of the data, which helps in planning how to handle and analyze it efficiently.

In [ ]:

df_pp.info(memory_usage="deep")
          

<class 'pandas.core.frame.DataFrame'>
          RangeIndex: 28782629 entries, 0 to 28782628
          Data columns (total 16 columns):
           #   Column                                  Dtype 
          ---  ------                                  ----- 
           0   {F887F88E-7D15-4415-804E-52EAC2F10958}  object
           1   70000                                   int64 
           2   1995-07-07 00:00                        object
           3   MK15 9HP                                object
           4   D                                       object
           5   N                                       object
           6   F                                       object
           7   31                                      object
           8   Unnamed: 8                              object
           9   ALDRICH DRIVE                           object
           10  WILLEN                                  object
           11  MILTON KEYNES                           object
           12  MILTON KEYNES.1                         object
           13  MILTON KEYNES.2                         object
           14  A                                       object
           15  A.1                                     object
          dtypes: int64(1), object(15)
          memory usage: 22.6 GB

The output provides a detailed overview of the property dataset's structure:¶

It reveals that the dataset has over 28 million entries and 16 columns, indicating a large and comprehensive collection of data.
The column names and data types (mostly 'object' and one 'int64') are listed, helping in understanding the kind of information each column holds.
Notably, the memory usage is approximately 22.6 GB, which is significant and underscores the need for efficient data handling strategies, especially on less powerful computers.

This information is crucial for planning how to manage, process, and analyze this extensive dataset effectively.

This part of the code gives us a peek into the dataset and its size: ¶

It shows the first couple of entries from the dataset, similar to viewing the first few lines of a spreadsheet. This helps us get a feel for what kind of information the dataset contains.
Additionally, it tells us the total number of rows and columns in the dataset, giving us an idea of its overall size and scope.

Understanding the basic layout and size of the dataset is a crucial first step before starting any detailed analysis or processing.

In [ ]:

display(df_pp.head(2))
          print('Number of Rows:', df_pp.shape[0])
          print('Number of Columns:', df_pp.shape[1])
          

	{F887F88E-7D15-4415-804E-52EAC2F10958}	70000	1995-07-07 00:00	MK15 9HP	D	N	F	31	Unnamed: 8	ALDRICH DRIVE	WILLEN	MILTON KEYNES	MILTON KEYNES.1	MILTON KEYNES.2	A	A.1
0	{40FD4DF2-5362-407C-92BC-566E2CCE89E9}	44500	1995-02-03 00:00	SR6 0AQ	T	N	F	50	NaN	HOWICK PARK	SUNDERLAND	SUNDERLAND	SUNDERLAND	TYNE AND WEAR	A	A
1	{7A99F89E-7D81-4E45-ABD5-566E49A045EA}	56500	1995-01-13 00:00	CO6 1SQ	T	N	F	19	NaN	BRICK KILN CLOSE	COGGESHALL	COLCHESTER	BRAINTREE	ESSEX	A	A

Number of Rows: 28782629
          Number of Columns: 16

In this part of the code, the dataset undergoes a few changes to make it more user-friendly and organized:¶

First, the columns of the dataset are renamed with clearer, more descriptive names like "Price", "Date", "Postcode", etc. This makes the dataset easier to understand and work with.
The "Date" column is modified to only include the date part, excluding any time information if present.
After these changes, the modified dataset is saved as a new CSV file named 'pp-complete-modified.csv'. This file is stored on the computer for future use.
Lastly, the dataset is removed from the notebook's memory to free up space, which is especially important for computers with limited resources.

These steps enhance the dataset's readability and structure, and help in managing computer memory efficiently.

In [ ]:

column_names = [
              "TUI", 
              "Price", 
              "Date", 
              "Postcode", 
              "Property Type", 
              "Old/New", 
              "Duration", 
              "PAON", 
              "SAON", 
              "Street", 
              "Locality", 
              "Town/City", 
              "District", 
              "County", 
              "Price Type", 
              "Status"
          ]
          
          df_pp.columns = column_names
          df_pp['Date'] = df_pp['Date'].str.split().str[0]
          
          df_pp.to_csv('/Users/Albakov/Desktop/Data Analysis/My Analysis of Property/pp-complete-modified.csv', index=False)
          
          del df_pp
          

Here, the dataset is loaded again with some specific adjustments to how the data is handled:¶

The data types for each column are defined more precisely. For example, categories for text-based columns and a more memory-efficient type for numerical data like price.
These adjustments help in reducing the amount of computer memory the dataset uses, making it more manageable, especially on less powerful computers.
Additionally, the "Date" column is converted to a format that is specifically recognized as dates, facilitating any analysis or operations that involve dates.

These changes are geared towards optimizing the dataset for more efficient analysis and processing.

In [ ]:

dtypes = {
              "TUI": "category",  
              "Price": "int32",
              "Postcode": "category",
              "Property Type": "category",
              "Old/New": "category",
              "Duration": "category",
              "PAON": "category",  
              "SAON": "category",  
              "Street": "category",  
              "Locality": "category",  
              "Town/City": "category",
              "District": "category",
              "County": "category",
              "Price Type": "category",
              "Status": "category"
          }
          
          df_modified = pd.read_csv('/Users/Albakov/Desktop/Data Analysis/My Analysis of Property/pp-complete-modified.csv', dtype=dtypes)
          
          df_modified["Date"] = pd.to_datetime(df_modified["Date"])
          

In [ ]:

df_modified.info(memory_usage="deep")
          

<class 'pandas.core.frame.DataFrame'>
          RangeIndex: 28782629 entries, 0 to 28782628
          Data columns (total 16 columns):
           #   Column         Dtype         
          ---  ------         -----         
           0   TUI            category      
           1   Price          int32         
           2   Date           datetime64[ns]
           3   Postcode       category      
           4   Property Type  category      
           5   Old/New        category      
           6   Duration       category      
           7   PAON           category      
           8   SAON           category      
           9   Street         category      
           10  Locality       category      
           11  Town/City      category      
           12  District       category      
           13  County         category      
           14  Price Type     category      
           15  Status         category      
          dtypes: category(14), datetime64[ns](1), int32(1)
          memory usage: 4.7 GB

This output shows the results of the recent modifications to the dataset:

The dataset still contains over 28 million entries, but now with more efficient data types like 'category' for most text columns and 'int32' for numerical data.
Notably, the memory usage has been significantly reduced to 4.7 GB, compared to the initial 22.6 GB. This reduction is a key improvement, especially for analysis on computers with limited memory.
The changes in data types, including the conversion of the "Date" column to a proper date format, are reflected in the output.

This demonstrates the effectiveness of the data optimization steps taken to make the dataset more manageable for processing and analysis.

Here's a quick look at the updated dataset and its dimensions:¶

The first couple of entries from the modified dataset are displayed. This snapshot includes columns like 'TUI', 'Price', 'Date', and others, providing an initial view of how the data is now structured.
Additionally, the total number of rows and columns in the modified dataset is provided. It confirms that the dataset remains extensive with over 28 million entries across 16 different columns.

This glimpse into the dataset, along with its size, helps in getting a sense of the data's scope and the effectiveness of the recent optimizations.

In [ ]:

display(df_modified.head(2))
          print('Number of Rows:', df_modified.shape[0])
          print('Number of Columns:', df_modified.shape[1])
          

	TUI	Price	Date	Postcode	Property Type	Old/New	Duration	PAON	SAON	Street	Locality	Town/City	District	County	Price Type	Status
0	{40FD4DF2-5362-407C-92BC-566E2CCE89E9}	44500	1995-02-03	SR6 0AQ	T	N	F	50	NaN	HOWICK PARK	SUNDERLAND	SUNDERLAND	SUNDERLAND	TYNE AND WEAR	A	A
1	{7A99F89E-7D81-4E45-ABD5-566E49A045EA}	56500	1995-01-13	CO6 1SQ	T	N	F	19	NaN	BRICK KILN CLOSE	COGGESHALL	COLCHESTER	BRAINTREE	ESSEX	A	A

Number of Rows: 28782629
          Number of Columns: 16

This part of the code introduces a new dataset focusing on UK companies that own property in England and Wales:¶

The dataset is loaded from a CSV file named 'UK companies that own property in England and Wales.csv'.
After loading, an overview of this dataset is provided, including details like the number of entries, the types of data it contains, and its memory usage.

This step is important for incorporating additional data into our analysis, expanding the scope to include information about property ownership by UK companies.

In [ ]:

df_uk = pd.read_csv('/Users/Albakov/Desktop/Data Analysis/My Analysis of Property/UK companies that own property in England and Wales.csv', low_memory=False)
          df_uk.info(memory_usage="deep")
          

<class 'pandas.core.frame.DataFrame'>
          RangeIndex: 4105590 entries, 0 to 4105589
          Data columns (total 35 columns):
           #   Column                           Dtype  
          ---  ------                           -----  
           0   Title Number                     object 
           1   Tenure                           object 
           2   Property Address                 object 
           3   District                         object 
           4   County                           object 
           5   Region                           object 
           6   Postcode                         object 
           7   Multiple Address Indicator       object 
           8   Price Paid                       float64
           9   Proprietor Name (1)              object 
           10  Company Registration No. (1)     object 
           11  Proprietorship Category (1)      object 
           12  Proprietor (1) Address (1)       object 
           13  Proprietor (1) Address (2)       object 
           14  Proprietor (1) Address (3)       object 
           15  Proprietor Name (2)              object 
           16  Company Registration No. (2)     object 
           17  Proprietorship Category (2)      object 
           18  Proprietor (2) Address (1)       object 
           19  Proprietor (2) Address (2)       object 
           20  Proprietor (2) Address (3)       object 
           21  Proprietor Name (3)              object 
           22  Company Registration No. (3)     object 
           23  Proprietorship Category (3)      object 
           24  Proprietor (3) Address (1)       object 
           25  Proprietor (3) Address (2)       object 
           26  Proprietor (3) Address (3)       object 
           27  Proprietor Name (4)              object 
           28  Company Registration No. (4)     object 
           29  Proprietorship Category (4)      object 
           30  Proprietor (4) Address (1)       object 
           31  Proprietor (4) Address (2)       object 
           32  Proprietor (4) Address (3)       float64
           33  Date Proprietor Added            object 
           34  Additional Proprietor Indicator  object 
          dtypes: float64(2), object(33)
          memory usage: 5.9 GB

The overview of the newly loaded dataset about UK companies owning property in England and Wales reveals:

The dataset comprises over 4 million entries, each representing a property.
It contains a wide array of information across 35 columns, including details like property address, price paid, and proprietor names.
The majority of the data is text-based, with a couple of columns holding numerical values.
The memory usage of this dataset is approximately 5.9 GB, indicating its considerable size.

This information helps us understand the depth and breadth of the data available for analysis regarding property ownership by companies in the UK.

This part of the code provides a brief glimpse into the dataset on UK property ownership by companies:¶

The first two entries are displayed, giving an initial view of the types of information included, such as property address, district, price, and proprietor details.
Additionally, the total number of rows and columns is noted. With over 4 million rows and 35 columns, the dataset is extensive and covers a wide range of details.

These insights are helpful for understanding the composition and scale of the data, offering a foundation for more in-depth analysis.

In [ ]:

display(df_uk.head(2))
          print('Number of Rows:', df_uk.shape[0])
          print('Number of Columns:', df_uk.shape[1])
          

	Title Number	Tenure	Property Address	District	County	Region	Postcode	Multiple Address Indicator	Price Paid	Proprietor Name (1)	...	Proprietor (3) Address (2)	Proprietor (3) Address (3)	Proprietor Name (4)	Company Registration No. (4)	Proprietorship Category (4)	Proprietor (4) Address (1)	Proprietor (4) Address (2)	Proprietor (4) Address (3)	Date Proprietor Added	Additional Proprietor Indicator
0	356353	Freehold	37 Ixworth Place, London (SW3 3QH)	KENSINGTON AND CHELSEA	GREATER LONDON	GREATER LONDON	SW3 3QH	N	NaN	ZURICH ASSURANCE LTD	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01-02-2005	N
1	356361	Freehold	70 Marylebone High Street, The Lord Tyrawley, ...	CITY OF WESTMINSTER	GREATER LONDON	GREATER LONDON	NaN	N	NaN	HOWARD DE WALDEN ESTATES LIMITED	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	29-11-1963	N

2 rows × 35 columns

Number of Rows: 4105590
          Number of Columns: 35

In this section, the dataset is refined to make it more manageable and easier to analyze:¶

Columns with too many missing values are removed. Specifically, any column with more than 40% missing data is dropped. This step helps focus on more reliable and complete information.
The column names are then cleaned up for consistency. This includes removing specific markers like "(1)" from the names and trimming any extra spaces.
An overview of the revised dataset is provided, showing the new structure, data types, and memory usage.

These modifications are aimed at enhancing the dataset's quality and usability for more effective data analysis.

In [ ]:

threshold = len(df_uk) * 0.6
          df_uk.dropna(axis=1, thresh=threshold, inplace=True)
          df_uk.columns = df_uk.columns.str.replace("(1)", "")
          df_uk.columns = df_uk.columns.str.strip()
          
          df_uk.info(memory_usage="deep")
          

<class 'pandas.core.frame.DataFrame'>
          RangeIndex: 4105590 entries, 0 to 4105589
          Data columns (total 14 columns):
           #   Column                           Dtype 
          ---  ------                           ----- 
           0   Title Number                     object
           1   Tenure                           object
           2   Property Address                 object
           3   District                         object
           4   County                           object
           5   Region                           object
           6   Postcode                         object
           7   Multiple Address Indicator       object
           8   Proprietor Name                  object
           9   Company Registration No.         object
           10  Proprietorship Category          object
           11  Proprietor  Address              object
           12  Date Proprietor Added            object
           13  Additional Proprietor Indicator  object
          dtypes: object(14)
          memory usage: 3.5 GB

The overview of the updated UK property ownership dataset shows:

The dataset now contains 14 columns, reduced from the original 35, following the removal of columns with significant missing data.
The remaining columns include key information like property address, proprietor name, and registration details.
All columns are of 'object' type, which is typical for text-based information.
The memory usage has decreased to 3.5 GB, reflecting the effect of removing less informative columns and optimizing the dataset's size.

These changes make the dataset more focused and manageable, improving its suitability for in-depth analysis.

Here's another look at the streamlined UK property ownership dataset:¶

The code displays the first two rows of the revised dataset, providing a snapshot of the information now available, such as property address, proprietor name, and tenure type.
Additionally, it confirms the total number of rows and columns post-cleanup: the dataset still contains a vast number of entries (over 4 million) but now with a more focused set of 14 columns.

These updates offer a clearer and more concise view of the dataset, aiding in a more efficient analysis process.

In [ ]:

display(df_uk.head(2))
          print('Number of Rows:', df_uk.shape[0]) 
          print('Number of Columns:', df_uk.shape[1])
          

	Title Number	Tenure	Property Address	District	County	Region	Postcode	Multiple Address Indicator	Proprietor Name	Company Registration No.	Proprietorship Category	Proprietor Address	Date Proprietor Added	Additional Proprietor Indicator
0	356353	Freehold	37 Ixworth Place, London (SW3 3QH)	KENSINGTON AND CHELSEA	GREATER LONDON	GREATER LONDON	SW3 3QH	N	ZURICH ASSURANCE LTD	02456671	Limited Company or Public Limited Company	The Grange, Bishops Cleeve, Cheltenham, Glouce...	01-02-2005	N
1	356361	Freehold	70 Marylebone High Street, The Lord Tyrawley, ...	CITY OF WESTMINSTER	GREATER LONDON	GREATER LONDON	NaN	N	HOWARD DE WALDEN ESTATES LIMITED	NaN	Limited Company or Public Limited Company	23 Queen Anne Street, London W1G 9DL	29-11-1963	N

Number of Rows: 4105590
          Number of Columns: 14

This code introduces yet another dataset to the analysis, focusing on overseas companies that own property in England and Wales:¶

The dataset is loaded from a file titled 'Overseas companies that own property in England and Wales.csv'.
An initial overview is provided, detailing the structure, types of data included, and the memory usage of this new dataset.

Incorporating this dataset allows for a broader and more comprehensive analysis of property ownership, including both UK and overseas entities.

In [ ]:

df_overseas = pd.read_csv('/Users/Albakov/Desktop/Data Analysis/My Analysis of Property/Overseas companies that own property in England and Wales.csv', low_memory=False)
          df_overseas.info(memory_usage="deep")
          

<class 'pandas.core.frame.DataFrame'>
          RangeIndex: 93510 entries, 0 to 93509
          Data columns (total 39 columns):
           #   Column                           Non-Null Count  Dtype  
          ---  ------                           --------------  -----  
           0   Title Number                     93510 non-null  object 
           1   Tenure                           93510 non-null  object 
           2   Property Address                 93507 non-null  object 
           3   District                         93509 non-null  object 
           4   County                           93509 non-null  object 
           5   Region                           93509 non-null  object 
           6   Postcode                         71570 non-null  object 
           7   Multiple Address Indicator       93509 non-null  object 
           8   Price Paid                       33497 non-null  float64
           9   Proprietor Name (1)              93509 non-null  object 
           10  Company Registration No. (1)     4356 non-null   object 
           11  Proprietorship Category (1)      93509 non-null  object 
           12  Country Incorporated (1)         93509 non-null  object 
           13  Proprietor (1) Address (1)       93509 non-null  object 
           14  Proprietor (1) Address (2)       25183 non-null  object 
           15  Proprietor (1) Address (3)       2572 non-null   object 
           16  Proprietor Name (2)              6260 non-null   object 
           17  Company Registration No. (2)     82 non-null     object 
           18  Proprietorship Category (2)      6260 non-null   object 
           19  Country Incorporated (2)         6260 non-null   object 
           20  Proprietor (2) Address (1)       6260 non-null   object 
           21  Proprietor (2) Address (2)       1609 non-null   object 
           22  Proprietor (2) Address (3)       140 non-null    object 
           23  Proprietor Name (3)              40 non-null     object 
           24  Company Registration No. (3)     1 non-null      object 
           25  Proprietorship Category (3)      40 non-null     object 
           26  Country Incorporated (3)         40 non-null     object 
           27  Proprietor (3) Address (1)       40 non-null     object 
           28  Proprietor (3) Address (2)       6 non-null      object 
           29  Proprietor (3) Address (3)       0 non-null      float64
           30  Proprietor Name (4)              8 non-null      object 
           31  Company Registration No. (4)     0 non-null      float64
           32  Proprietorship Category (4)      8 non-null      object 
           33  Country Incorporated (4)         8 non-null      object 
           34  Proprietor (4) Address (1)       8 non-null      object 
           35  Proprietor (4) Address (2)       1 non-null      object 
           36  Proprietor (4) Address (3)       0 non-null      float64
           37  Date Proprietor Added            93308 non-null  object 
           38  Additional Proprietor Indicator  93509 non-null  object 
          dtypes: float64(4), object(35)
          memory usage: 149.8 MB

The newly loaded dataset on overseas companies owning property in England and Wales shows the following characteristics:

It comprises approximately 93,510 entries, each potentially representing a property.
The dataset is detailed, with 39 columns covering various aspects like property address, proprietor name, and price paid.
Several columns have a significant number of missing values, indicating areas where data might be incomplete.
The memory usage for this dataset is about 149.8 MB, which is relatively modest compared to the previous datasets.

This dataset adds an international dimension to the property ownership analysis, offering insights into overseas investments in England and Wales.

Here's a brief introduction to the overseas property ownership dataset:¶

The first two rows of the dataset are displayed, giving a sneak peek into the kind of information it contains, such as the property address and proprietor details.
The dataset is confirmed to have 93,510 rows and 39 columns, indicating a comprehensive collection of data points on overseas property ownership.

This initial look provides a basic understanding of the dataset's structure and the extent of the data available for analysis.

In [ ]:

display(df_overseas.head(2))
          print('Number of Rows:', df_overseas.shape[0])
          print('Number of Columns:', df_overseas.shape[1])
          

	Title Number	Tenure	Property Address	District	County	Region	Postcode	Multiple Address Indicator	Price Paid	Proprietor Name (1)	...	Proprietor (3) Address (3)	Proprietor Name (4)	Company Registration No. (4)	Proprietorship Category (4)	Country Incorporated (4)	Proprietor (4) Address (1)	Proprietor (4) Address (2)	Proprietor (4) Address (3)	Date Proprietor Added	Additional Proprietor Indicator
0	SYK570104	Freehold	276 Sheffield Road, Birdwell, Barnsley (S70 5TG)	BARNSLEY	SOUTH YORKSHIRE	YORKS AND HUMBER	S70 5TG	N	NaN	MILLER ROSS DEVELOPMENTS LIMITED	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	09-10-2012	N
1	SYK571158	Freehold	Land on the south east side of Oakwells, Barto...	DONCASTER	SOUTH YORKSHIRE	YORKS AND HUMBER	DN3 3AB	N	NaN	CHATFORD LIMITED	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	18-06-2012	Y

2 rows × 39 columns

Number of Rows: 93510
          Number of Columns: 39

The overseas property ownership dataset has been refined and streamlined:¶

Columns with a significant number of missing values (more than 40%) were removed to focus on more complete and reliable data.
The column names were also cleaned up for consistency, which included removing specific markers and extra spaces.
The refined dataset now has 14 columns, a reduction from the original 39, making it more concise and focused.
The memory usage of the dataset has decreased to approximately 83.8 MB, reflecting the effects of these optimizations.

These changes help in making the dataset more manageable and relevant for detailed analysis.

In [ ]:

threshold = len(df_overseas) * 0.6
          df_overseas.dropna(axis=1, thresh=threshold, inplace=True)
          df_overseas.columns = df_overseas.columns.str.replace("(1)", "")
          df_overseas.columns = df_overseas.columns.str.strip()
          
          df_overseas.info(memory_usage="deep")
          

<class 'pandas.core.frame.DataFrame'>
          RangeIndex: 93510 entries, 0 to 93509
          Data columns (total 14 columns):
           #   Column                           Non-Null Count  Dtype 
          ---  ------                           --------------  ----- 
           0   Title Number                     93510 non-null  object
           1   Tenure                           93510 non-null  object
           2   Property Address                 93507 non-null  object
           3   District                         93509 non-null  object
           4   County                           93509 non-null  object
           5   Region                           93509 non-null  object
           6   Postcode                         71570 non-null  object
           7   Multiple Address Indicator       93509 non-null  object
           8   Proprietor Name                  93509 non-null  object
           9   Proprietorship Category          93509 non-null  object
           10  Country Incorporated             93509 non-null  object
           11  Proprietor  Address              93509 non-null  object
           12  Date Proprietor Added            93308 non-null  object
           13  Additional Proprietor Indicator  93509 non-null  object
          dtypes: object(14)
          memory usage: 83.8 MB

A quick overview of the updated overseas property ownership dataset:¶

The first two rows are shown, giving an insight into the type of information now available, such as property address, proprietor name, and the country where the company is incorporated.
The refined dataset has a total of 93,510 rows and 14 columns, indicating a substantial amount of data but in a more focused format compared to the original dataset.

This display helps in understanding the structure and key elements of the streamlined dataset, setting the stage for further analysis.

In [ ]:

display(df_overseas.head(2))
          print('Number of Rows:', df_overseas.shape[0])
          print('Number of Columns:', df_overseas.shape[1])
          

	Title Number	Tenure	Property Address	District	County	Region	Postcode	Multiple Address Indicator	Proprietor Name	Proprietorship Category	Country Incorporated	Proprietor Address	Date Proprietor Added	Additional Proprietor Indicator
0	SYK570104	Freehold	276 Sheffield Road, Birdwell, Barnsley (S70 5TG)	BARNSLEY	SOUTH YORKSHIRE	YORKS AND HUMBER	S70 5TG	N	MILLER ROSS DEVELOPMENTS LIMITED	Limited Company or Public Limited Company	IRELAND	25/26 Windsor Place, Lower Pembroke Street, Du...	09-10-2012	N
1	SYK571158	Freehold	Land on the south east side of Oakwells, Barto...	DONCASTER	SOUTH YORKSHIRE	YORKS AND HUMBER	DN3 3AB	N	CHATFORD LIMITED	Limited Company or Public Limited Company	GUERNSEY	Le Vauquiedor Manor, St Martin's, Guernsey, GY...	18-06-2012	Y

Number of Rows: 93510
          Number of Columns: 14

In this part of the code, the datasets are transferred to a structured database:¶

A connection is established to a database named pp-complete-modified.db. Think of this as setting up a large filing cabinet where data can be organized and stored.
Each of the three datasets (df_modified, df_uk, and df_overseas) is saved into this database as separate tables. This is akin to placing different documents into clearly labeled folders within the filing cabinet.
The tables are named property_paid_price, UK_comp_own_prop_Engl_Wales, and Overseas_comp_own_prop_Engl_Wales, corresponding to each dataset.

Storing data in a database like this helps manage large amounts of information more efficiently and allows for easier retrieval and analysis.

In [ ]:

conn = sqlite3.connect('pp-complete-modified.db')
          
          df_modified.to_sql('property_paid_price', conn, if_exists='replace', index=False)
          
          df_uk.to_sql('UK_comp_own_prop_Engl_Wales', conn, if_exists='replace', index=False)
          
          df_overseas.to_sql('Overseas_comp_own_prop_Engl_Wales', conn, if_exists='replace', index=False)

Checking the tables created in the database:¶

A query is run against the database to list all the tables in it. Think of this as checking the labels on the folders in our filing cabinet to see what's inside.
The output confirms the presence of three tables: property_paid_price, UK_comp_own_prop_Engl_Wales, and Overseas_comp_own_prop_Engl_Wales.

This step ensures that all the datasets have been successfully stored in the database and are ready for future analysis.

In [ ]:

query = """
          SELECT name 
          FROM sqlite_master 
          WHERE type='table';
          """
          db = pd.read_sql_query(query, conn)
          db
          

Out[ ]:

	name
0	property_paid_price
1	UK_comp_own_prop_Engl_Wales
2	Overseas_comp_own_prop_Engl_Wales

This section of the code examines the structure of the tables in the database:¶

For each table in the database (property_paid_price, UK_comp_own_prop_Engl_Wales, Overseas_comp_own_prop_Engl_Wales), a query is performed to retrieve the details of its structure.
The output for each table shows the column names and their data types, providing an overview of the kind of information stored in each table.
- For instance, the property_paid_price table includes columns like TUI, Price, Date, with types such as TEXT, INTEGER, and TIMESTAMP.
This information is valuable for understanding the format and layout of the data within each table, which is crucial for any further analysis or queries.

These steps help in getting familiar with the database's structure, ensuring that the data is organized as intended and ready for use.

In [ ]:

tables = ['property_paid_price', 'UK_comp_own_prop_Engl_Wales', 'Overseas_comp_own_prop_Engl_Wales']
          
          for table in tables:
              query = f"PRAGMA table_info({table});"
              df = pd.read_sql(query, conn)
              print(f"Structure of {table}:")
              display(df[['name','type']])
              print('='*50)
          

Structure of property_paid_price:

	name	type
0	TUI	TEXT
1	Price	INTEGER
2	Date	TIMESTAMP
3	Postcode	TEXT
4	Property Type	TEXT
5	Old/New	TEXT
6	Duration	TEXT
7	PAON	TEXT
8	SAON	TEXT
9	Street	TEXT
10	Locality	TEXT
11	Town/City	TEXT
12	District	TEXT
13	County	TEXT
14	Price Type	TEXT
15	Status	TEXT

==================================================
          Structure of UK_comp_own_prop_Engl_Wales:

	name	type
0	Title Number	TEXT
1	Tenure	TEXT
2	Property Address	TEXT
3	District	TEXT
4	County	TEXT
5	Region	TEXT
6	Postcode	TEXT
7	Multiple Address Indicator	TEXT
8	Proprietor Name	TEXT
9	Company Registration No.	TEXT
10	Proprietorship Category	TEXT
11	Proprietor Address	TEXT
12	Date Proprietor Added	TEXT
13	Additional Proprietor Indicator	TEXT

==================================================
          Structure of Overseas_comp_own_prop_Engl_Wales:

	name	type
0	Title Number	TEXT
1	Tenure	TEXT
2	Property Address	TEXT
3	District	TEXT
4	County	TEXT
5	Region	TEXT
6	Postcode	TEXT
7	Multiple Address Indicator	TEXT
8	Proprietor Name	TEXT
9	Proprietorship Category	TEXT
10	Country Incorporated	TEXT
11	Proprietor Address	TEXT
12	Date Proprietor Added	TEXT
13	Additional Proprietor Indicator	TEXT

==================================================

This code provides a quick preview of the first few rows in each table of the database:¶

For each table (property_paid_price, UK_comp_own_prop_Engl_Wales, Overseas_comp_own_prop_Engl_Wales), the first two entries are displayed.
This glimpse into each table shows the types of data stored, such as property details, proprietor information, and other relevant attributes.
- For example, in the property_paid_price table, entries include details like transaction unique identifier (TUI), price, date, postcode, and property type.
These previews are useful for getting a practical sense of what the data looks like in each table, which is important for understanding the context and potential uses of the data.

This step is helpful in ensuring that the data has been loaded correctly into the database and provides a tangible view of the datasets for further analysis.

In [ ]:

for table in tables:
              query = f"SELECT * FROM {table} LIMIT 2;"
              df = pd.read_sql(query, conn)
              print(f"First few rows of {table}:")
              display(df)
              print("\n")
          

First few rows of property_paid_price:

	TUI	Price	Date	Postcode	Property Type	Old/New	Duration	PAON	SAON	Street	Locality	Town/City	District	County	Price Type	Status
0	{40FD4DF2-5362-407C-92BC-566E2CCE89E9}	44500	1995-02-03 00:00:00	SR6 0AQ	T	N	F	50	None	HOWICK PARK	SUNDERLAND	SUNDERLAND	SUNDERLAND	TYNE AND WEAR	A	A
1	{7A99F89E-7D81-4E45-ABD5-566E49A045EA}	56500	1995-01-13 00:00:00	CO6 1SQ	T	N	F	19	None	BRICK KILN CLOSE	COGGESHALL	COLCHESTER	BRAINTREE	ESSEX	A	A

          
          First few rows of UK_comp_own_prop_Engl_Wales:

	Title Number	Tenure	Property Address	District	County	Region	Postcode	Multiple Address Indicator	Proprietor Name	Company Registration No.	Proprietorship Category	Proprietor Address	Date Proprietor Added	Additional Proprietor Indicator
0	356353	Freehold	37 Ixworth Place, London (SW3 3QH)	KENSINGTON AND CHELSEA	GREATER LONDON	GREATER LONDON	SW3 3QH	N	ZURICH ASSURANCE LTD	02456671	Limited Company or Public Limited Company	The Grange, Bishops Cleeve, Cheltenham, Glouce...	01-02-2005	N
1	356361	Freehold	70 Marylebone High Street, The Lord Tyrawley, ...	CITY OF WESTMINSTER	GREATER LONDON	GREATER LONDON	None	N	HOWARD DE WALDEN ESTATES LIMITED	None	Limited Company or Public Limited Company	23 Queen Anne Street, London W1G 9DL	29-11-1963	N

          
          First few rows of Overseas_comp_own_prop_Engl_Wales:

	Title Number	Tenure	Property Address	District	County	Region	Postcode	Multiple Address Indicator	Proprietor Name	Proprietorship Category	Country Incorporated	Proprietor Address	Date Proprietor Added	Additional Proprietor Indicator
0	SYK570104	Freehold	276 Sheffield Road, Birdwell, Barnsley (S70 5TG)	BARNSLEY	SOUTH YORKSHIRE	YORKS AND HUMBER	S70 5TG	N	MILLER ROSS DEVELOPMENTS LIMITED	Limited Company or Public Limited Company	IRELAND	25/26 Windsor Place, Lower Pembroke Street, Du...	09-10-2012	N
1	SYK571158	Freehold	Land on the south east side of Oakwells, Barto...	DONCASTER	SOUTH YORKSHIRE	YORKS AND HUMBER	DN3 3AB	N	CHATFORD LIMITED	Limited Company or Public Limited Company	GUERNSEY	Le Vauquiedor Manor, St Martin's, Guernsey, GY...	18-06-2012	Y

This code segment counts the number of rows in each table of the database:¶

It runs a query for each table (property_paid_price, UK_comp_own_prop_Engl_Wales, Overseas_comp_own_prop_Engl_Wales) to determine how many entries they contain.
The results show the scale of each dataset:
- The property_paid_price table has 28,782,629 rows.
- The UK_comp_own_prop_Engl_Wales table contains 4,105,590 rows.
- The Overseas_comp_own_prop_Engl_Wales table includes 93,510 rows.

Knowing the number of rows in each table is important for understanding the volume of data available for analysis in each category of property ownership.

In [ ]:

for table in tables:
              query = f"SELECT COUNT(*) FROM {table};"
              count = pd.read_sql(query, conn).iloc[0, 0]
              print(f"Number of rows in {table}: {count}")
          

Number of rows in property_paid_price: 28782629
          Number of rows in UK_comp_own_prop_Engl_Wales: 4105590
          Number of rows in Overseas_comp_own_prop_Engl_Wales: 93510

Conclusion¶

At the beginning of this task, we embarked on a journey to transform extensive property-related datasets into a format suitable for analysis on computers with limited resources. Our goal was to make this information more accessible and manageable, especially for systems that might struggle with large volumes of data.

What We Achieved:

Data Consolidation and Optimization: We successfully imported large datasets from CSV files and refined them, focusing on key information and reducing memory usage. This optimization made the data more manageable for analysis.
Creation of a Structured Database: We transferred these optimized datasets into a SQLite database. This database consists of three main tables, each representing a unique aspect of property ownership in the UK and overseas. The structured nature of this database allows for more efficient data handling and querying.
Data Validation and Inspection: We confirmed the successful creation and structure of the database tables and gained an understanding of their contents. This step ensured the data was correctly organized and ready for future analysis.
Preparation for In-Depth Analysis: With these steps, the datasets are now in a state that is more accessible for detailed analysis. The reduced memory footprint means that even less powerful computers can handle the data, democratizing the ability to perform complex data analysis.

In summary, we transformed unwieldy and large datasets into a streamlined and efficient format, paving the way for comprehensive analysis on a wide range of computing systems. This process has not only made the data more accessible but also laid the groundwork for insightful exploration into property ownership patterns in England and Wales.

Ramzan Albakov

Data Optimization

Transformation of three voluminous CSV files into a SQLite database, designed to facilitate data analysis on computers with limited capacity by optimizing memory use and employing SQL for effective data management.

Introduction¶

Variables Description¶

Implementation: Executing the Code¶

This code snippet prepares essential tools for data handling:¶

Loads a dataset called 'pp-complete.csv' from a specific folder on the computer:¶

Detailed information about the property dataset:¶

The output provides a detailed overview of the property dataset's structure:¶

This part of the code gives us a peek into the dataset and its size: ¶

In this part of the code, the dataset undergoes a few changes to make it more user-friendly and organized:¶

Here, the dataset is loaded again with some specific adjustments to how the data is handled:¶

Here's a quick look at the updated dataset and its dimensions:¶

This part of the code introduces a new dataset focusing on UK companies that own property in England and Wales:¶

This part of the code provides a brief glimpse into the dataset on UK property ownership by companies:¶

In this section, the dataset is refined to make it more manageable and easier to analyze:¶

Here's another look at the streamlined UK property ownership dataset:¶

This code introduces yet another dataset to the analysis, focusing on overseas companies that own property in England and Wales:¶

Here's a brief introduction to the overseas property ownership dataset:¶

The overseas property ownership dataset has been refined and streamlined:¶

A quick overview of the updated overseas property ownership dataset:¶

In this part of the code, the datasets are transferred to a structured database:¶

Checking the tables created in the database:¶

This section of the code examines the structure of the tables in the database:¶

This code provides a quick preview of the first few rows in each table of the database:¶

This code segment counts the number of rows in each table of the database:¶

Conclusion¶