This Jupyter Notebook documents the process of creating a SQLite database from three large CSV files. The primary objective of this endeavor is to enable data analysis on low-performance computers with limited RAM. Working directly with large datasets in memory-intensive environments like pandas often leads to kernel crashes due to memory overloads. By transitioning the data into a SQLite database, we can significantly reduce the memory footprint of our data processing tasks.
SQLite offers a lightweight, file-based database system that is ideal for handling large datasets on machines with constrained resources. By storing data in a database, we can leverage SQL's efficient data retrieval capabilities, allowing for complex queries and data analysis without the need for loading the entire dataset into memory. This approach is not only resource-efficient but also scalable and more manageable, especially when dealing with large volumes of data.
The following sections in this notebook detail the steps taken to import, clean, and transfer the data from CSV files into a structured database. This includes data preprocessing, schema definition, and data insertion, ensuring that the database is well-organized and ready for subsequent analysis. This setup aims to facilitate robust data analysis even on computers that struggle with high-memory demands, thus democratizing data analysis to a wider range of computational environments.
Data Item | Explanation |
---|---|
TUI | A reference number which is generated automatically recording each published sale. The number is unique and will change each time a sale is recorded. |
Price | Sale price stated on the transfer deed. |
Date | Date when the sale was completed, as stated on the transfer deed. |
Postcode | This is the postcode used at the time of the original transaction. Note that postcodes can be reallocated and these changes are not reflected in the Price Paid Dataset. |
Property Type | D = Detached, S = Semi-Detached, T = Terraced, F = Flats/Maisonettes, O = Other. Note that: - we only record the above categories to describe property type, we do not separately identify bungalows - end-of-terrace properties are included in the Terraced category above - ‘Other’ is only valid where the transaction relates to a property type that is not covered by existing values, for example where a property comprises more than one large parcel of land |
Old/New | Indicates the age of the property and applies to all price paid transactions, residential and non-residential. Y = a newly built property, N = an established residential building |
Duration | Relates to the tenure: F = Freehold, L= Leasehold etc. Note that HM Land Registry does not record leases of 7 years or less in the Price Paid Dataset. |
PAON | Primary Addressable Object Name. Typically the house number or name. |
SAON | Secondary Addressable Object Name. Where a property has been divided into separate units (for example, flats), the PAON (above) will identify the building and a SAON will be specified that identifies the separate unit/flat. |
Price Type | Indicates the type of Price Paid transaction. A = Standard Price Paid entry, includes single residential property sold for value. B = Additional Price Paid entry including transfers under a power of sale/repossessions, buy-to-lets (where they can be identified by a Mortgage), transfers to non-private individuals and sales where the property type is classed as ‘Other’. Note that category B does not separately identify the transaction types stated. HM Land Registry has been collecting information on Category A transactions from January 1995. Category B transactions were identified from October 2013. |
Status | Indicates additions, changes and deletions to the records.(see guide below). A = Added records: records added into the price paid dataset in the monthly refresh due to new sales transactions C = Changed records: records changed in the price paid dataset in the monthly refresh. You should replace or update records in any stored data using the unique identifier to recognise them D = Deleted records: records deleted from the price paid dataset in the monthly refresh. You should delete records from any stored data using the unique identifier to recognise them. Note that where a transaction changes category type due to misallocation (as above) it will be deleted from the original category type and added to the correct category with a new transaction unique identifier. |
=================================================================================================================================================================================
Purpose: These tools are essential for efficient data analysis and storage, especially suitable for handling large volumes of data on less powerful computers.
import pandas as pd
import numpy as np
import sqlite3
df_pp = pd.read_csv('/Users/Albakov/Desktop/Data Analysis/My Analysis of Property/pp-complete.csv')
df_pp.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 28782629 entries, 0 to 28782628 Data columns (total 16 columns): # Column Dtype --- ------ ----- 0 {F887F88E-7D15-4415-804E-52EAC2F10958} object 1 70000 int64 2 1995-07-07 00:00 object 3 MK15 9HP object 4 D object 5 N object 6 F object 7 31 object 8 Unnamed: 8 object 9 ALDRICH DRIVE object 10 WILLEN object 11 MILTON KEYNES object 12 MILTON KEYNES.1 object 13 MILTON KEYNES.2 object 14 A object 15 A.1 object dtypes: int64(1), object(15) memory usage: 22.6 GB
This information is crucial for planning how to manage, process, and analyze this extensive dataset effectively.
Understanding the basic layout and size of the dataset is a crucial first step before starting any detailed analysis or processing.
display(df_pp.head(2))
print('Number of Rows:', df_pp.shape[0])
print('Number of Columns:', df_pp.shape[1])
{F887F88E-7D15-4415-804E-52EAC2F10958} | 70000 | 1995-07-07 00:00 | MK15 9HP | D | N | F | 31 | Unnamed: 8 | ALDRICH DRIVE | WILLEN | MILTON KEYNES | MILTON KEYNES.1 | MILTON KEYNES.2 | A | A.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {40FD4DF2-5362-407C-92BC-566E2CCE89E9} | 44500 | 1995-02-03 00:00 | SR6 0AQ | T | N | F | 50 | NaN | HOWICK PARK | SUNDERLAND | SUNDERLAND | SUNDERLAND | TYNE AND WEAR | A | A |
1 | {7A99F89E-7D81-4E45-ABD5-566E49A045EA} | 56500 | 1995-01-13 00:00 | CO6 1SQ | T | N | F | 19 | NaN | BRICK KILN CLOSE | COGGESHALL | COLCHESTER | BRAINTREE | ESSEX | A | A |
Number of Rows: 28782629 Number of Columns: 16
These steps enhance the dataset's readability and structure, and help in managing computer memory efficiently.
column_names = [
"TUI",
"Price",
"Date",
"Postcode",
"Property Type",
"Old/New",
"Duration",
"PAON",
"SAON",
"Street",
"Locality",
"Town/City",
"District",
"County",
"Price Type",
"Status"
]
df_pp.columns = column_names
df_pp['Date'] = df_pp['Date'].str.split().str[0]
df_pp.to_csv('/Users/Albakov/Desktop/Data Analysis/My Analysis of Property/pp-complete-modified.csv', index=False)
del df_pp
These changes are geared towards optimizing the dataset for more efficient analysis and processing.
dtypes = {
"TUI": "category",
"Price": "int32",
"Postcode": "category",
"Property Type": "category",
"Old/New": "category",
"Duration": "category",
"PAON": "category",
"SAON": "category",
"Street": "category",
"Locality": "category",
"Town/City": "category",
"District": "category",
"County": "category",
"Price Type": "category",
"Status": "category"
}
df_modified = pd.read_csv('/Users/Albakov/Desktop/Data Analysis/My Analysis of Property/pp-complete-modified.csv', dtype=dtypes)
df_modified["Date"] = pd.to_datetime(df_modified["Date"])
df_modified.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 28782629 entries, 0 to 28782628 Data columns (total 16 columns): # Column Dtype --- ------ ----- 0 TUI category 1 Price int32 2 Date datetime64[ns] 3 Postcode category 4 Property Type category 5 Old/New category 6 Duration category 7 PAON category 8 SAON category 9 Street category 10 Locality category 11 Town/City category 12 District category 13 County category 14 Price Type category 15 Status category dtypes: category(14), datetime64[ns](1), int32(1) memory usage: 4.7 GB
This output shows the results of the recent modifications to the dataset:
This demonstrates the effectiveness of the data optimization steps taken to make the dataset more manageable for processing and analysis.
This glimpse into the dataset, along with its size, helps in getting a sense of the data's scope and the effectiveness of the recent optimizations.
display(df_modified.head(2))
print('Number of Rows:', df_modified.shape[0])
print('Number of Columns:', df_modified.shape[1])
TUI | Price | Date | Postcode | Property Type | Old/New | Duration | PAON | SAON | Street | Locality | Town/City | District | County | Price Type | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {40FD4DF2-5362-407C-92BC-566E2CCE89E9} | 44500 | 1995-02-03 | SR6 0AQ | T | N | F | 50 | NaN | HOWICK PARK | SUNDERLAND | SUNDERLAND | SUNDERLAND | TYNE AND WEAR | A | A |
1 | {7A99F89E-7D81-4E45-ABD5-566E49A045EA} | 56500 | 1995-01-13 | CO6 1SQ | T | N | F | 19 | NaN | BRICK KILN CLOSE | COGGESHALL | COLCHESTER | BRAINTREE | ESSEX | A | A |
Number of Rows: 28782629 Number of Columns: 16
This step is important for incorporating additional data into our analysis, expanding the scope to include information about property ownership by UK companies.
df_uk = pd.read_csv('/Users/Albakov/Desktop/Data Analysis/My Analysis of Property/UK companies that own property in England and Wales.csv', low_memory=False)
df_uk.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4105590 entries, 0 to 4105589 Data columns (total 35 columns): # Column Dtype --- ------ ----- 0 Title Number object 1 Tenure object 2 Property Address object 3 District object 4 County object 5 Region object 6 Postcode object 7 Multiple Address Indicator object 8 Price Paid float64 9 Proprietor Name (1) object 10 Company Registration No. (1) object 11 Proprietorship Category (1) object 12 Proprietor (1) Address (1) object 13 Proprietor (1) Address (2) object 14 Proprietor (1) Address (3) object 15 Proprietor Name (2) object 16 Company Registration No. (2) object 17 Proprietorship Category (2) object 18 Proprietor (2) Address (1) object 19 Proprietor (2) Address (2) object 20 Proprietor (2) Address (3) object 21 Proprietor Name (3) object 22 Company Registration No. (3) object 23 Proprietorship Category (3) object 24 Proprietor (3) Address (1) object 25 Proprietor (3) Address (2) object 26 Proprietor (3) Address (3) object 27 Proprietor Name (4) object 28 Company Registration No. (4) object 29 Proprietorship Category (4) object 30 Proprietor (4) Address (1) object 31 Proprietor (4) Address (2) object 32 Proprietor (4) Address (3) float64 33 Date Proprietor Added object 34 Additional Proprietor Indicator object dtypes: float64(2), object(33) memory usage: 5.9 GB
The overview of the newly loaded dataset about UK companies owning property in England and Wales reveals:
This information helps us understand the depth and breadth of the data available for analysis regarding property ownership by companies in the UK.
These insights are helpful for understanding the composition and scale of the data, offering a foundation for more in-depth analysis.
display(df_uk.head(2))
print('Number of Rows:', df_uk.shape[0])
print('Number of Columns:', df_uk.shape[1])
Title Number | Tenure | Property Address | District | County | Region | Postcode | Multiple Address Indicator | Price Paid | Proprietor Name (1) | ... | Proprietor (3) Address (2) | Proprietor (3) Address (3) | Proprietor Name (4) | Company Registration No. (4) | Proprietorship Category (4) | Proprietor (4) Address (1) | Proprietor (4) Address (2) | Proprietor (4) Address (3) | Date Proprietor Added | Additional Proprietor Indicator | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 356353 | Freehold | 37 Ixworth Place, London (SW3 3QH) | KENSINGTON AND CHELSEA | GREATER LONDON | GREATER LONDON | SW3 3QH | N | NaN | ZURICH ASSURANCE LTD | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01-02-2005 | N |
1 | 356361 | Freehold | 70 Marylebone High Street, The Lord Tyrawley, ... | CITY OF WESTMINSTER | GREATER LONDON | GREATER LONDON | NaN | N | NaN | HOWARD DE WALDEN ESTATES LIMITED | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 29-11-1963 | N |
2 rows × 35 columns
Number of Rows: 4105590 Number of Columns: 35
These modifications are aimed at enhancing the dataset's quality and usability for more effective data analysis.
threshold = len(df_uk) * 0.6
df_uk.dropna(axis=1, thresh=threshold, inplace=True)
df_uk.columns = df_uk.columns.str.replace("(1)", "")
df_uk.columns = df_uk.columns.str.strip()
df_uk.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4105590 entries, 0 to 4105589 Data columns (total 14 columns): # Column Dtype --- ------ ----- 0 Title Number object 1 Tenure object 2 Property Address object 3 District object 4 County object 5 Region object 6 Postcode object 7 Multiple Address Indicator object 8 Proprietor Name object 9 Company Registration No. object 10 Proprietorship Category object 11 Proprietor Address object 12 Date Proprietor Added object 13 Additional Proprietor Indicator object dtypes: object(14) memory usage: 3.5 GB
The overview of the updated UK property ownership dataset shows:
These changes make the dataset more focused and manageable, improving its suitability for in-depth analysis.
These updates offer a clearer and more concise view of the dataset, aiding in a more efficient analysis process.
display(df_uk.head(2))
print('Number of Rows:', df_uk.shape[0])
print('Number of Columns:', df_uk.shape[1])
Title Number | Tenure | Property Address | District | County | Region | Postcode | Multiple Address Indicator | Proprietor Name | Company Registration No. | Proprietorship Category | Proprietor Address | Date Proprietor Added | Additional Proprietor Indicator | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 356353 | Freehold | 37 Ixworth Place, London (SW3 3QH) | KENSINGTON AND CHELSEA | GREATER LONDON | GREATER LONDON | SW3 3QH | N | ZURICH ASSURANCE LTD | 02456671 | Limited Company or Public Limited Company | The Grange, Bishops Cleeve, Cheltenham, Glouce... | 01-02-2005 | N |
1 | 356361 | Freehold | 70 Marylebone High Street, The Lord Tyrawley, ... | CITY OF WESTMINSTER | GREATER LONDON | GREATER LONDON | NaN | N | HOWARD DE WALDEN ESTATES LIMITED | NaN | Limited Company or Public Limited Company | 23 Queen Anne Street, London W1G 9DL | 29-11-1963 | N |
Number of Rows: 4105590 Number of Columns: 14
Incorporating this dataset allows for a broader and more comprehensive analysis of property ownership, including both UK and overseas entities.
df_overseas = pd.read_csv('/Users/Albakov/Desktop/Data Analysis/My Analysis of Property/Overseas companies that own property in England and Wales.csv', low_memory=False)
df_overseas.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 93510 entries, 0 to 93509 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Title Number 93510 non-null object 1 Tenure 93510 non-null object 2 Property Address 93507 non-null object 3 District 93509 non-null object 4 County 93509 non-null object 5 Region 93509 non-null object 6 Postcode 71570 non-null object 7 Multiple Address Indicator 93509 non-null object 8 Price Paid 33497 non-null float64 9 Proprietor Name (1) 93509 non-null object 10 Company Registration No. (1) 4356 non-null object 11 Proprietorship Category (1) 93509 non-null object 12 Country Incorporated (1) 93509 non-null object 13 Proprietor (1) Address (1) 93509 non-null object 14 Proprietor (1) Address (2) 25183 non-null object 15 Proprietor (1) Address (3) 2572 non-null object 16 Proprietor Name (2) 6260 non-null object 17 Company Registration No. (2) 82 non-null object 18 Proprietorship Category (2) 6260 non-null object 19 Country Incorporated (2) 6260 non-null object 20 Proprietor (2) Address (1) 6260 non-null object 21 Proprietor (2) Address (2) 1609 non-null object 22 Proprietor (2) Address (3) 140 non-null object 23 Proprietor Name (3) 40 non-null object 24 Company Registration No. (3) 1 non-null object 25 Proprietorship Category (3) 40 non-null object 26 Country Incorporated (3) 40 non-null object 27 Proprietor (3) Address (1) 40 non-null object 28 Proprietor (3) Address (2) 6 non-null object 29 Proprietor (3) Address (3) 0 non-null float64 30 Proprietor Name (4) 8 non-null object 31 Company Registration No. (4) 0 non-null float64 32 Proprietorship Category (4) 8 non-null object 33 Country Incorporated (4) 8 non-null object 34 Proprietor (4) Address (1) 8 non-null object 35 Proprietor (4) Address (2) 1 non-null object 36 Proprietor (4) Address (3) 0 non-null float64 37 Date Proprietor Added 93308 non-null object 38 Additional Proprietor Indicator 93509 non-null object dtypes: float64(4), object(35) memory usage: 149.8 MB
The newly loaded dataset on overseas companies owning property in England and Wales shows the following characteristics:
This dataset adds an international dimension to the property ownership analysis, offering insights into overseas investments in England and Wales.
This initial look provides a basic understanding of the dataset's structure and the extent of the data available for analysis.
display(df_overseas.head(2))
print('Number of Rows:', df_overseas.shape[0])
print('Number of Columns:', df_overseas.shape[1])
Title Number | Tenure | Property Address | District | County | Region | Postcode | Multiple Address Indicator | Price Paid | Proprietor Name (1) | ... | Proprietor (3) Address (3) | Proprietor Name (4) | Company Registration No. (4) | Proprietorship Category (4) | Country Incorporated (4) | Proprietor (4) Address (1) | Proprietor (4) Address (2) | Proprietor (4) Address (3) | Date Proprietor Added | Additional Proprietor Indicator | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SYK570104 | Freehold | 276 Sheffield Road, Birdwell, Barnsley (S70 5TG) | BARNSLEY | SOUTH YORKSHIRE | YORKS AND HUMBER | S70 5TG | N | NaN | MILLER ROSS DEVELOPMENTS LIMITED | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 09-10-2012 | N |
1 | SYK571158 | Freehold | Land on the south east side of Oakwells, Barto... | DONCASTER | SOUTH YORKSHIRE | YORKS AND HUMBER | DN3 3AB | N | NaN | CHATFORD LIMITED | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 18-06-2012 | Y |
2 rows × 39 columns
Number of Rows: 93510 Number of Columns: 39
These changes help in making the dataset more manageable and relevant for detailed analysis.
threshold = len(df_overseas) * 0.6
df_overseas.dropna(axis=1, thresh=threshold, inplace=True)
df_overseas.columns = df_overseas.columns.str.replace("(1)", "")
df_overseas.columns = df_overseas.columns.str.strip()
df_overseas.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 93510 entries, 0 to 93509 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Title Number 93510 non-null object 1 Tenure 93510 non-null object 2 Property Address 93507 non-null object 3 District 93509 non-null object 4 County 93509 non-null object 5 Region 93509 non-null object 6 Postcode 71570 non-null object 7 Multiple Address Indicator 93509 non-null object 8 Proprietor Name 93509 non-null object 9 Proprietorship Category 93509 non-null object 10 Country Incorporated 93509 non-null object 11 Proprietor Address 93509 non-null object 12 Date Proprietor Added 93308 non-null object 13 Additional Proprietor Indicator 93509 non-null object dtypes: object(14) memory usage: 83.8 MB
This display helps in understanding the structure and key elements of the streamlined dataset, setting the stage for further analysis.
display(df_overseas.head(2))
print('Number of Rows:', df_overseas.shape[0])
print('Number of Columns:', df_overseas.shape[1])
Title Number | Tenure | Property Address | District | County | Region | Postcode | Multiple Address Indicator | Proprietor Name | Proprietorship Category | Country Incorporated | Proprietor Address | Date Proprietor Added | Additional Proprietor Indicator | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SYK570104 | Freehold | 276 Sheffield Road, Birdwell, Barnsley (S70 5TG) | BARNSLEY | SOUTH YORKSHIRE | YORKS AND HUMBER | S70 5TG | N | MILLER ROSS DEVELOPMENTS LIMITED | Limited Company or Public Limited Company | IRELAND | 25/26 Windsor Place, Lower Pembroke Street, Du... | 09-10-2012 | N |
1 | SYK571158 | Freehold | Land on the south east side of Oakwells, Barto... | DONCASTER | SOUTH YORKSHIRE | YORKS AND HUMBER | DN3 3AB | N | CHATFORD LIMITED | Limited Company or Public Limited Company | GUERNSEY | Le Vauquiedor Manor, St Martin's, Guernsey, GY... | 18-06-2012 | Y |
Number of Rows: 93510 Number of Columns: 14
pp-complete-modified.db
. Think of this as setting up a large filing cabinet where data can be organized and stored.df_modified
, df_uk
, and df_overseas
) is saved into this database as separate tables. This is akin to placing different documents into clearly labeled folders within the filing cabinet.property_paid_price
, UK_comp_own_prop_Engl_Wales
, and Overseas_comp_own_prop_Engl_Wales
, corresponding to each dataset.Storing data in a database like this helps manage large amounts of information more efficiently and allows for easier retrieval and analysis.
conn = sqlite3.connect('pp-complete-modified.db')
df_modified.to_sql('property_paid_price', conn, if_exists='replace', index=False)
df_uk.to_sql('UK_comp_own_prop_Engl_Wales', conn, if_exists='replace', index=False)
df_overseas.to_sql('Overseas_comp_own_prop_Engl_Wales', conn, if_exists='replace', index=False)
property_paid_price
, UK_comp_own_prop_Engl_Wales
, and Overseas_comp_own_prop_Engl_Wales
.This step ensures that all the datasets have been successfully stored in the database and are ready for future analysis.
query = """
SELECT name
FROM sqlite_master
WHERE type='table';
"""
db = pd.read_sql_query(query, conn)
db
name | |
---|---|
0 | property_paid_price |
1 | UK_comp_own_prop_Engl_Wales |
2 | Overseas_comp_own_prop_Engl_Wales |
property_paid_price
, UK_comp_own_prop_Engl_Wales
, Overseas_comp_own_prop_Engl_Wales
), a query is performed to retrieve the details of its structure.property_paid_price
table includes columns like TUI
, Price
, Date
, with types such as TEXT
, INTEGER
, and TIMESTAMP
.These steps help in getting familiar with the database's structure, ensuring that the data is organized as intended and ready for use.
tables = ['property_paid_price', 'UK_comp_own_prop_Engl_Wales', 'Overseas_comp_own_prop_Engl_Wales']
for table in tables:
query = f"PRAGMA table_info({table});"
df = pd.read_sql(query, conn)
print(f"Structure of {table}:")
display(df[['name','type']])
print('='*50)
Structure of property_paid_price:
name | type | |
---|---|---|
0 | TUI | TEXT |
1 | Price | INTEGER |
2 | Date | TIMESTAMP |
3 | Postcode | TEXT |
4 | Property Type | TEXT |
5 | Old/New | TEXT |
6 | Duration | TEXT |
7 | PAON | TEXT |
8 | SAON | TEXT |
9 | Street | TEXT |
10 | Locality | TEXT |
11 | Town/City | TEXT |
12 | District | TEXT |
13 | County | TEXT |
14 | Price Type | TEXT |
15 | Status | TEXT |
================================================== Structure of UK_comp_own_prop_Engl_Wales:
name | type | |
---|---|---|
0 | Title Number | TEXT |
1 | Tenure | TEXT |
2 | Property Address | TEXT |
3 | District | TEXT |
4 | County | TEXT |
5 | Region | TEXT |
6 | Postcode | TEXT |
7 | Multiple Address Indicator | TEXT |
8 | Proprietor Name | TEXT |
9 | Company Registration No. | TEXT |
10 | Proprietorship Category | TEXT |
11 | Proprietor Address | TEXT |
12 | Date Proprietor Added | TEXT |
13 | Additional Proprietor Indicator | TEXT |
================================================== Structure of Overseas_comp_own_prop_Engl_Wales:
name | type | |
---|---|---|
0 | Title Number | TEXT |
1 | Tenure | TEXT |
2 | Property Address | TEXT |
3 | District | TEXT |
4 | County | TEXT |
5 | Region | TEXT |
6 | Postcode | TEXT |
7 | Multiple Address Indicator | TEXT |
8 | Proprietor Name | TEXT |
9 | Proprietorship Category | TEXT |
10 | Country Incorporated | TEXT |
11 | Proprietor Address | TEXT |
12 | Date Proprietor Added | TEXT |
13 | Additional Proprietor Indicator | TEXT |
==================================================
property_paid_price
, UK_comp_own_prop_Engl_Wales
, Overseas_comp_own_prop_Engl_Wales
), the first two entries are displayed.property_paid_price
table, entries include details like transaction unique identifier (TUI), price, date, postcode, and property type.This step is helpful in ensuring that the data has been loaded correctly into the database and provides a tangible view of the datasets for further analysis.
for table in tables:
query = f"SELECT * FROM {table} LIMIT 2;"
df = pd.read_sql(query, conn)
print(f"First few rows of {table}:")
display(df)
print("\n")
First few rows of property_paid_price:
TUI | Price | Date | Postcode | Property Type | Old/New | Duration | PAON | SAON | Street | Locality | Town/City | District | County | Price Type | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {40FD4DF2-5362-407C-92BC-566E2CCE89E9} | 44500 | 1995-02-03 00:00:00 | SR6 0AQ | T | N | F | 50 | None | HOWICK PARK | SUNDERLAND | SUNDERLAND | SUNDERLAND | TYNE AND WEAR | A | A |
1 | {7A99F89E-7D81-4E45-ABD5-566E49A045EA} | 56500 | 1995-01-13 00:00:00 | CO6 1SQ | T | N | F | 19 | None | BRICK KILN CLOSE | COGGESHALL | COLCHESTER | BRAINTREE | ESSEX | A | A |
First few rows of UK_comp_own_prop_Engl_Wales:
Title Number | Tenure | Property Address | District | County | Region | Postcode | Multiple Address Indicator | Proprietor Name | Company Registration No. | Proprietorship Category | Proprietor Address | Date Proprietor Added | Additional Proprietor Indicator | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 356353 | Freehold | 37 Ixworth Place, London (SW3 3QH) | KENSINGTON AND CHELSEA | GREATER LONDON | GREATER LONDON | SW3 3QH | N | ZURICH ASSURANCE LTD | 02456671 | Limited Company or Public Limited Company | The Grange, Bishops Cleeve, Cheltenham, Glouce... | 01-02-2005 | N |
1 | 356361 | Freehold | 70 Marylebone High Street, The Lord Tyrawley, ... | CITY OF WESTMINSTER | GREATER LONDON | GREATER LONDON | None | N | HOWARD DE WALDEN ESTATES LIMITED | None | Limited Company or Public Limited Company | 23 Queen Anne Street, London W1G 9DL | 29-11-1963 | N |
First few rows of Overseas_comp_own_prop_Engl_Wales:
Title Number | Tenure | Property Address | District | County | Region | Postcode | Multiple Address Indicator | Proprietor Name | Proprietorship Category | Country Incorporated | Proprietor Address | Date Proprietor Added | Additional Proprietor Indicator | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SYK570104 | Freehold | 276 Sheffield Road, Birdwell, Barnsley (S70 5TG) | BARNSLEY | SOUTH YORKSHIRE | YORKS AND HUMBER | S70 5TG | N | MILLER ROSS DEVELOPMENTS LIMITED | Limited Company or Public Limited Company | IRELAND | 25/26 Windsor Place, Lower Pembroke Street, Du... | 09-10-2012 | N |
1 | SYK571158 | Freehold | Land on the south east side of Oakwells, Barto... | DONCASTER | SOUTH YORKSHIRE | YORKS AND HUMBER | DN3 3AB | N | CHATFORD LIMITED | Limited Company or Public Limited Company | GUERNSEY | Le Vauquiedor Manor, St Martin's, Guernsey, GY... | 18-06-2012 | Y |
property_paid_price
, UK_comp_own_prop_Engl_Wales
, Overseas_comp_own_prop_Engl_Wales
) to determine how many entries they contain.property_paid_price
table has 28,782,629 rows.UK_comp_own_prop_Engl_Wales
table contains 4,105,590 rows.Overseas_comp_own_prop_Engl_Wales
table includes 93,510 rows.Knowing the number of rows in each table is important for understanding the volume of data available for analysis in each category of property ownership.
for table in tables:
query = f"SELECT COUNT(*) FROM {table};"
count = pd.read_sql(query, conn).iloc[0, 0]
print(f"Number of rows in {table}: {count}")
Number of rows in property_paid_price: 28782629 Number of rows in UK_comp_own_prop_Engl_Wales: 4105590 Number of rows in Overseas_comp_own_prop_Engl_Wales: 93510
At the beginning of this task, we embarked on a journey to transform extensive property-related datasets into a format suitable for analysis on computers with limited resources. Our goal was to make this information more accessible and manageable, especially for systems that might struggle with large volumes of data.
What We Achieved:
Data Consolidation and Optimization: We successfully imported large datasets from CSV files and refined them, focusing on key information and reducing memory usage. This optimization made the data more manageable for analysis.
Creation of a Structured Database: We transferred these optimized datasets into a SQLite database. This database consists of three main tables, each representing a unique aspect of property ownership in the UK and overseas. The structured nature of this database allows for more efficient data handling and querying.
Data Validation and Inspection: We confirmed the successful creation and structure of the database tables and gained an understanding of their contents. This step ensured the data was correctly organized and ready for future analysis.
Preparation for In-Depth Analysis: With these steps, the datasets are now in a state that is more accessible for detailed analysis. The reduced memory footprint means that even less powerful computers can handle the data, democratizing the ability to perform complex data analysis.
In summary, we transformed unwieldy and large datasets into a streamlined and efficient format, paving the way for comprehensive analysis on a wide range of computing systems. This process has not only made the data more accessible but also laid the groundwork for insightful exploration into property ownership patterns in England and Wales.
Contains HM Land Registry data © Crown copyright and database right 2021. This data is licensed under the Open Government Licence v3.0.