Features
Extraction
Extracts data from either a web URL or a local directory path.
Parameters:
- url (str): The URL of the web page from which data needs to be extracted.
- path (str): The local directory path from which data needs to be extracted.
- depth (int): The depth of crawling if a URL is provided (default is 0, which means only the provided URL is considered).
- ext (list): A list of file extensions to filter the extracted files (e.g., [‘.csv’, ‘.xls’]).
Returns:
- list: A list of extracted data file paths.
Note:
- If both url and path are provided, preference will be given to url.
- If no extension filter is provided (ext=None), all files will be considered.
Example:
harmonizer = Harmonizer()
# Extract data from ur
url = "https://microdatos.dane.gov.co/index.php/catalog/548/get-microdata"
list_datainfo = harmonizer.extract(url=url, path=None, depth=0, ext=['.csv','.xls','.xlsx', ".txt", ".sav", ".zip"])
harmonizer = Harmonizer(list_datainfo)
# Extract data from path
url = "data/input"
list_datainfo = harmonizer.extract(url=None, path=url, depth=0, ext=['.csv'])
harmonizer = Harmonizer(list_datainfo)
Transformation
Transforms the extracted data into a structured format.
Parameters: -none
Returns:
- list: A list of transformed data (DataInfo object) file paths.
Example:
harmonizer.transform()
Load
Loads the extracted datasets.
Parameters: -none
Returns:
- list: A list of tuples containing success status and message for each dataset. Each tuple has two elements: success (bool) and message (str).
Example:
results = harmonizer.load()
# Print the results
for i, result in enumerate(results):
print(f"Dataset {i + 1}: Success: {result[0]}, Message: {result[1]}")
Querier
Queries the loaded datasets using SQL queries.
Create a querier intance
from nyctibius.db.querier import Querier
querier = Querier(db_path='data/output/nyctibius.db')
Start a SELECT query for a specific table and columns.
Select all columns from the table
parameters:
- table (str): The table name.
- columns (list, optional): A list of columns to be selected.
returns:
- Querier: The current Querier instance.
Example:
df = querier.select(table="Estructura CHC_2017").execute()
Note: Select specific columns from the table
df = querier.select(table="Estructura CHC_2017", columns=['DIRECTORIO', 'P1']).execute()
Add a condition to the query.
parameters:
- condition (str): The condition to be added.
- operator (str, optional): The operator to be used to combine the condition with the previous one (default is ‘AND’).
returns:
- Querier: The current Querier instance.
Add a filter to the query
df = querier.select(table="Estructura CHC_2017").filter("P1 = 1").execute()
Note: Add multiple filters to the query
df = querier.select(table="Estructura CHC_2017").filter(condition="P1 > 5").filter(condition="P2 < 10", operator='AND').execute()
Add an IN or NOT IN condition to the query.
parameters:
- column (str): The column name.
- values (list): A list of values to be checked.
- operator (str, optional): The operator to be used to combine the condition with the previous one (default is ‘AND’).
- not_in (bool, optional): If True, the condition will be NOT IN (default is False).
returns:
- Querier: The current Querier instance.
Add an IN condition to the query
df = querier.select(table="Estructura CHC_2017").filter_in(column="P1", values=[1, 2, 3]).execute()
Add a LIKE condition to the query that checks if a column starts with, ends with, or contains a pattern.
parameters:
- column (str): The column name.
- pattern (str): The pattern to be checked.
- condition_type (str): The type of LIKE condition. Must be ‘startswith’, ‘endswith’, or ‘contains’.
- operator (str, optional): The operator to be used to combine the condition with the previous one (default is ‘AND’).
returns:
- Querier: The current Querier instance.
Add a LIKE condition to the query
df = querier.select(table="Estructura CHC_2017").filter_like(column="P1", pattern="1%", condition_type='startswith').execute()
Add a JOIN clause to the query.
parameters:
- table (str): The table name to be joined.
- join_type (str, optional): The type of join. Defaults to ‘INNER’.
- on_condition (str, optional): The condition for the ON clause. Defaults to None.
returns:
- Querier: The current Querier instance.
Add a JOIN clause to the query
df = querier.select(table="Estructura CHC_2017").join(table="Estructura_CHC_2017_Copy", on_condition="'Estructura CHC_2017.DIRECTORIO' = Estructura_CHC_2017_Copy.DIRECTORIO").execute()
Add a LIMIT clause to the query
parameters:
- limit (int): The number of rows to be returned.
returns:
- Querier: The current Querier instance.
Add a LIMIT clause to the query
df = querier.select(table="Estructura CHC_2017").limit(5).execute()
Modifier
Initialize the Modifier instance
from nyctibius.db.modifier import Modifier
modifier = Modifier(db_path='data/output/nyctibius.db')
Get a list of all tables in the database
parameters:
- none
returns:
- list: A list of table names.
Example:
tables = modifier.get_tables()
Get a list of all columns in a table
parameters:
- table_name (str): The table name.
returns:
- list: A list of column names.
Example:
columns = modifier.get_columns(table_name="Estructura CHC_2017")
Get the datatypes of all columns in the table
parameters:
- table_name (str): The table name.
returns:
- dict: A dictionary with column names as keys and their datatypes as values.
Example:
datatypes = modifier.get_column_datatypes(table_name="Estructura CHC_2017")
Rename a table in the database
parameters:
- old_table_name (str): The old table name.
- new_table_name (str): The new table name.
returns:
- dict: A dictionary containing the status of the operation and a message.
Example:
result = modifier.rename_table(old_table_name="Estructura CHC_2017", new_table_name="New_Estructura_CHC_2017")
Rename a column in a table
parameters:
- table_name (str): The table name.
- old_column_name (str): The old column name.
- new_column_name (str): The new column name.
returns:
- dict: dict: A dictionary containing ‘status’ and ‘message’. ‘status’ is a boolean indicating whether the operation was successful or not. ‘message’ is a string describing the result of the operation.
Example:
result = modifier.rename_column(table_name="New_Estructura_CHC_2017", old_column_name="P1", new_column_name="New_P1")
Rename the columns of a table.
parameters:
- table_name (str): The table name.
- new_column_names (list): A list of new column names.
returns:
- dict: A dictionary containing ‘status’ and ‘message’. ‘status’ is a boolean indicating whether the operation was successful or not. ‘message’ is a string describing the result of the operation.
Example:
result = modifier.rename_table_columns(table_name="Estructura CHC_2017", new_column_names=["new_index", "new_DIRECTORIO", "new_TIP_FOR", "new_P1"])
Remove duplicate rows in a table based on a specific column
parameters:
- table_name (str): The table name.
- column_name (str): The column name based on which duplicates are removed.
returns:
- dict: A dictionary containing ‘status’ and ‘message’. ‘status’ is a boolean indicating whether the operation was successful or not. ‘message’ is a string describing the result of the operation.
Example:
result = modifier.remove_duplicates(table_name="New_Estructura_CHC_2017", column_name="New_P1")
Set a column as the primary key of a table
parameters:
- table_name (str): The table name.
- pk_column_name (str): The name of the column to set as the primary key.
returns:
- dict: A dictionary containing ‘status’ and ‘message’. ‘status’ is a boolean indicating whether the operation was successful or not. ‘message’ is a string describing the result of the operation.
Example:
result = modifier.set_primary_key(table_name="New_Estructura_CHC_2017", pk_column_name="New_P1")
Delete a column from a table
parameters:
- table_name (str): The table name.
- column_name (str): The column name to be deleted.
returns:
- dict: A dictionary containing ‘status’ and ‘message’. ‘status’ is a boolean indicating whether the operation was successful or not. ‘message’ is a string describing the result of the operation.
Example:
result = modifier.delete_column(table_name="Estructura CHC_2017", column_name="new_index")
Duplicate a table in the database.
parameters:
- old_table_name (str): The old table name.
- new_table_name (str): The new table name.
returns:
- dict: A dictionary containing ‘status’ and ‘message’. ‘status’ is a boolean indicating whether the operation was successful or not. ‘message’ is a string describing the result of the operation.
Example:
result = modifier.duplicate_table(old_table_name="Estructura CHC_2017", new_table_name="Estructura_CHC_2017_Copy")
Set a column as a foreign key of a table
parameters:
- table_name (str): The table name.
- fk_column_name (str): The name of the column to set as the foreign key.
- referenced_table_name (str): The name of the table that the foreign key references.
- referenced_column_name (str): The name of the column in the referenced table.
returns:
- dict: A dictionary containing ‘status’ and ‘message’. ‘status’ is a boolean indicating whether the operation was successful or not. ‘message’ is a string describing the result of the operation.
Example:
result = modifier.set_foreign_key(table_name="Estructura CHC_2017", fk_column_name="new_DIRECTORIO", referenced_table_name="Other_Table", referenced_column_name="Estructura CHC_2017_copy")
Merge two tables into a new table
parameters:
- table1 (str): The name of the first table.
- table2 (str): The name of the second table.
- new_table (str): The name of the new table.
returns:
- dict: A dictionary containing ‘status’ and ‘message’. ‘status’ is a boolean indicating whether the operation was successful or not. ‘message’ is a string describing the result of the operation.
Example:
result = modifier.merge_tables(table1="Estructura CHC_2017", table2="Other_Table", new_table="Merged_Table")
Delete a table from the database
parameters:
- table_name (str): The table name to be deleted.
returns:
- dict: A dictionary containing ‘status’ and ‘message’. ‘status’ is a boolean indicating whether the operation was successful or not. ‘message’ is a string describing the result of the operation.
Example:
result = modifier.delete_table(table_name="New_Estructura_CHC_2017")