Deciphering Big Data

Learning Objectives

  • Introduce and review various concepts of big data, technologies, and data management to enable you identify and manage challenges associated with security risks and limitations.
  • Critically analyse data wrangling problems and determine appropriate methodologies and tools in problem solving. Explore different data types and formats. Evaluate various data storage formats ranging from structured, quasi structured, semi structured, and unstructured formats. We explore the various memory and storage requirements.
  • Examine data exploration methods and analyse data for presentation in an organisation. Critically evaluate data readability, readiness, and longevity within the data Pipeline. Examine cloud services, API (Application Programming Interfaces) and how this enables data interoperability and connectivity.
  • Examine and analyse the ideas and theoretical concepts underlying DBMS (Database Management Systems) Database Design and Modelling.
  • Explore the future of use of data and deciphering by examining some fundamental ideas and concepts of machine learning and how these concepts are applied in various methods in handling big data.

Unit 3, Data Collection and Storage

I recently delved into web scraping using Python, specifically focusing on the 'Beautiful Soup' and 'Requests' libraries. This exploration helped me grasp not only the mechanics but also the appropriate contexts for employing web scraping. In addition, our discussions extended to the utilization of APIs as an alternative method for data collection, highlighting the differences and use cases in comparison to web scraping. Following this, I contributed my solution for the formative activity to our module's wiki page. This process not only solidified my understanding of web scraping and API usage but also allowed me to share my insights and solutions with peers, fostering a collaborative learning environment.

My coding file can be accessed here

Unit 4, Data Cleaning and Transformation

In Unit 4 I delved into the essential concepts, techniques, and methods of data cleaning and transformation. exploring the data management pipeline, various factors affecting data cleaning, and understand the critical requirements for data design and process automation. The unit has given me the knowledge and ability to clean and transform data effectively using the data pipeline as a guide, whilst comprehending the necessities for designing automated processes.

For the exercise I used data from this GitHub Repository, by Jackie Kazil. The book it’s referenced to is ‘UNICEF’s Child Labor datasets taken from the Kazil & Jarmul, 2016 textbook’.

My python code for the exercise on Chapter 7, is here.

Unit 5, Data Cleaning and Automating Data Collections

I continued exploring the practical aspects of data cleaning using Python, with examples drawn from household-level surveys conducted by UNICEF. files can be found from the GitHub Repository online. Additionally, we evaluated how Python scripts can be developed to automate the cleaning process and how these automations incorporate machine learning strategies. understanding the significance of database representation and architecture was also explored.

Unit 6, Database Design and Normalisation

For unit 6, we focussed on database design/normalization, with a particular emphasis on the construction of relational databases. In this unit, I evaluate how cleaning methods enhance the storage of usable datasets, understand the creation of databases and the importance of key fields for linking data. I also explored the construction and terminology associated with creating a database, identifying technical terms used in building a relational database, and evaluating the requirements for establishing a physical database architecture.

We also carried out our group project into database design, and discussed the outlining for a MySQL database. My assignment can be accessed here.

Unit 7, Constructing Normalised Tables and Database Build

Next, we examined an un-normalized table and broke it down into 1NF (First Normal Form), 2NF (Second Normal Form), and 3NF (Third Normal Form). I used this data to create a relational database based on the normalized tables, using an application environment of my choice. During the unit, we explored data attributes, associations, operations, and relationships. Upon completion of this unit, I understood the concept of transforming a flat file database model into a normalized model, built a relational database model, and tested its usability.

For this unit we carried out two tasks around; Normalisation and Data Building.

Unit 8, Compliance and Regulatory Framework for Managing Data

For the unit on compliance frameworks related to data management, we explored the rights of individuals and the obligations of organisations to stakeholders. I gained knowledge in the regulations used for managing data and the impact it has on organisations and industries. I am able to apply compliant frameworks in different scenarios, exercise rights with respect to data held about individuals/organisations/stakeholders, and its regulatory requirements associated with data storage. The case study for the module was DreamHome, and I’ve made a document on it here, discussing it more in detail.

Unit 9, Database Management Systems (DBMS) and Models

For Unit 9, I delved into the diverse world of Database Management Systems (DBMS), in which I used lecture casts and reading lists to cover a wide spectrum from flat files and relational databases like SQL, PostgreSQL, and Oracle, to non-relational models including MongoDB and Data Lakes, as well as technologies such as SAP, Object orientation, Data warehouse, Clouds, Hadoop, and HDF. I gained understanding of design concepts and theories that form the backbone of these systems, while also analysing their strengths, limitations, and suitability for various application environments. Throughout this unit, my aim has been to thoroughly understand the principles behind the design and development of DBMS and become adept at making informed design choices and solutions suitable for applications involving large datasets (e.g. Big Data).

We did another collaborative discussion for the module here, discussing the importance of GDPR in maintaining security. My thoughts, along with my peers for the discussion can be found here.

Unit 10, More on APIs (Application Programming Interfaces) for Data Parsing

Reasons to use APIs are:

  • Real-Time Data Access
  • Efficiency and Speed
  • Consistency and Reliability
  • Automation
  • Security

I analysed and evaluated Application Programming Interfaces (APIs) to understand how they facilitate data parsing and inter-process communication. I examined the critical security requirements needed to ensure the robust functionality of APIs and addressed the various challenges and issues associated with their implementation. By the end of the unit, I had learned to configure APIs for a variety of platforms requiring data parsing and connectivity. This knowledge has equipped me to ensure that the APIs I work with are functional, secure, and effective in facilitating seamless communication between different software systems.

Unit 11, DBMS Transaction and Recovery

For this module, we explored the vulnerabilities of database systems and the mechanisms in place to prevent data loss and maintain data integrity during system failures. I was introduced to transaction processing and learned about the ACID properties—Atomicity, Consistency, Isolation, and Durability—and their crucial role in managing the transaction cycle. I also examined how scheduled transactions, system failures, and checkpoints help achieve a database state where transactions are either fully committed or not committed at all, thus preventing errors within the database. By the end of the unit, I understood the significance of transaction consistency, realized that data transactions are fast-moving and interleaved, and grasped the purpose of a transaction manager in maintaining the smooth operation of database systems.

The YouTube by ByteByteGo on ‘ACID Properties in Databases With Examples’ goes into quick detail on the importance of ACID properties in databases like SQL.

IMAGE ALT TEXT HERE

Unit 12, Future of Big Data Analytics

Understanding the role of machine learning and its impact on advancing big data analysis. We explored the future possibilities within big data analytics and examined how machine learning techniques and applications are pivotal in modelling large and complex datasets. Additionally, I reflected on the relevance and applicability of these topics to organizational contexts, focusing on the importance of adhering to compliance frameworks to ensure that data concerning individuals is protected. This included understanding existing regulations, laws, rules, and standards crucial for enforcing compliance requirements. By the end of the unit, I had gained a clear understanding of the emerging and future trends in big data analytics and learned how machine learning strategies can be effectively applied in this evolving field.

Our final assignment was to expand on the group work completed in Unit 6, and provide a summary of SQL and NoSQL databases. My assignment can be found here, on my GitHub respository.