diff --git a/advanced/geo-hashing/README.md b/advanced/geo-hashing/README.md new file mode 100644 index 0000000..4fbab63 --- /dev/null +++ b/advanced/geo-hashing/README.md @@ -0,0 +1,29 @@ +# Geo Hashing + +## What is GeoHash ? + +- A Geohash is a **unique identifier** of a specific region on the Earth. +- The basic idea is the Earth gets divided into rectangular regions of user-defined size, and each region is assigned a unique id which is called its **Geohash**. +- For a given location on earth, the algorithm converts an arbitrary precision latitude and longitude into a string, and the regions with a **similar string prefix** will be **closer together**. + - Conceptually, GeoHashing **reduces proximity search** to _string prefix matching_. As each character encodes additional precision, shared prefixes denote geographic proximity. +


The demo of GeoHash "w3gvk1" corresponding to the longitude & latitude of the HCMC Opera House (Red dot) and the neighbors with precision=6 such as "w3gv7f", "w3gv7b", etc. with the same prefix in geohashes

+- Geohashes also provide a degree of **anonymity** since it isn’t necessary to expose exact GPS coordinates. The location of an entity up to a bounding box cell at a given precision is all that is known. + +## Algorithm + +- The user specifies a level of precision, usually between 1 and 12, and a GeoHash of that length is returned. +- The GeoHash symbol map consists of 32 characters: consists of digits 0 thru 9 plus all lowercase letters excluding a, i, l, o. + - `base32 = "0123456789bcdefghjkmnpqrstuvwxyz"` +- Generate the geohash using `pygeohash` package + +```Python +import pygeohash +lat, lon, precision = 10.776775578390142, 106.7031296241205, 6 +gh_center = pygeohash.encode(latitude=lat, longitude=lon, precision=precision) +print(gh_center) # 'w3gvk1' +``` + +- As you can see, with `precision=6`, the geohash returned is also with the `length=6` +- The table below gives the dimensions of GeoHash cells at each level of precision (taken from [here](https://www.movable-type.co.uk/scripts/geohash.html)): + +


The cell sizes of geohashes of different lengths

diff --git a/advanced/geo-hashing/geo-hashing.ipynb b/advanced/geo-hashing/geo-hashing.ipynb new file mode 100644 index 0000000..120ec52 --- /dev/null +++ b/advanced/geo-hashing/geo-hashing.ipynb @@ -0,0 +1,449 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Geo Hashing\n", + "- [GeoHashing from Scratch in Python](https://www.jtrive.com/posts/geohash-python/index.html)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# !pip install pygeohash\n", + "# !pip install folium" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import pygeohash\n", + "\n", + "# visualise geohash with folium\n", + "import folium \n", + "from folium.features import DivIcon" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "def get_bbox_geohash(lat, lon, precision=12):\n", + " min_lat, max_lat = -90, 90\n", + " min_lon, max_lon = -180, 180\n", + " for ii in range(5 * precision):\n", + " if ii % 2 == 0:\n", + " # Bisect longitude (E-W).\n", + " mid_lon = (min_lon + max_lon) / 2\n", + " if lon >= mid_lon: \n", + " min_lon = mid_lon\n", + " else:\n", + " max_lon = mid_lon\n", + " else:\n", + " # Bisect latitude (N-S).\n", + " mid_lat = (min_lat + max_lat) / 2\n", + " if lat >= mid_lat:\n", + " min_lat = mid_lat\n", + " else:\n", + " max_lat = mid_lat\n", + " return [min_lat, min_lon, max_lat, max_lon]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [], + "source": [ + "lat, lon, precision = 10.776775578390142, 106.7031296241205, 6\n", + "gh_center = pygeohash.encode(latitude=lat, longitude=lon, precision=precision)\n", + "min_lat, min_lon, max_lat, max_lon = get_bbox_geohash(lat, lon, precision=precision)" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'w3gvk1'" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "gh_center" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "# Get mid_lat and mid_lon for GeoHash id placement. \n", + "mid_lat = (min_lat + max_lat) / 2\n", + "mid_lon = (min_lon + max_lon) / 2" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "m = folium.Map(\n", + " location=[lat, lon], \n", + " #width=900, \n", + " #height=600, \n", + " zoom_start=16, \n", + " zoom_control=True, \n", + " no_touch=True,\n", + " tiles=\"OpenStreetMap\"\n", + " )\n", + "\n", + "# precision = 6 GeoHash bounding box. \n", + "folium.Rectangle(\n", + " [(min_lat, min_lon), (max_lat, max_lon)], \n", + " fill_color=\"red\", fill_opacity=.15\n", + " ).add_to(m)\n", + "\n", + "# Red dot at Merchandise Mart. \n", + "folium.CircleMarker(\n", + " location=[lat, lon], radius=5, color=\"red\", fill_color=\"red\", \n", + " fill_opacity=1\n", + " ).add_to(m)\n", + "\n", + "# precision = 6 GeoHash id.\n", + "folium.map.Marker(\n", + " [mid_lat, mid_lon],\n", + " icon=DivIcon(\n", + " icon_size=(250,36),\n", + " icon_anchor=(100,50),\n", + " html=f'
{gh_center}
',\n", + " )\n", + " ).add_to(m)\n", + "# m" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ + "def get_bbox_and_geohash(lat, lon, precision):\n", + " geohash = pygeohash.encode(latitude=lat, longitude=lon, precision=precision)\n", + " min_lat, min_lon, max_lat, max_lon = get_bbox_geohash(lat, lon, precision=precision)\n", + " bbox = [min_lat, min_lon, max_lat, max_lon]\n", + " return geohash, bbox" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Identify GeoHash Neighboring Cells: once the bounding box for the target GeoHash is known we simply increment those coordinates by a small amount, then lookup the GeoHash and bounding box associated with the new coordinate" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": {}, + "outputs": [], + "source": [ + "eps = 1e-10\n", + "gh_center, bb_center = get_bbox_and_geohash(lat, lon, precision=precision)\n", + "min_lat, min_lon, max_lat, max_lon = bb_center\n", + "# Get GeoHash id and bounding box for Northwest cell.\n", + "gh_nw, bb_nw = get_bbox_and_geohash(max_lat + eps, min_lon - eps, precision=precision)\n", + "\n", + "# Get GeoHash id and bounding box for Northeast cell.\n", + "gh_ne, bb_ne = get_bbox_and_geohash(max_lat + eps, max_lon + eps, precision=precision)\n", + "\n", + "# Get GeoHash id and bounding box for Southeast cell.\n", + "gh_se, bb_se = get_bbox_and_geohash(min_lat - eps, max_lon + eps, precision=precision)\n", + "\n", + "# Get GeoHash id and bounding box for Southwest cell.\n", + "gh_sw, bb_sw = get_bbox_and_geohash(min_lat - eps, min_lon - eps, precision=precision)" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [], + "source": [ + "coord_list = zip([gh_center, gh_nw, gh_ne, gh_se, gh_sw],[bb_center, bb_nw, bb_ne, bb_se, bb_sw])" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
Make this Notebook Trusted to load map: File -> Trust Notebook
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "m = folium.Map(\n", + " location=[lat, lon], \n", + " #width=900, \n", + " #height=600, \n", + " zoom_start=16, \n", + " zoom_control=True, \n", + " no_touch=True,\n", + " tiles=\"OpenStreetMap\"\n", + ")\n", + "# Red dot at Merchandise Mart. \n", + "folium.CircleMarker(\n", + " location=[lat, lon], radius=5, color=\"red\", fill_color=\"red\", \n", + " fill_opacity=1\n", + ").add_to(m)\n", + "\n", + "for gh, bb in coord_list:\n", + " min_lat, min_lon, max_lat, max_lon = bb\n", + " # Get mid_lat and mid_lon for GeoHash id placement. \n", + " mid_lat = (min_lat + max_lat) / 2\n", + " mid_lon = (min_lon + max_lon) / 2\n", + " # precision = 6 GeoHash bounding box. \n", + " folium.Rectangle(\n", + " [(min_lat, min_lon), (max_lat, max_lon)], \n", + " fill_color=\"red\", fill_opacity=.15\n", + " ).add_to(m)\n", + "\n", + " # precision = 6 GeoHash id.\n", + " folium.map.Marker(\n", + " [mid_lat, mid_lon],\n", + " icon=DivIcon(\n", + " icon_size=(250,36),\n", + " icon_anchor=(100,50),\n", + " html=f'
{gh}
',\n", + " )\n", + " ).add_to(m)\n", + "m" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/advanced/others/fuzzy-name-matching.ipynb b/advanced/others/fuzzy-name-matching.ipynb new file mode 100644 index 0000000..dd93a3e --- /dev/null +++ b/advanced/others/fuzzy-name-matching.ipynb @@ -0,0 +1,283 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The Fuzz\n", + "- `theFuzz` uses the Levenshtein edit distance to calculate the degree of closeness between two strings. \n", + " - It also provides features for determining string similarity in various situations\n", + "- [Reference](https://www.datacamp.com/tutorial/fuzzy-string-python)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# !conda install thefuzz" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from thefuzz import fuzz" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## String Matching Methods\n", + "|Technique|\tDescription|\tCode Example|\n", + "|:------:|:------|:------|\n", + "|Simple Ratio|\tCalculates similarity considering the order of input strings.\t|`fuzz.ratio(name, full_name)`|\n", + "|Partial Ratio|\tFinds partial similarity by comparing the shortest string with sub-strings.|\t`fuzz.partial_ratio(name, full_name)`\n", + "|Token Sort Ratio|\tIgnores order of words in strings.|\t`fuzz.token_sort_ratio(full_name_reordered, full_name)`|\n", + "|Token Set Ratio|\tRemoves common tokens before calculating similarity.|\t`fuzz.token_set_ratio(name, full_name)`|" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Simple Ratio \n", + "- `ratio()` calculates the edit distance based on the ordering of both input strings\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Similarity score: 86\n" + ] + } + ], + "source": [ + "# Check the similarity score\n", + "name = \"Kurtis Pykes\"\n", + "full_name = \"Kurtis K D Pykes\"\n", + "\n", + "print(f\"Similarity score: {fuzz.ratio(name, full_name)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Partial Ratio\n", + "- `partial_ratio()` seeks to find how partially similar two strings are.\n", + " - it calculates the similarity by taking the **shortest** string, which in this scenario is stored in the variable `name`, then compares it against the **sub-strings** of the same length in the longer string, which is stored in `full_name`. " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Similarity score: 67\n" + ] + } + ], + "source": [ + "print(f\"Similarity score: {fuzz.partial_ratio(name, full_name)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Since order matters in partial ratio, our score dropped to 67 in this instance. \n", + "- Therefore, to get a 100% similarity match, you would have to move the \"K D\" part to the end of the string" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Partial ratio similarity score: 100\n", + "Simple ratio similarity score: 86\n" + ] + } + ], + "source": [ + "# Order matters with partial ratio\n", + "# Check the similarity score\n", + "name = \"Kurtis Pykes\"\n", + "full_name = \"Kurtis Pykes K D\" # move K D to the end \n", + "\n", + "print(f\"Partial ratio similarity score: {fuzz.partial_ratio(name, full_name)}\")\n", + "\n", + "# But order will not effect simple ratio if strings do not match\n", + "print(f\"Simple ratio similarity score: {fuzz.ratio(name, full_name)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Token Sort Ratio\n", + "- Token sort doesn’t care about what order words occur in. It accounts for similar strings that aren’t in order as expressed above" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Token sort ratio similarity score: 100\n", + "Partial ratio similarity score: 75\n", + "Simple ratio similarity score: 86\n" + ] + } + ], + "source": [ + "# Check the similarity score\n", + "full_name = \"Kurtis K D Pykes\"\n", + "full_name_reordered = \"Kurtis Pykes K D\"\n", + "\n", + "# Order does not matter for token sort ratio\n", + "print(f\"Token sort ratio similarity score: {fuzz.token_sort_ratio(full_name_reordered, full_name)}\")\n", + "\n", + "# Order matters for partial ratio\n", + "print(f\"Partial ratio similarity score: {fuzz.partial_ratio(full_name, full_name_reordered)}\")\n", + "\n", + "# Order will not effect simple ratio if strings do not match\n", + "print(f\"Simple ratio similarity score: {fuzz.ratio(name, full_name)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- If there are words that are dissimilar words in the strings, it will negatively impact the similarity ratio" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Token sort ratio similarity score: 86\n" + ] + } + ], + "source": [ + "# Check the similarity score\n", + "name = \"Kurtis Pykes\"\n", + "full_name = \"Kurtis K D Pykes\" # \"Kurtis Pykes K D\"\n", + "\n", + "print(f\"Token sort ratio similarity score: {fuzz.token_sort_ratio(name, full_name)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Token set ratio\n", + "- The `token_set_ratio()` method is pretty similar to the token_sort_ratio(), except it takes out common tokens before calculating how similar the strings are: this is extremely helpful when the strings are significantly different in length. " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Token sort ratio similarity score: 100\n" + ] + } + ], + "source": [ + "# Check the similarity score\n", + "name = \"Kurtis Pykes\"\n", + "full_name = \"Kurtis K D Pykes\"\n", + "\n", + "print(f\"Token sort ratio similarity score: {fuzz.token_set_ratio(name, full_name)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Process\n", + "- The process module enables users to extract text from a collection using fuzzy string matching. Calling the extract() method on the process module returns the strings with a similarity score in a vector. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[('barcelona fc', 86), ('AFC Barcelona', 82)]\n" + ] + } + ], + "source": [ + "from thefuzz import process\n", + "\n", + "collection = [\"AFC Barcelona\", \"Barcelona AFC\", \"barcelona fc\", \"afc barcalona\"]\n", + "print(process.extract(\"barcelona\", collection, scorer=fuzz.ratio, limit=2))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python_tutorial", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/assets/img/geo-hash-cell-size-with-different-length.png b/assets/img/geo-hash-cell-size-with-different-length.png new file mode 100644 index 0000000..e0513a9 Binary files /dev/null and b/assets/img/geo-hash-cell-size-with-different-length.png differ diff --git a/assets/img/geo-hash.png b/assets/img/geo-hash.png new file mode 100644 index 0000000..28ed28d Binary files /dev/null and b/assets/img/geo-hash.png differ diff --git a/assets/img/lru_cache_example.png b/assets/img/lru_cache_example.png new file mode 100644 index 0000000..e09c4e2 Binary files /dev/null and b/assets/img/lru_cache_example.png differ diff --git a/assets/img/plotly_subplots_example.png b/assets/img/plotly_subplots_example.png new file mode 100644 index 0000000..88b1031 Binary files /dev/null and b/assets/img/plotly_subplots_example.png differ diff --git a/basics/README.md b/basics/README.md index 93e414d..71b0037 100644 --- a/basics/README.md +++ b/basics/README.md @@ -3,9 +3,9 @@ ## Topics - [`*args` and `**kwargs`](./args_kwargs_tutorial.py) -- [Google Colab](./google_colab_tutorial.ipynb) +- [Google Colab](./notebooks/google_colab_tutorial.ipynb) - [Pathlib](./pathlib_tutorial.py) -- [Subprocess](./subprocess_tutorial.py) +- [Subprocess](./notebooks/subprocess.ipynb) - [YAML](./yaml/README.md) ## Argument Parser @@ -26,8 +26,8 @@ args = parser.parse_args() print('Hello, ' + args.name + '!') -# use var() to make args as the dict -args = var(parser.parse_args()) +# use vars() to make args as the dict +args = vars(parser.parse_args()) print('Hello, ' + args['name'] + '!') ``` diff --git a/basics/dotvenv.md b/basics/dotvenv.md index aef373e..1a47edf 100644 --- a/basics/dotvenv.md +++ b/basics/dotvenv.md @@ -2,7 +2,7 @@ - Installation: `pip install python-dotenv==1.0.0` - Config file stores in `.env` file ```shell -HUGGINGFACEHUB_API_TOKEN="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" +HUGGINGFACEHUB_API_TOKEN="" ``` - Load environmental variables ```Python diff --git a/basics/google_colab_tutorial.ipynb b/basics/notebooks/google_colab_tutorial.ipynb similarity index 100% rename from basics/google_colab_tutorial.ipynb rename to basics/notebooks/google_colab_tutorial.ipynb diff --git a/basics/notebooks/pandas-pivot-melt-crosstab.ipynb b/basics/notebooks/pandas-pivot-melt-crosstab.ipynb new file mode 100644 index 0000000..8c91038 --- /dev/null +++ b/basics/notebooks/pandas-pivot-melt-crosstab.ipynb @@ -0,0 +1,941 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Reshaping Dataframes" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pivot\n", + "- In pandas, there are two methods `.pivot()` and `.pivot_table()` (RECOMMENDED)\n", + "- However, `.pivot()` unable to handle duplicate values in the index column, in this case, the index column is `cusid` which contains multiple rows of `cusid=1`, and `cusid=2`" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
cusidpayment_methodmerchanttotal_txn
01DEBITSHOPEE1
11DEBITGRAB2
21CREDITSHOPEE3
32CREDITSHOPEE4
42CREDITLAZADA5
52DEBITGSM6
\n", + "
" + ], + "text/plain": [ + " cusid payment_method merchant total_txn\n", + "0 1 DEBIT SHOPEE 1\n", + "1 1 DEBIT GRAB 2\n", + "2 1 CREDIT SHOPEE 3\n", + "3 2 CREDIT SHOPEE 4\n", + "4 2 CREDIT LAZADA 5\n", + "5 2 DEBIT GSM 6" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame({'cusid': [1,1,1,2,2,2],\n", + " 'payment_method': ['DEBIT', 'DEBIT', 'CREDIT', 'CREDIT', 'CREDIT', 'DEBIT'],\n", + " 'merchant': ['SHOPEE', 'GRAB', 'SHOPEE', 'SHOPEE', 'LAZADA', 'GSM'],\n", + " 'total_txn': [1, 2, 3, 4, 5, 6],\n", + "})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
payment_methodCREDITDEBIT
merchantLAZADASHOPEEGRABGSMSHOPEE
cusid
1NaN3.02.0NaN1.0
25.04.0NaN6.0NaN
\n", + "
" + ], + "text/plain": [ + "payment_method CREDIT DEBIT \n", + "merchant LAZADA SHOPEE GRAB GSM SHOPEE\n", + "cusid \n", + "1 NaN 3.0 2.0 NaN 1.0\n", + "2 5.0 4.0 NaN 6.0 NaN" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pivot_df = df.pivot_table(index=[\"cusid\"], columns=[\"payment_method\", \"merchant\"], values=[\"total_txn\"])\n", + "pivot_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "MultiIndex([('CREDIT', 'LAZADA'),\n", + " ('CREDIT', 'SHOPEE'),\n", + " ( 'DEBIT', 'GRAB'),\n", + " ( 'DEBIT', 'GSM'),\n", + " ( 'DEBIT', 'SHOPEE')],\n", + " names=['payment_method', 'merchant'])" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# MultiIndex\n", + "pivot_df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CREDIT_LAZADACREDIT_SHOPEEDEBIT_GRABDEBIT_GSMDEBIT_SHOPEE
cusid
1NaN3.02.0NaN1.0
25.04.0NaN6.0NaN
\n", + "
" + ], + "text/plain": [ + " CREDIT_LAZADA CREDIT_SHOPEE DEBIT_GRAB DEBIT_GSM DEBIT_SHOPEE\n", + "cusid \n", + "1 NaN 3.0 2.0 NaN 1.0\n", + "2 5.0 4.0 NaN 6.0 NaN" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "flatten_name_df = pivot_df.copy()\n", + "flatten_name_df.columns = list(map(\"_\".join, pivot_df.columns))\n", + "flatten_name_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Melt (Unpivot)\n", + "- Unpivot a DataFrame from **wide** to **long** format, optionally leaving identifiers set.\n", + "- For example, we want to melt the dataframe `df` below into `subjects` and `grades` for each student instead of having multiple subjects columns" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameMathEnglishAge
0BobA+C13
1JohnBB16
2FooAB16
3BarFA+15
4AlexDF15
5TomCA13
\n", + "
" + ], + "text/plain": [ + " Name Math English Age\n", + "0 Bob A+ C 13\n", + "1 John B B 16\n", + "2 Foo A B 16\n", + "3 Bar F A+ 15\n", + "4 Alex D F 15\n", + "5 Tom C A 13" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame({'Name': ['Bob', 'John', 'Foo', 'Bar', 'Alex', 'Tom'],\n", + " 'Math': ['A+', 'B', 'A', 'F', 'D', 'C'],\n", + " 'English': ['C', 'B', 'B', 'A+', 'F', 'A'],\n", + " 'Age': [13, 16, 16, 15, 15, 13]})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameAgeSubjectGrades
4Alex15MathD
10Alex15EnglishF
3Bar15MathF
9Bar15EnglishA+
0Bob13MathA+
6Bob13EnglishC
2Foo16MathA
8Foo16EnglishB
1John16MathB
7John16EnglishB
5Tom13MathC
11Tom13EnglishA
\n", + "
" + ], + "text/plain": [ + " Name Age Subject Grades\n", + "4 Alex 15 Math D\n", + "10 Alex 15 English F\n", + "3 Bar 15 Math F\n", + "9 Bar 15 English A+\n", + "0 Bob 13 Math A+\n", + "6 Bob 13 English C\n", + "2 Foo 16 Math A\n", + "8 Foo 16 English B\n", + "1 John 16 Math B\n", + "7 John 16 English B\n", + "5 Tom 13 Math C\n", + "11 Tom 13 English A" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.melt(\n", + " id_vars=[\"Name\", \"Age\"],\n", + " value_vars=[\"Math\", \"English\"],\n", + " var_name=\"Subject\",\n", + " value_name=\"Grades\",\n", + ").sort_values(by=[\"Name\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Crosstab\n", + "- Crosstab: displays the relationship between two or more categorical variables by showing the frequency of different combinations of those variables" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
GenderEducationAge
0MaleGraduate27
1FemaleUndergraduate18
2FemaleUndergraduate19
3MaleGraduate24
4MaleGraduate29
5FemaleGraduate23
6MaleUndergraduate18
\n", + "
" + ], + "text/plain": [ + " Gender Education Age\n", + "0 Male Graduate 27\n", + "1 Female Undergraduate 18\n", + "2 Female Undergraduate 19\n", + "3 Male Graduate 24\n", + "4 Male Graduate 29\n", + "5 Female Graduate 23\n", + "6 Male Undergraduate 18" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame({'Gender': ['Male', 'Female', 'Female', 'Male', 'Male','Female', 'Male'],\n", + " 'Education': ['Graduate', 'Undergraduate', 'Undergraduate', 'Graduate', 'Graduate', 'Graduate', 'Undergraduate'],\n", + " 'Age': [27, 18, 19, 24, 29, 23,18]})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
EducationGraduateUndergraduate
Gender
Female02
Male30
\n", + "
" + ], + "text/plain": [ + "Education Graduate Undergraduate\n", + "Gender \n", + "Female 0 2\n", + "Male 3 0" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Basic crosstab\n", + "cross_tab = pd.crosstab(df['Gender'], df['Education'])\n", + "cross_tab" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
EducationGraduateUndergraduate
Gender
Female0.1428570.285714
Male0.4285710.142857
\n", + "
" + ], + "text/plain": [ + "Education Graduate Undergraduate\n", + "Gender \n", + "Female 0.142857 0.285714\n", + "Male 0.428571 0.142857" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Crosstab with normalization: shows the proportion of each combination relative to the total.\n", + "cross_tab_normalized = pd.crosstab(df['Gender'], df['Education'], normalize='all')\n", + "cross_tab_normalized" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
EducationGraduateUndergraduate
Gender
Female23.00000018.5
Male26.66666718.0
\n", + "
" + ], + "text/plain": [ + "Education Graduate Undergraduate\n", + "Gender \n", + "Female 23.000000 18.5\n", + "Male 26.666667 18.0" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Crosstab with aggregation for each combination\n", + "cross_tab_agg = pd.crosstab(df['Gender'], df['Education'], values=df['Age'], aggfunc='mean')\n", + "cross_tab_agg" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "# Crosstab with margins\n", + "cross_tab_margins = pd.crosstab(df['Gender'], df['Education'], margins=True, margins_name=\"Total\")\n", + "print(\"\\nCrosstab with Margins:\")\n", + "print(cross_tab_margins)\n", + "\n", + "\n", + "print(\"\\nCrosstab with Normalization:\")\n", + "print(cross_tab_normalized)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "ml_env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/basics/notebooks/subprocess.ipynb b/basics/notebooks/subprocess.ipynb new file mode 100644 index 0000000..1461054 --- /dev/null +++ b/basics/notebooks/subprocess.ipynb @@ -0,0 +1,270 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Subprocess\n", + "- `subprocess.run` is a higher-level wrapper around Popen that is intended to be more convenient to use.\n", + " - Usage: to run a command and capture its output\n", + "- `subprocess.call` \n", + " - Usage: to run a command and check the return code, but do not need to capture the output.\n", + "- `subprocess.Popen` is a lower-level interface to running subprocesses\n", + " - Usage: if you need more control over the process, such as interacting with its input and output streams." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## subprocess.run()\n", + "- `subprocess.run()` method is a convenient way to run a subprocess and wait for it to complete.\n", + " - Once the subprocess is started, the `run()` method blocks until the subprocess completes and returns a `CompletedProcess` object\n", + "- `subprocess.run()`'s input arguments:\n", + " - `args`: The command to run and its arguments, passed as a **list of strings**.\n", + " - `capture_output`: When set to True, will capture the standard output and standard error.\n", + " - `text`: when set to True, will return the stdout and stderr as string, otherwise as bytes `b'/Users/codexplore/Developer/repos/`.\n", + " - `check`: \n", + " - when check is set to True, the function will check the return code of the command and raise a `CalledProcessError` exception if the return code is non-zero. \n", + " - when check is set to False (default), the function will not check the return code and will not raise an exception, even if the command fails.\n", + " - `timeout`: A value in seconds that specifies how long to wait for the subprocess to complete before timing out.\n", + "- `subprocess.run()`` method also returns a `CompletedProcess` object, which contains the following attributes:\n", + " - `args`: The command and arguments that were run.\n", + " - `returncode`: The return code of the subprocess.\n", + " - `stdout`: The standard output of the subprocess, as a bytes object.\n", + " - `stderr`: The standard error of the subprocess, as a bytes object.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total 40\n", + "drwxr-xr-x 4 codexplore staff 128 Feb 6 17:32 \u001b[1m\u001b[36m.\u001b[m\u001b[m\n", + "drwxr-xr-x@ 8 codexplore staff 256 Feb 6 17:32 \u001b[1m\u001b[36m..\u001b[m\u001b[m\n", + "-rw-r--r--@ 1 codexplore staff 18414 Oct 22 10:29 google_colab_tutorial.ipynb\n", + "-rw-r--r-- 1 codexplore staff 0 Feb 6 17:32 subprocess.ipynb\n" + ] + }, + { + "data": { + "text/plain": [ + "CompletedProcess(args=['ls', '-la'], returncode=0)" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "subprocess.run([\"ls\", \"-la\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "result = subprocess.run([\"pwd\"], capture_output=True, text=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('/Users/codexplore/Developer/repos/python/basics/notebooks\\n', '')" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result.stdout, result.stderr" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result.returncode" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## subprocess.call()" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Python 3.9.6\n", + "Command executed successfully.\n" + ] + } + ], + "source": [ + "return_code = subprocess.call([\"python3\", \"--version\"])\n", + "\n", + "if return_code == 0:\n", + " print(\"Command executed successfully.\")\n", + "else:\n", + " print(\"Command failed with return code\", return_code)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## subprocess.Popen()\n", + "- `Popen` allows you to start a new process and interact with its standard input, output, and error streams. It returns a handle to the running process that can be used to wait for the process to complete, check its return code, or terminate it.\n", + "- The Popen class has several methods that allow you to interact with the process, such as `communicate(`), `poll()`, `wait()`, `terminate()`, and `kill()`." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Python 3.9.6\n", + "\n" + ] + } + ], + "source": [ + "import subprocess\n", + "\n", + "p = subprocess.Popen([\"python3\", \"--version\"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)\n", + "\n", + "output, errors = p.communicate()\n", + "\n", + "print(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Subprocess `PIPE`\n", + "- A `PIPE` is a unidirectional communication channel that connects one process's standard output to another's standard input. " + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Ouptut:\n", + "google_colab_tutorial.ipynb\n", + "subprocess.ipynb\n", + "\n", + "Error : None\n" + ] + } + ], + "source": [ + "# creates a pipe that connects the output of the `ls` command to the input of the `grep` command,\n", + "ls_process = subprocess.Popen([\"ls\"], stdout=subprocess.PIPE, text=True)\n", + "\n", + "grep_process = subprocess.Popen([\"grep\", \".ipynb\"], stdin=ls_process.stdout, stdout=subprocess.PIPE, text=True)\n", + "\n", + "output, error = grep_process.communicate()\n", + "\n", + "print(f\"Ouptut:\\n{output}\")\n", + "print(f\"Error : {error}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "google_colab_tutorial.ipynb\n", + "subprocess.ipynb\n", + "\n" + ] + } + ], + "source": [ + "result = subprocess.run([\"ls\"], stdout=subprocess.PIPE)\n", + "\n", + "print(result.stdout.decode()) # decode() to convert from bytes to strings\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/basics/subprocess_tutorial.py b/basics/subprocess_tutorial.py deleted file mode 100644 index 54d5ee6..0000000 --- a/basics/subprocess_tutorial.py +++ /dev/null @@ -1 +0,0 @@ -# Link: https://www.simplilearn.com/tutorials/python-tutorial/subprocess-in-python \ No newline at end of file diff --git a/daily_knowledge.md b/daily_knowledge.md index 5fe76fc..0a5d01b 100644 --- a/daily_knowledge.md +++ b/daily_knowledge.md @@ -1,9 +1,407 @@ +# 2024 + +## Day 2 + +- Notebook: make changes to your module's code and see the effects immediately without needing to manually reload or restart the notebook's kernel. + +```Python +%load_ext autoreload # enable the autoreload feature in your notebook session. +%autoreload 2 # sets the auto-reload mode to "2", which means that modules will be automatically re-loaded before executing any code that depends on them. +``` + +- How to express $10^{n}$: `2.944297e+10` is equivalent to `2.944297*(10**10)` + +## Day 1 + +### Matplotlib + +- Modify the x-axis with highlighted values (only), instead of listing down all the dates + - In the below example, instead of putting in x-axis all the months, we just highlight data points with marks a year from 1949 to 1962 + +```Python +fig, ax = plt.subplots() + +ax.plot(df['Month'], df['Passengers']) +ax.set_xlabel('Date') +ax.set_ylabel('Number of air passengers') + +plt.xticks(np.arange(0, 145, 12), # only provide each 12 months + np.arange(1949, 1962, 1) # a list of a year corresponding to +) +``` + +### Pandas + +#### `.loc` vs `.iloc` + +- `loc` gets rows (and/or columns) with particular **labels**. +- `iloc` gets rows (and/or columns) at **integer locations**. +- Example: given the following dataframe that has the index starting from 80 to 84 + +```Python +df = pd.DataFrame(np.arange(25).reshape(5, 5), + index=[80, 81, 82, 83, 84], + columns=['col_A','col_B','col_C', 'col_D', 'col_E']) +``` + +| | col_A | col_B | col_C | col_D | col_E | +| --: | ----: | ----: | ----: | ----: | ----: | +| 80 | 0 | 1 | 2 | 3 | 4 | +| 81 | 5 | 6 | 7 | 8 | 9 | +| 82 | 10 | 11 | 12 | 13 | 14 | +| 83 | 15 | 16 | 17 | 18 | 19 | +| 84 | 20 | 21 | 22 | 23 | 24 | + +- Return the first row of the df with two columns `col_A` and `col_D` + - `.loc`: since the first row in the dataframe corresponding to the `index=80`, so we need to specify it in the `.loc` + - `df.loc[80, ["col_A", "col_D"]]` + - `.iloc`: as iloc will be based on the integer location, so the first row in the df is corresponding to the location 0 + - `df.iloc[0, [df.columns.get_loc(c) for c in ["col_A", "col_D"]]]` + +```Python +# example of .loc and .iloc to return the first row of the df with two columns A and D +df.loc[80, ["col_A", "col_D"]] +# col_A 0 +# col_D 3 +df.iloc[0, [df.columns.get_loc(c) for c in ["col_A", "col_D"]]] +# col_A 0 +# col_D 3 +``` + +- Select last 2 row in the `df` → `.iloc` will have the advantages as it is based on the integer location rather then the labels. + +```Python +df.iloc[-2:, :] +``` + +| | col_A | col_B | col_C | col_D | col_E | +| --: | ----: | ----: | ----: | ----: | ----: | +| 83 | 15 | 16 | 17 | 18 | 19 | +| 84 | 20 | 21 | 22 | 23 | 24 | + +- Select first 3 columns of the row after index=82 + +```Python +df.iloc[df.index.get_loc(82):, :3] +``` + +| | col_A | col_B | col_C | +| --: | ----: | ----: | ----: | +| 82 | 10 | 11 | 12 | +| 83 | 15 | 16 | 17 | +| 84 | 20 | 21 | 22 | + # 2023 +## Day 11 + +### Python + +- `1e-4` is equal to `0.0001` with total 4 zeros +- Validate the input string variable, say `split` belongs to certain options: `assert split in ['train', 'test', 'both']` + - We can add the handling path if the assertion error paused the program + ```Python + try: + assert 'value' not in numerical_vars + except AssertionError: + # do something + ``` + +### Pandas + +#### `pd.melt` vs contingency table (`pd.crosstab`) + +- The `melt` function in pandas is used to unpivot a DataFrame, meaning that it **converts columns of data into rows**. + - This is useful when you want to combine the data for plotting + - For example: the melt function has converted the `Math` and `Physics` columns into rows, and created two new columns: `subject` and `score`. + +```Python +# Create a sample DataFrame +df = pd.DataFrame({'student_id': [1, 2, 3], 'Math': [4, 5, 6], 'Physics': [7, 8, 9]}) + +# Melt the DataFrame +df_melted = df.melt(id_vars=['student_id'], value_vars=['Math', 'Physics'], var_name='Ssubject', value_name='score') + +# Print the melted DataFrame +print(df_melted) + +# student_id subject score +# 0 1 Math 4 +# 1 2 Math 5 +# 2 3 Math 6 +# 3 1 Physics 7 +# 4 2 Physics 8 +# 5 3 Physics 9 + +# plotting the melted dataframe by subject using hue of seaborn + +sns.barplot(df_melted, x='student_id', y='score', hue='subject') +``` + +- The `crosstab` function: produces a DataFrame where the rows represent the levels of one factor and the columns represent the levels of another factor. +- Usage: to create frequency table for each category in a feature vs another feature +- The crosstab function takes the following arguments: + - `index`: The name of the column to use for the row labels. + - `columns`: The name of the column to use for the column labels. + - `values`: The name of the column to use for the cell values. If not specified, the counts of observations are used. + - `margins`: If True, the DataFrame will include a row and column for the totals. + +```Python +# Create a sample DataFrame +df = pd.DataFrame({'gender': ['M', 'F', 'M', 'F', 'F'], 'favorite_color': ['blue', 'red', 'green', 'blue', 'purple']}) + +# gender favorite_color +# 0 M blue +# 1 F red +# 2 M green +# 3 F blue +# 4 F purple +# Create a crosstabulation of gender and favorite color +crosstab = pd.crosstab(df['gender'], # rows + df['favorite_color'], # columns + margins=True # True to calculate the total + ) + +# re-name columns "All" to 'row_totals' & 'col_totals' +# note: columns "All" are only avail if margins=True +crosstab.columns = [*crosstab.columns.to_list()[:-1], 'row_totals'] +crosstab.index = [*crosstab.index.to_list()[:-1], 'col_totals'] +# blue green purple red row_totals +# F 1 0 1 1 3 +# M 1 1 0 0 2 +# col_totals 2 1 1 1 5 +``` + +### Matplotlib + +- Subplots: + +```Python +# Annual, weekly and daily seasonality +# ============================================================================== +fig, axs = plt.subplots(2, 2, figsize=(8.5, 5.5), sharex=False, sharey=True) + +# Before: +ax1 = axs[0,0] +ax2 = axs[0,1] +# After with axs.ravel() +axs = axs.ravel() +ax1 = axs[0] +#... +ax4 = axs[3] + +``` + +### Conda vs Pip: Package Availability + +- Pip installs packages from the Python Package Index (`PyPI`), which hosts a vast array of Python libraries. Almost any Python library can be installed using pip. +- conda installs packages from the Anaconda distribution and other channels (`conda-forge`). While the number of packages available through conda is smaller than pip, conda can install packages for multiple languages and not just Python. +- Example: `skforecast` is not avail in the `conda-forge` but it is avail in `PyPI`, so you cannot `conda install skforecast`, but it can be install via `pip install` command + +## Day 10 + +### Pandas + +- Find the row with the column is equal to value: `index_choice = df.index.get_loc(CustomerId=15674932)` + +#### `pd.cut` vs `pd.qcut` + +| | Space between 2 bins | Frequency of Samples in each bin | +| --------- | -------------------- | -------------------------------- | +| `pd.cut` | Equal Spacing | Difference | +| `pd.qcut` | Un-equal Spacing | Same | + +- `pd.cut` will choose the bins to be **evenly spaced** according to the values themselves and not the frequency of those values. + - You also can define the bounds for each binning with `pd.cut()` + - You can use the Fisher-Jenks algorithm to determine the natural bounds and then pass those values into `pd.cut()` +- `pd.qcut` the bin interval will be chosen so based on the percentiles that you have the same number of records in each bin. + +```Python +factors = np.random.randn(30) + +pd.cut(factors, 5).value_counts() # bin size has equal interval of ~1 + +# (-2.583, -1.539] 5 +# (-1.539, -0.5] 5 +# (-0.5, 0.539] 9 +# (0.539, 1.578] 9 +# (1.578, 2.617] 2 + +pd.qcut(factors, 5).value_counts() # each bin has equal size of 6 + +# [-2.578, -0.829] 6 +# (-0.829, -0.36] 6 +# (-0.36, 0.366] 6 +# (0.366, 0.868] 6 +# (0.868, 2.617] 6 + + + + +``` + +#### `.read_csv()` by chunk + +- If the csv file is large, can consider to read by chunk + +```Python +df_iter = pd.read_csv(file_path, iterator=True, chunksize=100000) +df = next(df_iter) +``` + +### Python + +#### `lru_cache` from `functools` + +- `@lru_cache` modifies the function it decorates to return the same value that was returned the first time, instead of computing it again, executing the code of the function every time. + +```Python +@lru_cache +def say_hi(name: str, salutation: str = "Ms."): + return f"Hello {salutation} {name}" +``` + +

+ +#### typing `Annotated` + +- `Annotated` in python allows developers to declare the type of a reference and provide additional information related to it. + +```Python +from typing_extensions import Annotated +# This tells that "name" is of type "str" and that "name[0]" is a capital letter. +name = Annotated[str, "first letter is capital"] +``` + +- Fast API examples: + ```Python + from fastapi import Query + def read_items(q: Annotated[str, Query(max_length=50)]) + ``` + - The parameter `q` is of type `str` with a maximum length of 50. + +#### Float to Decimal conversion + +- Convert `float` directly to `Decimal` constructor introduces a rounding error. + +```Python +from decimal import Decimal +x = 0.1234 +Decimal(x) +# Decimal('0.12339999999999999580335696691690827719867229461669921875') +``` + +- **Solution**: to convert a float to a string before passing it to the constructor. + - You also can round the float before converting it to string + +```Python +Decimal(str(x)) +# Decimal('0.1234') +Decimal(str(round(x,2))) +# Decimal('0.12') +``` + +- + +## Day 9 + +### `subprocess` module + +- `subprocess.run` is a higher-level wrapper around Popen that is intended to be more convenient to use. + - Usage: to run a command and capture its output +- `subprocess.call` + - Usage: to run a command and check the return code, but do not need to capture the output. +- `subprocess.Popen` is a lower-level interface to running subprocesses + + - Usage: if you need more control over the process, such as interacting with its input and output streams. + +- A `PIPE` is a unidirectional communication channel that connects one process's standard output to another's standard input. + +```Python +# creates a pipe that connects the output of the `ls` command to the input of the `grep` command, +ls_process = subprocess.Popen(["ls"], stdout=subprocess.PIPE, text=True) + +grep_process = subprocess.Popen(["grep", ".ipynb"], stdin=ls_process.stdout, stdout=subprocess.PIPE, text=True) + +output, error = grep_process.communicate() + +print(f"Ouptut:\n{output}") +print(f"Error : {error}") + +# Ouptut: +# google_colab_tutorial.ipynb +# subprocess.ipynb +# +# Error : None +result = subprocess.run(["ls"], stdout=subprocess.PIPE) + +print(result.stdout.decode()) # decode() to convert from bytes to strings +# google_colab_tutorial.ipynb +# subprocess.ipynb +``` + +#### `subprocess` vs `os.system` + +- `subprocess.run` is generally more flexible than `os.system` (you can get the stdout, stderr, the "real" status code, better error handling, etc.) +- Even the [documentation for `os.system`](https://docs.python.org/3/library/os.html#os.system) recommends using subprocess instead. + +### `sys` module + +- The kernel knows to execute this script with a **python** interpreter with the shebang `#!/usr/bin/env python` +- `sys.argv[0]` return name of the script +- `sys.argv[1:]` return the arguments parsed to the script + +```Python +############################# +# in the python_script.py # +############################# +#!/usr/bin/env python +import sys +for arg in reversed(sys.argv[1:]): + print(arg) + +############################# +# in the interactive shell # +############################# +bash-5.2$ chmod +x python_script.py +bash-5.2$ ./python_script.py a b c +# Running: ./python_script.py +# c +# b +# a +``` + ## Day 8 - `bytes("yes", "utf-8")` convert string to binary objects: +### Matplotlib + +- Color Map + +```Python +cmap = plt.get_cmap("viridis") +fig = plt.figure(figsize=(8, 6)) +m1 = plt.scatter(X_train, y_train, color=cmap(0.9), s=10) +m2 = plt.scatter(X_test, y_test, color=cmap(0.5), s=10) +``` + +- Plot 2 charts on the same figure with share x-axis + +```Python +fig, ax1 = plt.subplots() + +ax1.hist(housing["housing_median_age"], bins=50) + +ax2 = ax1.twinx() # key: create a twin axis that shares the same x-axis +color = "blue" +ax2.plot(ages, rbf1, color=color, label="gamma = 0.10") +ax2.tick_params(axis='y', labelcolor=color) +ax2.set_ylabel("Age similarity", color=color) # second y-axis's measurement + +plt.show() +``` + ### Relative import ``` @@ -93,6 +491,12 @@ pip list --format=freeze > requirements.txt - `conda env create` to create an environment from a given `environment.yml` - Command: `conda env create --name tensorflow --file environment.yml` +#### Other common conda commands + +- `conda env list` to list down conda envs +- `du -h -s $(conda info --base)/envs/*` list down the size of each env +- `conda config --show channels` to show available channels in the config + ## Day 6 - Number: `1000000` can be written as `1_000_000` for the ease of visualisation @@ -146,6 +550,15 @@ np.hstack((a,b)) - To get color palatte `color_pal = sns.color_palette()` +#### Histogram (Displot) + +- Since the `sns.displot` is deprecated, so we will use `sns.histplot` as follow to plot the histogram + kde distribution by specifying `kde=True` +- Also can `set_xlim()` to the zone that containing the data in case the distribution is skewed + +```Python +sns.histplot(df['col_name'], kde=True, bins=50).set_xlim(0,8); +``` + #### Pairplot - Pairplot is to use `scatterplot()` for each pairing of the variables and `histplot()` for the marginal plots along the diagonal @@ -301,6 +714,14 @@ grouped_single = df.groupby(['Team', 'Position']).agg({'Age': ['mean', 'min', 'm grouped_single.columns = ['age_mean', 'age_min', 'age_max'] # rename columns # reset index to get grouped columns back grouped_single = grouped_single.reset_index() + +## Example 4: +df_customers.groupby('rfm_score').agg( # groupby 'rfm_score' col + customers=('customer_id', 'count'), # create new 'customers' col by count('customer_id' col) + mean_recency=('recency', 'mean'), # create new 'mean_recency' col by mean('recency' col) + mean_frequency=('frequency', 'mean'), + mean_monetary=('monetary', 'mean'), +).sort_values(by='rfm_score') ``` - Group by the first column and get second column as lists in rows using `.apply(list)` @@ -352,7 +773,7 @@ Name: b, dtype: object - Config file stores in `.env` file ```shell -HUGGINGFACEHUB_API_TOKEN="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" +HUGGINGFACEHUB_API_TOKEN="" ``` - Load environmental variables @@ -375,6 +796,11 @@ os.environ["HUGGINGFACEHUB_API_TOKEN"] = ... # insert your API_TOKEN here ```Python import warnings warnings.filterwarnings('ignore') + +# [Optional] If you do not want to supress all the warnings, you also can explicitly specify which warning needs ignore +import warnings +warnings.filterwarnings('ignore', category=FutureWarning) +warnings.filterwarnings('ignore', category=DeprecationWarning) ``` - Both `!` and `%` allow you to run shell commands from a Jupyter notebook @@ -449,7 +875,19 @@ df.loc[:,'C'] = df.apply(lambda row: 'Hi' if row['A'] > 10 and row['B'] < 5 else data.insert(len(data.columns), 'rolling', data['open'].rolling(5).mean().values) ``` -#### Joining Pandas DataFrame +#### Joining Pandas DataFrame with Numpy array + +- Concat `df` with `numpy_array` and assign the name for the `numpy_array` as `new_col` + - Syntax: `df = df.assign(new_col=numpy_array)` + +#### Joining Pandas DataFrame with Pandas Series + +```Python +# need to convert pd's Series into the Dataframe, and transpose it before concat with pd's DF +pd.concat([df, pd_series.to_frame().T], ignore_index=True) +``` + +#### Joining Pandas DataFrames - Experience: before joining (either `concat`, `merge`), need to careful about the _index_ of the dataframes (might need to `.reset_index()`) as apart from the joining condition, pandas also matching the index of each dataframe. diff --git a/docs/plotly.md b/docs/plotly.md new file mode 100644 index 0000000..44cad66 --- /dev/null +++ b/docs/plotly.md @@ -0,0 +1,101 @@ +# Plotly + +## Installation + +```Python +pip install plotly +``` + +## Plotly Express + +- Plotly Express is a high-level interface for creating various types of plots easily. + +```Python +import plotly.express as px +fig = px.bar(x=["a", "b", "c"], y=[1, 3, 2]) +fig.show() +``` + +## Plotly Graph Objects + +### Single Plot + +```Python +import plotly.graph_objects as go + +# define the fig object +fig = go.Figure() +# add multiple line into the fig object +fig.add_trace(go.Scatter(x=data_train.index, y=data_train['users'], mode='lines', name='Train')) +fig.add_trace(go.Scatter(x=data_val.index, y=data_val['users'], mode='lines', name='Validation')) +fig.add_trace(go.Scatter(x=data_test.index, y=data_test['users'], mode='lines', name='Test')) +fig.update_layout( + title = 'Number of users', + xaxis_title="Time", + yaxis_title="Users", + legend_title="Partition:", + width=800, + height=350, + margin=dict(l=20, r=20, t=35, b=20), + legend=dict( + orientation="h", + yanchor="top", + y=1, + xanchor="left", + x=0.001 + ) +) +#fig.update_xaxes(rangeslider_visible=True) +fig.show() +``` + +### Subplots + +- `make_subplots` function from `plotly.subplots` is to create the subplots. +- In the below example, `make_subplots()` creates the figure `fig` with 2 rows and 1 column. + - `fig.add_trace()` function to add two traces to the figure, one for stock price data and one for volume data. + - `go.Ohlc()` function creates an interactive candlestick chart based on stock price data + - `go.Scatter()` function is used to plot the volume data + - By `setting layout_xaxis_rangeslider_visible` to `False`, the update() function modifies the layout of the plot to get rid of the range slider for the x-axis. + +```Python +import plotly.graph_objs as go +from plotly.subplots import make_subplots + +# Define the range for cropping the x-axis +start_date = '2021-01-10' +end_date = '2021-01-20' + +# Create subplots +# shared_xaxes=True to help crop both plots at the same time. +fig = make_subplots(rows=2, cols=1, shared_xaxes=True) + +# Add OHLC plot at row=1, col=1 +fig.add_trace( + go.Ohlc( + x=df["Date"], + open=df["Open"],high=df["High"],low=df["Low"],close=df["Close"], + name="Price" + ), + row=1, col=1 # specify the plot order row=1, col=1 +) + +# Add Volume scatter plot at row=2, col=1 +fig.add_trace( + go.Scatter( + x=df["Date"], y=df["Volume"], + name="Volume" + ), + row=2, col=1 # specify the plot order row=2, col=1 +) + +# Update x-axes range +fig.update_xaxes(range=[start_date, end_date]) + +# Hide the rangeslider +fig.update(layout_xaxis_rangeslider_visible=False) + +fig.show() +``` + +

diff --git a/requirements.txt b/requirements.txt index 77005ea..f434a2a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,9 @@ attrs black +folium pydantic +pygeohash pyaml +tqdm +librosa +wandb